Methodology
The aim of this study is to determinate the selenoproteins encoded in the genome of Manis javanica as well as the machinery required for their synthesis and selenium metabolism. In order to find these selenoproteins, it has been performed an homology-based approach using the human genome as its closest specie.
The reference genome was provided by our teachers introducing the following command in the terminal:
$ /cursos/20428/BI/genomes/2016/Manis_javanica/genome.fa
Queries acquisition
As mentioned before, Homo sapiens genome is required as a reference to do the homology-based approach. The objective is to find out the selenoproteins and proteins involved in their synthesis in Manis javanica by comparing both genomes.
In order to obtain these proteins, it is necessary the access to SelenoDB, the reference database for selenoproteins. Each protein is accurately described and its transcripts are sequenced in fasta format.
Every sequence must be copied into an EMACS text editor, without the “hash” [#] that is found at the end. Despite selenocysteines are represented with an “U”, the executed programs do not recognise this character. For this reason, all the “U” must be replaced by an “X”, which represents any possible amino acid. Also, if there is any @ or other symbol, it must be removed. Once saved and stored as a fasta file (prot_name.fa), it can be used for the following steps.
Search of a gene in a concrete genome’s region: tBLASTn
The NCBI BLAST software is required for the next step. It is available by introducing the following modules:
$ module load modulepath/goolf-1.7.20
$ module load BLAST+/2.2.30-goolf-1.7.20
BLAST (Basic Local Alignment Search Tool) is a complex program based on a heuristic algorithm, which locally aligns a problem protein (query) with sequences from a database (
Manis javanica genome). There are many types of BLAST depending on the query and the reference sequence used. In this case, tblastn is used since the function of this program is to convert the protein sequence into a nucleotide one and to find out those high-homology regions with different parts of our query.
The tBLASTn command required is:
$ tblastn –query prot_name.fa –db /cursos/20428/BI/genomes/2016/Manis_javanica/genome.fa –out prot_name.blast
Where:
- query: Our problem sequence.
- db: Database containing our animal’s genome (
Manis javanica).
- out: Output file; no concrete format is required (better write an intuitive ending).
It is useful to tabulate the information by using –outfmt 7. Moreover, the parameter “evalue > 0.05” is also useful. By introducing this parameter, tblastn removes the scaffolds that are not enough significant.
$ tblastn –query prot_name.fa –db /cursos/20428/BI/genomes/2016/Manis_javanica/genome.fa –outfmt 7 -evalue 0.05 –out prot_name.blast
Once the tBLASTn is done, a list of scaffolds appears combined with other parameters. The different scaffolds are high-homology regions between the genome and the query. Each one has an e-value, which represents the amount of random possible sequences with high-homology with the sequence. Scaffolds with low e-values become good candidates for our query’s alignment. Another parameter used is the bit score, which indicates the overall quality of the alignment. In contrast with the e-value, a high score is required. It is also indicated the length of the query which is useful to evaluate the scaffold’s choice, as it should contain the largest possible length. Otherwise, some regions in the query won’t get aligned.
Taking all these parameters into account, one scaffold not used previously must be chosen in order to proceed with this protocol.
Delimiting the genomic region of interest: Fastafetch and Fastasubseq
The genome is already structured by an index previously codified. However, it is important to delimit a more accurate region in which the scaffold is present. The commands used are:
$fastafetch /cursos/20428/BI/genomes/2016/Manis_javanica/genome.fa
/cursos/20428/BI/genomes/2016/Manis_javanica/genome.index scaffold_name > output_file.fetch
Where:
- scaffold_name: The elected scaffold (subject id).
In order to delimit the sequence to some thousands of nucleotides, the following commands are required:
$ module load Exonerate/2.2.0-goolf-1.7.20
$fastasubseq output_file.fetch start_nucleotide length > output_file.subseq
Where:
- start nucleotide: The first nucleotide corresponding to the lowest nucleotide in the query, minus 50.000 nucleotides if possible.
- length: The difference between the start nucleotide pointed before and the last one in our scaffold, plus 50.000 nucleotides if possible.
This makes sure that the selected region is big enough to fit the aligned region.
Gene prediction: Exonerate
The exonerate program predicts the gene by comparing the protein sequence from database with the genomic region obtained with fastasubseq. This file includes the alignment of introns and exons from human protein with the
Manis javanica genome. The command used is the following one:
$ exonerate –m p2g –showtargetgff –q prot_name.fa –t output_file.subseq > output_file.exonerate
Where:
- m p2g: It makes possible an alignment protein-genome.
- showtargetgff: The output file is given in GFF format.
A new file is obtained from the exonerate program with the GFF notation which includes the position of the exons:
$ export PATH=/cursos/20428/BI/bin:$PATH
$ exonerate –m p2g –showtargetgff –q prot_name.fa –t output_file.subseq | egrep –w exon > output_file.exonerate.gff
Where:
- w exon: Exons extraction.
The positions of the exon are extracted from the whole sequence using the program fastaseqfromGFF:
$ fastaseqfromGFF.pl output_file.subseq output_file.exonerate.gff > output_file.nt.fa
This nucleotide sequence obtained from fastaseqfromGFF has to be translated to protein in order to perform the alignment using t-coffee. The fastatranslate program uses all the possible open reading frameworks (ORF) to do this task. However, only the first one is used in this case as it is highly reliable:
$ fastatranslate -f output_file.nt.fa -F 1 > output_file.prot
Where:
- F 1: First ORF used for nucleotide sequence translation.
After this step, the * symbol of protein file must be replaced for X symbol in order to avoid errors when running the t-coffee program.
Final alignment: T-coffee
Once the predicted gene is translated into a protein sequence, T-coffee (Tree-based Consistency Objective Function for alignment Evaluation) program is used in order to compare both sequences. The resulting file shows the homology between both sequences and amino acid one to one most probable alignment. The commands used in this step are:
$ module load T-Coffee/11.00.8cbe486-goolf-1.7.20
$ t_coffee output_file.prot prot_blast.fa
SECIS and Seblastian prediction: SECISearch3
SECIS elements are located in the 3’ - UTR regions of the genes and are essential for the synthesis of selenoproteins. That’s why SECIS elements are used in order to confirm the prediction of selenoproteins.
This elements are predicted by using in silico
SECISearch3 and
Seblastian that identify if there is any SECIS element in the sequence given. Even though, this programmes have no specificity, giving many false positives. That is why it is needed to revise manually each result obtained in order to determine whether the protein is a selenoprotein or not.