Varanus komodoensis’ genome was obtained from a file given by the bioinformatics teachers as a Fasta file. It can be found in the following directory:
mnt/NFS_UPF/soft/genomes/2019/Varanus_komodoensis/genome.fa
The lizard Anolis Carolinensis genome has the closest phylogenetic relationship with our specie. However, its selenoprotein annotation is incomplete, some of its proteins were really short and incomplete, so we could not make an accurate prediction. Therefore, we finally decided to compare Varanus komodoensis genome also with a further relative species that has a very well annotated selenoproteome: the Homo sapiens’.
In order to obtain the selenoproteins of our chosen species we used the SelenoDB website. Human proteins were taken from the 1.0 version of the database, and Anolis Carolinensis’ proteins from the 2.0 one, as it was not in the first version. We downloaded the protein aminoacid sequences in two fasta files, one for each specie, and then we created a single file for every query, separately.
First, we created a directory for Human and another one for Anolis Carolinensis where we put all the sequences. Then we created two other directories to perform the analysis and store the results and the information obtained from the comparisons.
To split the big fasta files in multiple files for each query we used the next commands:
awk '/^>/ {OUT=substr($0,2) ".fasta"}; OUT {print >OUT}' lizardx.fa
awk '/^>/ {OUT=substr($0,2) ".fasta"}; OUT {print >OUT}' humanx.fa
Here, an explanation of the different steps will be done in order to have a better understanding of the development of our prediction process. A bash programme was created to automate the prediction and annotation process, script. It consists in the generation of different files with the protein alignments prediction in the genome for all the possibilities. Finally, after obtaining the results, data analysis will be performed.
BLAST
BLAST (Basic Local Alignment Search Tool) is an algorithm that finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. In order to compare each Homo sapiens and Anolis carolinensis queries to the Varanus komodoensis genome, it is a very useful tool to do it. Specifically, we used a subtype of BLAST called TBLASTN, a tool which compares a genome sequence with query proteins.
tblastn -outfmt 6 -query ${protein} -db /mnt/NFS_UPF/soft/genomes/2019/Varanus_komodoensis/genome.fa -out ${prefix}_blast.txt
With the BLAST algorithm, an output file is generated with the following elements (for every hit):
- Contig: the genome fragment which fits with the hit.
- Scaffold: the genome part in which the contig is found.
- Start: the hit initial point.
- End: the hit final point.
- E-value: a measurement of the hit significance.
To evaluate the hits obtained and choose the ones which are more significant, we took into account the E-value and the identity percentage of each one. The E-value is the expression of the possibility to find an alignment as good or better as the one found in your database by chance. Therefore, the lower E-values have the higher significance. We established a threshold for each conditions: the e-value must be under 0,0001 and the identity percentage above 50%. Another point to take into account is the hit’s position. We observe that in some cases, the majority of hits are found in the same scaffold. This fact is interesting because it is possible that these genes are contained in an specific cluster.
awk '{ if(($3 > 50) && ($11 < 0.00001)) { print }}' ${prefix}_blast.txt > ${prefix}_blast_filtered.txt
It is important to underline that the protein sequences obtained from online SelenoDB database contain the Selenocysteine annotation as ’U’. This symbol is incompatible with the bash programme, because U is not recognised as an amino acid, and it shows an error instead. Therefore, in order to analyse the protein sequences, before the TBLASTN performance, it is necessary to convert all the Selenocysteines (U) into X (any possible amino acid). Then, we observed that in the several sequences, different symbols were included in the end, such as @, #, %... In the following line, we have removed all this symbols and changed U into X:
sed 's/\#//g' lizardnet.fa | sed 's/@//g' | sed 's/%//g' | sed 's/U/X/g' > lizardx.fa
Fastafetch: extraction of scaffold of interest
Thanks to the previous step, the scaffolds of interest are obtained. Now, we need to extract them into a different Fasta file. To do that, we will use the Fastafetch programme (using the database showed before).
fastafetch /mnt/NFS_UPF/soft/genomes/2019/Varanus_komodoensis/genome.fa /mnt/NFS_UPF/soft/genomes/2019/Varanus_komodoensis/genome.index ${SCAFFOLD} > ${SCAFFOLD}.fa
The output will be classified in different Fasta files, named each one with the name of the specific scaffold: ${SCAFFOLD}.fa.
Fastasubseq: extraction of exact region of interest
Once we have the scaffolds in the new fasta files, we have to obtain the exact region of interest. To do that, we extended the search 50.000 nucleotides in both 5’ and 3’ directions. This was done in order to not miss part of the sequence due to the presence of introns (hits correspond only to coding sequences).
If 50.000 exceeds the total sequence length, the program establishes the start or the end as the maximum or minimum length of the sequence. (reduces from 1 nucleotide to 1 nucleotide until it reaches the maximum length of the sequence.)
Exonerate and egrep function
At this point, we have the subsequences of the scaffolds selected before in several files. Now, the programme Exonerate will allow us to predict the gene coordinates of every gene predicted. Then, with the Egrep programme, we will be able to extract the exon coordinates from gene coordinates by searching the word “exon” in the sequence.
The exonerate programme with egrep command:
exonerate -n 1 -m p2g --showtargetgff -q ${protein} -t ${SCAFFOLD}_${START_COORD}_${END_COORD}.fa | egrep -w exon > ${SCAFFOLD}_${START_COORD}_${END_COORD}_exons.gff
Fasta seq from GFF
At this point, we use another command to extract the cDNA sequence of the predicted protein. This is also saved in a new file:
fastaseqfromGFF.pl ${SCAFFOLD}_${START_COORD}_${END_COORD}.fa ${SCAFFOLD}_${START_COORD}_${END_COORD}_exons.gff > ${SCAFFOLD}_${START_COORD}_${END_COORD}_exons_nucl.fa
Fasta translate
Then, we translate the cDNA sequences into amino acid sequences.
fastatranslate -f ${SCAFFOLD}_${START_COORD}_${END_COORD}_exons_nucl.fa -F 1 > ${SCAFFOLD}_${START_COORD}_${END_COORD}_aa.fa
It is important to mention that the obtained results will show the Selenocysteine amino acids as *, which will not be recognized by the T-Coffee programme, which is the following step. To solve this, it is necessary to change this * for X’s.
sed 's/*/X/g' ${SCAFFOLD}_${START_COORD}_${END_COORD}_aa.fa > ${SCAFFOLD}_${START_COORD}_${END_COORD}_aa_U_repl.fa
T-coffee
The T-coffee programme has been used to perform a global alignment between the final predicted Varanus komodoensis protein and the query protein from the Anolis carolinensis’ and Homo sapiens’ genomes.
t_coffee ${protein} ${SCAFFOLD}_${START_COORD}_${END_COORD}_aa_U_repl.fa > ${SCAFFOLD}_${START_COORD}_${END_COORD}_tcoffee
The last thing we performed was a SECIS prediction. To do so, we used the programs Seblastian and SECISearch3 with the results of the subseq as inputs. (these tools can be found in http://seblastian.crg.es/). This prediction is important since a SECIS element must be found in all selenoproteins. It is also important to compare the positions of the predicted SECIS element with the absolute positions of our predicted gene, in order to determine if the SECIS predicted is found in the 3’ UTR region. If not, it will be dismissed.
Finally, a phylogenetic tree was performed for each protein family in order to study the the relationship between the queries and the predicted proteins. It also helped us to check if the scaffold chosen correlated with the phylogeny or not. The input was the human and/or lizard protein and our predicted proteins. The tool used in this case can be found in http://www.phylogeny.fr/simple_phylogeny.cgi.