The aim of our study is to predict and identify the Laticauda laticaudata’s selenoproteins and selenoprotein machinery factors. In order to do that, we performed an homology-based approach from the L. laticaudata genome. So, we compared our specie with Homo sapiens genome due to their close phylogenetic relationship and to the fact that the human selenoprotein are described.
The L. laticaudata genome was download from the file provided by the teachers of the Bioinformatics subject in the next directory:
mnt/NFS_UPF/soft/genomes/2019/Laticauda_laticaudata/genome.fa |
In order to identify the selenoproteins of L. laticaudata, we have chosen Homo sapiens genome because it is the closest specie that has all the selenoproteins, except two. We haven’t found these two selenoproteins in seleno db1, seleno db2 nor uniprot. So, we believe that they haven’t a good query and the prediction can not be made.
The sequences of queries has been obtained from SelenoDB 1.0 database. We have copied the sequence into an emacs file named as query.fa, where query is the name of the protein. Moreover, we changed “U” for “X” in order to perform a correct analysis because the software doesn’t recognise it. Also, we removed the symbols like # or @.
We perform the protein prediction analysis faster using two automatic perl programs created by us. Firstly, we have introduced a file with all the queries in the terminal in order to perform the tBLASTn. The following steps have been automatized in script1.pl and script2.pl. The first one creates a folder for each hit obtained in tBLASTn and execute the script2.pl for each one. The script2.pl includes all the commands to make the prediction.
The previous steps of data acquisition has been performed manually, we have created a multifasta.fa with all the queries. We selected the main hits of tBLASTn according to query coverage percentage and low e-value (< 0,002) and then, we created a file with the selected hits. The next step has been automatized in script1.pl in which news folders creates for each hit selected. Some of these commands include output and input file reference names written in a standardized manner ($identifyer).
We ran a tBLASTn for each query using the following command:
tblastn -query sel15_human.prot.fa -db /mnt/NFS_UPF/soft/genomes/2019/Laticauda_laticaudata/genome.fa -evalue 0.002 -outfmt ‘7 qseqid sseqid evalue qstart qend sstart send length pident qcovs qcovus’ -out queryname |
BLAST performs comparisons between queries and the L. Laticaudata genome. So, we have used tBLASTn function because we want to compare proteins with the translated genomic sequences. In this step, the main parameter which we take count in each hit is E-value, because it shows the times that we expect to found a hit by chance. So, the lower ones correspond to a higher hit significance, that’s why we chose hits with a value lower than 0,002. We generated an output document including the next elements for each hit:
The output document is used after to do the data analysis.
We select hits using basically percentage of query coverage higher than 60% and E-value lower than 0,002. We use the following command to create a file with the scaffold sequence:
my $scaffold_file = "$SCAFFOLD.fa"; my $cmd = "fastafetch -f $genome -i $genomeindex -q $SCAFFOLD > $scaffold_file"; my $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
We generate a document with length scaffold sequence using the next command:
my $scaffold_length = "$SCAFFOLD.length.fa"; $cmd = "fastalength $scaffold_file > $scaffold_length"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
We use fastasubseq software in order to extract our genomic region of interest inside the scaffold.
my $scaffold_subseq = "$SCAFFOLD.subseq.fa"; $cmd = "fastasubseq -f $scaffold_file -s $start -l $long > $scaffold_subseq"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd"); } |
In order to avoid missing parts of the sequence, we have added 50,000 base pairs up and 50.000 base pairs down to each scaffold sequence downstream of the BLAST hit. This step has been done automatically using the next command:
my $length; open (FILE, '<', "$scaffold_length"); while ( my @taula = split(/ /,$_); } close (FILE); print "$length\n"; $start = $start - 50000; if ($start < 0){ $start = 0; } $end = $end + 50000; if ($end > $length) { $end = $length; } print "$start\n"; print "$end\n"; my $long = $end - $start; print "$long\n"; |
Exonerate software extracts the predicted gene for our genomic region of interest. We decide not to include exhaustive prediction because we consider a high level of homology between L. Laticaudata and Homo sapiens selenoproteins. We extract the genes and exons using egrep command.
my $scaffold_exonerate = "$SCAFFOLD.exonerate.fa"; $cmd = "exonerate -m p2g --showtargetgff -n 1 -q $query -t $scaffold_subseq > $scaffold_exonerate"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
my $scaffold_genes_exonerate = "$SCAFFOLD.genes.exonerate.gff"; $cmd = "egrep -w gene $scaffold_exonerate > $scaffold_genes_exonerate"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
my $scaffold_all_exonerate = "$SCAFFOLD.all.exonerate.gff"; $cmd = "egrep -w exon $scaffold_exonerate > $scaffold_all_exonerate"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
We use Fastaseq to generate the cDNA sequence of the predicted genes from Exonerate.
my $scaffold_prednuc = "$SCAFFOLD.prednuc.fa"; $cmd = "fastaseqfromGFF.pl $scaffold_subseq $scaffold_all_exonerate > $scaffold_prednuc"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
This program translates the cDNA protein obtained with fastaseqfromGFF into protein sequence.
my $scaffold_aa = "$SCAFFOLD.aa.gff"; $cmd = "fastatranslate -f $scaffold_prednuc -F 1 > $scaffold_aa"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
The output file is scanned for * symbols in order to replace it by X.
my $prediction_aa = "$SCAFFOLD.prediction.txt"; $cmd = "sed -i 's/\*/X/g' $scaffold_aa > $prediction_aa"; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
We have used T-Coffee to evaluate the quality of alignments. So, T-Coffee compares the predicted proteins of L. Laticaudata with the queries of Homo sapiens. We execute it with the following commands:
my $scaffold_tcoffee = "$SCAFFOLD.tcoffee"; $cmd = "t_coffee $query $scaffold_aa > $scaffold_tcoffee "; $exit_code = system($cmd); if($exit_code!=0){ die("exit code $exit_code failed to run $cmd "); } |
As we have told in the introduction, SECIS elements are responsible of translational readthrough because they present Sec residue. They are situated in 3’UTR end of selenoprotein-coding mRNA.
We identify SECIS elements in our predicted sequences using SeciSearch3 tool. In order to corroborate the results, we confirm the predictions with Seblastian. If the program predicts more than one SECIS element, we establish as a criteria that the SECIS element must be in the same strand and it has to be on the 3’UTR end of the gene (between 200 and 5,000 positions more respect the end).
We have used phylogeny.fr in several protein families in order to check the alignments making a phylogeny tree. The output file shows the correlation between the L. laticaudata proteins and the human proteins.