Our aim was to search for the already known selenoprotein families in the Saimiri boliviensis genome. Because the synthesis of selenoproteins requires a specific machinery, we also searched for the proteins of the machinery. In both cases, we used an homology-based approach.
In order to find the putative regions coding for selenoproteins and selenoprotein homologues, we used gene prediction based on homology complemented with SECIS element prediction. Because of the phylogenetic proximity between S. boliviensis and Homo sapiens, we used the human selenoproteins as queries to search for regions encoding homologous proteins in the S. boliviensis genome. First, BLAST was used to search for the genomic regions with similarity (after translation) to the human selenoproteins. Hits were then extended, yielding subsequences that were used for gene prediction with exonerate and GeneWise. T-Coffee was then used to align the predictions with their respective queries. The incorporation of Sec into selenoproteins requires a SECIS element. Therefore, in order to find the SECIS elements corresponding to the predicted selenoproteins, we used SECISearch to predict putative SECIS elements. Besides, we also used blast to find sequences homologous to the human SECIS elements. Finally, Selenoprofiles was used to confirm our results.
The search of the machinery was also done using the human machinery proteins as queries and using blast followed by exonerate and GeneWise to predict the homologues. In this case, we did not search for SECIS elements because these proteins do not contain Sec (except for SPS2, which was analysed together with the other selenoproteins).
Because of the phylogenetic proximity and the availability of the sequences, we used the human proteins and SECIS elements as queries for our analysis. We also searched for four extra selenoprotein families which are not found in mammals.
BLAST (Basic Local Alignment Search Tool) is an algorithm that searches for local similarity between a query sequence and a sequence database. There are many programs of BLAST. The selection of a specific program depends on the query, the database and the aim of the search.
In order to find genomic regions encoding proteins homologous to the known selenoproteins and the machinery, we used tblastn. This program searches for similarity between a protein query and a nucleotide database translated into the six reading frames. An important parameter in blast is the expectation value (E-value) of a hit, which reflects the chances of finding the hit by chance. Therefore, the lower the E-value, the better. For the human queries, multiple hits were obtained with a wide range of E-values, including E-values as low as 2x10-120, consistent with the phylogenetic proximity between humans and S. boliviensis. We used an E-value cut-off of 0.001 (hits with higher e-values were discarded), and all the reported hits were used for further analysis. For the non-human selenoproteins, no E-value cut-off was used.
We also searched for sequences with similarity to the human SECIS elements found in SelenoDB. For this, because they are nucleotide sequences, we used blastn, which searches for similarity between a nucleotide sequence and a nucleotide database. We also used an E-value cut-off of 0.001.
In order to run blast from the shell, a database from the genome has to be created. This had already been created for us, but can be done with the following command line:
Blast can be run with the following command line (parenthesis indicate optionality):
After finding the genomic regions with similarity to the protein queries (blast hits), a genomic region was selected for each blast hit so that it included the hit and the surrounding nucleotides. In case we obtained more than one hit on the same scaffold for a given query, we "merged" the hits so that the subsequence included all the hits and the surrounding nucleotides after the first and last hits. This subsequence was then used for gene prediction using exonerate and GeneWise (the subsequence is necessary in order to reduce the computing time, which would be unreasonably high if we run these programs for the whole scaffold).
Extracting a subsequence from a genome sequence can be done with the program fastasubseq, which is part of the exonerate package. First, we need to index the genome and then get the scaffolds which contain the subsequences to be extracted.
We made a program that generates a file with the scaffolds that will be needed (forfastafetch.pl)
The scaffolds can be retrieved in two steps:
Then we can extract the subsequence. For this, we need the name of the scaffold from which the subsequence must be extracted, the start position and the length of the sequence to extract. H. sapiens and S. boliviensis are closely-related species and the protein sequences are expected to be similar. However, proteins are rarely reported as a unique blast hit because they are usually encoded by more than one exon and different exons are reported as different blast hits. Therefore, for a given query and a given scaffold, the smallest and biggest position in the scaffold for the blast hits must be identified. Because it might be that some exons are less conserved and therefore not reported as blast hits, we wanted to reduce as much as possible the possibility of omitting these sequences from the subsequence. Therefore, we extended 150,000 nucleotides before the smallest position detected by blast and after the biggest one, and extracted the subsequence in between. We made a program that calculated the start position and the length (forfastasubseq.pl).
The subsequences can be extracted with the following command line:
See fastaindex, fastafetch and fastasubseq in our script.
Exonerate is a tool for sequence alignment. In one of its modalities, it aligns a protein sequence to a translated DNA sequence, allowing for introns and therefore predicting the structure of a gene encoding a similar protein. We used exonerate to align the protein queries to the DNA subsequences extracted in the previous step, so that the similar proteins encoded in the genome were predicted.
Exonerate can be run using the following command line:
Then, the GFF output is used to get the sequence of the predicted cDNA in fasta format, which will be then translated.
For this, first a file is made with the information for the exons:
Then the fasta sequence is obtained with the program fastaseqfromGFF.pl:
Finally, fastatranslate, one of the programs included in the exonerate package, is used to translate the cDNA sequence.
GeneWise uses a combination of gene prediction model and a protein homology model to compare a protein sequence to a DNA sequence and, whenever existing, predict the structure of the gene (introns and exons) that encodes for a similar protein. It is similar to exonerate but it does not use heuristics (Birney et al, 2004).
Genewise can be run with the following command line:
After predicting the similar proteins with exonerate and GeneWise, each sequence is aligned to its respective query with T-Coffee. This is a multiple sequence alignment package, which, therefore, allows to align two or more sequences.
T-Coffee can be run with the following command line:
See T-Coffee in our script: for the exonerate predictions and for the GeneWise predictions.
The recoding of the UGA codon into Sec requires a SECIS element in the 3'UTR of the mRNA. Therefore, to further characterize the predicted proteins belonging to selenoprotein families (not for the machinery, except for SPS2), we searched for SECIS elements around the predicted proteins. For this, we used SECISearch and blast.
SECIS elements have conserved features in the primary and secondary structure. SECISearch is a program that predicts SECIS elements based on this conservation, together with an estimation of the free energy of the predicted secondary structure, which reflects the stability of the structure (Kryukov et al, 2003).
SECISearch can be run on the command line for any sequence for which SECIS elements must be predicted. We used the same subsequences as for exonerate and GeneWise.
The conservation exhibited by SECIS elements is quite loose, making SECIS prediction challenging. SECISearch predictions include a number of false positives, and some SECIS elements might be missed. It is known that SECIS elements of mammalian orthologous selenoproteins exhibit sequence similarity (Kryukov et al, 2003). Therefore, to complement the SECIS prediction, we used blast to identify genomic regions with similarity to the human SECIS elements, as previously explained.
See SECISearch in our script.
Selenoprofiles is a computational pipeline to identify all members of a protein family encoded in a target genome sequence. It was developed to identify selenoproteins. It uses a similar approach to ours. The main difference is that the initial queries are alignments of selenoprotein families instead of a single protein. From this alignments, a profile is built and used for finding homologous regions with psitblastn (this modality of blast compares a profile (position specific scoring matrix) to a nucleotide database translated into the six frames). Subsequently, exonerate and GeneWise are used to predict candidate genes. The parameters used for running these programs are slighlty different to ours. For instance, the scoring matrix is modified so that the extension of the alignment over a TGA codon in the target genome is favoured when it aligns with selenocysteine, cysteine, arginine and threonine. Predicted candidates are then filtered on the basis of a minimum spanning length in relation to the profile alignment. SECISearch is then used to search for SECIS elements. Finally, results are refiltered (Mariotti et al, 2010).
We executed Selenoprofiles with the following command line. We searched for all the eukaryotic selenoproteins.
Here we show a script, which we call script1, which allows to automatically perform all the steps for a given set of protein queries (e.g. selenoproteins, machinery). This script uses three perl programs: extractu.pl, forfastafetch.pl, forfastasubseq.pl.
#!/bin/bash ### First we have already created a folder named proteins, which contains the file with the queries (humanselenoprot_prot.fa) and the program extractu.pl. ### We also have a folder named fastafetch, with the programs forfastafetch.pl and forfastasubseq.pl. ### BLAST. We start in our "home" folder. We create a folder named blast. We copy the file containing the queries (humanselenoprot_prot.fa) and run tblastn in it. We set the-m9 option and filter out the hits with an e-value smaller than 0.001. mkdir blast cd blast cp ../proteins/humanselenoprot_prot.fa . export PATH=/cursos/BI/bin/ncbiblast/bin:$PATH #In order to run the different programs of the script, we need to export their paths. blastall -p tblastn -i humanselenoprot_prot.fa -d /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa -m9 -e 0.001 -o blast ### We make a file without the lines starting with "#". egrep -v "#" blast > ../inputfromblast.txt ### Next, Us in protein sequences are replaced by Xs. For this, the programs extractu.pl is required in the proteins folder.Then, individual protein sequences are extracted into indindividual files. We do this with the programs fastaindex and fastafetch, as we will do for the genomic sequences. cd ../proteins ./extractu.pl < humanselenoprot_prot.fa > protnou.fa export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH fastaindex protnou.fa protnou.index for protein in `cut -f 1 ../inputfromblast.txt | uniq` ; do { fastafetch protnou.fa protnou.index $protein > "${protein}.fa" } done ### In order to run fastafetch, first we use the program forfastafetch.pl to get a file with the queries and the names of the scaffolds we want to retrieve. cd ../fastafetch ./forfastafetch.pl < ../inputfromblast.txt > protein_genomic.txt cut -f 2 protein_genomic.txt > inputforfastafetch.txt #This makes a file with the names of the scaffolds. ### Then we run FASTAFETCH (first fastaindex). export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH fastaindex /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa genome.index for name in `cat inputforfastafetch.txt`; do { fastafetch /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa genome.index $name > "${name}.fa" } done ### We change the names of the scaffolds to make the handling of the files easier. Each scaffold is named with the name of the protein and the hit. When two or more proteins have hits on the same scaffold, a copy is generated for each protein. nom=a while read protein genome do if [ "$nom" != "$protein" ]; then num=1 nom=$protein cp "${genome}.fa" "${protein}_${num}_genomic.fa" else num=$[$num+1] cp "${genome}.fa" "${protein}_${num}_genomic.fa" fi done < protein_genomic.txt ### Then we generate a file with the information we need for running fastasubseq with the program forfastasubseq.pl. It is an extension of the program forfastafetch.pl and in this case the name of the scaffold is the new name. ./forfastasubseq.pl < ../inputfromblast.txt > inputforfastasubseq.txt ### Then we run fastasubseq. First we make a folder named subsequences. mkdir ../subsequences while read query genomic start length orientation do fastasubseq $genomic $start $length > ../subsequences/"${genomic}_exonerate.fa" 2> error.temp error=`cat error.temp` #this is to ammend those cases in which the length goes beyond the end of the scaffold, which gives an error. if [ "$error" != "" ]; then final=`cat error.temp | egrep "before end" | cut -d '(' -f 2 | sed 's/)//'` length=$(( $final - $start)) fastasubseq $genomic $start $length > ../subsequences/"${genomic}_exonerate.fa" fi done < inputforfastasubseq.txt ### Then we run exonerate (from home). cd .. mkdir outputexoneratealign mkdir outputexonerate mkdir multiplepredictions export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH export PATH=/cursos/BI/bin:$PATH while read query genomic start length orientation do genomicexonerate="${genomic}_exonerate.fa" exonerate -m p2g --showtargetgff -q ./proteins/$query -t ./subsequences/$genomicexonerate > outputexoneratealign/"${genomicexonerate}.gff" done < ./fastafetch/inputforfastasubseq.txt ###Then we transform the exonerate output into fasta and translate the sequence. For that, we need to take into account that exonerate might give more than one prediction. In this case, we only consider those predictions in the same orientation as the blast hit. ls outputexoneratealign > listexonerate.txt for name in `cat listexonerate.txt`; do { subsequence=`echo $name | sed 's/\.gff//'` numprediction=`egrep "C4 Alignment" ./outputexoneratealign/$name | uniq -c | cut -d "C" -f 1` if [ $numprediction == 1 ]; then cat ./outputexoneratealign/$name | egrep -w exon > ./outputexonerate/"${name}.gff" fastaseqfromGFF.pl ./subsequences/$subsequence ./outputexonerate/"${name}.gff" > ./outputexonerate/"${name}fromgff.fa" else i=0 region1=no region2=no while read file #read the file line by line do if [ $region1 = "no" ]; then field=`echo $file | cut -d ":" -f 1` if [ "$field" = "Target range" ]; then #here is where exonerate gives the start and ending position of the prediction start="`echo $file | cut -d ":" -f 2 | cut -d "-" -f 1 | cut -d " " -f 2`" end="`echo $file | cut -d ">" -f 2 | cut -d " " -f 2`" if [ "$start" -lt "$end" ]; then orientation="+" else orientation="-" fi hit=`basename $name _exonerate.fa.gff`; orientationblast=`cat ./fastafetch/inputforfastasubseq.txt | egrep "$hit" | cut -f 5` echo orientationblast $orientationblast if [ "$orientationblast" = "$orientation" ]; then #then do the other steps region1=yes fi fi fi if [ "$region1" = "yes" ]; then if [ "$region2" = "no" ]; then compare=`echo $file | cut -c 7-11` if [ "$compare" = "START" ]; then i=$[$i + 1] region2=yes fi fi if [ "$region2" = "yes" ]; then echo "$file" >> ./multiplepredictions/"${name}${i}" compare=`echo $file | cut -c 7-9` if [ "$compare" = "END" ]; then region2=no region1=no fi fi fi done < ./outputexoneratealign/$name fi } done ls multiplepredictions > listgff.txt for file in `cat listgff.txt` ; do { subsequence=`echo $file | sed 's/\.gff[0-9]//'` cat ./multiplepredictions/$file | egrep -w exon > ./outputexonerate/"${file}.gff" fastaseqfromGFF.pl ./subsequences/$subsequence ./outputexonerate/"${file}.gff" > ./outputexonerate/"${file}fromgff.fa" } done # We make a file with the names of the fasta files generated in the previous step. These files contain the exonerate predictions that will be translated next. ls outputexonerate | egrep "fromgff" > ./outputexonerate/totranslate.txt mkdir translated cd outputexonerate export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH for sequence in `cat totranslate.txt` ; do { i=1 while [ $i -lt 4 ] #although we expect the good translation will be for the frame 1, we also consider the frames 2 and 3 do fastatranslate -F $i $sequence > ../translated/"${sequence}_${i}_translated.fa" i=$[i+1] done } done ### T-COFFEE. Exonerate predictions and queries are aligned using T-COFFEE. cd .. mkdir tcoffeeexonerate cd tcoffeeexonerate ls ../translated > namestcoffeeexonerate.txt export PATH=/cursos/BI/bin:$PATH for sequence in `cat namestcoffeeexonerate.txt` ; do { protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa\.gff[0-9]*fromgff\.fa_[0-9]_translated\.fa//' ` t_coffee ../translated/$sequence ../proteins/"${protein}.fa" } done cd .. ### SECISEARCH mkdir secis cd secis ls ../subsequences > subsequences.txt export PATH=/cursos/BI/bin:$PATH while read subseq do SECISearch.pl ../subsequences/$subseq done < subsequences.txt cd .. ### GENEWISE preparation. ls subsequences > subsequencesgenewise.txt mkdir genewise cd genewise mkdir outputgenewise mkdir outputgenewiserev ###First we run genewise for the forward strand. for subsequence in `cat ../subsequencesgenewise.txt`; do { query=`echo $subsequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa//'` genewise -pep -pretty -cdna -gff ../proteins/${query}.fa ../subsequences/$subsequence > ./outputgenewise/"${subsequence}_genewise.gff" } done ###Then we run genewise for the reverse strand. for subsequence in `cat ../subsequencesgenewise.txt`; do { query=`echo $subsequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa//'` genewise -pep -pretty -cdna -trev -gff ../proteins/${query}.fa ../subsequences/$subsequence > ./outputgenewiserev/"${subsequence}_genewiserev.gff" } done ###Then we need to get the protein prediction from the genewise output in order to align it with the query using T-Coffee. ls outputgenewise | egrep "genewise.gff" > namesgenewise.txt ls outputgenewiserev | egrep "genewiserev.gff" > namesgenewiserev.txt cd outputgenewise cp ../namesgenewise.txt . for name in `cat namesgenewise.txt` ; do { fet=false region=no while read file do if [ "$fet" = "false" ]; then if [ $region = "no" ]; then compare="`echo $file | cut -c 1-3`" if [ "$compare" = ">gi" ]; then region=yes i=0 fi else compare="`echo $file | cut -c 1-2`" if [ $compare != "//" ]; then i=$[i + 1] else fet=true fi fi fi done < $name cat $name | egrep "pep" -A $i > ${name}_fortcoffee.fa } done cd ../outputgenewiserev cp ../namesgenewiserev.txt . for name in `cat namesgenewiserev.txt` ; do { fet=false region=no while read file do if [ "$fet" = "false" ]; then if [ $region = "no" ]; then compare="`echo $file | cut -c 1-3`" if [ "$compare" = ">gi" ]; then region=yes i=0 fi else comparar="`echo $file | cut -c 1-2`" if [ $comparar != "//" ]; then i=$[i + 1] else fet=true fi fi fi done < $name cat $name | egrep "pep" -A $i > ${name}_fortcoffee.fa } done cd .. mkdir tcoffeegenewise mkdir tcoffeegenewiserev ls outputgenewise | egrep "fortcoffee.fa" > ./tcoffeegenewise/namestcoffee.txt ls outputgenewiserev | egrep "fortcoffee.fa" > ./tcoffeegenewiserev/namestcoffeerev.txt cd tcoffeegenewise export PATH=/cursos/BI/bin:$PATH for sequence in `cat namestcoffee.txt` ; do { protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa_genewise\.gff_fortcoffee\.fa//'` t_coffee ../outputgenewise/$sequence ../../proteins/"${protein}.fa" } done cd ../tcoffeegenewiserev for sequence in `cat namestcoffeerev.txt` ; do { protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa_genewiserev\.gff_fortcoffee\.fa//'` t_coffee ../outputgenewiserev/$sequence ../../proteins/"${protein}.fa" } done
#!/usr/bin/perl -w use strict; #This program is used to substitute the seleneocysteines (U) in the protein queries by X. my @line; my $i; my $newline; while (<STDIN>){ chomp($_); $i = 0; @line = split ("", $_); if ($line[$i] eq ">"){ $newline = join ("", @line); } else { while ($i < scalar(@line)){ if ($line[$i] eq "U"){ $line[$i] = "X"; } $i = $i + 1; } $newline = join ("", @line); } print "$newline\n"; }
#!/usr/bin/perl -w use strict; #this program generates a file with the names of the queries and the scaffolds for which we got blast hits. my @linia; my $a; my $human;#protein name my $nom;#scaffold name my @output; my $humancomparacio; $human = "a"; $nom = "a"; $inici = 1; $fi = 1; $humancomparacio = "a"; while (<STDIN>){ chomp($_); $a = $_; @linia = split ("\t", $a); if ($linia[0] ne $humancomparacio) {#first line or new protein if ($humancomparacio ne "a") {#if not first line @output = join("\t",$human, $nom); print "@output\n"; } $humancomparacio=$linia[0]; $human = $linia[0]; $nom = $linia[1]; } else{#it's still the same protein if ($linia[1] ne $nom){#new scaffold @output = join("\t", $human, $nom); print "@output\n"; $human = $linia[0]; $nom = $linia [1]; } } } @output = join("\t", $human, $nom); print "@output\n";
#!/usr/bin/perl -w use strict; #This program uses the blast tabular output (without the lines starting by #) and produces a file with five columns:query name, scaffold name, start for fastasubseq, length for fastasubseq, gene orientation. my @linia; my $a; my $human;#protein name my $nom;#scaffold name my $inici;#start my $fi;#end my @output; my $length; my $numero;#scaffold counter my $humancomparacio;#variable with the name of the protein query. It is used to know whether every line refers to the same protein as the previous line or a new one. my $orientacio;#orientation $human = "a"; $nom = "a"; $inici = 1; $fi = 1; $humancomparacio = "a"; while (<STDIN>){ chomp($_); $a = $_; @linia = split ("\t", $a); if ($linia[0] ne $humancomparacio) {#the first file line or new protein if ($humancomparacio ne "a"){ #unless it is the first line $length = ($fi - $inici + 300000); $inici = ($inici - 150000); if ($inici < 0) { $inici = 1;} $nom = $human."_".$numero."_"."genomic.fa"; $human = $human.".fa"; @output = join("\t", $human, $nom, $inici, $length, $orientacio); print "@output\n"; } $numero = 1; $human = $linia[0]; $humancomparacio = $linia[0]; $nom = $linia[1]; if ($linia[8] < $linia[9]){ $inici = $linia[8]; $fi = $linia[9]; $orientacio = "+"; } else { $inici = $linia[9]; $fi = $linia[8]; $orientacio = "-"; } } else{#the protein is not new if ($linia[1] ne $nom) {#new scaffold of the same protein #then we print what we have from the previous scaffolds $length = ($fi - $inici + 300000); $inici = ($inici - 150000); if ($inici < 0) { $inici = 1;} $nom = $human."_".$numero."_"."genomic.fa"; $human = $human.".fa"; @output = join("\t", $human, $nom, $inici, $length, $orientacio); print "@output\n"; $numero = $numero + 1;#we need to add 1 because it is a new scaffold $human = $linia[0]; $nom = $linia[1]; if ($linia[8] < $linia[9]){ $inici = $linia[8]; $fi = $linia[9]; $orientacio = "+"; } else { $inici = $linia[9]; $fi = $linia[8]; $orientacio = "-"; } } if ($linia[1] eq $nom){ #the protein and the scaffold are the same, there is no need to add anything else if ($linia[8] < $linia[9]) {#forward if ($linia[8] < $inici) {#substitute the start value if it is smaller $inici = $linia[8]; } if ($linia[9] > $fi){#substitute the end value if it is greater $fi = $linia[9]; } } else #reverse { if ($linia[9] < $inici){ $inici = $linia[9]; } if ($linia[8] > $fi){ $fi = $linia[8]; } } } } } $length = ($fi - $inici + 300000); $inici = ($inici - 150000); if ($inici < 0) {$inici = 1;} $nom = $human."_".$numero."_"."genomic.fa"; $human = $human.".fa"; @output = join("\t", $human, $nom, $inici, $length, $orientacio); print "@output\n"; #print "$human","$nom","$inici","$fi","\n";