Interesting links
Methods
1. Obtaining Latimeria chalumnae genome
4. Extraction of the region of interest
1. Obtaining Latimeria chalumnae genome
Latimeria chalumnae genome was obtained from Ensembl website, in the following direction:
However, the genome was also provided by the Bioinformatic Department group at Pompeu Fabra university. It was available at this directory:
~/cursos/BI/genomes/project_2013/Latimeria_chalumnae
Most of the selenoproteins sequences were obtained from the NCBI Protein database, and some of them were obtained from the Selenodb database.
We obtained the selenoproteins sequence of Xenopus tropicalis and Danio rerio, as explained in the introduction they are closely related to Latimeria chalumnae.
Blast is a program that compares sequences by doing local alignments. There are different types of Blast (Blastp, tBlastn, ...). In our case we are using tBlastn, the one that compares an aminoacid sequence against a nucleotid sequence or a DNA database.
This program will give us the best alignments between the different querys and our genome, and these alignments are going to be what we call hits.
To use this program we will need to export the software. We can do it by typing this command into Shell:
$ export PATH=/cursos/BI/bin/ncbiblast/bin:$PATH
$ cp /cursos/BI/bin/ncbiblast/.ncbirc ~/
Once we have done this, we can call Blast and execute it with this command:
blastall -p tblastn -i query.fa -d genome.fa -o output
where:
-p is the type of Blast we want to use (tBlastn in our case)
-i is the query we choose
-d is our genome
-o is the file with the results of the program
Once we obtain the results, we have to look for the hits that are stadistically significant. The parameter that tells us the significance of a hit is the e-value, that describes the number of hits one can expect to see by chance when searching a database of a particular size. The bigger this value is, the less significant is teh result. So it's important to establish a good threshold, not too big to take into account bad alignments and not too little to dismiss significant hits. The threshold that we have chosen is e-values under 1x10-10 (0.000000001).
Now we have the hits and its region in the genome, we are going to extract the genomic region where we have found this significant alignment. We will need two commands to do this.
The first one allows us to make an index of our genome (we only need to do this index once for all the querys). This command is fastaindex:
fastaindex genome.fa genome.index
where genome.fa is the file with our genome and genome.index is the file with the output.
After having done an index of our genome, now we can execute fastafetch, the program that will extract the region of interest:
fastafetch genome.fa genome.index JH126696.1 > fastafetch_ouput.fa
where:
genome.fa is our genome
genome.index is the genome's index
JH126696.1 is the region of interest, the region where we have found the hit. In our case this region is an scaffold, a piece of genome made as a part of genome sequencing
fastafetch_output.fa is the file that contains the region of interest (or scaffold in our case)
Exonerate is a program that predicts any functional structure (or gene) in a genomic sequence using the alignment between the query and the genome.
To use this program we need to introduce the following command
export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH
And now we can execute the program:
exonerate -m p2g --showtargetgff -q query.fa -t fastafetch_output.fa > query.gff
where:
-m is the type of alingment, in this case protein to genome (p2g)
--showtargetgff is an order to get the ouput file in a GFF format
-q is the query
-t is the region of interest
selenoprotein.gff is the output file (we named it with the name of the selenoprotein, i.e SelK.gff)
As we want to obtain the protein sequence, we have to obtain the cDNA sequence from this gene prediction. To do this we have to extract the exons information from the exonerate output:
egrep -w exon selenoprotein.gff > cDNA.gff
where:
-w determines the word we are interested in (exon, in our case)
selenoprotein.gff is the exonerate output
cDNA.gff is the file with the output, that contains the genomic position of the exons
After this we will use a perl program, fastaseqfromGFF, to extract the cDNA sequence. First we need this command before executing the program:
export PATH=/cursos/BI/bin:$PATH
And then we can execute the program:
fastaseqfromGFF.pl fastafetch_output.fa cDNA.gff > cDNA.fa
where:
fastafetch_output is the region of interest, where we have found the hit
cDNA.gff is the file that contains the position of the exons in the genome
cDNA.fa is the file that contains the cDNA sequence
Now we want to obtain the protein sequence of our prediction. For this we will use fastatranslate program. This program translates the 6 possible open reading frames (ORFs) of a DNA sequence into a protein sequence. This is the command to execute the program:
fastatranslate cDNA.fa > 6pautes.fa
where:
cDNA.fa is the cDNA sequence
6pautes.fa are the 6 possible ORFs of the DNA sequence
Once we have the 6 possible ORF, we have to decide which one fits the best, and copy it in a new file. To do this we can use the option -F 1 , the one that selects the ORF. So the command would be like this:
fastatranslate cDNA.fa -F 1 > 1pauta.fa
where -F 1 selects the ORF, in this case, the number one.
T-Coffee is a program that does global alignments between two sequences. This program will tell us how similar are the two sequences and which is the aminoacid aligned with the selenocystein of the query. The command to execute the program is the following:
t_coffee 1pauta.fa query.fa
where:
1pauta.fa is our protein predicted
query.fa is the query
Apart from looking for the selenoproteins of Latimeria, we have also looked for the proteins that participate in the synthesis of the selenocysteins. This search is useful as it is confirmatory because if we have found several selenoproteins in the genome, it means we should find also the machinery involved in the selenoproteins transcription.
To search these proteins we have used the same method that we have already described for the selenoproteins.
We have searched these machinery proteins: SPS1, SPS2, PSTK, Secp43, SBP2, eEFsec and SecS.
Finally we have also searched the presence of the SECIS elements in the genome. These are structures located near the selenoprotein sequence and are one of the elements that makes it possible to put a selenocystein in a stop codon. So this means this is also a good tool to confirm the presence of selenoproteins.
We have used the SECISearch webpage to look for the SECIS elements.
To increase the speed of the data obtaining, we have developed two programs:
- Selenosearch.sh : this is the main program. It executes the different commands that we use to search manually the selenoproteins in our genome (Blast, Exonerate, T_coffee...). The next program (minim_evalue.pl) is included inside this program as it is executed by this main program.
- minim_evalue.pl : this perl program takes the Blast output and looks for the lowest e-value between all the hits that we have obtained. This step is critical so we want to take the best hit.
3. TBlastn and hit selection
4. Extraction of the region of interest
6. Protein alignment: T-coffee
7. Transcription Machinery
8. SECIS element search
9. Automation