Methods & Materials
The aim of this project was to identify in silico all the selenoproteins in the genome of Spermophilus dauricus, whose selenoproteins have never been annotated before, by using an homology-based approach.
Queries
It is reasonable to think that the phylogenetically closer one specie and the target organism are, the more accurate the analysis will be. Therefore, the most appropriate organism to use as query is the one philogenetically closest to Spermophilus dauricus. In this case, the closest organism with annotated selenoproteins is Ictidomys tridecemlineatus. However, human selenoproteins are more studied and better annotated, so we complemented the search by analysing both organisms.
The selenoproteins of Ictidomys tridecemlineatus were obtained from the source:
Spermophilus dauricus genome
Spermophilus dauricus genome was needed in order to identify its selenoproteome.
Such genome was provided by the University Pompeu Fabra, and can be found in the following directory:
/cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa
Manual annotation
The procedure allows only to work with one query at the same time, and try to identify whether there is a hit in the genome. A hit means a sequence in the genome that is significantly similar to the query that we are analyzing. This process has to be repeated for each query. First of all, all the needed programs were exported for the analysis:
- module load modulepath/goolf-1.7.20
- module load BLAST+/2.2.30-goolf-1.7.20
- module load Exonerate/2.2.0-goolf-1.7.20
- module load T-Coffee/11.00.8cbe486-goolf-1.7.20
- export PATH=/cursos/20428/BI/bin:$PATH
- export PATH=/cursos/20428/BI/soft/genewise/x86_64/bin:$PATH
- export WISECONFIGDIR=/cursos/20428/BI/soft/genewise/x86_64/wise2.2.0/wisecfg/
To locate the genomic region where our selenoproteins can potentially be, and since we were given protein queries, tBLASTN (Basic Local Alignment Search Tool) program was used. This program searches translated nucleotide databases using a protein query. Spermophilus dauricus genome was, therefore, translated from nucleotides to proteins, and Ictidomys Tridecemlineatus protein queries were used to find potential alignments. The command used in shell was:
tblastn -query query1.fa -db /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa -out homologia1.blast
The significant hits obtained have to be extracted from the genome. First of all, it is important to export the region sequence where the scaffold is present by using Fastafetch, which provides a new file with the regions where the hits are included. By doing this, it is necessary to run Fasta index, a command used to arrange data in the most suitable way, which is easier for the program.
fastafetch /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.index "KZ294443.1" > hit1_query1.fa
From the region obtained in fastafetch, a command is applied to further delimit the region where the hits are located:
fastasubseq hit1_query1.fa start length > subseqHIT1_query1.fa
The next step is to predict the gene of the genome sequence and align it with the sequence of the query. The command is:
exonerate -m p2g --showtargetgff -q query1.fa subseqHIT1_query1.fa --exhaustive yes > HIT1_query1.exonerate
This program allows to extract the exonic part from the exonerate file, using the command:
egrep -w exon HIT1_query1.exonerate > HIT1_query1.exonerate.gff
The next step is to extract the cDNA sequence that encodes for the final protein, using the next command:
fastaseqfromGFF.pl subseqHIT1_query1.fa HIT1_query1.exonerate.gff > HIT1_query1.nt.fa
This program translates the cDNA sequence obtained in the step before into aminoacids, with this command:
fastatranslate -f HIT1_query1.nt.fa -F 1 > HIT1_query1.predicted.fa
GeneWise compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. Therefore it predicts the protein directly from the subseq output, which allows to skip some of the steps performed before. The command used is the following:
genewise -pep -pretty -cdna -gff query.fa subseq_contigs.fa > genewise_bo.fa
genewise -pep -pretty -cdna -gff -trev query.fa subseq_contigs.fa > genewise_bo.fa
Finally, this program compares the aminoacids obtained from the genome of Spermophilus Dauricus with a query that was used as reference. The command is:
t_coffee query1.fa HIT1_query1.predicted.fa > HIT1_query1.final
The t-coffee confirmed the alignment between the sequence of the genome and the query of Ictidomys Tridecemlineatus, but to confirm that this genome sequence refers to a selenoprotein, it is important to determine whether there is a SECIS element in this genome region. This is achieved using SECISearch3 and Seblastian, programs that are based on the SECIS and selenoproteins prediction.
Data acquisition
In order to automatize all the process, we first ran a parsing program script_1, in order to organize all the queries in specific directories. Having done that, we then ran script_2, which first performed tblastn. Then, it processed the results by ruling out all hits with e-Value smaller than 0.01. Also all hits that were found in the same scaffold and had the same direction were joint together into a unique hit with the minimum of the starts and the maximum of the ends. Finally, subseq, exonerate, egrep, fastaseqfromGFF, fastatranslate and tcoffe were called as described above for every hit that had been selected.
Script_3 performed genewise-based protein prediction for all the hits that were selected before. All 3 programs were modified from M. Schikora C et al, 2016. Finally, we used script_4 and script_5 to create the tables and images in the discussion.Data analysis
Genewise and exonerate predicted a lot of proteins from the tblastn results. In order to find the correct selenoprotein or cys homologue, we manually analysed the different output obtained. We focused our attention on hits with small e-values and great identities. We only selected hits that had a really good alignment with the query in both Exonerate and Genewise when possible. Also we made sure to check that a Seci in 3’UTR existed in a plausible position if the predicted protein contained a Sec residue. Finally, the prediction performed by Seblastian helped us decide which selenoprotein was the correct one.