The aim of our project is to find, identify and annotate all the Miichthys miiuy's selenoproteins. The methods we have used to achieve this objective are based on the software NCBI BLAST. We have compared the genome of our specie of interest (Miichthys miiuy) with another specie that we have considered is phylogenetically close with it and is fully annotated, the Zebrafish (Danio rerio). Also, we have used the human (Homo sapiens) selenoproteome to compare those selenoproteins that were not present in Zebrafish.
Queries selection
In order to in order to select the queries that are going to be compared with Miichthys miuuy's genome, we have chosen Zebrafish as a reference because it is the closest phylogenetic relative whose selenoproteins are well described due to its importance as a research model. The Zebrafish proteome was provided by the database SelenoDB (a selenoprotein specialized database). However, we have found some problems during this process. On the one hand, some of the proteins were not properly annotated in selenoDB 2.0 so they were obtained from selenodb.crg.eu where this selenoprotein was well annotated. On the other hand, some of the protein families were not properly described since some of the proteins did not have the subfamily specified (the subfamily appeared as "NONE"). These proteins that do not have a subfamily name have been named as the name of the protein family and a number in brackets. In this case to check if the scaffold is correctly assigned to a concrete query, we have used a phylogeny software: http://www.phylogeny.fr
According to proteins related to selenium metabolism prediction, all of them have been predicted blasting Miichthys miiuy genome against the different families of Homo sapiens found in SelenoDB.
Procedure:
Genome selection
We were provided with the genome from Miichthys miiuy that could be obtained from the path /cursos/20428/BI/genomes/2016/Miichthys_miiuy/genome.fa -out.
Blast
We have used Blast to compare each query, to the Miichthys miiuy genome. The query file (nomdelaselenoproteina_zf.aa.fa) is where the Danio rerio selenoprotein is annotated with the U changed by an X. The -db must be followed by the name of the database BLAST where we want to search our query. The output file will be the blast where we can see the different hits we obtain from this protein in our genome and in which scaffolds are situated and their E-values.
Our command is: tblastn -query nomdelaselenoproteina_zf.aa.fa -db /cursos/20428/BI/genomes/2016/Miichthys_miiuy/genome.fa -out nomdelaselenoproteina.blast
Fasta fetch
The fasta index command indexes the genome of Miichthys miiuy, making easier for the program to run and find the sequence in the genome. We do not use this command because the genome file is already indexed.
To extract the genomic region that potentially contains the gene, a fasta file containing only one scaffold has to be done. In order to choose which scaffold is the best to make the fasta fetch we need to study the E-value and choose the lower or at least one scaffold lower than 0.05. We also need to consider that the different proteins of a family cannot have the same chosen scaffold.
To do this we use this command: fastafetch /cursos/20428/BI/genomes/2016/Miichthys_miiuy/genome.fa /cursos/20428/BI/genomes/2016/Miichthys_miiuy/genome.index nomscaffold > nomdelaselenoproteina_nomscaffold.fa
Nomscaffold is the name of the scaffold, and nomdelaselenoproteina_nomscaffold.fa is the file where we want to save the output of this command.
Fasta subseq
The fastasubseq command extracts just the region of the chosen scaffold we are interested in, that was aligned in the blast.
Our command is: fastasubseq nomdelaselenoproteina_nomscaffold.fa start length > genomenomedelaselenoproteina_nomscaffold.fa
Nomdelaselenoproteina_nomscaffold.fa is the output from fastafetch command, start is the first nucleotide that will be selected by fastasubseq and length is the distance between the end and the start of the sequence we want to study. For almost all the genes we have extended the search 50000 nucleotides 5' and 50000 3' because the alignment is just represented by the coding part of the genome, and we wanted to ensure we were including all the gene. When the extension is longer than the length of the scaffold we extend the search as long as possible. When the gene is located closer to the start of the scaffold we start the prediction in the nucleotide 1.
Exonerate
The function of this program is to predict the genes and align it with the sequence of the selenoprotein. Additionally, we have used the command egrep -w exon in order to only have the exons and separate them in another file.
The command we use is: exonerate -m p2g --showtargetgff -q nomdelaselenoproteïna_zf.aa.fa -t genomenomedelaselenoproteina_nomscaffold.fa > nomdelaselenoproteina_nomscaffold.exonerate
-m p2g: compares the protein with the genome.
--showtargetgff: allows us to get all the information about the coordinates in a GFF file format.
-q: shows that the file we are using is a FASTA format.
-t: is the target, the sequence we obtain from doing the fastasubseq command.
egrep -w exon nomdelaselenoproteina_nomscaffold.exonerate > nomdelaselenoproteina_nomscaffold.exonerate.gff
egrep -w exon: takes just the exonic regions from all the information that the exonerate produces and copy them in a file, which we called nomdelaselenoproteina_nomscaffold.exonerate.gff.
Fastaseq from GFF
The fastaseqfromGFF program extracts the cDNA sequence that encodes for the final protein in fasta format and sends the output into a file called nomdelaproteina_nomscaffold.nt.fa.
The command of this program is: fastaseqfromGFF.pl genomenomdelaproteina_nomscaffold.fa nomdelaselenoproteina_nomscaffold.exonerate.gff > nomdelaproteina_nomscaffold.nt.fa
Fasta translate
This program translates the sequence previously obtained, that was cDNA, into the amino acids, that will be our output.
It is executed with the command : fastatranslate -f nomdelaproteina_nomscaffold.nt.fa -F 1 > nomdelaselenoproteina_zf_nomscaffold.predicted.fa
-f is followed by the file in the nucleotidic format containing just the exon sequences.
-F specifies the reading frame, in this case -F1.
T-coffee
This program is used to align 2 protein sequences. In this case we are going to compare our query ( the Danio rerio's selenoprotein) with the sequence we have predicted that is the homologous in Miichthys miiuy.
Before using this program it has been changed the * in the fastatranslate output by an U because the t-coffee program does not recognise the * symbol and eliminates it.
Our command is: t_coffee nomdelaselenoproteina_zf.aa.fa nomdelaselenoproteina_zf_nomscaffold.predicted.fa > nomdelaselenoproteina_nimscaffold.final
Nomdelaselenoproteina_nimscaffold.final is the final output resulting from t_coffee command and contains the alignment between the query and the predicted protein. In this last file is where we have been able to see if the selenoprotein predicted from Miichthys miuuy's genome is correctly aligned or not with Danio rerio's selenoprotein.
Genewise
This program is a gene predictor for homologous sequences. We used it when the exonerate file contains more than one result in order to contrast whether the data obtained with this program is better than the data obtained with the exonerate.
This is our command: genewise -cdna -pep -pretty -gff nomdelaselenoproteina_especie.aa.fa genomenomdelaselenoproteina_nomscaffold.fa > nomdelaselenoproteina_genewise.fa
-cdna: is used to show the predicted cDNA sequence aligned.
-pep: is used to show the predicted peptide.
-pretty: is used to show the ASCII alignment.
-gff: is used to create the output in a gff format.
SECISearch3 and Seblastian predictions
These two programs have been used in order to predict SECIS elements as well as selenoproteins.
Seblastian looks for a SECIS element and also tries to find the selenoprotein corresponding to this SECIS element. Sometimes, even though it is able to find a SECIS element, it cannot find the selenoprotein, but since selenoproteins have already been correctly aligned using tcoffee, we can confirm that selenoproteins are found in Miichthys miuuy's genome.
SECISearch3 only looks for SECIS elements, without predicting selenoproteins. When more than one SECIS element has been predicted we have used some criteria to choose the best one. Firstly, the SECIS element has to be found in the same strand (+ or -) than the gene codifying for the selenoprotein. Secondly, we have to make sure that the SECIS element is found in the 3'UTR region of the gene.
Automation:
Once obtained the genome of the Miichthys miiuy and the selenoprotein queries, we have created a bioinformatic semiautomatic program written in perl for every species we wanted to compare with Miichthys miiuy genome. In our case, we compared this organism's with Danio rerio's and Homo sapiens' selenoproteins queries. The programs used for the protein prediction are called programa_zf and programa_hs. These programs have been designed to make all the commands explained above (except genewise) to predict the selenoproteins, one by one, in the Miichthys miiuy genome. The name of the query and the scaffold, the start and the end of the prediction have to be entered in the program manually.