Materials and Methods
The aim of this study was to determinate the selenoproteins in the genome of P. schlosseri as well as the machinery required for their synthesis. In order to find these selenoproteins, an homology-based approach using the genome of zebrafish has been performed.
Queries
Zebrafish genome was chosen as the reference to do the homology-based approach because it is the closest phylogenetic relative that has a well described selenoproteome.
SelenoDB (a selenoprotein database) provided the majority of zebrafish proteins. However, some challenges were faced during the obtention of the queries. First, in some cases proteins were not properly designated in selenoDB as the subfamily name was not specified (appeared as «NONE») and more than one query had the same name. In these cases, to try to find a more specific denomination, a BLAST was done using ENSEMBL.
Second, certain zebrafish proteins were not well annotated, as their first amino acid wasn’t methionine. Consequently, these proteins were searched in the database UniProt, to find out if it was an annotation error in SelenoDB. If the outcome of the search in UniProt was a Met-starting protein, it was used as the novel reference query. If, on the contrary, the obtained sequence was neither starting by a methionine the Homo sapiens protein from SelenoDB was used to make the homology-based approach.
Comparision process
1. Exports
The first point was to export the programs (NCBI Blast, Exonerate, T-Coffee, GeneWise...) that have been used throughout the analysis. The paths to realize the exports are:
- $ export PATH=/cursos/BI/bin:$PATH
- $ export PATH=/cursos/BI/bin/ncbiblast/x64/bin:$PATH
- $ cp /cursos/BI/bin/ncbiblast/.ncbirc ~/
- $ export PATH=/cursos/BI/soft/exonerate/x86_64/bin:$PATH
- $ export PATH=/cursos/BI/soft/t_coffee/x86_64/bin:$PATH
- $ export PATH=/cursos/BI/soft/genewise/x86_64/bin:$PATH
- $ export WISECONFIGDIR=/cursos/BI/soft/genewise/x86_64/wise2.2.0/wisecfg/
2. Fasta Index
The second step was to create an index of the P. schlosseri genome, which was registered under the name of PS.index. This has allowed successfully comparisons between the P. schlosseri genome and the queries. The command used was:
$ fastaindex /cursos/BI/genomes/2015/Periophthalmodon_schlosseri/genome.fa PS.index
3. Contig Length Index
Then a contig length index was created. Basically, it is an index where the length of each contig is recorded. The command used was:
$ fastalength /cursos/BI/genomes/2015/Periophthalmodon_schlosseri/genome.fa > llargada_cromosomes
>4. Change the U symbols for X symbols and eliminate artifacts
The next step was to change U characters (present in the queries, representing the selenocysteine residue) for X symbols. This was made to avoid errors when running programs like Exonerate, T-COFFEE or Genewise.
Furthermore, artifacts or incorrect symbols present in the queries such as #, % or @ were also eliminated.
5. Blast
TLater, a BLAST (Basic Local Alignment Search Tool) was conducted. A BLAST is an heuristic algorithm that performs comparisons between queries (zebra fish or human) with the genome of P. schlosseri. In order to obtain the alignment, the BLAST quickly select sequences (hits) from the reference genome with a higher homology with the query.
The type of blast used was tBLASTn (p tblastn) because proteins were being compared with nucleotides. At this point, it is highly interesting to consider that one of the most important parameters obtained from blast is the expectation value (E-value) for each hit. The E-value means the number of hits that can be expected to be found when searching the query in a database only by chance. The higher this value is, the less significant is the match. The threshold defined to consider an alignment as a valuable hit and to further analyse it was an E-value lower than 0,01.
Moreover, “-m8” BLAST option was used to obtain all the results presented as a chart. This option was chosen due to the fat that it facilitates the visualization of the outputs. The command used was:
$ blastall -p tblastn -i nomdelaproteina.z.aa.fa -d /cursos/BI/genomes/2015/Periophtalmodon_schlosseri/genome.fa -m8 -e 0.01 > nomdelaproteina.z.blast.fa
The information obtained from each hit after running the blast is detailed below:
- Query id: the name of the analysed sequence.
- Subject id: scaffold, it is the hit sequence.
- %identity: the % of similarity between the subsequence of the query and the one in the database.
- Alignment length: length of the alignment between the subsequence of the query and the one in the database.
- Mismatches: number of mismatches in the alignment.
- Gap openings: number of gap openings in the alignment.
- q.start: the first position of the subsequence of the query that has been aligned.
- q.end: the final position of the subsequence of the query that has been aligned.
- s.start: the first position of the subsequence of the hit found in the target genome.
- s.end: the final position of the subsequence of the hit found in the target genome.
- E-value: expectation value of the hit.
Finally, from the results of the BLAST the scaffold’s strand was defined. The strand sign was defined by a simple substractions of s.start - s.end, so that if the result was positive the strand was defined as positive and if not, negative. Knowing the sign of the strand was useful to perform the Genewise (it will be explained later).
6. Fastafetch
After obtaining the blast output for each query, the next step was to extract the sequences of interest of each hit from the P. schlosseri genome by using fastafetch. Importantly, the length between the q.start and the q.end is usually elongated by 50.000 positions to make sure that any region susceptible of containing encoding information is discarted.
However, the P. schlosseri genome is annotated in very small fragments or contigs. The information obtained from the contig length index revealed that contig lengths oscillated from aproximately 200 positions in the shorter one to 110.000 in the longer one.
That’s why, instead of elongating both ends of the subsequence, the whole contig was considered. This is why fastasubseq program was not used in this analysis. The command used was:
$ fastafetch /cursos/BI/genomes/2015/Periophthalmodon_schlosseri/genome.fa PS.index nomcromosoma > nom.seq.fa
7. Exonerate
It was necessary to ensure that the hit was taken from an exonic regions and, therefore, encodes a protein. For this reason, the program exonerate was used to align the DNA fragment extracted with the fastafetch program with the initial query. Exonerate provides a more exact and accurate alignment than BLAST. Besides, it provides more gene features such as introns, exons, splice areas, etc. The command used was:
$ exonerate -m p2g --showtargetgff -q ALL_selenoproteins.fasta.nº -t nom.genomic.fa --exhaustive | egrep -w exon > nom.exonerate.gff
egrep function was used to determine the positions of the exons, which were later extracted from the whole sequence using the program fastaseqfromGFF. The comand used was:
$ fastaseqfromGFF.pl nom.genomic.fa nom.exonerate.gff > nom.ps.nt.cds
In order to perform a third P. schlosseri-zebrafish alignment using t-coffe and genewise, the sequences obtained from the prior needed to be translated to protein. To do this, the program fastatranslate was used. The command used was:
$ fastatranslate -F 1 nom.ps.nt.cds > nom.ps.aa.fa
8. Change the * symbols for X symbols
The next step was to change U characters (present in the queries) for X symbols. This was made to avoid errors when running programs like Exonerate, T-COFFEE or Genewise.
9. T-Coffee
Once the sequence had been translated into protein, the program T-Coffee (Tree-based Consistency Objective Function Evaluation for alignment) was used to determine either the presence or lack of homology between the two proteins compared (the one predicted for P. Schlosseri and the one from zebrafish). This program allows multiple alignments of protein, DNA and RNA. The command used was:
$ t_coffee nomdelaproteina.z.aa.fa nom.ps.aa.fa > nom.tcoffee
10. Genewise
In order to contrast the results obtained with Exonerate and T-coffee, the program Genewise (which generates a gene annotation using a different algorithm) was used. In case the original sequence of the P. schlosseri protein was coded in the negative strand, the function –trev was added. The command used was:
$ genewise -pretty -cdna -gff sel15.ps.aa.fa sel15.genomic.fa -trev > sel15.genwise
11. Secis prediction
SECIS elements are located in the 3’-UTR regions of the genes and are essential for the synthesis of selenoproteins. That's why searching for them can help confirming the prediction of selenoproteins.[26]
Two online tools were used: SECISearch3 and Seblastian .[26]
The SECISearch3 core uses a model of curated, secondary structure based on the alignment of 1122 SECIS elements across all eukaryotic lineages and is useful to predict SECIS structures from given predicted selenoproteins transcrits. [26]
Seblastian is a pipeline for the prediction of selenoproteins in nucleotide sequences: at first it searches for SECIS elements, and then it looks upstream trying to find their selenoprotein coding sequence.The prediction of SECIS element in Seblastian is relied to SECIS Search3. [26]
Automatization process
a)multifasta_split.pl
The first perl program, called “Multifasta_split.pl”, was used to process the file downloaded from selenoDB database. The objective of running this program was to convert a single multifasta document containing all the zebrafish proteins into several documents, each of them containing a single protein. The script of the program can be find by clicking on the following link.
b)export_v2.sh
"Export_v2", is a bash program that was used to give the system the different paths so the programs needed for the analysis could be used. In addition, fastaindex and contig length index were created. Finally, it automically executes the program “prova.pl”. The script of the program can be find by clicking on the following link.
c)prova.pl
Finally, “prova.pl” is a program written in perl that was used to perform the steps already described (from changing U symbols for X to Genewise) in order to compare P. schlosseri’s genome and the queries. The script of the program can be find by clicking on the following link.