Materials and methods



    The aim of this project was to determine all selenoproteins and machinery genes present in Xiphophorus Couchianus’s genome. In order to determine these proteins, Zebrafish genome was used to identify homology regions as described below.

Genome

The genome of Xiphophorus Couchianus was provided from university sources, and can be found in:

/cursos/BI/genomes/2015/Xiphophorus_couchianus/genome.fa.
Export

Firstly, all programs (NCBI Blast, Exonerate, T-Coffee…) were exported to be able to do the analysis.

$ export PATH=/cursos/BI/bin:$PATH
$ export PATH=/cursos/BI/bin/ncbiblast/x64/bin:$PATH
$ cp /cursos/BI/bin/ncbiblast/.ncbirc ~/
$ export PATH=/cursos/BI/soft/exonerate/x86_64/bin:$PATH
$ export PATH=/cursos/BI/soft/t_coffee/x86_64/bin:$PATH
$ export PATH=/cursos/BI/soft/genewise/x86_64/bin:$PATH
$ export WISECONFIGDIR=/cursos/BI/soft/genewise/x86_64/wise2.2.0/wisecfg/

Queries


    Zebrafish is the closest organism to Xiphophorus in terms of evolution, as a consequence, its genome was used as a reference. both organisms were compared, and this was possible because selenoproteins maintain a high degree of homology and these are well conserved between species.

    First of all, selenoprotein database known as SelenoDB was used to obtain zebrafish proteins. While overlapping its proteins with Xiphophorus’s genome, some mistaken annotations were found. Some of the sequences were not starting with a methionine residue, which meant that were not properly defined. Moreover, some proteins from the same family were specified with the same name, making difficult its distinction.

    When analysis was unsuccessful, proteins from other species or from the human genome were faced against the genome of interest in the same way.

Programs



Blast

    Firstly, homology regions had to be identified in order to find out potential selenoprotein sequence. For this reason, BLAST (Basic Local Alignment Search Tool) was done, which provided an alignment of the proteins for both genomes. To do so, the following command was applied:

tblastn -query GPx.fa -db /cursos/20428/BI/genomes/2016/Xiphophorus_couchianus/genome.fa -out AlineamentGPx.fa

    This program was able to align the queries with the Xiphophorus Couchianus genome, so an output was obtained. Then, a list of hits was presented, all ordered depending on the score. To select the best hit, some parameters were determined. The most suitable was the one with the highest score (more probable to be in the sequence) and the lowest E-value (less likely to happen by chance and considering a good value below 0.05).

    In this step of the analysis, some problems were faced. Each hit is contained in a specific scaffold and scaffolds between family members cannot be repeated. Thus, some families were not presenting enough scaffolds for each subgroup or some scaffolds were the same between members of the same family so identification of the correct scaffold for each query was difficult to determine.

    Finally, in order to verify if the assumption for the scaffolds picked for each protein family was correct, a phylogenetic tree was performed by the phylogeny program. Finally, the analysis could be ensured.

Fastaindex

    Fastaindex is a command used to arrange data in the most suitable way so the program could extract results easier. Using this, Xiphophorus Couchianus’s genome could be indexed.The command used was:

fastaindex /cursos/BI/genomes/2015/Xiphophorus_couchianus/genome.fa Xiphophorus_couchianus.index

Fastafetch

    Once hits were obtained, those with the most statistical significative results were extracted by using Fastafetch, which provided a new document with the region where the hits were predicte.The command used was:

fastafetch /cursos/20428/BI/genomes/2016/Xiphophorus_couchianus/genome.fa /cursos/20428/BI/genomes/2016/Xiphophorus_couchianus/genome.index KQ557211.1 > nameprot_KQ557211.1.fa
Fastasubseq

    This command extracts the region where hits are located. Here, to make sure that the sequence selected was containing the gene of interest, elongation of 500000 nucleotides 5’ upstream and 100000 nucleotides 3’ downstream was done. By doing that, the sequence length was reaffirmed.The command used was:

fastasubseq nameprot_KQ557211.1.fa [50.000nt less] 100000 > nameprot_KQ557211.1.subseq
Exonerate

    This program predicts the exonic sequences that code for selenoproteins. Thus, exons were extracted from the region obtained after doing Fastasubseq.To do so, the command used was:

exonerate -m p2g --showtargetgff -q nameprot.fa -t nameprot_KQ557211.1.subseq | egrep -w exon > nameprot_KQ557211.1.exonerate.gff
Fastaseq from gff

    After obtaining the Exonerate analysis, in order to observe the features from the cDNA sequence, it had to be converted in a .gff format using FastaseqfromGFF as described below:

fastaseqfromGFF.pl GPx_KQ557202.1.subseq GPx_KQ557202.1.exonerate.gff > GPx_KQ557202.1.fafromGFF.fa
Fastatranslate

    After having the cDNA sequence, the amino acid sequence of the protein was obtained by using this command. There are different frames to translate proteins. Then, -f1 was established as forward reading frame 1 is the one of interest.

fastatranslate -f GPx_KQ557202.1.fafromGFF.fa -F 1 > GPx_KQ557202.1.translate
T-coffe

    Finally, in order to determine the homology and contrast the matches obtained between the sequences, a t-coffee analysis was performed, where * symbol represents a match. The command used was the following:

t_coffee GPx_KQ557202.1.translate GPx.fa > GPx_KQ557202.1.tcoffee

SECIS prediction


    Knowing that the SECIS element is an essential structure needed to synthesise the selenoprotein, it was important to determine whether the protein predicted had it or not. This could help to confirm the prediction of selenoproteins.

    In order to do so, two new computational methods known as SECISearch3 and Seblastian were used and identification and analysis of selenoproteins was possible. While SECISearch3 predicts SECIS structures form the selenoproteins transcrits, Seblastian uses the information provided by SECISearch3 to predict selenoprotein sequences encoded upstream of SECIS elements (12).