RESULTS


The table below resumes the results obtained in our analysis of Mus spretus selenoproteome. All selenoproteins predicted in Mus spretus can be found in this table, as well as the proteins involved in their synthesis. Every protein has been located in a given contig of Mus spretus genome and the exact location within this contig has been identified (columns 1 and 2, respectively).

The relevant documents for protein prediction can be found in the table. These include the results of tBLASTn, T-Coffees obtained with Exonerate or Genewise, SECIS information and the photograph of the chosen SECIS, and Seblastian. Moreover, in the last column we have included the Matlab figures where all blast hits can be seen, together with Exonerate and Genewise predictions (see below for more information). Genewise and Exonerate T-Coffees have only been added when they were accurate and relevant for obtaining the final prediction. In the third column, final protein predictions can be found.

Since we have used both Mus musculus and human selenoproteomes to make our predictions, we have attached the documents obtained from Mus musculus with a mouse icon and the ones obtained from humans with a person icon.



SELENOPROTEINS AND CYSTEINE HOMOLOGUES

Protein Name Contig Gene Location Predicted Protein tBlastn Exonerate Genewise Secis Info Secis Photo Seblastian Matlab Figure
Glutathione peroxidase (GPx)
GPx1 CM004102.1 109207309-109208129
GPx2-a CM004106.1 70218431-70221164
GPx2-b CM004100.1 89512429-89512994
GPx3 CM004105.1 53888501-53895197
GPx4 CM004103.1 78895105-78898745
GPx5 CM004107.1 17270357-17275565
GPx6 CM004107.1 17297628-17304819
GPx7 CM004097.1 105404803-105410611
GPx8 CM004107.1 110154109-11057254
Iodothyronine deiodinase (DIO)
DIO1 CM004097.1 104286841-104301651
DIO2 CM004106.1 84555859-84564721
DIO3 CM004106.1 105018284-105019117
Thioredoxin reductase (TXNRD)
TXNRD1 CM004103.1 81689812-81714116
TXNRD2 CM004110.1 15356078-15410943
TXNRD3 CM004099.1 88803548-88833559
Methionine sulfoxide reductase A (MsrA)
MsrA CM004108.1 54488572-54815934
Methionine-R-sufoxide reductase (MSRB)
MSRB1 CM004111.1 21502796-21509320
MSRB3 CM004103.1 121537637-121663987
15kDa selenoprotein (Sel15)
Sel15 CM004096.1 144217792-144243497
Selenoprotein H (SELENOH)
SELENOH CM004095.1 84546376-84546943
Selenoprotein I (SELENOI)
SELENOI CM004098.1 27268100-27306511
Selenoprotein K (SELENOK)
SELENOK-a CM004108.1 22613900-22618809
SELENOK-b CM004097.1 132604011-132604226
SELENOK-c CM004095.1 169487792-169487989
SELENOK-d LVXV01025633.1_8 9521-9799
Selenoprotein M (SELENOM)
SELENOM CM004105.1 417415-419622
Selenoprotein N (SELENON)
SELENON CM004097.1 131159360-131172024
Selenoprotein O (SELENOO)
SELENOO CM004109.1 89147087-89157897
Selenoprotein P (SELENOP)
SELENOP CM004109.1 192628-197787
Selenoprotein S (SELENOS)
SELENOS CM004100.1 53733901-53743134
Selenoprotein T (SELENOT)
SELENOT CM004096.1 56410358-56427006
Selenoprotein U (SELENOU)
SELENOU1 CM004108.1 34057328-34066959
SELENOU2 CM004107.1 61511789-615300095
SELENOU3 CM004097.1 151529697-151532124
Selenoprotein W (SELENOW)
SELENOW-1 CM004100.1 8372322-8374762
SELENOW-2 CM004105.1 99815374-99815374

MACHINERY PROTEINS

Protein Name Contig Gene Location Predicted Protein tBlastn Exonerate Genewise Secis Info Secis Photo Seblastian Matlab Figure
tRNA Sec 1 associated protein 1 (SECp43)
SECp43 CM004097.1 128762881-128781272
Selenophosphate synthetase (SEPHS)
SEPHS2 CM004100.1 117216271-1172117623
Selenocysteine synthase (SecS)
SECS CM004098.1 50454061-50480812
Phosphoseryl-tRNA kinase (PSTK)
PSTK CM004100.1 121562581-121571150
SECIS binding protein 2 (SBP2)
SBP2 CM004107.1 48925499-48962157
Eukaryotic elongation factor (eEFsec)
eEFsec CM004099.1 87343010-87537215


We attach a text file with the predicted exon locations and SECIS coordinates for each protein within the contig. We also add a visual representation of this data, which is a Matlab figure that allows the user to browse across through the Mus spretus selenoproteome.

Example of protein prediction

To illustrate the process of prediction we want to show an example. We chose SELENOI because it could be predicted from the mouse query and the analysis is easy to understand.

After data acquisition we generated a Matlab figure that contains the relevant information for screening a candidate protein (every figure is attached in the Results table). Below there is a screenshot of how this figure looks (in this case it refers to the mouse query SPP00001577_2.0):



This figure shows the following elements that we got in data acquisition (see methods for a proper understanding of how these files were generated):

  • Mus spretus contigs in which we got tBLASTn hits (black lines).
  • The location of these BLAST hits (purple boxes).
  • The SUBSEQ regions generated from these hits (black boxes).
  • Genes predicted by Exonerate (blue boxes, with exons and introns).
  • Genes predicted by Genewise (yellow boxes, with exons and introns).
  • T-Coffee results for both Exonerate and Genewise predictions(each amino acid is a box with a different color, see below for more information).
Both genes predicted have the locations within the contig annotated. All BLAST, predicted genes and T-Coffee boxes are above or below the contig depending on if they are in the forward or reverse strand, respectively.

The most important thing of this overview figure is the coloring of the T-Coffee text. Each color indicates the homology of the predicted protein to the query (in this case the mouse SPP00001577_2.0 protein). The color code is:

  • Red: Less than 30% homology.
  • Magenta: Between 30 - 60 % homology.
  • Blue: Between 60 - 90 % homology.
  • Green: More than 90 % homology.
This allowed us to automatically screen for relevant predictions. In this case there is a very good prediction in contig CM004098.1, and 2 predictions with very low homology. The next step was to zoom into the interesting region to get more information about that prediction. Below is the zoomed CM004098.1 region:



We can see how the BLAST hits, Genewise and Exonerate predictions (first 3 boxes) which overlap completely. The two green boxes are the visual representations of tCoffee mentioned before. The first box (Ex-Tc) refers to the Exonerate T-Coffee, and the second (GW-Tc) refers to the Genewise one. Note that there is a text line that indicates which is the T-Coffee score, the % of homology and information about the alignment of the Sec (in this case it was perfectly aligned, and this is labeled as ''true'').

To understand better how this T-Coffee visual representation works we present the predicted genes in contig CM004096.1, which have very low homology. Below is a zoomed region of the T-Coffee obtained:



Each of the colors describes how is each query amino acid aligned with the predicted protein:

  • Green is match
  • Red is miss-match
  • Yellow is a gap in the predicted protein
  • Purple is a gap in the query protein
There is also a blue outlined box which corresponds to the query Sec. In this case we can see how this prediction had very low homology and the Sec was aligned with a gap (yellow box), so that we discarded it from the analysis.

We used this analysis framework combined with manual verifications to browse among all our files and make precise protein predictions (see discussion for a detailed explanation), in a high-throughput and friendly-interface manner. All figures are attached to the results table.