Selenoproteins of Mus spretus

RESULTS

The table below resumes the results obtained in our analysis of Mus spretus selenoproteome. All selenoproteins predicted in Mus spretus can be found in this table, as well as the proteins involved in their synthesis. Every protein has been located in a given contig of Mus spretus genome and the exact location within this contig has been identified (columns 1 and 2, respectively).

The relevant documents for protein prediction can be found in the table. These include the results of tBLASTn, T-Coffees obtained with Exonerate or Genewise, SECIS information and the photograph of the chosen SECIS, and Seblastian. Moreover, in the last column we have included the Matlab figures where all blast hits can be seen, together with Exonerate and Genewise predictions (see below for more information). Genewise and Exonerate T-Coffees have only been added when they were accurate and relevant for obtaining the final prediction. In the third column, final protein predictions can be found.

Since we have used both Mus musculus and human selenoproteomes to make our predictions, we have attached the documents obtained from Mus musculus with a mouse icon and the ones obtained from humans with a person icon.

SELENOPROTEINS AND CYSTEINE HOMOLOGUES

Protein Name Contig Gene Location Predicted Protein tBlastn Exonerate Genewise Secis Info Secis Photo Seblastian Matlab Figure

Glutathione peroxidase (GPx)
GPx1 CM004102.1 109207309-109208129

GPx2-a CM004106.1 70218431-70221164

GPx2-b CM004100.1 89512429-89512994

GPx3 CM004105.1 53888501-53895197

GPx4 CM004103.1 78895105-78898745

GPx5 CM004107.1 17270357-17275565

GPx6 CM004107.1 17297628-17304819

GPx7 CM004097.1 105404803-105410611

GPx8 CM004107.1 110154109-11057254

Iodothyronine deiodinase (DIO)
DIO1 CM004097.1 104286841-104301651

DIO2 CM004106.1 84555859-84564721

DIO3 CM004106.1 105018284-105019117

Thioredoxin reductase (TXNRD)
TXNRD1 CM004103.1 81689812-81714116

TXNRD2 CM004110.1 15356078-15410943

TXNRD3 CM004099.1 88803548-88833559

Methionine sulfoxide reductase A (MsrA)
MsrA CM004108.1 54488572-54815934

Methionine-R-sufoxide reductase (MSRB)
MSRB1 CM004111.1 21502796-21509320

MSRB3 CM004103.1 121537637-121663987

15kDa selenoprotein (Sel15)
Sel15 CM004096.1 144217792-144243497

Selenoprotein H (SELENOH)
SELENOH CM004095.1 84546376-84546943

Selenoprotein I (SELENOI)
SELENOI CM004098.1 27268100-27306511

Selenoprotein K (SELENOK)
SELENOK-a CM004108.1 22613900-22618809

SELENOK-b CM004097.1 132604011-132604226

SELENOK-c CM004095.1 169487792-169487989

SELENOK-d LVXV01025633.1_8 9521-9799

Selenoprotein M (SELENOM)
SELENOM CM004105.1 417415-419622

Selenoprotein N (SELENON)
SELENON CM004097.1 131159360-131172024

Selenoprotein O (SELENOO)
SELENOO CM004109.1 89147087-89157897

Selenoprotein P (SELENOP)
SELENOP CM004109.1 192628-197787

Selenoprotein S (SELENOS)
SELENOS CM004100.1 53733901-53743134

Selenoprotein T (SELENOT)
SELENOT CM004096.1 56410358-56427006

Selenoprotein U (SELENOU)
SELENOU1 CM004108.1 34057328-34066959

SELENOU2 CM004107.1 61511789-615300095

SELENOU3 CM004097.1 151529697-151532124

Selenoprotein W (SELENOW)
SELENOW-1 CM004100.1 8372322-8374762

SELENOW-2 CM004105.1 99815374-99815374

MACHINERY PROTEINS

Protein Name Contig Gene Location Predicted Protein tBlastn Exonerate Genewise Secis Info Secis Photo Seblastian Matlab Figure

tRNA Sec 1 associated protein 1 (SECp43)
SECp43 CM004097.1 128762881-128781272

Selenophosphate synthetase (SEPHS)
SEPHS2 CM004100.1 117216271-1172117623

Selenocysteine synthase (SecS)
SECS CM004098.1 50454061-50480812

Phosphoseryl-tRNA kinase (PSTK)
PSTK CM004100.1 121562581-121571150

SECIS binding protein 2 (SBP2)
SBP2 CM004107.1 48925499-48962157

Eukaryotic elongation factor (eEFsec)
eEFsec CM004099.1 87343010-87537215

We attach a text file with the predicted exon locations and SECIS coordinates for each protein within the contig. We also add a visual representation of this data, which is a Matlab figure that allows the user to browse across through the Mus spretus selenoproteome.

Example of protein prediction

To illustrate the process of prediction we want to show an example. We chose SELENOI because it could be predicted from the mouse query and the analysis is easy to understand.

After data acquisition we generated a Matlab figure that contains the relevant information for screening a candidate protein (every figure is attached in the Results table). Below there is a screenshot of how this figure looks (in this case it refers to the mouse query SPP00001577_2.0):

This figure shows the following elements that we got in data acquisition (see methods for a proper understanding of how these files were generated):

Mus spretus contigs in which we got tBLASTn hits (black lines).

The location of these BLAST hits (purple boxes).

The SUBSEQ regions generated from these hits (black boxes).

Genes predicted by Exonerate (blue boxes, with exons and introns).

Genes predicted by Genewise (yellow boxes, with exons and introns).

T-Coffee results for both Exonerate and Genewise predictions(each amino acid is a box with a different color, see below for more information).

Both genes predicted have the locations within the contig annotated. All BLAST, predicted genes and T-Coffee boxes are above or below the contig depending on if they are in the forward or reverse strand, respectively.

The most important thing of this overview figure is the coloring of the T-Coffee text. Each color indicates the homology of the predicted protein to the query (in this case the mouse SPP00001577_2.0 protein). The color code is:

Red: Less than 30% homology.

Magenta: Between 30 - 60 % homology.

Blue: Between 60 - 90 % homology.

Green: More than 90 % homology.

This allowed us to automatically screen for relevant predictions. In this case there is a very good prediction in contig CM004098.1, and 2 predictions with very low homology. The next step was to zoom into the interesting region to get more information about that prediction. Below is the zoomed CM004098.1 region:

We can see how the BLAST hits, Genewise and Exonerate predictions (first 3 boxes) which overlap completely. The two green boxes are the visual representations of tCoffee mentioned before. The first box (Ex-Tc) refers to the Exonerate T-Coffee, and the second (GW-Tc) refers to the Genewise one. Note that there is a text line that indicates which is the T-Coffee score, the % of homology and information about the alignment of the Sec (in this case it was perfectly aligned, and this is labeled as ''true'').

To understand better how this T-Coffee visual representation works we present the predicted genes in contig CM004096.1, which have very low homology. Below is a zoomed region of the T-Coffee obtained:

Each of the colors describes how is each query amino acid aligned with the predicted protein:

Green is match

Red is miss-match

Yellow is a gap in the predicted protein

Purple is a gap in the query protein

There is also a blue outlined box which corresponds to the query Sec. In this case we can see how this prediction had very low homology and the Sec was aligned with a gap (yellow box), so that we discarded it from the analysis.

We used this analysis framework combined with manual verifications to browse among all our files and make precise protein predictions (see discussion for a detailed explanation), in a high-throughput and friendly-interface manner. All figures are attached to the results table.