ABSTRACT

Selenoproteins are proteins containing one or more selenocysteine (Sec) residues, the 21st naturally occurring amino acid which contains selenium (Se). This amino acid is encoded by the UGA codon, which normally serves as a STOP codon. Special proteic machinery and SECIS (Sec insertion sequences) elements are required for Sec to be included in a protein chain.

The double function of the UGA codon is the reason why characterization of selenoproteomes is challenging and most of them are miss-annotated. Although the basic mechanisms of Sec synthesis and insertion into proteins have been studied in great detail, the identity and functions of many selenoproteins remain largely unknown in most organisms.

The aim of this project is to characterize the Cricetomys gambianus selenoproteome, identifying and annotating the selenoproteins of this organism, which have never been described before. Using a computationally semi-automatic homology-based approach with the well annotated Mus musculus genome, we have identified 15 selenoproteins in the Cricetomys gambianus genome as well as 9 Cys-containing homologues.

This analysis represents a first insight into the selenoproteome of this mammalian.

INTRODUCTION

Selenoproteins

Selenium and selenocysteine

Selenium (Se) is an essential trace element for humans, plants and microorganisms. Although it is toxic when the intake exceeds certain doses, Se has important cellular functions and its deficiency can contribute to pathophysiological conditions, including heart disease, neuromuscular disorders and cancer (1). The importance of Se to biological systems is underlined by the fact that it is the only trace element to be specified in the genetic code, in the form of selenocysteine (Sec). Sec is now recognized as the 21st amino acid as it has its own codon and selenocysteine-specific biosynthetic and insertion machinery (4). Sec is known to be a “proteinogenic” amino acid, as it is inserted into proteins during biosynthesis, in contrast to post-translationally modified amino acids, and the majority of them perform oxido-reductase roles (2).

Fig. 1 Genetic code (4)

Selenoproteins

Proteins containing Sec are known as selenoproteins (selenium-containing proteins). The mammalian selenoproteins can be broadly classified into two classes: housekeeping selenoproteins and stress-related selenoproteins. Housekeeping selenoproteins serve functions critical to cell survival, thus, they are less affected in Se-deficient conditions; whereas stress-related selenoproteins are not essential for survival and show a decreased expression in Se-deficient conditions (3). Selenoproteins can be found in all three kingdoms (Eukaryotes, Archea and Eubacteria). In Eukarya, selenoproteins are largely present in animals. Selenoproteins are well characterised in humans and mice, which have twenty-five and twenty-four genes, respectively, that encode selenoproteins (2). However, other Eukarya organisms, such as higher plants and yeast, do not have selenoproteins because the machinery necessary to incorporate them is lost (4). On the other hand, red algae, insects and nematodes have less than five selenoproteins (3). As we can see, the selenoproteome (this is, the set of selenoproteins in the proteome of an organism), differs widely between organisms. It was shown in recent studies, that aquatic organisms generally have larger selenoproteomes than terrestrial organisms, and that mammalian selenoproteomes show a trend toward the reduced use of selenoproteins. Several selenoproteins were lost across vertebrates after the terrestrial environment was colonised (fishes have a higher use of this proteins compared to mammals). This points to the idea that selenoproteins may be helpful for the adaptation to a determined environment, thus, the environment plays a key role on the development of selenoproteomes. For example, it has been suggested that the higher amount of oxygen in water increases the need of selenoproteins related to red-ox functions, thus, the selenoproteome of water-living organisms needs to be larger (3).

Sec biosynthesis

Sec is the only amino acid in Eukarya whose biosynthesis occurs on its own tRNA, known as Sec tRNA [Ser]Sec (1). We will now proceed to briefly describe this process.

Fig. 2 Schematic representation of Sec aminoacid biosynthesis (1)

Sec tRNA (tRNA[Ser]Sec)

tRNA[Ser]Sec is initially aminoacylated with serine in a reaction catalysed by seryl-tRNA synthetase (SerS) to form seryl-tRNA[Ser]Sec, which provides the backbone for Sec biosynthesis. tRNA[Ser]Sec revealed several unique features that distinguished it from all other tRNAs. For example, it is the longest tRNA sequenced until now (90 amino acids), it has two isoforms that form phosphoseryl- tRNA[Ser]Sec and it has an atypical long acceptor stem and a long D-stem that may have up to 6 base pairs (bp) instead of 3 or 4 bp found in other tRNAs. Crystal structures of human tRNA[Ser]Sec have been recently obtained, which revealed unusual secondary structures and unique tertiary interactions. The unusually long D stem containing 6 bp in this tRNA does not interact with the variable arm, resulting in an open cavity instead of the tertiary hydrophobic core as shown in the picture (1).

Seryl-tRNA synthetase

Aminoacylation of tRNA[Ser]Sec with Ser is the first step in the biosynthesis of Sec in Eukaryotes: the Ser moiety serves as the backbone for Sec. The fact that tRNA[Ser]Sec is initially aminoacylated with serine by seryl-tRNA synthetase (SerS) suggests that it has identity elements for Ser, but not for Sec. These elements are located in the discriminator base and the long variable arm, both of which are essential for aminoacylation by SerS (1).

Phosphoseryl-tRNA[Ser]Sec Kinase and Sec Synthase

The conversion of the serine moiety on tRNA[Ser]Sec to selenocysteyl-tRNA[Ser]Sec Sec is catalysed by Sec synthase (SecS) which incorporates selenophosphate, the active form of Se, into the amino acid backbone and forms Sec-tRNA. Biosynthesis of Sec in Archaea and Eukaryotes requires a further catalytic step leading to the synthesis of an O-phosphoseryl-tRNA[Ser]Sec intermediate, which serves as a substrate for eukaryotic SecS. The kinase that is responsible for the phosphorylation of seryl-tRNA to phosphoseryl-tRNA, designated phosphoseryl-tRNA kinase (PSTK), and SecS were finally identified using computational and comparative genomic approaches. The function of these proteins were confirmed experimentally. Interaction of tRNA[Ser]Sec with the active site of SecS induces enzyme’s conformational change that allows binding of O-phosphoseryl-tRNA[Ser]Sec, but not free phosphoserine, in order to the reaction to occur. In eukaryotic and archeal systems, SecS recognizes the phosphate moiety of O-phosphoseryl-tRNA[Ser]Sec synthesized by PSTK, while PSTK discriminates Ser-tRNA[Ser]Sec Sec from Ser-tRNASer (1).

Selenophosphate synthetase

In bacteria, SPSs synthezises monoselenophosphate from selenide and ATP, and monoselenophosphate acts as the active Se donor in bacteria. Two proteins were identified in mammals, SPS1 and SPS2, that were initially thought to function as SPSs of bacteria. However, further analysis revealed that SPS2 could generate selenophosphate in vitro, whereas SPS1 could not. This suggests that SPS2 is required for de novo synthesis of selenophosphate, while SPS1 may have a possible role in Sec recycling through a Se salvage system. It has also been suggested that SPS1 may be involved in Se metabolism by regulating biosynthesis and/or uptake of vitamin B6. Since SPS2 is itself a selenoprotein, it possibly serves as an autoregulator of selenoprotein synthesis. In addition to its role in Sec biosynthesis, SPS2 enzyme has been recently implicated in Cys biosynthesis. This means that Cys can be biosynthesized de novo by using the Sec biosynthetic machinery and, then, inserted into selenoproteins during Se deficiency (1).

Sec Lyase

The mechanism by which Sec is decomposed involves the enzyme Sec lyase (SCL), which catalyses a PLP-dependent degradation of Sec to L-alanine and elemental Se. More recently, structural studies revealed the details of its catalytic mechanism and how this unique enzyme distinguishes Sec from Cys and Ser residues, which only differ from Sec by a single atom. SCL is expressed in a variety of tissues with the highest activity observed in liver and kidney. However, SCL activity was not detected in blood and fat tissue (1).

Sec incorporation into proteins

Cotranslational incorporation of Sec into proteins is dictated by in-frame UGA codons present in selenoprotein mRNAs. Sec is introduced into selenoproteins by a complex mechanism that requires special trans-acting protein factors, Sec-tRNA[Ser]Sec and a cis-acting Sec insertion sequence (SECIS) element. When a ribosome encounters the UGA codon, which normally signals translation termination, Sec machinery interacts with the canonical translation machinery to prevent premature termination. SECIS elements serve as the factors that dictate re-coding of UGA as Sec. In response to the SECIS element in selenoprotein mRNA, Sec-tRNA[Ser]Sec, which has an anticodon complementary to the UGA, translates UGA as Sec. At least two trans-acting factors are required for efficient re-coding of UGA as Sec in eukaryotes: SECIS binding protein 2 (SBP2) and Sec-specific translation elongation factor (eEFSec). SBP2 is stably associated with ribosomes and contains a distinct L7Ae RNA-bnding domain that is known to bind SECIS elements with high affinity and specificity (1). Aside from binding to ribosomes and SECIS elements, SBP2 also interacts with eEFSec, which recruits Sec-tRNA[Ser]Sec and facilitates incorporation of Sec into the nascent, growing polypeptide (1). Additional SECIS-binding proteins were identified and their roles in selenoprotein synthesis were characterized, including ribosomal protein L30, eukaryotic initiation factor 4a3 (eIF4a3) and nucleolin. While ribosomal protein L30 has been predicted to constitute a part of the basal Sec insertion machinery, nucleolin and eIF4a3 serve as regulatory proteins that modulate synthesis of selenoproteins and may contribute to the hierarchy of selenoprotein expression (1).

Fig. 3 Schematic representation of Sec-tRNA[Ser]Sec incorporation onto the mRNA chain (1)

Mechanism of incorporation

SECIS Elements

SECIS elements are cis-acting stem-loop RNA structures that are found in the 3’-untranslated regions of all eukaryotic selenoprotein mRNAs, although they differ in sequences, motifs and structures among organisms. Several features distinguish SECIS elements from other functional mRNA stem-loop structures. Eukaryotic SECIS elements are formed by two helixes separated by an internal loop, a GA Quartet (SECIS core) structure, and an apical loop or bulge. The SECIS core structure is the main functional element of the SECIS and is required for interaction with SBP2. The apical loops is used to classify SECIS elements into two different types. SECIS elements whose apical loop lacks a bulge called ministem are classified as type 1, and those containing the ministem (bulge) belong to type 2 SECIS elements. In addition to the SECIS core, a conserved AAR motif in the apical region of SECIS is required for Sec incorporation. The function of AAR motif remains unknown, and no AAR sequence-specific binding proteins have been identified (1).

SBP2

SBP2 contains three distinct domains: an NH2-terminal domain whose function is currently unknown and is completely absent in some organisms, suggesting a regulatory role, a Sec incorporation domain (SID) in the middle part of the protein and a COOH-terminal RNA-binding domain (RBD), which interacts with RNAs containing kink-turn motifs as SECIS. Both RBD and SID are involved in SECIS binding. RBD domain containing the L7Ae motif is required for SECIS binding activity of SBP2. It has been suggested that SID may enhance SECIS-binding of RBD (1).

Sec-Specific Elongation Factor

A Sec-specific eukaryotic elongation factor (eEFSec) is responsible for recruiting tRNA[Ser]Sec and, together with SBP2, it inserts Sec into nascent protein chains in response to UGA codons (1). Similar to the canonical elongation factor eEF1A, which is involved in incorporation of the other 20 amino acids, eEFSec has GTPase activity. But unlike eEF1A, it has high specificity for aminoacylated tRNA[Ser]Sec and does not bind phosphoseryl or other aminoacylated tRNAs (1). It is likely that the complex between SBP2 and eEFSec is mediated by the SECIS element and may be further stabilized by tRNA[Ser]Sec, as SBP2 and eEFSec interaction could be enhanced by overexpression of the tRNA[Ser]Sec expression of the gene (1). The presence of two separate proteins in Eukaryotes might be dictated by a distant location of SECIS (in the 3’-UTR). This is different in Bacteria, which have SECIS immediately downstream of Sec-encoding UGA codon. To decode UGA as Sec, these factors undergo conformational change upon binding a SECIS element, which stimulates functional interactions with the ribosome (1).

Ribosomal Protein L30

Ribosomal protein L30 is a component of the large (60S) ribosomal subunit in eukaryotes and has been shown to bind a SECIS element through an L7Ae RNA-binding motif, which is also found in SBP2, at a site overlapping with the SBP2-binding (1). Although the exact function of L30 in the selenoprotein synthesis pathway remains unknown, some observations suggest that it might constitute a part of the basal Sec insertion machinery (1).

Regulation of incorporation

Regulation of Selenoprotein expression

Previous studies using both cell culture and animal models have shown that expression of selenoproteins is differentially regulated by Se availability (1). While expression of stress-related selenoproteins such as GPx1, MsrB1, SelW, and SelH is strongly regulated by dietary Se, expression of housekeeping selenoproteins such as TR1 and TR3 is less regulated by dietary Se. The underlying mechanisms are not fully understood, but it seems that elF4a3 could be implicated (1). Another SECIS-binding protein that seems to influence the efficiency of UGA translation is nucleolin. Nucleolin is an abundant phosphoprotein located in the nucleolus, which is involved in rRNA synthesis and ribosome biogenesis. This protein also plays a role in regulation of transcription and chromatin remodeling and contains four RNA recognition motifs that helps to identified nucleolin as a potential SECIS-binding protein. However, subsequent studies reported conflicting data on nucleolin’s affinity for SECIS elements and its role in regulation of selenoprotein synthesis (1). Considering that SECIS elements can be bound by a number of different proteins, including SBP2, SBP2L, eIF4a3, L30, nucleolin, and possibly others, it would not be surprising if multiple SECIS-binding proteins regulate the expression of selenoproteins in a combinatorial manner (1).


About Cricetomys gambianus

African giant pouched rats belong to a group of rats which have cheek pouches which are used to carry food to their shelter, where it is eaten or stored. The other species belonging to Cricetomys genus is african pouched rat or Emin’s pouched rat. Both species are quite similar in appearance and size, but gambian pouched rat has coarse, brown fur and a dark ring around the eye, whilst Emin’s pouched rat has soft, light fur and a characteristic white line in the belly. Gambian pouched rats are nocturnal animals and they depend on the smell and the hearing to orientate. They can live in different habitats like savannah, the woods and the mountains, and they can be found in central and south Africa. Some rats are kept as pets, and some others have been trained to detect explosives and people infected with tuberculosis. More information about this species can be found in our Wikipedia page.

Sec characterisation in Cricetomys gambianus

Although the basic mechanisms of Sec synthesis and insertion into proteins in both Prokaryotes and Eukaryotes have been studied in great detail, the identity and functions of many selenoproteins remain largely unknown. In the last decade, there has been significant progress in characterizing selenoproteins and selenoproteomes and understanding their physiological functions (1). As in the genetic code, UGA serves both as a stop signal and a selenocysteine codon, and there are no computational methods for identifying UGA coding function, most selenoprotein genes are misannotated. The methods currently used to look for selenoproteins in the genome rely on identification of selenocysteine insertion structures, the coding potential of UGA codons and the presence of cysteine-containing homologs (5). Based on these premises and using as a reference the Mus musculus genome (which has been well characterised) our aim is to characterise Cricetomys gambianus selenoproteome to further contribute to these growing knowledge of selenoprotein role in living organisms.














Materials and methods

The aim of this study is to find, characterize and annotate the selenoproteins in Cricetomys gambianus genome. To achieve this goal, a homology-based study needs to be performed. We need to compare Cricetomys gambianus genome with the genome of a well annotated specie. Moreover, both species must be as closely related as possible. Our choice has been Mus musculus, as it is the closest mammal whose genome has been well annotated in SelenoDB database.

Here, we show the path that it is going to be followed to obtain the selenoproteome of C.gambianus.

Query identification

The queries that are going to be used to identify the selenoproteins in Cricetomys gambianus genome are extracted from SelenoDB 2.0, although we have used SelenoDB 1.0 to compare if both versions contained the same information. SelenoDB 1.0 version was done manually, thus biological sense has been taken into account when finding selenoproteins in genomes; whilst the SelenoDB 2.0 version was done automatically with computers. With this step, amino acid sequence of the Mus musculus selenoproteins are obtained.

Queries were saved in a file with the name query+number of the protein (e.g. query1). Then, we deleted manually the strange characters and changed “U” characters for “X”. The last step was done because the program does not recognize selenocysteines (U), so we change it for “X” that means any nucleotide could be in that position.

Prediction process

In order to use the programs required to start the prediction process, we need to have access to the “public” folder. To do so, we have to enter our username (uXXXXXX) and our password. Thanks to this folder, we have access to the genome sequences and all the programs.

Another thing to take into account is that every time we have to run a program, we need to write the whole path to the folder. It is necessary to remark that in the explanation following, we did not write down the whole path in order to simplify it, but it always appears in the script.

Up to this point, the following steps were automatized using the following script. We just had to introduce manually the name of the query and then, after running tblastn, the name of the scaffold and the positions of beginning and the end of the hit. After that, next steps were automatically performed and finally we obtained the T-Coffee alignment.

To look for the potential Cricetomys gambianus genomic regions where the previously identified queries could be found, we use BLAST (Basic Local Alignment Tool). As we are using an amino acid query, we use TBLASTN, which compares a protein query sequence against a nucleotide sequence. The result of the search are the potential alignments or hits in the genome of Cricetomys gambianus.

The genome of Cricetomys gambianus was obtained from:

mnt/NFS_UPF/soft/genomes/2019/Cricetomys_gambianus

The path to the program is:

mnt/NFS_UPF/soft/ncbi-blast-2.7.1+/bin/tblastn

The command used is:

tblastn -query query -db /mnt/NFS_UPF/soft/genomes/2019/Cricetomys_gambianus/genome.fa -out q_scaffold

Where query is the query of the selenoprotein we are interested in and q_scaffold is the file where the potential positions for the query in the genome of Cricetomys gambianus will be found.

The output file will contain all the potential alignments or hits found and an expected value (E-value) associated. This value quantifies the probability of finding this hit randomly. The hit with less E-value will be chosen to continue the procedure. If the whole process is done and the final results are not as we expected, we will pick the second highest E-value and proceed with the next steps again. In case that none of this works, we will try to find some alternatives in order to obtain a correct alignment. If this is the case, we will explain it on the results.

Since our program is automatized, we just need to write down the name of the scaffold chosen and its first and last nucleotides.

This function allows us to obtain the scaffold of interest that corresponds to the potential regions of C. gambianus where selenoproteins can be found.

The command used is:

fastafetch /mnt/NFS_UPF/soft/genomes/2019/Cricetomys_gambianus/genome.fa /mnt/NFS_UPF/soft/genomes/2019/Cricetomys_gambianus/genome.index $scaffold > q_scaffold

Where q_scaffold is the sequence of nucleotides in Critecomys gambianus genome with more probability to contain selenoproteins.

The scaffold is obtained only based in the fragments that are similar between the Cricetomys gambianus genome and the predicted selenoprotein, but we need to use a longer region to make sure that we can find the whole selenoprotein in the genome of Cricetomys gambianus and not just a small fragment. We decided that, given a specific scaffold, it will be amplified in 100.000 nucleotides in both 5’ and 3’. Besides, we need take into account that this elongated sequence cannot be longer that the scaffold.

if ($i < $f){
$i = $i - 100000;
if ($i < 0){
$i = 1}
$f = $f + 100000;
$l = $f - $i;
if (($i + $l) > $length_scaffold){
$l = $length_scaffold;
$l = $l - $i -5;}
}
else {
$f = $f - 100000;
if ($f < 0){
$f = 1}
$i = $i + 100000;
$l = $i - $f;
if (($f + $l) > $length_scaffold){
$l = $length_scaffold - $f -5;}
}


In the first place, we looked if the frame is positive (first nucleotide smaller than last one) or negative (last nucleotide smaller than first one). Also, it is taken into account if the initial and final position are inside the scaffold. If not, the first position takes a value of 0 and the last one takes the length of the scaffold as a value.

To isolate the whole region, we use the program FASTAsubseq. This command needs the initial position, the length and the file obtained in the last step.

The command used is:
Fastasubseq q_scaffold1 $i $l > q_subseq

Fastasubseq q_scaffold1 $f $l > q_subseq

The first line shows the program that would be executed if the hit is found in the forward strand. While the second line, would be executed if the hit is in the reverse strand. The output file is named q_subseq.

To predict the presence of SECIS sequences, the output file of the exonerate is needed. However, the file obtained when amplifying the scaffold 200.000 bases is too long for the program. As SECIs elements can only be downstream (3’ direction), the input file for fastasubseq will be the scaffold amplified just in 3’ direction. The maximum length that Seblastian accepts is 120.000 base pairs, so if the strand is positive, it will be first nucleotide of the hit ($i) + 120.000 and if the strand is negative, it will be last nucleotide of the hit ($f) - 120.000.

A new file, called q_secis, is created.


Protein prediction

In order to predict the potential exons that are found in the previous sequence we use Exonerate. Exhaustive mode is chosen in order to have more accurate results. With the egrep, we are able to concatenate all exons of the file.

The command used is:

exonerate -m p2g --showtargetgff --exhaustive yes -q $query -t q_subseq > q_exonerate

egrep -w exon q_exonerate > q_gff

Exonerate also shows if the exons predicted are in the positive (5’-3’) or negative (3’-5’) strand. It must be taken into account that this program use 5’ as the end the first nucleotide.

This extension is used to obtain file containing the cDNA of the predicted protein based on the exonerate results.

The command used is:

fastaseqfromGFF.pl q_subseq q_gff > q_fastaseqgff

This step is done to obtain the sequence of amino acids based on the predicted exon sequences. When the function -F 1 is used in the command, only the first open reading frame (ORF) will be taken into account. Just in a few queries we will need to change it, but it will be always specified in the Discussion.

The command used is:

fastatranslate -f q_fastaseqgff -F 1 > q_predita

Where q_predita is the amino acid sequence of the potential selenoprotein.

The problem with fastatranslate is that when it finds a TGA, that codes at the same time for a stop or for a selenocysteine (if SECIS elements are present), translates as a *. It may suppose a wrong alignment when tcoffe is executed. To solve that, we need to change * for X (meaning any amino acid).

The command used is:

sed 's/*/X/g' q_predita > q_preditax



Alignment with reference protein

After all these steps, the amino acid sequence of the potential selenoprotein has been obtained. Now, T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) will be used.

T-Coffee is a multiple sequence alignment software that generates a library of pairwise alignments to guide the multiple sequence alignment. Both the initial query (from Mus musculus) and our predicted amino acid sequence (from Cricetomys gambianus), obtained with the previous step, will be aligned.

The command used is:

t_coffee query1 pred_protein_aa1

The output of this comparison will be the result of the alignment, which indicates amino acids that remain the same in both proteins (indicated with *), amino acids that have changed but are structurally coherent (indicated with :), non-coherent amino acids (indicated with .) and gaps (indicated with a space). Some scores will also be obtained as a result of T-Coffee performance. The meaning and interpretation of the results will be discussed in the next sections.

Identification of SECIs elements

In order to predict SECIS elements, Seblastian (Selenoprotein prediction server) will be used. This is a new method for selenoprotein gene detection that uses SECISearch3 and then predicts selenoprotein sequences encoded upstream of SECIS elements.

The input to this program will be a fastasubseq file q_secis.

The output will be the SECIS elements predicted downstream of our potential selenoprotein. Some scores will be obtained as a result of Seblastian performance. Their meaning and interpretation will be discussed in the next sections.

Phylogeny

We have performed a phylogenetic tree analysis for the main families of selenoproteins studied in this project: GPx, DI, SPS, SelW and TR. With this analysis it is easy to see if the results found with the previous steps are coherent or not.

The phylogenetic trees were developed using the website Phylogeny.fr, where you input a multifasta of your query and predicted proteins and it generates the phylogenetic tree with the distances between branches. The results of this analysis is found in the figures present in the discussion.

RESULTS

Table of results

The results of our study of the selenoproteome are:


Mus musculus
Available file for this field
Not available file for this field


Protein Residue in C. gambianus Specie Scaffold
(e-value)
Length Sense Tblastn Exonerate Exons T-Coffee
(predicted protein)
SECIS info
(Seblastian)
SECIS
Iodothyronine deiodinase family
DI1 Cys PVKD010015863.1
(1.83e-55)
329 - 1
DI2 Sec PVKD010004213.1
(4.87e-117)
8801 - 2
Glutathione peroxidase family
GPx1 Sec PVKD010000564.1
(3.40e-67)
790 + 2
GPx2 Sec PVKD010000980.1
(1.07e-70)
3321 + 2
GPx3 Sec PVKD010003256.1
(3.51e-41)
7166 + 5
GPx4 Sec PVKD010015909.1
(5.77e-62)
2398 + 6
GPx5 Cys PVKD010003030.1
(1.43e-28)
72343 - 5
GPx7 Cys PVKD010353829.1
(8.79e-55)
272 + 1
GPx8 Cys PVKD010031108.1
(7.81e-40)
3322 + 3
Methathione sulfoxide reductase A
MsrA Cys PVKD010019530.1
(7.51e-26)
2854 - 2
Cys PVKD010004692.1
(5.60e-18)
99393 - 5
Sel15
Sel15 Sec PVKD010011498.1
(9.41e-19)
30975 + 5
SelH
SelH Sec PVKD010007063.1
(3.31e-35)
13235 + 3
SelI family
SelI Sec PVKD010011778.1
(2.79e-36)
37244 + 10
SelK family
SelK Sec PVKD010007843.1
(1.71e-16)
284 + 1
SelM family
SelM Sec PVKD010014587.1
(5.98e-23)
2142 - 5
SelN
SelN Sec PVKD010009337.1
(1.62e-31)
10853 - 12
SelS family
SelS Cys PVKD010001728.1
(6.22e-08)
8200 + 6
SelT family
SelT Sec PVKD010015842.1
(2.55e-20)
503 - 5
SelW family
SelW1 Sec PVKD010002365.1
(7.67e-5)
477 + 3
SelW2 Cys PVKD010009993.1
(2.62e-21)
881 + 4
Thioredoxin reductase family
TR1 Sec PVKD010006181.1
(3.55e-38)
22659 + 14
TR2 Sec PVKD010006181.1
(1.40e-19)
51613 + 13
TR3 Sec
PVKD010012172.1
(4.08e-35)
29557 + 16
Sec machinery
SECIS binding protein 2 (SBP2)
SBP 2 Cys PVKD010006502.1
(3.93e-28)
23564 + 17
Selenophosphate synthetase (SPS)
SPS1 Cys PVKD010000387.1
(2.03e-38)
20377 + 8
SPS2 Sec PVKD010013165.1
(0)
1343 + 1




















DISCUSSION

The main goal of this project is to predict and annotate the whole selenoproteome of C. gambianus. The genome of C. gambianus has been recently sequenced. In order to achieve it, a comparison with M. musculus genome was performed. We chose M. musculus because it is the closest specie to Gambian rat with a well annotated genome.

The 27 mouse proteins were obtained from SelenoDB1.0, some of them where selenoproteins or homologous with cysteine, but also machinery related to the processing of selenoproteins. Some of these proteins belong to the same family, having a total of 15 families (1). In order to elaborate the discussion, we took into account some parameters that we will explain in the following paragraphs.

First of all, we looked on the E-value to decide which hit should be aligned. The scaffold corresponding to the lowest E-value was always tried first and then, the rest of the steps were performed. If the final alignment was correct and the results with the Seblastian indicated that we found a selenoprotein, we stopped there. In case that the lower E-value did not get us satisfactory results, we took into account other parameters that we will explain in every particular case.

Then, we also observed some parameters of the results of the T-Coffee: the score, the number of gaps and the number of amino acid changes. In case that the results were good, we moved to the final step.

Finally, we looked for SECIS elements in all the predicted proteins 3’-UTR regions with the SECIS Search3 software. We had one initial screening of the results: all SECIS with an Infernal score lower than 10 were discarded. Then we checked to see if the SECIS element was found in the same strand as the predicted protein and rejected those which did not match. Afterwards, we analysed the Infernal and Covels scores of the remaining SECIS, bearing in mind that most vertebrates have an Infernal score higher than 20 and a Covels score higher than 15. And finally, we checked if the distance between the SECIS element and the selenocysteine (Sec) was biologically coherent (between 2 and 5.2 kb in mammalians) (6). In the discussion, we also mention the grade system developed by the software, assigning grades A, B or C, from best to worst, to each SECIS prediction.

Additionally, for each family of proteins studied we also performed a phylogenetic analysis using the phylogeny.fr website. The tree depiction, with the distances between branches, are represented in the figures throughout the discussion.

Based on all the data we got, we can now discuss our results of the analysis of each protein. There is also a representation of the genes and SECIS elements. It is important to say that the positions of the genes predicted and the positions shown in the schemes correspond to relative positions, which means that the number of nucleotides correspond to the position inside each particular scaffold. In the figures, it is also shown in which exon the hit was found (coloured in green).

In this link you can find a text file with the list of each scaffold and its approximated location in the C. gambianus genome.

We must also mention that the C. gambianus genome was organised in relatively small scaffolds, which has generated an added handicap to our analysis when proteins reached across multiple scaffolds and when searching for SECIS elements in shortened 3’-UTR regions.

GLUTATHIONE PEROXIDASE (GPx) FAMILY

Proteins of the family named Glutathione Peroxidase (GPx) are found in the three domains of life and represent the largest protein family in vertebrates. In mammals, we found 8 paralogs (GPx1, GPx2, GPx3, GPx4, GPx5, GPx6, GPx7 and GPx8). Not all of them have Sec residue: in some of them this residue is replaced by a Cys. They all have an antioxidant activity and they are involved in many physiological processes (1).

In Mus musculus, our genome of reference, we found that just GPx1, GPx2, GPx4 are selenoproteins, GPx5, GPx6, GPx7 and GPx8 have a Cys residue and GPx3 have none of both. The analysis of selenoproteomes demonstrate a trend towards reduced selenoprotein usage in mammals. Indeed, some selenoproteins appeared in mammals by gene duplication and a replacement from Sec to Cys (GPx5 and GPx6).

In C. gambianus we found that there are four GPx with Sec residues (GPx1-4). GPx5, GPx7 and GPx8 have Cys residues instead. GPx6 can not be found in this species, this will be further discussed.


GPx1

It is the most abundant selenoprotein in mammals (1). When we ran Blast, a huge amount of hits in different scaffolds were found. This was expectable since we knew that there are a lot of proteins in this family and they are very similar to each other. T-Coffee alignment gives us a high score with not many gaps (just in one extreme) or amino acid changes. The residue Sec is conserved as we expected, the position was also the same.

The predicted gene is found between the positions 94403 and 95193 of the scaffold PVKD010000564.1, in the positive strand and it has 790 nucleotides. The scaffold has 2 exons and the hit was found in the second one. Using SECIS Search3, a SECIS element was found at the 3’-UTR section of the GPx1 protein, with a good Infernal and Covels score, which defines this SECIS element as correctly predicted. It was found between positions 94808 and 95193 of the genome.

For all these reasons we can accept our prediction, confirming that C. gambianus has GPx1.

GPx2

This protein is only present in vertebrates and is mainly found in the epithelium of gastrointestinal tract.

According to all the results obtained with Tblast and T-Coffee it is very likely that this protein is found in the scaffold PVKD010000980.1 between the positions 51940 and 55261 of the forward strand. It has 2 exons and the hit was found in the second one. Moreover, the predicted protein starts with a Met, so it makes sense that this is the beginning of the protein. It contains Sec on the same position as the reference genome. SECIS Search3 detected two possible SECIS elements for this sequence, and one was discarded because it was present in the reverse strand. The correct SECIS element was found at the 3’-UTR segment, between positions 107425 and 107490 of the scaffold and had an overall A grade, defined by good infernal and covels score.

For all these reasons we can accept our prediction, confirming that C. gambianus has GPx2.

GPx3

The results obtained confirmed again that it is very likely to find GPx3 in C. gambianus. Only few amino acid changes and a small gap were found. Although the Sec residue is not present in Mus musculus we did see it in C. gambianus. The predicted gene is located in the scaffold PVKD010003256.1, it has 5 exons between the positions 128652 and 135818 of the forward strand and it has 7166 nucleotides. SECIS Search3 predicted a grade A SECIS element in the positive strand of the protein sequence, between positions 165008 and 165082.

For all these reasons we can accept our prediction, confirming that C. gambianus has GPx3.

GPx4

We got really good results in this alignment. This could be explained because this protein is the most conserved in evolution (6). Only four amino acidic changes and no gaps were found when doing the T-Coffee alignment. Both sequences have the Sec residue on the same position. The protein is found on the scaffold PVKD010015909.1 between positions 23844 and 26242 of the forward strand. It has 6 exons and a length of 2398 nucleotides. One grade A SECIS element was found for this sequence. The SECIS element was found on the same strand at the end of the predicted protein, between positions 25198 and 25271 of the scaffold, and thus it was validated.

For all these reasons we can accept our prediction, confirming that C. gambianus has GPx4.

GPx5

We had some problems with the alignment of this protein. First of all, we tried with the lowest E-value but we got a really poor score with T-Coffee and almost none of the amino acids were conserved. After that, we also tried with other significant E-values of other scaffold, but the results were the same.

Finally, we changed the reading frameshift (-f 2) and the results were much better. T-Coffee gave us a good score with some amino acid changes and just two gaps on the ends. Moreover, both sequences started with a Met so it is very likely that the protein predicted corresponds to GPx5. In M. musculus, this protein has two Sec residue but none of them is conserved in C. gambianus genome.

The predicted gene is found in the reverse strand on the positions 100514 and 93950 of the genome, corresponding to the scaffold PVKD010003030.1. Its length is 72343 and it has 5 exons. Neither selenoprotein nor SECIS elements could be predicted for the GPx5 of C. gambianus.

For all these reasons we can conclude that C. gambianus has GPx5 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.

GPx6

When running the blast and T-Coffee, GPx3 and GPx6 are found in the same scaffold and between the same positions, so it indicates that one is not found in C. gambianus. Knowing that both are part of the same evolutionary group, they may be that similar that the program interpreted them as the same protein. However, when analysing the results in a biological way, it does not make any sense. That is why, even though we got very high results with T-Coffee and blast in both cases, we tried another scaffold with another E-value for GPx6. The results were not satisfactory either, so we decided which protein corresponds to the scaffold using the phylogenetic tree.

The results obtained from the tree suggest that the protein found corresponds to GPx3 because both proteins predicted for C. gambianus are more closely related to the GPx3 of the Mus musculus.

Also, SECIS elements were only found in GPx3, further corroborating our hypothesis that only this predicted protein acts as a selenoprotein in C. gambianus and that the protein GPx6 found in C. gambianus is indeed GPx3.



GPx7

To align this protein, we chose the hit in scaffold PVKD010353829.1 found in the forward strand between the positions 21 and 293. T-Coffee results were good: the alignment presented some gaps in both of the ends but the rest was well conserved and the sequence starts with a Methionine. About Sec residue, it was not found in M. musculus GPx7 and, according to our results, neither in C. gambianus. The sequence is found between the nucleotide 21 and the nucleotide 293 of the scaffold and is 272 nucleotides and it just has one exon. In this case, no SECIS structure was found with SECIS Search3.

For all these reasons we can conclude that C. gambianus has GPx7 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.

GPx8

According to the results in T-Coffee, it is very likely that the predicted protein is found in the positions 334 and 3656 of the scaffold PVKD010031108.1. There are a few amino acid changes and quite few gaps. Its length is 3322, it has 3 exons and it is found in the forward strand. We did not find Sec in any of the two proteins, nor did we find any SECIS element.

For all these reasons we can conclude that C. gambianus has GPx8 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.


Phylogenetic tree

It is known that 3 evolutionary groups are found between GPx family: GPx1/GPx2, GPx3/GPx5/GPx6 and GPx4/GPx7/GPx8 (1). In this phylogenetic tree, 4 evolutionary groups are found correlating quite well with the bibliography. The results showed a cluster conformed by GPx7 and GPx8 for both Mus musculus and C. gambianus> proteins. Even though GPx4 is not in the same cluster, it is the protein closer to them, so all can be considered the same evolutionary group as shown in the paper of Vyacheslav M. et al.. Regarding to GPx1 and GPx2, both of them belong to the same cluster. Finally, the rest of proteins are found in the same group. As we said before, GPx3 and GPx6 of the C. gambianus are the same protein, corresponding to GPx3. We use both proteins to create the tree in order to know which one does not exist. All clusters show the Mus musculus and Cricetomys gambianus homologous porteins together, indicating that the results predicted correlate with the results of the reference genome.



IODOTHYRONINE DEIODINASE (DI) FAMILY

The second family of selenoproteins found in mammals is thyroid hormone deiodinases. This family consists in three paralogous membrane proteins that are involved in activation/inactivation of thyroid hormone. The three of them have different subcellular locations and tissue expression and their function is quite different. While DI1 and DI2 act activating the hormone towards the deiodination of the outer ring (T3), DI3 acts as inactivator in order to regulate levels of this hormone. In some cases, though, DI1 catalyzes the removal of the inner ring iodine leading to the formation of inactive thyroid hormone (1).

Deiodinases possess a thioredoxin-fold and show significant intra-family homology since the three of them share a common ancestor (6).

In some species, such as M. musculus only DI1 and DI2 paralogs are found. The same results were obtained for C. gambianus.


DI1

In this protein, Sec residue is not conserved in any of the two species. T-Coffee results were pretty good. Although there were some gaps on the ends, the middle of the strand was conserved with not many amino acid changes. The hit was found between positions 2734-3063 of the scaffold PVKD010015863.1. It can be found in the reverse strand, it has 1 exon and its length is 329 nucleotides. SECIS Search3 was unable to find any SECIS elements in the 3’-UTR region of the DI1 predicted gene sequence.

For all these reasons we can conclude that C. gambianus has DI1 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.

DI2

In this protein, Sec residue is only found in C. gambianus. Actually, in T-Coffee results, which are really good, Sec is aligned with a gap. The scaffold chosen was PVKD010004213.1, and the hit is between positions 65800 and 56999. It is located in the reverse strand and it has a length of 8801 nucleotides and 2 exons. SECIS Search3 was able to find three SECIS elements for this predicted gene sequence. All three of them were found in the reverse strand and 3’ of the Sec, although one had a better score than the rest. This last SECIS element, of grade A, was found in between positions 52444 and 52370.
For all these reasons we can accept our prediction, confirming that C. gambianus has DI2.


Phylogenetic tree

The results observed in this phylogenetic tree are coherent with the ones we extracted from our data. The proteins of the C. gambianus are closely related to their homologues in M. musculus and afterwards they link the whole DI family.



SELENOPROTEINS W1 (SelW1), W2 (SelW2), H (SelH) AND T (SelT)

All these proteins belong to the Rdx family and they all have a thioredoxin-like fold and are characterized by presenting a conserved motif (Cys-x-x-Sec). Based on its structure, it is thought that they act as a thiol-based oxidoreductases, but the exact function is unknown.

SelW is one of the most abundant selenoproteins in mammals. It belongs to the stress-related group and its expression is highly regulated by the availability of Se in the diet.

SelT is thought to have an effect in calcium homeostasis and neuroendocrine function. More recently, it was found that it is also implicated in the regulation of pancreatic cell function and glucose homeostasis.

Finally, SelH specifically binds to sequences that have heat shock and stress response elements and its involved in regulation of transcription of enzymes implicated on GPx de novo synthesis (1).


SelW1

SelW1 was found in the scaffold PVKD010002365.1. The predicted gene was located between positions 85411 and 85888 in the forward strand, containing 3 exons. Its length was 477 nucleotides. In the T-Coffee output, there were some gaps and amino acid changes but the score was very high. This protein contains two Sec residues in C. gambianus. It is remarkable, though, that in M. musculus these two residues have changed to S and W. Two SECIS elements were predicted for this protein, and only one was found in the same strand as the predicted protein, the positive strand. This SECIS element had an overall B grade and was located between positions 177101 and 177181 of the scaffold PVKD010002365.1, at the 3’-UTR region of the gene. However, the distance separating it from the gene was of 90 kb, which is not a valid distance by literature standards (2 to 5.2 kb between SECIS and protein in mammalians) (6). Thus, no viable SECIS elements were predicted for the SelW1 protein.

For all these reasons we can conclude that, although it seems that SelW1 in C. gambianus is still a selenoprotein (as Sec residues are conserved), we cannot conclude it is a functional selenoprotein, as no viable SECIS elements were found. In this case, the UGA codon is probably read as a STOP signal instead of a Sec.

SelW2

According to T-Coffee results, this protein is found in the scaffold PVKD010009993.1 between the positions 46239 and 47120 of the genome and it has a length of 881 nucleotides. The alignment showed almost no amino acid changes nor gaps. It corresponds to the forward strand and it is formed by 4 exons. We did not find Sec residues in any of the two species, they have a Cys instead. No SECIS elements were predicted for the SelW2 protein sequence either.

For all these reasons we can conclude that C. gambianus has SelW2 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.

Phylogenetic tree

In this phylogenetic tree, both predicted proteins of C. gambianus are closely related to their homologues in M. musculus and then, the whole family is also linked . So the results observed are coherent with the ones we extracted from our data. Both proteins of the C. gambianus are closely related to their homologues in M. musculus and afterwards they link the whole family of proteins.


SelH

It is present in mouse and rat genomes and both proteins conserve the motive Cys-x-x-Sec. We got a high score in the alignment and it only showed 5 amino acid changes and no gaps. This gene is found in the positive strand of the scaffold PVKD010007063.1 between positions 88174-87671. It has 3 exons and its length is 13235 nucleotides. SECIS Search3 found two valid SECIS elements in the 3’ UTR region of the predicted SelH protein, with grades B and A respectively. Both SECIS elements were predicted on the reverse strand and their positions on the scaffold PVKD010015863.1 were 5171-5246 and 86972-87039. We chose the second one (87039-86972), as it was the best scored one. Only the best scoring SECIS element is depicted in the results table.

For all these reasons we can accept our prediction, confirming that C. gambianus has SelH.

SelT

With the results of T-Coffee output, it is very likely that SelT is found between 24130-37365 of the reverse strand. It corresponds to the scaffold PVKD010015842.1. We got the highest score with the alignment and only few gaps were found in the 5’ region. The sequence starting with a Met and Sec residue was conserved in both species. The length is 503 nucleotides and it has 5 exons. SECIS Search3 predicted one grade A SECIS element for the SelT protein. It was found on the positive strand, between positions 28865 and 28938 of the scaffold PVKD010007042.1.

For all these reasons we can accept our prediction, confirming that C. gambianus has SelT.

Methionine sulfoxide reductase A (MSRA)

MsrA is a selenoprotein which acts as a stereospecific methionine-S-sulfoxide reductase, which means that catalyzes the repair of the S enantiomer of oxidized methionine residues in proteins. It contains the Sec residue in the active site. The Sec-containing MsrA proteins display more than 10-50 times higher activity than MsrA homologues naturally containing Cys, suggesting that Sec provides catalytic advantages in these redox-active enzymes (1).

As we performed for the rest of the proteins, we started analyzing the scaffold with the lowest E-value (PVKD010019530.1). However, the results of the T-Coffee showed an alignment where the first part was full of gaps, while the second part was really good aligned. This fact made us think that maybe the protein was truncated into two different scaffolds. That is why we decided to analyze the scaffold with the second lowest E-value (PVKD010004692.1). This alignment showed exactly the opposite: the first part was really well aligned, while the second one was not. That made us confirm our hypothesis: both scaffolds codify for the same protein and its sequence is truncated. This happened due to the short length of the scaffolds in the reference genome, as we mentioned before.


Scaffold PVKD010019530.1

According to the results obtained with T-Coffee, it is probable that the predicted protein is found in this scaffold, as the score is pretty high (940). No Sec residues are found in either C. gambianus nor M. musculus genome sequences. The predicted gene is located on the reverse strand, from the nucleotide 17092 to the nucleotide 14238, and it has a length of 2854 nucleotides. There are 2 exons in this scaffold sequence.


Scaffold PVKD010004692.1

The score obtained in the T-Coffee output was very high (999), which means that it is probable that the predicted protein is located in this scaffold. No Sec residue has been found in both species’ sequences either. The predicted gene is found between the nucleotide 102269 and the nucleotide 2876 in the reverse strand, and its length is 99393 nucleotides. There are 5 exons in this scaffold sequence.

As we said before, this protein is located in the reverse strand, which means that its 3’ end is located at the beginning. That is why we performed the SECIS Search3 using the scaffold named as PVKD010004692.1.

SECIS Search3 predicted a SECIS element between positions 99675 and 99567 of the scaffold PVKD010004692.1. This SECIS had an overall B grade, with low infernal and covels scores, and it was found in the same strand as the protein.

For all these reasons we can conclude that C. gambianus has MsrA, but it is no longer a selenoprotein as its Sec is not conserved.


SECIS BINDING PROTEIN 2 (SBP2)

The SBP2 is a protein which takes part on the mechanism and regulation of Sec incorporation into proteins, as it is required for efficient re-coding of UGA as Sec in eukaryotes, together with another protein (eEFSec) (1).

We first started analyzing the scaffold with the lowest E-value, but we realized the score obtained by the T-Coffee was too low and it showed a lot of gaps and changes of amino acids, and just a few random alignments. We then tried to do the same for the second scaffold with the lowest E-value, but something similar happened. Then, we realized that the exonerate results showed a protein with too many exons, and even some of them were superposed. We thought that this could be due to an alternative exon of the protein, so we tried just with exons corresponding to one protein.

Results with T-Coffee were also bad, but then we decided to repeat the process changing the reading frameshift (-f 2). Finally, we obtained a very good result by using the first scaffold with the best E-value ( PVKD010006502.1) and the new reading frameshift.

The score obtained from the T-Coffee output is 998, with almost no gaps and just some amino acid changes. No Sec residues found in any of the two species. The predicted gene is located between the nucleotide 112116 and the nucleotide 35680 of the forward strand, and its length is 23564 nucleotides. It has 17 exons, which is quite a lot, but since this protein has three domains it makes sense that the coding region is large (1). SECIS Search3 was not able to predict SECIS elements in the 3’ UTR region of the SBPS2 predicted protein sequence, that makes sense since we did not found Sec.

For all these reasons we can conclude that SBP2 is not a selenoprotein. This result make sense since this protein belongs to the group of machinery related to the processing of selenoproteins (1).

SELENOPHOSPHATE SYNTHETASE (SPS) FAMILY

In this family we find two proteins: SPS1 and SPS2. Initially, they were grouped because it was thought they both have the SPS function. However, it has been shown that SPS2 can generate selenophosphate, whereas SPS1 can not. Several findings suggest that SPS2 is required for de novo synthesis of selenophosphate, while SPS1 may have a possible role in Sec recycling through a Se salvage system. In addition, since SPS2 is a selenoprotein itself, it possibly serves as an autoregulator of selenoprotein synthesis (1).


SPS1

The obtained score from the T-Coffee is so high (1000) we can almost confirm that the SPS1 protein is found in the scaffold PVKD010000387.1. Also, the alignment is perfect, with the exception of one single change of amino acid. No Sec residue is found in either of the compared sequences. The predicted gene is located between the nucleotide 390929 and the nucleotide 411306 of the forward strand, and it is 20377 nucleotides long. It has 8 exons. No SECIS elements were found in this sequence.

For all these reasons we can conclude that C. gambianus has SPS2 but it is no longer a selenoprotein as its Sec is not conserved and no SECIS elements were found.

SPS2

Analyzing the results from the T-Coffee output, we can say it is probable that the protein is located in the scaffold PVKD010013165.1, as the score is very high (999). The alignment shows one single gap in the 5’ end and small amino acid changes. Also, our predicted protein starts with a methionine, so we think it is a good prediction. We also found a Sec residue in both genome sequences, the M. musculus and the C. gambianus. This gene has one single exon and is located between the nucleotide 10245 and 11588 of the forward strand, and it has a length of 1343 nucleotides. SECIS Search3 detected one grade A SECIS element for the SPS2 protein, between positions 12115 and 12191 of the PVKD010013165.1 scaffold, a the 3’ UTR region of this protein.

For all these reasons we can accept our prediction, confirming that C. gambianus has SPS2.


Phylogenetic tree

The results observed in this phylogenetic tree are coherent with the ones we extracted from our data. Both proteins of the C. gambianus are closely related to their homologues in M. musculus and afterwards they link the whole SPS family of proteins.



SELENOPROTEIN I (SelI)

The transmembrane protein SelI is a recently evolved selenoprotein which is found only in vertebrates. The physiological function of the Sec-containing SelI has to be further examined (1).

The highest score (1000) of the T-coffee was obtained, meaning that it is probable that the predicted protein is located in the scaffold named as PVKD010011778.1. The alignment was almost perfect, even though there were a few changes of amino acids and small gaps. We found the Sec residue in both sequences, C. gambianus and M. musculus. The predicted gene is located between the nucleotide 1782 and the nucleotide 39026 of the forward strand, and it is 37244 nucleotides long. It has 10 exons. The SECIS element found in the predicted SelI gene sequence was located in a 3’ position to the selenocysteine, on the forward strand and between positions 13706 and 13783 of the scaffold PVKD010011778.1. It had an overall A grade, with a good infernal and covels score.

For all these reasons we can accept our prediction, confirming that C. gambianus has SelI.

SELENOPROTEIN 15 (Sel15) AND SELENOPROTEIN M (SelM)

Both Sel15 and SelM are thioredoxin-like fold ER-resident proteins that form a distinct selenoprotein family. On the one hand, Sel15 is thought to mediate the cancer prevention effect of dietary Se and regulation of redox homeostasis in the ER. On the other hand, SelM is a distant homolog of Sel15, which was identified by bioinformatics approaches. The presence of redox-active motifs and structural similarities to other thioredoxin-fold oxidoreductases suggest that Sel15 and SelM may catalyze the reduction or rearrangement of disulfide bonds in the ER-localized or secretory proteins (1).


Sel15

As we performed for the other proteins, we started analyzing the scaffold with the lowest E-value. However, the results obtained with T-Coffee were not really good. That is why we decided to work with the second lowest E-value, corresponding to the scaffold named as PVKD010011498.1. The score obtained from the T-Coffee was really good (1000), so we can be almost sure our predicted protein fits in this scaffold. The alignment was absolute, even though there were some changes of amino acids and one gap. We also found the Sec residue in both of the studied sequences. The predicted gene is located between the nucleotide 10111 and the nucleotide 41086, on the forward strand and it has a length of 30975 nucleotides. It has 5 exons. One grade A SECIS element was predicted for the Sel15 protein. It was found between the 10795 and 10873 positions of the scaffold PVKD010011498.1, on the 3’ region of the predicted selenocysteine and on the same strand as the predicted gene.

For all these reasons we can accept our prediction, confirming that C. gambianus has Sel15.

SelM

According to the results obtained with T-Coffee, we can be pretty sure the protein predicted is found in the scaffold PVKD010014587.1, as the score is the highest possible (1000). The alignment is almost perfect, although some changes of amino acids can be found. We found the Sec residue conserved in both sequences, C. gambianus and M. musculus. The predicted gene is located between the nucleotide 3785 and the nucleotide 5927 on the reverse strand, and it is 2142 nucleotides long. It has 5 exons. SECIS Search3 found a grade A SECIS element at the reverse strand of the predicted protein gene sequence. It was found in a 3’ position to the selenocysteine, between positions 3672 and 3743 of the scaffold PVKD010014587.1, and had good infernal and covels scores.

For all these reasons we can accept our prediction, confirming that C. gambianus has SelM.


SELENOPROTEIN K (SelK) AND SELENOPROTEIN S (SelS)

Even though SelK and SelS have no similarities on their sequences, they could be assigned to a single SelK/SelS family of related selenoproteins based on their topology, the presence of a glycine-rich segment and a characteristic location of Sec residues in the COOH-terminal end of the protein. This family is the most widespread eukaryotic selenoprotein family, whose members are present in nearly all known Se-utilizing organisms. They have been both implicated in ER-associated degradation of misfolded proteins (1).


SelK

The score obtained from T-Coffee indicates that it is probable that this protein is located in the scaffold PVKD010007843.1, as the value is high (997). There are some changes of amino acids and a few little gaps, but the fact that it starts with a methionine makes us think it is a good prediction. We found the Sec residues in both of the studied sequences. The predicted gene is located between the nucleotide 63806 and the nucleotide 64090, in the forward strand and it is 284 nucleotides long. It has 1 exon. Regarding SECIS element prediction, no SECIS elements were found in this genome sequence.

For all these reasons we can conclude that, although it seems that SelK in C. gambianus is still a selenoprotein (as Sec residues are conserved), we cannot conclude it is a functional selenoprotein, as no SECIS elements were found. The UGA codon is probably read as a STOP codon in C. gambianus.

SelS

In the T-Coffee output we obtained a really high score (999), indicating that it is probable that the predicted protein is located in the scaffold PVKD010001728.1. Although there are some changes of amino acids and little gaps, the alignment is really good. We found the Sec residue in the M. musculus sequence, but not in the C. gambianus one. The predicted gene is located between the nucleotide 134279 and the nucleotide 142479, in the forward strand and it has a length of 8200 nucleotides. It has 6 exons. Regarding the SECIS, three SECIS were predicted for SelS, but one was discarded for being in the incorrect strand. From the remaining two, only one is considered a valid SECIS, because of the low infernal and covels scores presented by the third predicted SECIS elements. Thus, the valid SECIS element, which we can see in the results table, is found in the positive strand between positions 177085 and 177172 of the scaffold PVKD010001728.1, and has an overall A grade.

For all these reasons we can conclude that C. gambianus has SelS but it is no longer a selenoprotein as its Sec is not conserved even if viable SECIS elements were found.

SELENOPROTEIN N (SelN)

SelN was one of the first proteins identified through bioinformatic approaches. It is an ER-resident transmembrane glycoprotein and it plays an important role in the maintenance of satellite cells and it is required for regeneration of skeletal muscle tissue following stress or injury (1).

The result obtained from the T-Coffee was pretty high (998), meaning that it is probable that the predicted protein is located in the scaffold PVKD010009337.1. There are some amino acid changes and one single gap, but generally the alignment is really conserved. The Sec residue is also conserved in both sequences, M. musculus and C. gambianus. The predicted gene is located between the nucleotide 35198 and the nucleotide 46051, in the reverse strand and it has a length of 10853 nucleotides. It has 12 exons. One SECIS element was found on the 3’ UTR region of this protein. It had good infernal and covels scores, and an overall A grade. It was found in the same strand as the predicted protein, the reverse strand, between positions 34061 and 34151 of the scaffold PVKD010009337.1.
For all these reasons we can accept our prediction, confirming that C. gambianus has SelN.

THIOREDOXIN REDUCTASE (TR) FAMILY

This family of selenoproteins together with the thioredoxin represent the biggest reduction system of the cell. In mammals, we found three isoforms (TR1, TR2, TR3) which all contain selenocysteine.

TR1 is involved in NADPH-dependent reduction of Trx1. This protein takes part in the control of many physiological processes such as antioxidant defense, apoptosis, transcription regulation and so on. It also acts as an electron donor for enzymes activated by redox reactions. The second member of the family, TR3, is involved in reduction of mitochondrial thioredoxin. Finally, the third isoform is TR2. This protein differs a little bit from the two above regarding to its structure. It has a glutaredoxin (Grx) domain that gives to this protein a Grx activity apart from the Txr one. However, its function still remains unknown.

When we analyzed the exon distribution in TR1 and TR2, we realized they were almost the same, just differing in one exon. In our opinion, this might be because of an alternative splicing which allows the existence of these two isoenzymes (1).


TR1

Taking into account the results obtained from the T-Coffee, it is probable that the predicted protein is found in the scaffold PVKD010006181.1, as the score is really high (999). At the beginning of the sequence, there are some changes of amino acids, but the rest is really well conserved. Also, the protein gets started with a methionine, which makes us think that it is a good prediction. The Sec residue is conserved in both of the studied sequences. The predicted gene is located between the nucleotide 53709 and the nucleotide 72789, in the forward strand and it is 22659 nucleotides long. It has 14 exons. SECIS Search3 predicted a grade A SECIS element for this protein. It was found in the same strand as the predicted TR1 protein, the positive strand, between positions 66841 and 66918 of the scaffold PVKD010006181.1, positioned 3’ to the selenocysteine.

For all these reasons we can accept our prediction, confirming that C. gambianus has TR1.

TR2

To study this protein, we had to choose the scaffold with the second lowest E-value, as the first one was not good enough. The T-Coffee output shows a quite high score (997), which means that it is probable that the predicted protein is found in the scaffold called PVKD010006181.1. The alignment showed a few changes of amino acids and a single little gap. The Sec residue is conserved in both M. musculus and C. gambianus sequences. The predicted gene is located between the nucleotide 24690 and the nucleotide 76368, in the forward strand and it has a length of 51613 nucleotides. It has 13 exons. One grade A SECIS element was predicted for TR2, at the positive strand and between positions 37828 and 37905 of the scaffold PVKD010006181.1. It had good infernal and good covels scores.

For all these reasons we can accept our prediction, confirming that C. gambianus has TR2.

TR3

In this alignment we obtained the highest score (1000), no gaps and few amino acid changes. So it is very likely that the protein is found between the positions 5981-35538 in the positive strand of the scaffold PVKD010012172.1. Sec residue is conserved in both M. musculus and C. gambianus genomes. It has 16 exons and 29557 nucleotides. SECIS Search3 predicted one SECIS element for the TR3 predicted protein, on the forward strand between positions 11127 and 11202 of the scaffold PVKD010012172.1. It was found in a 3’ position to the selenocysteine and had an overall A grade.

For all these reasons we can accept our prediction, confirming that C. gambianus has TR3.


Phylogenetic tree

In literature, it was seen that TR2 differs a bit from the other two isoforms (TR1 and TR3) (1). In this phylogenetic analysis, we see that indeed TR1 and TR2 are closely related to their homologues in each species. However, the results for TR2 are surprising, as the TR2 from Mus musculus seems to follow the pattern observed in literature, but the TR2 from Cricetomys gambianus does not. If we look at the scaffolds form which the proteins were extracted, we see that TR1 and TR2 both come from the same scaffold, and in our analysis both these proteins are nearly the same, just deferring in one exon. In our opinion, this might be because of an alternative splicing which allows the existence of these two isoenzymes in Cricetomys gambianus. It could also be due to the fact that our predicted protein does not include the GRx domain which normally makes it more different to the other TR family isoforms, as this domain could have disappeared from the TR2 Cricetomys gambianus gene.

Conclusions

Creation of a semi-automatic program.

We managed to automatise nearly all the steps detailed in the methods section. The result was a Perl program on which we introduced the query protein of the reference genome sequence in FASTA format and, after selecting the correct scaffold from the tblastn and writing down the first and the last nucleotide of the hit, it outputted the predicted protein sequence through T-Coffee.

A total of 15 selenoproteins have been found in Cricetomys gambianus genome. Also, we could found 9 Cys-containing homologues and 2 UGA codons that are likely to be read as STOP codons.

Although we did not find that all of these proteins begin with a methionine, we have considered these proteins as selenoproteins as they have a selenocysteine and a SECIS element in a 3’UTR position from the selenocysteine residue predicted. These proteins are GPx1, GPx2, GPx3, GPx4, SelH, SelT, SelI, Sel15, SelM, SelN, TR1, TR2, TR3, DI2 and SPS2. DI1, GPx5, GPx7, GPx8, MsrA, SelS, SelW2, SBP2 and SPS1 lost their Sec residue for a Cys residue, and therefore, they are Cys-containing homologues. SelK and SelW1 conserve the UGA codon, probably read as a STOP codon as we discussed earlier.

GPx6 selenoprotein of the Mus musculus was not found in the Cricetomys gambianus genome.

3 machinery selenoproteins have been found; only 1 conserves selenocysteine residue.

SPS2 is the only machinery selenoprotein that has been found in Cricetomys gambianus. Both SPS1 and SBP2 are present, but they have lost their selenocysteine residue and we found a cysteine residue instead. These proteins are cysteine-containing homologues.

13 selenoproteins are conserved in both Mus musculus and Cricetomys gambianus genome.

GPx1, GPx2, GPx4, SelH, SelT, SelI, Sel15, SelM, SelN, TR1, TR2, TR3 and SPS2 have been conserved throughout evolution.

3 selenoproteins are present in Mus musculus but their selenocysteine is not conserved in the Cricetomys gambianus genome: they are either Cys homologues or their UGA codon is read as a STOP signal.

GPx5, SelK and SelS are conserved in Mus musculus. GPx5 and SelS have lost their selenocysteine residue in Cricetomys gambianus. SelK contains two selenocysteines, but no viable SECIS element could be found 3’ of this gene.

2 selenoproteins are present in Cricetomys gambianus but they are Cys homologues in the Mus musculus genome.

GPx3 and DI2 are conserved in Cricetomys gambianus. Their homologues in the Mus musculus genome did not conserve the selenocysteine.

8 selenoproteins were not conserved in either Cricetomys gambianus nor Mus musculus genomes.

GPx7, GPx8, DI1, SelW1, SelW2, MsrA, SBP2 and SPS1 have been identified in the genomes of both species, but their selenocysteines were not conserved, thus confirming that these selenoproteins have been lost throughout evolution in these species.

The main weakness of this project is the shortness of the scaffolds provided for the Cricetomys gambianus genome.

Both our protein prediction and SECIS analysis could have been influenced by the fact that the scaffolds we obtained were too short. Protein sequence prediction process becomes more complex when proteins alignments reach across multiple scaffolds in an assembly with an identity percentage lower than 80%, as is the case of the Cricetomys gambianus genome. The search for SECIS elements is also affected as the shortness in the scaffold length affects the quantity of nucleotides you actually scour for SECIS elements.

REFERENCES

(1) Labunskyy VM, Hatfield DL, Gladyshev VN. Selenoproteins: molecular pathways and physiological roles. Physiol rev 2014; 94(3):739-777

(2) Qazi IH, Angel C, Yang H, Pan B, Zoidis E, Zeng, CJ et al. Selenium, selenoproteins, and female reproduction: a review. Molecules 2018; 23(12):3053

(3) Mariotti M, Ridge PG, Zhang Y, Lobanov, AV, Pringle TH, Guigo R et al. Composition and evolution of the vertebrate and mammalian selenoproteomes. PLOS one 2012, 7(3): e3306

(4) Mangiapane E, Pessione A, Pessione E. Selenium and selenoproteins: an overview on different biological systems. Curr Protein Pept Sc 2014; 15(6):598-607

(5) Kryukov, GV, Castellano S, Novoselov SV, Lobanov AV, Zehtabb O, Guigó R, Gladyshev VN. Characterization of mammalian selenoproteomes. Science 2003; 300(5624):1439-1443

(6) Mariotti M, Lobanov AV, Guigo R, Gladyshev VN. SECISearch3 and Seblastian: new tools for prediction of SECIS elements and selenoproteins. Nucleic Acids Res 2013; 41(15):e149

(Photographs) All photographs in this website are Intellectual Property of Joel Sartore, photographer for National Geographic.

Contact

The Team

We are a group of 4th-grade Human Biology students from University Pompeu Fabra in Barcelona. This project is part of the Bioinformatics course carried out from October to December 2019. Please, feel free to contact us for doubts or questions about our work.

Susana Güerri Ferrández

susanapilar.guerri01@estudiant.upf.edu

Clàudia Río Bergé

claudia.rio01@estudiant.upf.edu

Maria Sopena Rios

maria.sopena01@estudiant.upf.edu

Paula Wegbrans Giró

paula.wegbrans01@estudiant.upf.edu