Conclusions

The aim of our work was to annotate the selenoproteins found in the genome of Ceratotherium simum using an in silico approach. This new data generated will be useful to understand the evolution of the selenoproteins and to improve its knowledge, thus allowing further functional studies related with physiological processes or diseases. The study was based on the idea that there is enough homology between phylogenetically close species to predict new selenoproteins from an organism well-annotated.

Homology is the main power of the study; therefore we have chosen accurately our query system in order to not miss any interesting output. The queries were obtained from SelenoDB from three different species carefully selected: human, mouse and horse. The parameters of selection were mainly two: the queries best annotated and the ones from the species more close phylogeneticaly to C. simum. Therefore, the human genome is the best annotated, whereas the horse genome is closer to C. simum, avoiding missed proteins; and the mouse queries will be useful to deal with any possible random divergence. Thus, we assure that our search will be exhaustive and could face random divergence as we have designed the queries to be from partial different evolutionary origins. Furthermore, the big data was not a problem a priori because our program was able to face automatically with all the inputs given. These analysis were complemented with SECIS analysis and Selenoprofiles prediction to guarantee that our prediction were correct.

In general, human queries were enough to find the protein in C simum. However, in some occasions the presence of the other queries in the system was really useful. For instance, correct prediction of SBP2/SBP2-like selenoproteins was only possible with those queries, as well as the correct characterization of Sel15 and the SPS family; and in some isoforms suggestions.

Then, the results of this project have been the next:

  • 29 predicted selenoproteins
  • 8 predicted cystein homologous proteins
  • 5 predicted selenoprotein syntesis machinery proteins
  • 3 predicted tRNAs
  • 17 studied unidentified genomic elements
  • 3 identified duplications along evolution

Some of these results have been specially interesting. For instance, we have validated the cluster of MrsA and SelR2 described in the literature, where it is said that this seleonproteins are sometimes clustered or fused, using our system we had suggest that it exist the possibility of a cluster. Moreover, we have found the SBP2-like selenoprotein and we have described the phenomenon that took place in SelV evolution. Also we were able to describe several possible pseudogenes, and we suggest some isoforms to be tested experimentally in order to prove them. Actually, we suggest that more investigation concerning isoforms should be addressed in order to achieve a clear conception of them.

To conclude, we would like to highlight the importance of our study supporting the evolutionary theory previously proposed. Specifically, our results of the GPx family –concretely GPx5 and GPx6 and its tandem duplication, and also the deletions present in GPx4, 7 and 8- together with the SelV and SelW origin. Moreover, the SBP2 duplication only present in vertebrates is also consistent, together with the functional duplication in SelO and the slightly divergent C-terminal domain of SelP.

Regarding to the limitations of our project, we recognice three main weak points in our approach. The first one has been that we trusted the queries found in databases. However, they could be wrongly annotated. To solve the problem we have used a set of complementary programs with their own database selection, in order to compare the results obtained with them to ours. Those programs are Seblastian and Selenoprofiles. However, there is still another limitation factor: all the programs cited above also use homology as the prediction element, thus we could be missing some hits that has not been described yet. Furthermore, this question sets a good problem: the homology threshold, which will restrict the proteins predicted. Further investigation should be addressed in order to improve the selenoproteins prediction techniques, because if not the selenoproteins that are very divergent or are not present in the databases will be blind to our methods. However, those programs had also some limitations. For instance, Selenoprofiles in certain situations didn’t give all the data expected. This happened concretely with the GPx family, due to a bug; also with the Sel V / SelW situation. Another limitation is related to the exonerate program. This program can wrongly predict some selenoproteins, as we have described in the discussion with SelS. In this case, the codon stop of the protein predicted is UGA, but the exonerate program has predicted a selenocysteine when possibly is not the case. Finally, it was not able to predict some proteins that we suggest that actually are in C. simum such as SelO2 or SelU2. Eventually, with all the information obtained with these methods, and an important search in the literature associated to each protein we had been able to determinate the different selenoproteins in C simum. 

Further research should be addressed in order to characterized more selenoproteomes to fill the evolutionary tree and to understand the concrete function of those selenoproteins that keep unknown.