Haplochromis burtoni

Conclusion

The purpose of this project was to contribute to the annotation of the Haplochromis burtoni genome by documenting the Selenoproteins found within it. Zebrafish Selenoproteins homologous were searched in the Haplochromis burtoni genome to achieve this goal. The Zebrafish genome was chosen as the reference genome because it is the most recently common ancestor of our organism of interest, in addition that it is a very well-annotated genome. For those misanotated proteins (did not start with methionine) we used queries from the human database.

To accomplish this, a python program was created. With the application of this program, it was possible to obtain the queries from the reference (query) species taken into account (Human and Danio rerio) and look for homology sites in the Haplochromis burtoni genome. By this, it was possible to obtain a variety of hits that allowed us to predict the protein of our model organism. Once this prediction was obtained, a t-coffee study was performed in order to analyze the alignment with our reference query. All of this information has allowed us to determine whether a human or zebrafish Selenoprotein homolog exists in the Haplochromis burtoni genome, and if so, whether it is conserved (containing Selenocysteine amino acid) or has been lost. Additionally, a Selenoprotein Prediction Server, SEBLASTIAN, was used in order to address whether there were SECIS elements in the predicted proteins, and if these were indeed considered selenoproteins.

From a total of 41 selenoproteins checked, the characterization is the following one:

Selenoproteins (selenocysteine): GPx1, GPx2, GPx3, GPx4, DI1, DI2, DI3, Fep15, Sel15, SelH, SelI, SelJ, SelK, SelM, SelN, SelO, SelP, SelR1, SelT, SelU1, SelW, TR1, TR2, TR3.

Cys-containing homologous: SelU2, SelU3, SPS, GPx7, GPx8, MsrA, SelI*, SelL*, SelR2, SelR3, SelS*

Selenoprotein machinery identified in H. burtoni: PSTK, SecS, eEFSec, Secp43, SBP2, SPS2.

Proteins not found in H. burtoni: SelV, SelW2**, Gpx5, GPx6, SPS1

*SelI, SelL and SelS: they have lost the selenoprotein when compared to the genome of reference.
**SelW2: this protein is not found in H. burtoni but in humans.

From those predicted selenoproteins, some of them have been predicted including the first methionine amino acid and others not:

Selenoproteins with Met	Selenoproteins without Met
DI, GPx2, SECp43, SECS, SelI, SElK, SelO, SelP, SelR, SelR1, SelR3, SelT, SPS, SPS1, SPS2	DI1, DI2, DI3, eEFsec, Fep15, GPx1, GPx3, GPx4, GPx5, GPx6, GPx7, GPx8, MsrA, PSTK, SBP2, Sel15, SelH, SelJ, SelL, SelM, SelN, SelR2, SelS, SelU, SelU1, SelU2, SelU3, SelW, SelW2, TR1, TR2, TR3

Methionine was not correctly predicted at the beginning of each protein for different reasons. By doing an homology-based approach, it was assumed that the protein present in H. burtoni starts at the same point as the zebrafish or human protein. Also, the identification of another amino acid where it was supposed to be methionine may be explained due to the presence of misanotations in the H. burtoni genome.

With the program and methodology explained above, 41 of the 150 evaluated predicted proteins were identified on the H. burtoni genome: 24 selenoproteins, 11 cysteine-homologous proteins and 6 selenoprotein machinery proteins were identified. Among these genes, some queries are aligned in the same position of the genome. As such, the number of predicted proteins is overestimated. Moreover, we introduced some human homologues from the zebrafish proteins to avoid loosing some proteins. This has also contributed to the prediction of more proteins than the real ones found in the H. burtoni genome. As such, when using homologous proteins from different species as a query, we obtain a duplicated orthologue of these proteins.

Limitations of the program

This program has some restrictions that must be taken into account. The first limitation occurs when predicting proteins from protein families whose genetic sequence is very similar. In these cases, the alignment of each query takes place in each of the different genes that make up the family of the H. burtoni genome. As a consequence, we will overestimate the number of possible genes that code for each protein. Therefore, the alignments must be analyzed manually to determine which is the actual gene coding for each single query protein. However, this can lead to errors in the annotation that can be rectified by performing phylogenies.

The second limitation is in regards of the optimitzation of the code: an improvement of the automatization of the program itself could be implemented. In our program, you have to introduce a document which contains a fasta list of all the well-annotated query proteins, and if you want to use different species, you have to merge them into a single document. Therefore, a point to improve would be to be able to make a program to choose the proteins that are well annotated (i.e. that start with methionine) and start processing the different queries automatically, by introducing the different fastas.