Chelonoidis abingdonii

Do you want to know how we made it?

Materials and methods

To characterize the selenoproteome and the selenoprotein machinery of Chelonoidis abingdonii, we used an homology-based approach. Most of the proteins queried came from Homo sapiens, which selenoproteome is correctly annotated in SelenoDB 1.0. Some proteins could not be obtained from the human or were not in SelenoDB 1.0., thus having to search for them in other databases such as SelenoDB 2.0. or UniProt and for other species (Anolis carolinensis (lizard) and Gallus gallus (chicken)). This is the case of SelPb (which was lost in humans and it is not considered in the lizard in SelenoDB 2.0.); SecS, PTSK and SECp43 (which were not in SelenoDB 1.0, so we chose the human version and the lizard version of it in SelenoDB 2.0.). The different software needed did not recognize character “U” (Sec) as an amino acid. This is why all the “U” were replaced by an “X”. Moreover, all the symbols (such as $ or &) found at the end of the sequences were removed. These sequences are the queries, and they were saved in a .fa file.

The analysis included the following proteins:

Selenoproteins and Cys-containing homologs: GPx1, GPx2, GPx3, GPx4, GPx5, GPx6, GPx7, GPx8, TR1, TR2 (TGR), TR3, Dio1 (DI1), Dio2 (DI2), Dio3 (DI3), SelH, SelI, SelK, SelM, SelN, SelO, SelP, SelPb, MsrB1 (SelR1), MsrB2 (SelR2), MsrB3 (SelR3), SelS, SelT1 (SelT), SelU1, SelU2, SelU3, SelW1, Rdx12 (SelW2), Sep15 (Sel15), MsrA, SelS and SelV.
Selenoprotein machinery: eEFSec, SPS1, SPS2a/b (SEPHS2), PSTK, SECp43, SecS and SBP2.

The genome was found in the following source:

/mnt/NFS_UPF/bioinfo/BI/genomes/2018/Chelonoidis_abingdonii/genome.fa

The index of it was also given:

/mnt/NFS_UPF/bioinfo/BI/genomes/2018/Chelonoidis_abingdonii/genome.index

First, we manually applied Tblastn to find which positions in Chelonoidis abingdonii genome could codify for an amino acid sequence similar to the given query (the human, lizard or chicken selenoprotein). The predictions containing at least one region with a minimum identity of 60% and an E-value lower than 1·10^-4 were chosen. The predictions were annotated in a table that contained the name of the human selenoprotein, the Chelonoidis abingdonii scaffold and positions of the prediction. For the next step, the initial position that had to be looked at in the scaffold was calculated (50.000 base-pairs before the start of the predicted gene) and the length of the sequence that had to be considered (the length of the gene predicted by Tblastn plus 100.000 base-pairs). However, some scaffolds were too short, and the program could not run under these conditions. Therefore, the length had to be modified so that it would not exceed the limits of the scaffold.

Then, we designed de novo a program in Python language to apply fastafetch (extracts the scaffold of the genome using the index), fastasubseq (extracts the target sequence in the scaffold), exonerate (predicts the structure of possible genes in the sequence), fastaseqfromGFF (extracts the sequence of the exons), fastatranslate (transforms the nucleotide sequence into amino acid sequence) and t_coffee (aligns the predicted amino acid sequence with the query). The code for the program is the following: Code.

Another program was designed in Python to run Genewise (different program that applies in just one step the previous 4 procedures but using other criteria). The code for the program is the following: Code 2.

These programs used the information in the table (the Tblastn results) to apply all the programs. The table can be found here. Each row is a hit of the Tblastn prediction. The first column indicates the sense of the strand («for» stands for forward and «rev» stands for reverse). The second column indicated the scaffold of each prediction. The third column indicates the name of each protein followed by a number (in case there is more than one hit). The fourth column indicates the start of the predicted gene minus 50.000. The fifth column indicates the length, i.e., the end of the predicted gene plus the corresponding value from the fourth column. The sixth column indicates the name of the query protein. The seventh and eighth columns indicate the start and the end of the predicted gene.

The fastasubseq sequences were analysed by Seblastian and SecisSearch3. Seblastian predicts if there is any known selenoprotein in the given sequence. SecisSearch3 shows all the possible SECIS that could be formed by the sequence. When considering the SECIS predicted by SecisSearch3, we have taken into account (and annotated in the results) the SECIS that is located in the same strand of the predicted gene, after the predicted gene (not inside the translated region or before it), i.e., in the 3’UTR region, and as close as possible to the predicted gene.

Among all the possible genes predicted by Tblastn, the chosen gene was selected using these criteria (in the cases of selenoprotein machinery, Seblastian and SecisSearch3 are not taken into account):

The scaffold and positions in which it was contained did not overlap with another gene. In this situations it had to be decided which was the correct gene predicted in the region.
T_Coffee shows a good alignment, with high identity and similarity.
The length of the query and the predicted protein are similar.
A viable SECIS is predicted from the fastasubseq of the scaffold where the protein is predicted (when the protein contains a selenocysteine).
Seblastian finds a protein that matches with the one that is being queried.

In the cases where Exonerate could not predict the structure of the gene, we used the data obtained from Genewise. This happened twice, in SelH lizard-based prediction and in one of the possible predictions for SelO. We assumed that this events were due to the extremely short length of the amino acid sequences that were being studied.

Once we had established which were the best best predictions for each Chelonoidis abingdonii genes, we studied the predicted ones that did not show a good T_coffee alignment (with too many gaps or a low identity) and the selenoproteins for which there was not any viable SECIS predicted. In order to know whether our prediction was correct or not when it did not show a good alignment with the human protein (because the gene had significatively changed between humans and Chelonoidis abingdonii), we aligned the lizard protein with our predicted protein. This was applied in the human-based predictions of SBP2, MSrA, SelO, SelP, SelP, SelH, SelR2, SelR3, SelU2, GPx3 and TR3. In the case of SelPb, SPS2 and SelU3, the lizard homologs of these proteins were not available in SelenoDB2, therefore we could not make any comparison. Most of the re-analysed proteins showed a better alignment with their homologs in lizard, therefore we accepted their lizard-based predictions. In the cases of SBP2, SelO, SelR2, SelP and SelH, the lizard-based prediction of these genes enabled us to obtain better alignments and in three situations (SBP2, SelP and SelR2) we could reproduce a much better Chelonoidis abingdonii protein.

Finally, we created a phylogenetic tree for each protein family with Phylogeny.fr in order to study the relationship between the queries and the predicted proteins.