Materials and methods

The target of this project was to find, identify and annotate all the Paroedura picta selenoproteins, as this has not been done before. In order to identify and annotate all the selenoproteins found in Paroedura picta, including their SECIS elements and the genes which codify for their synthesis machinery, we have performed an homology-based study by comparing all described selenoproteins and related genes in Homo sapiens with the genome of Paroedura picta.

Although Paroedura picta is more phylogenetically closely related to other model animal species whose proteome is available in SelenoDB 2.0 database, such as the Lizard anolis carolinensis, we chose to use the Homo sapiens selenoproteome as our scaffold for protein annotation. We considered the latter a better annotated selenoproteome and prioritized rather than comparing selenoproteins in more closely related species.

In order to make the process faster, we used a Python Script for automatizing each step that we are going discuss next:

Figure: Pipeline for protein prediction
Obtaining the genome of Paroedura picta

The genome of Paroedura picta was obtained from a directory provided by the coordinators of the project and the Bioinformatics subject at the UPF. It can be reached through the following path:

/mnt/NFS_UPF/soft/genomes/2021/Paroedura_picta/genome.fa

Since this genome is contained in a multifasta file, we had to access the file containing the indexed genome. It could be reached following the path to the same directory, introducing the following:

/mnt/NFS_UPF/soft/genomes/2021/Paroedura_picta/genome.index

Acquisition of the queries

The amino acid sequences of all selenoproteins and selenoprotein-related genes of Homo sapiens, our reference organism, were obtained from the SelenoDB 2.0 database. These query sequences were registered in the computer following their order of appearance in the database, as “selenoprotein_isoform”.fa

In addition, since part of the software used does not recognise the “U” character for marking a Sec residue in the query sequences, every "U" was replaced by an “X”, we did this procedure manually. All the other symbols that could be found at the end of the sequences to mark their ending, such as “@” or “#”, were also removed for the same reason.

tBLASTn

To look for potential genomic regions where homologue genes to our queries could be found in Paroedura picta, we used BLAST (Basic Local Alignment Tool) program with the tBLAST algorithm, since we needed to compare an amino acid query (human selenoproteome) with a nucleotide sequence scaffold (Paroedura picta genome).

The output files obtained after running the program for every query protein, contained all the potential alignments or hits found and their different parameters among which we consider the contig name, starting point, ending point, length and the expected value (e-value) associated, which quantifies the probability of finding the alignment in a random part of the genome. To filter these hits according to their quality, only those with an e-value smaller than 0.0001 were chosen as candidates of possible genes and continue with the following procedure.

Fastafetch

Given the different hits obtained, with this command we were able to obtain the scaffold that corresponds to the potential region of a hit.

Fastasubseq

The next step is to extract the region that contained the hit. To do so, the specific region of the selected scaffold will be amplified in 50.000 nucleotides in both 5' and 3', in this way we make sure that we are not losing any gene information because the presence of introns.

Also, for each scaffold we checked that the length of the subseq is not bigger than the total scaffold length, since choosing a wrong length could produce some errors in the program execution.

Exonerate

In order to predict the potential exons that are found in the previous sequence we use Exonerate. With the egrep, we are able to concatenate all exons of the file. Exonerate also inform about where the predicted exons are found: in the forward or reverse strand.

fastaseqfromGFF

With the exons extracted from the Exonerate prediction, this command deduces the cDNA sequence of the predicted protein.

Fastatranslate

In this step we are translating the predicted cDNA into a protein. At this step the result, ideally, should resemble to the query protein, for which is an homolog. One extra consideration before moving to the next step is to change '*' for 'X', because TGA can code for both amino acids and it may be the case that TCOFFEE does not recognize the Sec.

TCOFFEE

This is the last step of our automatization chain. The program aligns the predicted selenoprotein to the original query and provides a score for the alignment.

Seblastian

This step is for validation purposes. We run manually a Seblastian for each scaffold subseq in order to see if we have a prediction of a SECIS element.