methods

Query acquisition



The major decision was choosing the organism we were going to use as a reference to search for conserved selenoproteins and its machinery proteins in the Colinus virginianus’ genome. After some searching in the SelenoDB 2.0 database, we decided to use the chicken (Gallus gallus), because it was a bird and therefore more closely related to C.virginianus than the other organisms. It is important to mention that we had the chance to use the turkey (Meleagris gallopavo), but we decided to choose the chicken because we consider that it is a better studied organism than the others, so it should be better characterised.

From this database we obtained the FASTA sequences that served as queries to compare with the C.virginianus’ genome. We pasted each sequence in a different EMACS file that was named as Query.fa, where Query refers to the protein name.

tree

Figure 5: Fragment of Phylogeny of some birds. As we can see, both Collinus and Gallus genera belong to the order of Galliformes, having differed less than 30 M years (a not specially long time). This fact was crucial for us to decide using Gallus gallus as the reference animal. Completed phylogenetic tree can be found in Prum et al,. 2015 (see References).


Genome acquisition



The aim of our project was to find the selenoproteins and cysteine-containing homologs or selenium machinery proteins in the organism Colinus virginianus. The general scheme for doing this job is summarized in the following image:

Methodologies draft

First of all, we obtained all the programs needed to perform the analysis from a server provided by our UPF bioinformatics professors, which can be accessed by writing the following line at the other locations tab in the unix file browser:

smb://fs-aules.ds.upf.edu/PUBLIC/

Then we obtained the Colinus virginianus genome and other necessary files from the server public directory. If you want to recreate the results, you can download the same files we used by clicking here or writing the following line in the shell:

/mnt/NFS_UPF/soft/genomes/2019/Colinus_virginianus/


Programs



At this point, we wanted to develop a program to analyse the possible selenoproteins (and machinery proteins) faster. With this program we were able to perform all the steps necessary for the analysis automatically except for:

1-. The introduction of each protein’s name, which was introduced manually by us at the beginning of the program.
2-. The scaffold selection, which was chosen manually by us after each tblastn.
3-. The first and last nucleotide position, that were chosen manually by us before each fastasubseq.


You can download our program by clicking here.

Changing U to X

Before performing the tblastn program, we ought to change the U that represents the selenocystein to an X (in case there was any). That way the tblastn program could read the full sequence without any error. This action is accomplished with the following command line:

sed s/U/X/g $prot > $prot.x

Blast

Once we had changed the possible U to X, we were ready to perform the tblastn program. This program makes a comparison between the query protein and the C. virginianus’ genome that gives an output containing the significant hits it has found. Our program considered only those hits with an e-value equal or minor to 0.001. In order to perform the tblastn, our program contains the following command line:

tblastn -query $prot.x -db /mnt/NFS_UPF/soft/genomes/2019/Colinus_virginianus/genome.fa -out $prot.tblastn -outfmt 7 -evalue 0.001

As we mentioned before, the scaffold selection was made manually by us after each tblastn; and the criteria to do so was picking the scaffold with the lowest e-value, and from that scaffold we chose those hits with more than 70% of identity. In other words, we chose the most significant scaffold for each protein. Since the tblastn is an heuristic program, we considered that it was not necessary to pick other scaffolds, because their significance was stated lower by the program and they had less hits, which did not match our objective to achieve the highest possible significance.
However, with proteins belonging to the same family, it was difficult to distinguish which scaffold was the most suitable for each of them. In these cases, we picked those scaffolds with an e-value equal or minor to 0.001 and with the highest percentage of identity possible. Then, we performed a phylogenetic tree in order to choose the best scaffold for each protein.

Fastafetch

The fastafetch program allowed us to extract the scaffold with the possible selenocysteines from the C. virginianus’ genome, which was done with the following command line:

fastafetch /mnt/NFS_UPF/soft/genomes/2019/Colinus_virginianus/genome.fa /mnt/NFS_UPF/soft/genomes/2019/Colinus_virginianus/genome.index $scaffold > $prot.fastafetch

Calculating nucleotide positions and length

As we mentioned before, the first and last nucleotide position were chosen manually by us before each fastasubseq. To do so, we ignored those that were very far away from the others (only in a few proteins), or the ones that, despite being in the same scaffold, had a very high e-value and a low percentage of identity; both situations used to happen together.
Doing this was necessary because then we avoided problems of excessive length in the subseq file when performing the Seblastian or SECISearch3, otherwise those webservers emitted an error and some subseq files could not be processed for being too large.
After defining the positions, we extended them 50,000 more nucleotides in both 5 prime and 3 prime ends in order to avoid missing part of the C. virginianus’ sequence due to the presence of introns (the scaffold only contains the coding region). But there were two special situations to take into account:

1-. If the first position had a negative number (meaning that it begun before the first nucleotide of the genome), the first nucleotide position was set to 1.
2-. If the last position had exceeded the scaffold length, the last nucleotide position was set to the last position of the scaffold.

It is important to mention also that the scaffold length was calculated by multiplying the number of lines minus one in the fastafetch file by the number of nucleotides in one row. Finally, the length we have used to do the fastasubseq was the last nucleotide position minus the first nucleotide position (after applying all changes specified above).

Fastasubseq

Once we extracted the scaffold thanks to the fastafetch program and we defined the start and length position, we were able to obtain the region of interest with the fastasubseq program using the following command line:

fastasubseq $prot.fastafetch $start $length > $prot.subseq

Exonerate

With exonerate we were able to generate a FASTA file containing the exons of the predicted protein, using the following command line:

exonerate -m p2g --showtargetgff -q $prot.x -t $prot.subseq | egrep -w exon > $prot.exonerate.gff

FASTA seq from GFF

This program allowed us to obtain the cDNA from the FASTA file created with the exonerate, using the following command line:

fastaseqfromGFF.pl $prot.subseq $prot.exonerate.gff > $prot.pred.nuc

FASTA translate

Fastatranslate transforms the cDNA obtained with the fastaseqfromgff into the protein sequence. It is important to mention that we used the first option of the program (-F 1). This was done by our program with the following command line: fastatranslate -F 1 $prot.pred.nuc > $prot.pred.aa

fastaseqfromGFF.pl $prot.subseq $prot.exonerate.gff > $prot.pred.nuc

Changing * to X

Before performing the t-coffee, it is necessary to change the * to an X for the correct functioning of the program, which is done with the following command line:

sed s/*/X/g $prot.pred.aa > $prot.pred.aa.x


T-Coffee

Finally, we performed a global alignment between the query protein and the predicted protein sequence using the t-coffee program with the following command line:

t_coffee $prot.x $prot.pred.aa.x > tcoffee_$prot

Here ends our program, and the following step was performed in the Seblastian/SECISearch3 webpage.


SECIS sequence prediction



Finally, we used the Seblastian webserver to find the selenoproteins’ genes by detecting the SECIS elements in the 3’-UTR region. If that did not work, we used SECISearch3 instead, which does the same as Seblastian but it does not take into account that the SECIS elements have to be at the 3’-UTR and their positions should be between 200-5000 nucleotides far away from Sec.