MATERIALS AND METHODS
Genome selection
The philogenetically closest species to Parus major was determined by searching in SelenoDB. Although the closest one is Taeniopygia guttata, the species used as a reference genome was Gallus gallus, since it is a model organism and its genome is better characterized. The genome of interest (P. major) was obtained from the following source:
/cursos/20428/BI/genomes/2016/Parus_major/genome.fa
Similarly, we were also provided with the indexed genome in several contigs:
/cursos/20428/BI/genomes/2016/Parus_major/index.fa
The proteins from Gallus gallus were obtained from SelenoDB, and each one of the "Green" and "Red" proteins were saved as individual fasta files substituting every U from the sequence to X manually.
Red proteins: DIO1, GPx3, SELENOI, SELENOO1, SELENOP2, SELENOU1, TXNRD3, DIO2, MSRB1, SELENOK, SELENOO3, SELENOS, TXNRD1, DIO3, Sel15, SELENON, SELENOP1, SELENOT, TXNRD2.
Green proteins: eEFsec, MsrA, SBP2_1, SECp43, SELENOH, SELENOO_2, GPx7, MSRB3, SBP2_2, SecS_1, SELENOK_2, SELENOO_4, GPx8, PSTK, SBP2_3, SecS_2, SELENOK_3, SEPHS
Firstly, we loaded a series of modules so that we could run all the programmes:
module load modulepath/goolf-1.7.20
module load BLAST+/2.2.30-goolf-1.7.20
module load Exonerate/2.2.0-goolf-1.7.20
module load T-Coffee/11.00.8cbe486-goolf-1.7.20
export PATH=/cursos/20428/BI/bin:$PATH
export PATH=/cursos/20428/BI/soft/genewise/x86_64/bin:$PATH
export WISECONFIGDIR=/cursos/20428/BI/soft/genewise/x86_64/wise2.2.0/wisecfg
Obtention of the queries
A tBLASTn was performed in order to align each one of our sequences of interest to the genome of Gallus gallus. The tBLASTn modality takes an aminoacid sequence as a query and compares it to the reference genome, which is translated to its 6 possible frames.
Selection of the queries
The best queries obtained were selected manually, considering only those that matched all the following criteria:
- Having a good score and a good E-value (discarding all those with a value lower than 0.01).
- If a sequence did not contain the selenoprotein (X) aligned with a STOP codon or a cysteine it was discarded.
Region of interest extraction
Once all the hits had been selected, the regions of interest were extracted from the Gallus gallus genome. Firstly, the scaffold containing the region was extracted using fastafetch:
$ fastafetch genoma.fa index.fa seq > seq.fa
Then, the region was extracted from the output sequence of fastafetch, as fastasubseq only works on a fasta file containing one sequence:
$fastasubseq seq.fa start length > genomic.fa
The start and end of the hit were extended 50,000 nucleotides per site, in order to obtain a region of interest that contained the whole gene, and always taking into account that the start must be higher than 1 and the length shorter than the contig's length.
Exonerate
The gene structure was predicted from the region of interest of each selected hit. Exonerate aligned and compared the genomic region obtained by using Fastasubseq with the initial query and anotated the introns and exons of the region The command used was:
$ exonerate -m p2g --showtargetgff -q $sample -t genomic.fa > $sample.exonerate.gff
In which:
- m p2g refers to the protein to gene mode.
- showtargetgff lets us get the output in GFF format.
- q refers to the query (Gallus gallus).
- t refers to the target (Parus major's region of interest).
Subsequently, only the exons predictions from the whole gene structure were extracted:
egrep -w exon $sample.exonerate.gff > $sample.exon.gff
Once we obtained the gff file, it was converted to fasta format using the Perl program fastaseqfromGFF.pl:
fastaseqfromGFF.pl genomic.fa $sample.exon.gff > $sampl.pred.nuc.fa
Finally, the predicted sequence was translated from cDNA to protein:
fastatranslate -F 1 $sample_pred.nuc.fa > $sample_pred.aa.fa
The -F option indicates 'forward' and the nucleotide position where the traduction will begin. As the protein prediction is done from cDNA (exons), we used the option -F 1. When the fastatraslate finds a stop codon TGA, it writes a *, which may supose a wrong tcoffee alignment. For this reason, we changed the * for X, so we can identify an alignment between the selenocystein of the query (also changed to X at the beginnig of the process) with a TGA or a Cystein of the predicted protein.
Gallus gallus from the same protein, by aligning them using tcoffee.
t_coffee $sample.pred.aa.fa $sample > $sample.tcoffee
Concerning to the homologous proteins, a previous alignment must be performed with its selenoproteins in order to identify in which position the cystein is located. This can allow us to detect if it aligns with our predicted sequence in the tcoffee output.
SECIS elements
To corroborate if the predictions correspond with Selenoproteins or not, the SECIs elements of the predicted sequences were determined by using the program SECISearch3/Seblastian (seblastian.crg.es) which also predicts homologous Selenoproteins to our prediction as well as the number of exons.
SrciSearch is a tool for prediction of eukaryotic SECIs elements. It is an open software available online.
Once the selenoprotein gene in Parus major had been obtained, SecisSearch3 software was used in order to predict SECIs elements in the genomic region of each protein. The genomic.fa file was given as input and the option to obtain all the SECIS elements was selected so that we the best one could be chosen according to the following criteria:
- The SECIS element and the gene must be in the same frame.
- The SECIS elements have to be located 3' nearby the last exon.
- The SECIS elements usually are at a maximum distance of 6000 nucleotides from the gene region.
- The gene coding for the SECIs element must contain an UGA codon.
Automatization process
In order to facilitate the process of results obtention different programes have been created.
Firstly, the tBLASTn of Gallus gallus against the genome of Parus major was automatised by using a bash script that made it possible to obtain the blast output for each protein. The scripts can be found in the following link: automatised tBLASTn.
Secondly, once the blast outputs had been obtained, they were analyzed manually as described in point 3. When a hit fulfilled our criteria, its contig's ID, start and lenght were anoted in a new text file, which was used to extract the region of interest. A margin of 50,000 nucleotides was added at the start and the end in order to ensure that the region contained all the selenoprotein gene. The value noted in the text file had the value already corrected.
The rest of the process to search selenoproteins within the Parus major genome was done automatically with a bash script, which can be found on this link.