Selenoproteins of Mus spretus |
METHODS and MATERIALS
This project includes the novel prediction of the Mus spretus selenoproteins and selenoprotein machinery factors. We performed an homology-based approach from the Mus musculus genome. We also performed a similar analysis from the Homo sapiens genome for most of the predicted proteins.
Obtention of Mus spretus genome Our genome of study is the genome of Mus spretus, which was created by the professors of the Bioinformatics subject. We downloaded it from: /cursos/20428/BI/genomes/2016/Genus_species/genome.fa The reference genome we have chosen to identify the selenoproteins of Mus spretus is the genome of Mus musculus because it is the closest species that has a well-anotated genome (see Introduction). The selenoproteins and the machinery proteins involved in the synthesis were obtained from the database SelenoDB in multifasta format. Every protein was classified in a single fasta file using the following program: multifasta.pl. The human proteins were also obtained in the same way. We used a combination of Perl, Bash, Matlab and SeciSearch3 software to perform the protein prediction analysis. One bash exportpath.sh and one perl scripts launchall.pl were used sequentially to automatically generate all the necessary files that contain relevant data for the prediction (see Data acquisition). The perl script is contained inside the bash script. Data analysis and final selenoprotein prediction was performed afterwards (see Data analysis). All the steps of data acquisition were completely automatized and the sections below describe the steps and commands used in the code. Some of these commands include output and input file reference names written in a standardized manner (hereafter written in capital letters). However, these are not exactly the same filenames we defined in the code we executed (see exportpath.sh and launchall.pl scripts for detailed information about file naming). Therefore, we decided to present the methods section in a way that makes the prediction process understandable.• Loading program modules The first step was loading the modules and paths of all the programs used during the analysis (Exonerate, GeneWise, Blast, etc.) using the following commands:
• Change U symbols for X and remove artifacts We changed the U symbols for X of all query files to get proper Exonerate and Genewise predictions. Also the #, % or @ artifact symbols were deleted from all files to ensure a good functioning of the steps below. • Blast We ran a BLAST for each query using the following command:
Then, we clustered the BLAST hits by strand (calculated from the Start and End coordinates) so we generated a contig sequence for each cluster that had 100.000 base pairs up and 100.000 base pairs downstream of the BLAST hit (hereafter called as SUBSEQ). The commands used were:
• Gene prediction performance The next step was to predict, within every subseq, the gene coordinates of every possible gene followed by a sequence alignment to the query protein (as a quality control for each predicted gene). Exonerate and Genewise softwares were used for the gene prediction and T-Coffee was used for protein alignment. Below are the detailed procedures of each step: ⇒ Exonerate The exonerate software includes different steps for gene prediction. First, we launched the program on the subseq with this command:
And another with the gene coordinates:
We next created, for each gene predicted, an amino acidic sequence file for further sequence alignment with the Mus musculus or human query protein. The commands used were:
This files were scanned for "*" symbols, substituted by X and saved for further protein alignment with T-Coffee with:
The Genewise software predicts exons in a given sequence so we used it to validate the exonerate prediction for all the proteins. We ran it against the SUBSEQ sequence (that contains all tBLASTn hits within the same contig and genomic orientation) with the following commands:
The GENE_WISE output contains the location of the predicted exons and the aminoacid sequence (hereafter referred as PREDICTED_AA_GENEWISE) that we used for the alignment to the query protein of Mus musculus or human. • T-Coffee We used T-Coffee as a global alignment tool in order to compare the predicted proteins of Mus spretus (CLEAN_PREDICTED_AA_EXONERATE and PREDICTED_AA_GENEWISE depending on the prediction software used) with the query of the Mus musculus or human proteins. We ran it with the following commands:
This was used to quantify the similarity of the predicted protein to the query, so that highly similar proteins would be more likely to be actual homologs in Mus spretus. • Screening for candidate selenoproteins The software described in Data acquisition generated lots of predicted proteins for each query protein of Mus musculus or human. We aimed to design an analysis procedure to come up with all relevant information of each putative Mus spretus selenoprotein or selenoprotein synthesis machinery. We created a Matlab script (plot_all) that, for each query protein, takes all the data acquired (see Data acquisition) and generates a figure with all relevant information for making a good prediction (contigs, BLAST hits, exons predicted and T-Coffee results). For a detailed figure content explanation see Results. We screened the figures for predicted genes with high homology to the query protein, proper Sec alignment and matching Genewise vs Exonerate prediction (if possible). • SECIS and Seblastian search SECIS are structured elements in 3'UTRs of selenoprotein-coding mRNAs and are responsible of the translational readthrough. We used the SeciSearch3 tool on those predicted genes supposed to be selenoproteins in order to confirm the existence of this regulatory element. The Seblastian software was also ran for each interesting SUBSEQ as an additional selenoprotein-prediction tool. We used it as a confirmation for our selenoprotein prediction. |