METHODS and MATERIALS

This project includes the novel prediction of the Mus spretus selenoproteins and selenoprotein machinery factors. We performed an homology-based approach from the Mus musculus genome. We also performed a similar analysis from the Homo sapiens genome for most of the predicted proteins.

Obtention of Mus spretus genome

Our genome of study is the genome of Mus spretus, which was created by the professors of the Bioinformatics subject. We downloaded it from: /cursos/20428/BI/genomes/2016/Genus_species/genome.fa

Obtention of queries

The reference genome we have chosen to identify the selenoproteins of Mus spretus is the genome of Mus musculus because it is the closest species that has a well-anotated genome (see Introduction). The selenoproteins and the machinery proteins involved in the synthesis were obtained from the database SelenoDB in multifasta format. Every protein was classified in a single fasta file using the following program: multifasta.pl. The human proteins were also obtained in the same way.

Process of prediction

We used a combination of Perl, Bash, Matlab and SeciSearch3 software to perform the protein prediction analysis. One bash exportpath.sh and one perl scripts launchall.pl were used sequentially to automatically generate all the necessary files that contain relevant data for the prediction (see Data acquisition). The perl script is contained inside the bash script. Data analysis and final selenoprotein prediction was performed afterwards (see Data analysis).

Data acquisition

All the steps of data acquisition were completely automatized and the sections below describe the steps and commands used in the code. Some of these commands include output and input file reference names written in a standardized manner (hereafter written in capital letters). However, these are not exactly the same filenames we defined in the code we executed (see exportpath.sh and launchall.pl scripts for detailed information about file naming). Therefore, we decided to present the methods section in a way that makes the prediction process understandable.

• Loading program modules

The first step was loading the modules and paths of all the programs used during the analysis (Exonerate, GeneWise, Blast, etc.) using the following commands:

⇒ module load modulepath/goolf-1.7.20
⇒ module load BLAST+/2.2.30-goolf-1.7.20
⇒ module load Exonerate/2.2.0-goolf-1.7.20
⇒ module load T-Coffee/11.00.8cbe486-goolf-1.7.20
⇒ export PATH=/cursos/20428/BI/bin:$PATH
⇒ export PATH=/cursos/20428/BI/soft/genewise/x86_64/bin:$PATH
⇒ export WISECONFIGDIR=/cursos/20428/BI/soft/genewise/x86_64/wise2.2.0/wisecfg/


• Change U symbols for X and remove artifacts

We changed the U symbols for X of all query files to get proper Exonerate and Genewise predictions. Also the #, % or @ artifact symbols were deleted from all files to ensure a good functioning of the steps below.

• Blast

We ran a BLAST for each query using the following command:

tblastn -query $fasta_file_X -db $genome -out Blast_output -outfmt 6

BLAST performs comparisons between queries (Mus musculus or human proteins) and the already indexed Mus spretus genome. Then it selects relevant hits for further analysis. We wanted to compare proteins vs translated genomic sequences, so the tBLASTn function was used. One important parameter for each hit is the E-value, which is the number of hits that are expected to be found when a search of the query in a database by chance is performed, so that lower E-values correspond to a higher hit significance. Thus, we took for further analysis those hits with an E-value < 0.01. We generated an output document containing the following elements for each hit:
  • Contig: corresponds to the genome fragment in which the hit was found.
  • Start: Position of the start of the hit.
  • End: Position of the end of the hit.
  • E-value: Measure of the significance of the hit.
This file was useful for further analysis.

• Selection of the regions of interest

Then, we clustered the BLAST hits by strand (calculated from the Start and End coordinates) so we generated a contig sequence for each cluster that had 100.000 base pairs up and 100.000 base pairs downstream of the BLAST hit (hereafter called as SUBSEQ). The commands used were:

fastafetch $genome /cursos/20428/BI/genomes/2016/Mus_spretus/genome.index CONTIG > CONTIG

(Generates a file with the contig sequence)

fastasubseq CONTIG START LENGTH > SUBSEQ

(Generates a file with the subseq sequence)


• Gene prediction performance

The next step was to predict, within every subseq, the gene coordinates of every possible gene followed by a sequence alignment to the query protein (as a quality control for each predicted gene). Exonerate and Genewise softwares were used for the gene prediction and T-Coffee was used for protein alignment. Below are the detailed procedures of each step:

Exonerate

The exonerate software includes different steps for gene prediction. First, we launched the program on the subseq with this command:

exonerate -m p2g --showtargetgff -q QUERY_PROTEIN -t SUBSEQ > EXONERATE_OUTPUT

which creates a visually interpretable output. We decided not to include the exhaustive prediction for a better optimum analysis. We considered that the homology between Mus spretus and Mus musculus is very high (see Introduction), making unnecessary an exhaustive prediction. Next, we created a file with all Exonerate-predicted exon coordinates with:

egrep -w exon EXONERATE_OUTPUT > EXONERATE_EXONS


And another with the gene coordinates:

egrep -w gene EXONERATE_OUTPUT > EXONERATE_GENE


We next created, for each gene predicted, an amino acidic sequence file for further sequence alignment with the Mus musculus or human query protein. The commands used were:

fastaseqfromGFF.pl SUBSEQ EXONERATE_EXONS > PREDICTED_NUCLEOTIDES

(creates the nucleotide sequence)

fastatranslate -f PREDICTED_NUCLEOTIDES -F 1 > PREDICTED_AA

(creates the amino acidic sequence)

This files were scanned for "*" symbols, substituted by X and saved for further protein alignment with T-Coffee with:

sed 's/*/X/g' PREDICTED_AA > CLEAN_PREDICTED_AA_EXONERATE

Genewise

The Genewise software predicts exons in a given sequence so we used it to validate the exonerate prediction for all the proteins. We ran it against the SUBSEQ sequence (that contains all tBLASTn hits within the same contig and genomic orientation) with the following commands:

genewise -pep -pretty -cdna -gff QUERY_PROTEIN SUBSEQ > GENEWISE_OUTPUT

(for forward oriented sequences) or:

genewise -pep -pretty -cdna -gff -trev QUERY_PROTEIN SUBSEQ > GENEWISE_OUTPUT

(for reverse oriented sequences).

The GENE_WISE output contains the location of the predicted exons and the aminoacid sequence (hereafter referred as PREDICTED_AA_GENEWISE) that we used for the alignment to the query protein of Mus musculus or human.

• T-Coffee

We used T-Coffee as a global alignment tool in order to compare the predicted proteins of Mus spretus (CLEAN_PREDICTED_AA_EXONERATE and PREDICTED_AA_GENEWISE depending on the prediction software used) with the query of the Mus musculus or human proteins. We ran it with the following commands:

t_coffee QUERY_PROTEIN CLEAN_PREDICTED_AA_EXONERATE > TCOFFEE_EXONERATE

(for the Exonerate prediction) and:

t_coffee QUERY_PROTEIN PREDICTED_AA_GENEWISE > TCOFFEE_GENEWISE

(for the GeneWise prediction)

This was used to quantify the similarity of the predicted protein to the query, so that highly similar proteins would be more likely to be actual homologs in Mus spretus.

Data analysis

• Screening for candidate selenoproteins

The software described in Data acquisition generated lots of predicted proteins for each query protein of Mus musculus or human. We aimed to design an analysis procedure to come up with all relevant information of each putative Mus spretus selenoprotein or selenoprotein synthesis machinery.
We created a Matlab script (plot_all) that, for each query protein, takes all the data acquired (see Data acquisition) and generates a figure with all relevant information for making a good prediction (contigs, BLAST hits, exons predicted and T-Coffee results). For a detailed figure content explanation see Results.

We screened the figures for predicted genes with high homology to the query protein, proper Sec alignment and matching Genewise vs Exonerate prediction (if possible).

• SECIS and Seblastian search

SECIS are structured elements in 3'UTRs of selenoprotein-coding mRNAs and are responsible of the translational readthrough. We used the SeciSearch3 tool on those predicted genes supposed to be selenoproteins in order to confirm the existence of this regulatory element.

The Seblastian software was also ran for each interesting SUBSEQ as an additional selenoprotein-prediction tool. We used it as a confirmation for our selenoprotein prediction.