In aim to predict novel genes containing IREs we have analized 60,770 full-length cDNAs based on a study of The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team Link to the article.
In our analysis, we used the program Geneid that predicts genes within anonymous sequences designed with a hierarchical structure.First, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays. Secondly, exons are built from those sites and scored (i.e, score of the definning site plus the log-likelihood ratio of Markov Model for coding DNA). Finally, the gene structure is assembled maximizing the sum of the scores of the exons.
Firstable, the sequences of the 60,770 mouse cDNAs were obtained from the database: Click here to download the sequences.
We ran Geneid on those sequences using the command:
            $ geneid -v genomes/M.musculus/cDNA/fantom2.00.seq.ri.fa > fantom2.00.seq.ri.geneid 2 > fantom2.00.seq.ri.error
Our output, fantom2.00.seq.ri.geneid (data not shown) was analized, step by step, in order to extract as much information as possible.
On a first step, Geneid roughly predicted 43,461 proteins, as we checked using the following command:
            $ grep -c ">ri_" fantom2.00.seq.ri.geneid
After that, we were interested in knowing whether the cDNAs coding for those proteins were in forward (5'-3') or reverse (3'-5') strand. We tested that possibility using the following commands:
            $ grep -c "(Forward)" fantom2.00.seq.ri.geneid
            $ grep -c "(Reverse)" fantom2.00.seq.ri.geneid
This analysis resulted in 37,761 predicted genes in foward and, for our surprise, 5,700 in reverse strand.
Attending to the information contained in the paper by the FANTOM Consortium, we did not expect to find reverse cDNA (3'-5'). The paper said that all the cDNAs were cloned in plasmids in forward strand (although, it is possible to clone cDNAs both in forward and reverse strand). Thus, we considered two different possibilities: i)That there were indeed some cDNAs cloned in reverse strand in the initial data, or ii) that Geneid was not predicting properly.
In order to see whether these reverse sequences could reflect bona-fide cDNAs we rated them taking into account their scores. For that purpose we used the following commands:
            $ grep "(Forward)" fantom2.00.seq.ri.geneid | gawk '$11 >= choosed score' | wc -l
            $ grep "(Reverse)" fantom2.00.seq.ri.geneid | gawk '$11 >= choosed score' | wc -l
and the results obtained are shown below:
                  Score Number of cDNA sequences
                      Reverse Forward
                  >=5 850 29440
                  >=10 368 24050
                  >=15 181 19346
                  >=20 109 15166
                  >=25 67 11858
                  >=50 8 3278
                  >=60 3 1885
                  >=70 2 1065
                  >=80 1 610
                  >=90 1 324
                  >=100 0 170
Attending to the scores above (very low number of reverse cDNAs with scores >60), we decided to discard the possibility that those reverse stranded cDNAs would be able to code for the predicted proteins, so we didn't consider them any further.
The command shown below allowed us to identify 18,087 cDNAs without prediction:
            $ grep -c "# Optimal Gene Structure. 0 genes. Score = 0.000000" fantom2.00.seq.ri.geneid
Revising the figures obtained with several of our commands we were surprised to observe that substraction of the "unpredicted genes" (18,087) from the total number of cDNAs (60,770) resulted in 42,683 cDNAS, a number that did not match the initial number of predicted proteins (43,461). This discrepancy was interpreted in terms of the production of false positives by Geneid. We realized that 771 cDNAs were considered as coding for more than one protein. We verified this hypotesis with the following commands:
            $ grep "# Optimal" fantom2.00.seq.ri.geneid | gawk '$5==1' | wc -l (41,912 cDNAs)
            $ grep "# Optimal" fantom2.00.seq.ri.geneid | gawk '$5==2' | wc -l (764 cDNAs)
            $ grep "# Optimal" fantom2.00.seq.ri.geneid | gawk '$5==3' | wc -l (7 cDNAs)
Thus, we consider that the real number of predicted proteins is 41,912. The scores for the 771 predictions apearing in the two bottom lines were very low neither provide evidence for alternative splice variants.
Our collegues Uribesalgo, I & Igea, A, predicted cDNAs containing IREs from the total of coding genes (42,683, both in forward and reverse strand) using two different patterns: strict and lax. Finally, they decided to consider only forward strand and lax pattern results, generating a file with 601 IREs genes in fasta format IRESpatrolaxeforward.fasta.
The next step was to transform the sequences from fasta to gff format, that we easely solved with some commands.
Transforming fasta to gff:
            $ grep \> IRESpatrolaxeforward.fasta | sed -e 's/[>:,]/ /g' -e 's/\[/ /g' -e 's/\]/ /g' | awk 'BEGIN{OFS="\t"}{print $1, "PatScan", "IRE", $2, $3, "1", "+", "0", "0"}'> IRESpatrolaxeforward.gff
            Example of FASTA FORMAT:
            >ri_0610006G13_R000001C10_918:[37,59]             UUGCUUCAACAGUGUUUGAACGG
            Example of GFF FORMAT:
            ri_0610006G13_R000001C10_918 PatScan IRE 37 59 "score" "forward strand" "frame (0, 1, 2)" "group"
            Some Results:
            ri_0610008J09_R000001F06_1035 PatScan IRE 596 614 1 + 0 0             ri_0610006K08_R000001M11_922 PatScan IRE 35 57 1 + 0 0             ri_0610039G03_R000004J13_924 PatScan IRE 1 21 1 + 0 0             ri_0710001E08_R000005E21_643 PatScan IRE 55 76 1 + 0 0             ri_0710007M18_R000005G22_742 PatScan IRE 128 148 1 + 0 0             ri_0910001A18_R000005H19_1429 PatScan IRE 95 113 1 + 0 0             ri_0910001K24_R000005L11_1151 PatScan IRE 1009 1029 1 + 0 0             ri_1300014A17_R000011F21_1296 PatScan IRE 535 556 1 + 0 0             ri_1110001D10_R000013E09_745 PatScan IRE 339 361 1 + 0 0             ri_1110008A10_R000014G03_824 PatScan IRE 247 265 1 + 0 0             ri_1110017F19_R000014N21_975 PatScan IRE 515 535 1 + 0 0             ri_1110004C05_R000015D01_649 PatScan IRE 236 261 1 + 0 0             ri_1110049L20_R000018E10_721 PatScan IRE 111 131 1 + 0 0
Once we generated the file IRESpatrolaxeforward.gff containing the 601 IREs, we found seven cases in wich there two different IRE motifs predicted for the same cDNA. In order to clasify them in 594 different files, we split the gff file containing the 601 predicted IRES into 594 files each one called by his own identification number. We used the following gawk command:
            $ gawk '{id=$1; system(">> gff/"id".gff"); print $0 >> "gff/"id".gff"}' IRESpatrolaxeforward.gff
The 594 files were saved in the gff directory.
In order to predict genes containing IREs, we used the program predictor2.0.pl written by our collegues Guix, X & Lambea, E. This program is able to introduce external evidences (IREs) in a Geneid prediction. We expected to find three different kinds of predictions:
Predictor2.0.pl scores IREs with 1. As the used pattern to find IREs is lax we were expecting to find a high number of random predictions (false positives). Indeed, in most of the cases, the predicted IRE were located inside the ORF (cases 1 and 2). This can lead to two possibilities: i) if the predicted ORF has a score higher than 1, the program only predicts the gene (1) or ii) when an IRE is predicted inside an ORF with a score lower than 1, the program discards the gene and only predicts the IRE (2).
And, finally, if the IRE is located in the 5' or 3' UTRs, it predicts both IRE and gene (3).
We ran predictor2.0.pl using the following command:
            $ ./predictor.pl IRESfantom.gff IRESfantom.geneid Param/human3iso.param
The program took the 594 gff files as input, read them one by one and line by line and integrated the prediction from Geneid introducing simultaneously the external evidences. The parameters used by Geneid were modified by Guix, X & Lambea, E in order to force it to predict genes with one single exon.
The output file (IRESfantom.geneid) was written in order as it was received from the processed input file.
In the cases 1 and 2, we obtained one file per prediction, -either a gene or a IRE-, while in the third case, we found two different files with the same id number followed by 1 or 2. As an example:
>ri_A730079M18_PX00152B20_934_1 >ri_A730081K01_PX00152L16_1786_1
>ri_A730090O07_PX00153E15_3726_1 >ri_A730090O07_PX00153E15_3726_2
Once the prediction made, those are the commands we used to discriminate between the outfile IRESfantom.geneid:
      ORF or IRE: $ grep \> IRESfantom.geneid | gawk -F"|" '{print $1}' | sed 's/_[0-9]$//' | uniq -u | wc -l (279 + 264 = 543 cDNAs)
      IRE and ORF: $ grep \> IRESfantom.geneid | gawk -F"|" '{print $1}' | sed 's/_[0-9]$//' | uniq -d | wc -l (51 cDNAs)
The command selects the secuences by identifying their ID, deletes the figure indicating the number of prediction in each cDNA (i.e, _1 or _2) and prints i)The IDs apearing only one time by the uniq -u command (i.e, one predicted ORF or IRE) or ii) The IDs apearing more than one time by the uniq -d command (i.e, when both ORF and IRE are predicted).
Our collegues Pons, F & Nolla, M performed a termodynamic filtration with a Delta G cut-off <= -2 (i.e, threshold that recovers 90% of real IREs) on the 51 predctions containing both ORF and IRE. They obtained a total of 45 cDNAs.
Then the proteins predicted by Geneid were extracted and compared with a human protein database (i.e, known or potential proteins with functional annotation) through a BLAST.
From the initial 45 protein set we found:
Our final objective was to find new IRE regulated proteins and those seven hypothetical proteins seem to be good candidates. In order to confirm our hypothesis those seven proteins need further analysis as explained in the conclusions.
To bear in mind...
All along the report we have been considering that our initial Geneid predictions where quite reliable while it is well known that nowadays the predicting genes programs still produce false positives and sometimes false negatives (when they try to predict through rich A+T zones without canonical motifs) despite their use of Markov Models for coding DNA.
This work would not have been possible without the guidance and courtesy of S. Castellano and the collaborative efforts of our fellow class-mates.
                                UP!