IRES

Prediction of genes containing IRE sequences.

by Francesc Xavier Guix Ràfols & Eva Lambea Martínez

The aim of our project is to find those genes of mouse containing IRE motifs in their structure.

To read general introduction click here.

Our source of cDNAs was obtained from the FANTOM database . It contains data for full length sequenced 60,770 RIKEN cDNA clones in FASTA format, which were used to make exon predictions running the Geneid program.

We also created a program called Predictor.pl written in Perl programming language. This program allowed us to make predictions on each cDNA sequence independently.

We were provided with an IRE evidences file called IRESpatrolaxeforward.IDs which was used to extract those genes containing IRE motifs.

Data mining was performed by using Unix commands through Shell terminal.

METHODS & PROCEEDINGS

The first step was to make a program Predictor1.0.pl witten in Perl programming language to perform the following tasks:

Reading sequence by sequence from the source file containing the 60,770 cDNAs in FASTA format.
Introducing the present reading sequence into an intermediate file called mid.fa so one only sequence is found inside the file when read.
Geneid is run to predict an exonic structure on the sequence contained in mid.fa.
The result of the prediction is added to an outfile which name is specified by the user before running Predictor.pl. At the end, this outfile will contain all the predictions that Geneid has made.

ANALYSIS OF THE OUTFILE

Once the outfile was obtained, we contrasted our expectations in relation with the content of this file. We evaluated the following parameters:

Does it exist a prediction for each sequence of cDNA? To answer this question we used the following Unix command:

$ grep "## date" outfile | wc -l

This command calculates how many times this symbol "## date" is found within the outfile. We chose this symbol because it is included in every prediction that Geneid makes. The returned value was 60,770 so we concluded that there was a prediction for each sequence of cDNA.
Number of predicted proteins: Bearing on mind that we treat with cDNA sequences we would expect to have one protein for each sequence, that is to say, it would be supposed to obtain 60,770 proteins. To check it we used the following Unix command:
$ grep ">" outfile | wc -l

This command calculates how many times this symbol ">" is found within the outfile. This symbol precedes the ID of each sequence predicted in FASTA format. Contrary to our belief, it turned out that the number of predicted proteins was lower than expected (43,480).
How many genes are predicted in forward? And in reverse? Does it make sense? In theory, we would expect that all the productive predictions were performed in forward due to fact that, according to the article (Analysis of the mouse transcriptome based on functional annotation of 60,770 full-lenght cDNAs), all the cDNAs had been introduced into the data base in forward.
To check it we used the following Unix command:

$ grep "Forward" outfile | wc -l

$ grep "Reverse" outfile | wc -l

We obtained these results:

Forward: 37,758
Reverse: 5,722 (*)

(*) We observed some cDNAs reverse sequences carrying a good score in the outfile, which means that not all the sequences were introduced in forward, as the article ensured.

To check which proportion of the 5,722 sequences had a significant score, we fixed a threshold of score > 10:

$ grep "Reverse" outfile | gawk '$NF>10' | wc -l

The result provided by this Unix command was: 369 sequences of cDNA.
When we observed that a large number of sequences in reverse with a good score had been obtained, we decided to increase the threshold value to have an idea of the amount of cDNA sequences carrying a score higher than 20:

$ grep "Reverse" outfile | gawk '$NF>20' | wc -l

The result obtained with the last Unix command was 109 cDNA sequences.
As soon as we saw the last results, we wanted to know if the presence of high score values were due to bad predictions or to the fact that some of the sequences had been introduced in reverse. It was likely to think that those sequences carrying a higher score values corresponded to true proteins.
To validate this, we performed a pBLAST search using some of the predicted proteins provided by Geneid (score > 20).
We figured out that all these proteins checked were real proteins, e.g. the sequence with a score > 50 corresponded to a kinase well characterized in mouse.

How many predicted genes are single, first, internal or terminal? Does it make sense? To answer this question we used the following Unix command:

$ egrep "Single|First|Internal|Terminal" outfile | awk '{print $1}'| sort | uniq -c

The result was:

Single: 10,139

First: 10,322

Internal: 17,993

Terminal: 13,327

In theory, we would expect to have only a single exon for each predicted protein because we are treating with cDNA sequences which are monocistronic in eukaryota organisms. The reason of these results could be due to the fact that Geneid predicts donnor and acceptor sites which are not real.

Moreover, it is supposed to be only one exon for each predicted protein, but we verified that it did not occur:

$ egrep "Single|First|Internal|Terminal" outfile | wc -l

This Unix command provided us the total number of exons predicted (51,781) which should coincide with the number of predicted proteins(43,480).

As they are different, it is indicative that for some cDNAs it is predicted more than one exon.

IMPROVING OUR PREDICTIONS

Our aim was to improve the Geneid predictions by modifying the parameter file. For this, we created a gene model in which we inactivated the intronic connections because we are treating with cDNAs which have no introns. As well, we forced intergenic connections to pedict only single exons, inactivating the other connections between those non-single exons.

Next, we show the modified gene model:

#Intronic connections

# First-:Single-

# aataaa-

# Promoter-

# Single-:Terminal-

50:4000

Intergenic connections

Single+

Single-

# aataaa+:Terminal+:Single+

# Promoter-:First-:Single-

Single+

Single-

Single+

# Single-:Terminal-:aataaa-

# Single+:First+:Promoter+

# Single-:Terminal-:aataaa-

500:Infinity

Table.This symbol (#) indicates which parameters have been inactivated.

After making predictions, we worked the sequences out obtaining the number of single exons in forward and in reverse. Moreover, we also counted how many predictions had been made by Geneid.

The result was:

Single: 31,592 (equal to the number of predictions)

Forward: 30,088

Reverse: 1,504

At this point, we decided to work only with those predictions in forward because of the biological restrictions, as we have commented before.

THE EVIDENCES

The next step consisted of extracting those cDNAs containing IRE structures from the input file (60,770 sequences). It was possible thanks to the file provided by Ana Igea & Iris Uribesalgo which included the IDs of those cDNA sequences (594) showing IRE evidences (601). The final objective was to use a gff file with the predicted IREs as external evidence in the prediction performed by Geneid on these 594 cDNA. We created a new version of the Predictor2.0.pl to perform this task.

The protocol that we followed was:

The file with the 60,770 sequences of cDNA in FASTA format is turned into TBL format as follows:
$ FastaToTbl genomes/M.musculus/cDNA/fantom2.00.seq.ri.fa > fantom2.00.seq.ri.tbl
It is checked that the 60,770 sequences are present at the TBL format file:
$ egrep -c "ri" fantom2.00.seq.ri.tbl
The sequences of cDNA in TBL format are sorted. The result obtained is redirected to another file called fantom2.00.seq.ri.tbl.ord:
$ sort fantom2.00.seq.ri.tbl > fantom2.00.seq.ri.tbl.ord
It is checked that the 60,770 sequences are present at the fantom2.00.seq.ri.tbl.ord file:
$ egrep "ri" fantom2.00.seq.ri.tbl.ord | wc -l
The file with the 601 IDs of the IREs predicted is sorted in order to join it with the file containing the 60,770 cDNAs sorted before. Furthermore the result is worked out (601 sequences obtained by similarity):
$ sort IRESpatrolaxeforward.IDs | join fantom2.00.seq.ri.tbl.ord - | egrep -c "ri"
A file containing those cDNA sequences that match the IREs IDs is generated (IRESfantom.tbl):
$ sort IRESpatrolaxeforward.IDs | join fantom2.00.seq.ri.tbl.ord - > IRESfantom.tbl
It is checked that the cDNA sequences matching the IREs IDs are present at the IRESfantom.tbl:
$ egrep -c "ri" IRESfantom.tbl
We observed that there were 594 cDNA sequences. Because of some cDNAs matched the same IREs IDs, the number of cDNA sequences is lower than 601.
Afterwards, the IRESfantom.tbl file is turned into FASTA format in order to make the predictions with the Geneid program:
$ TblToFasta IRESfantom.tbl > IRESfantom.fa
It is ensured that the 594 sequences of cDNA are present at the IRESfantom.fa file:
$ egrep -c "ri" IRESfantom.fa

RESULTS

We obtained a file called IRESfantom.fa with 594 cDNA sequences with possible IRE structures.

The next step would consist of using a gff file with the predicted IREs as external evidence in the prediction performed by Geneid on these 594 cDNA. This was carried out by Selma Serra & Mateu Lichtenstein.

DISCUSSION

At this point, we would like to comment the results obtained during our project in a general way (the specific results will be shown in the general conclusions page). It is important to emphasize the fact that there are no predictions for all the cDNA sequences as well as not all the cDNAs were introduced into the Fantom data base in reverse, as the article affirms.

Part of our work consisted of preparing a program (Predictor V2.0) capable of making predictions of genes having IREs (through Geneid), and choosing (at the same time) the gff file containing the evidence (there is one for each cDNA sequence with possible IRE structure, in total 594) that corresponds to the cDNA being read. The program worked perfectly and allowed us to obtain the general results shown in the conclusions.

REFERENCES

Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.

Nature 420, 563 - 573 (2002); doi:10.1038/nature01266
http://www1.imim.es/courses/BioinformaticaUPF/2002/projects/4.2/pages/english/page00.htm
http://www.ldc.usb.ve/~vtheok/webmaestro/ (HTML Course Online).
http://www.geocities.com/SiliconValley/Station/8266/perl/ (Perl Course Online).
http://www.ncbi.nlm.nih.gov/
http://www1.imim.es/geneid.html (Geneid program).