Roderic Guigó, IMIM/UPF/CRG, Barcelona

COMPUTATIONAL GENE IDENTIFICATION

NOTE: click on the images through this document to download higher quality postscript images.


The Problem

The Gene Identification Problem can be formulated as the problem of deducing the aminoacid sequences encoded in a given DNA genomic sequence.

Why is the problem relevant?

Why is the problem difficult?

In higher eukariotc organims, genes are neither contiguos nor continuous. First, genes coding for different proteins are separated by large intergenic regions that do not code for proteins. Second, a given protein sequence is not usually specified by a continuous DNA sequence, but genes are often splitted in a number (maybe large) of (small) coding fragments known as Exons, separated by (larger) non-coding intervining fragments known as Introns (See figure below). Often, intronic and intergenic DNA makes most of the genome in high eukariotic organisms. In the human genome, for instance, only a very small fraction of the DNA, which can be as low as 2%, corresponds to protein coding exons.

the pathway from DNA to protein sequences

In the next sections, we show that although signals exist on the DNA sequence that instruct the cellular machinery along the pathway from DNA to protein sequences, our knowledge of the way such signals are recognized and processed by the cell is still limited, and it is usually impossible to infer the genes encoded in a given DNA sequence by reliying only on these signals

A few types of signals on the DNA sequence are involved in gene specification

The figure below schematizes the pathway from DNA to protein sequences in a higher eukariotic cell. The main steps in this pahtway are:

  1. Transcription. The continuous sequence of DNA corresponding to a single gene is copied to an RNA sequence.

  2. Splicing. The primary RNA transcript is spliced to remove intron sequences, producing a shorter RNA molecule, known as messenger RNA (mRNA).

  3. Tranlation. The mRNA sequence is translated into protein sequence by a sub-cellular structure known as ribosome. The ribosome binds to an initiation codon, and scans the sequence synthesizing the amino acid sequence specified by consecutive non-overlapping codons. Scanning of the mRNA proceeds until the ribosome finds one of the three codons not specifying amino acids (the Stop Codons). At that point, elongation of the amino acid sequence ends, and the final protein product is released.

Signals exist in the DNA sequence---short strings of nucleotides---, which instruct the cellular machinery during these steps. The Promoter Elements, and the Transcription Termination Motif during transcription, the Donor Sites and Acceptor Sites during Splicing, and the Initiation Codon and the Stop Codon during Translation. Although eventually recognized by the cellular machinery through intermediate RNA molecules, the signals involved in gene specification are all ultimately encoded in the primary DNA sequence.

DNA signals involved in gene specification are aparently ill-defined and highly unspecific

DNA signals involved in gene specification are ill-defined, they lack generality, and are highly unspecific; with currently available detection methods, it is usually impossible to distinguish the signals truly processed by the cellular machinery from those---much more frequent---apparently non functional. As a consequence, attempting to predict gene structure by processing solely DNA sequence signals often results in a computationally untreatable combinatorial explosion of potential products.

In the figure below, we plot the potential start sites, acceptor and donor sites that can be identified along the 2000 bp long sequence containing the three exon beta-globin gene. Sites have been identified using a Position Weigth Matrix, with a cutoff such that no potential true sites are missed. From such signals, hundreds of potential exons can be constructed, which in turn can be combined into milions of potential genes. The cell apparently finds precissely its way through this puzzle, and only one (or a few) of such genes appear to be actually specified.

Information other than sequence signals can be used to infer the genes potentially encoded in a DNA sequence.

Information from a number of sources, other than the sequence signals recognized by the cellular machinery, can be used to infer the genes encoded by the cellular machinery. Roughly, this information can be categorized as follows:

In the figure below, we plot how this additional information can help us to localize the exons of the beta-globin gene.

I4.ps

All this intrinsec information can be used to score the predicted exons, and eventually filter out unlikely candidates.

Up to date Electronic Biobliographies on Computational Gene Identification are maintained by


Roderic Guigo (i Serra), IMIM and UB. rguigo@indy.imim.es