gene prediction: a solution

Bioinformàtica - 2on trimestre curs 2002/2003 - UPF

1 The Problem (page under revision)

Given the following 2000 bp DNA sequence
AGGACAGGTACGGCTGTCATCACTTAGACCTCACCCTGTGGAGCCACACCCTAGGGTTGG
CCAATCTACTCCCAGGAGCAGGGAGGGCAGGAGCCAGGGCTGGGCATAAAAGTCAGGGCA
GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACA
GACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG
GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGAC
AGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT
GATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGT
CTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGT
TATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGG
CCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGA
CAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACCCTTGATGTTTTCT
TTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAGGAAGGGGAGAAGTAACAGGGTACA
GTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTTAG
TTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTT
TTTTCTTCTCCGCAATTTTTACTATTATACTTAATGCCTTAACATTGTGTATAACAAAAG
GAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTA
CATTACTATTTGGAATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTT
TCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGTGTAATGTT
TTAATATGTGTACACATATTGACCAAATCAGGGTAATTTTGCATTTGTAATTTTAAAAAA
TGCTTTCTTCTTTTAATATACTTTTTTGTTTATCTTATTTCTAATACTTTCCCTAATCTC
TTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA
ACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATTTCTGCATATAAATATTTCTGCAT
ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACC
ATTCTGCTTTTATTTTATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCC
TTTTGCTAATCATGTTCATACCTCTTATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGG
TCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATC
AGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTC
TTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGG
GATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATT
GCAATGATGTATTTAAATTATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAG
TGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCAAACCTTGGGAAAATACACTATATC
TTAAACTCCATGAAAGAAGGTGAGGCTGCAAACAGCTAATGCACATTGGCAACAGCCCTG
ATGCCTATGCCTTATTCATCCCTCAGAAAAGGATTCAAGTAGAGGCTTGATTTGGAGGTT
the problems is to find the exonic structure of the genes encoded in this sequence, and to infer its amino acid sequence

To simplify we will assume that we know already that this sequence codes for a single complete gene in the forward strand.

2 Predicting splice signals

We know that we can use a PWM to locate potential splice signals along the sequence. For instance, for donor sites we could use a matrix such as $D$

      -3     -2    -1    +1    +2    +3    +4    +5    +6
A   0.34   0.87 -1.05  -inf  -inf  0.71  1.06 -1.27 -0.46
C   0.33  -0.63 -2.22  -inf  -inf -2.17 -1.19 -1.68 -0.38
G  -0.30  -0.64  1.17  1.39  -inf  0.56 -0.72  1.20 -0.29
T  -0.77  -0.59 -1.18  -inf  1.39 -2.29 -1.13 -1.58  0.66

where each coefficient $D_{ij}$ is computed as $\log
(\frac{P_{ij}}{Q_{i}})$, where $P_{ij}$ is the probability of nucleotide $i$ ( $i \in \{A,C,G,T\}$) at position $j$ in real donor sites, and $Q_i$ is the background probability of nucleotide $i$.

This matrix can be used to score each potential donor site (GT), along a given sequence. The score of a potential donor site $S = s_1 s_2 \ldots s_{9}$ within the sequence is computed as

\begin{displaymath}
L_D (S) = \sum_{i=1}^{10} D_{s_ii}
\end{displaymath} (1)

This is the log-likelihood ratio of the probability of observing this particular sequence $S$ in an actual site versus the probability of observing $S$ in any false GT site.

If we compute these scores along our sequence problem we would obtain something like:

We could simlarly compute the potential acceptor sites along the sequence, $L_A(S)$

3 Predicting coding regions

As we have seen, we can also compute codon usage to help locating coding regions (see Coding Statistics. Indeed, let $F(c)$ be the probability of codon $c$ in the genes of the species under consideration. Then, given a sequence $S = s_1 s_2 \cdots s_l$, and assuming independence between adjacent codons

\begin{displaymath}
P(C) = F(s_1s_2s_3) F(s_4s_5s_6) \cdots F(s_{l-2}s_{l-1}s_l)
\end{displaymath}

is the probability of finding the sequence $S$ in a coding region. On the other hand, let's asume that $F_0(c)$ is the probability of codon $c$ in a non-coding sequence. Then assuming independence between adjacent codons,

\begin{displaymath}
P_0(C) = F_0(s_1s_2s_3) F_0(s_4s_5s_6) \cdots F_0(s_{l-2}s_{l-1}s_l)
\end{displaymath}

is the probability of finding the sequence $S$ in non-coding regions. We usually compute the log-likelihood ratio of the sequence as


\begin{displaymath}
L_M (C) = \sum_{i=1}^{l-2} \log \frac{F(s_{3i-2}s_{3i-1}s_{3i})}{F_0(s_{3i-2}s_{3i-1}s_{3i})}
\end{displaymath}

We can run an sliding window window along the query seqeunce above, and compute this log-likelihood ratio on each window. Below the plot obtained when the window size is 120 pb.

Actually to solve the gene prediction problem as stated above--infer the amino acid sequence of the encoded gene, we also need to locate potential start and stop codons. We can similarly built a PWM for start codons, because in around the codon ATG there is some conserved sequence context. There is no such context for stop codons. Thus, each TAA, TAG or TGA should be considered a potential stop codon.

4 Predicting and scoring exons

The problem is now how to combine the information from sequence signals and codon bias to infer the most likely gene structure.

In one approach--that we call here ``exon-chaining approach''--, we just predict all potential exons that can be defined from the predicted splice signals, and score them according to the scores of the signals and of the codon bias in the exon sequence. Then, we just assemble the frame compatible chain of exons that maximizes the sum of the scores of the assembled exons.

4.1 construction of exons

An exon is an Open Reading Frame (ORF) defined by an acceptor site and a donor site. (Strictly speaking this corresponds to the so-called internal exon. In practice, we just want to locate the coding fraction of the gene, so we usually define as well, initial exons, defined by an start codon and a donor site, and terminal exons, defined by an acceptor site and a stop codon. We should also consider the possibility of intronless genes: ORFs defined by an start and a stop codons)

4.2 scoring of exons

The score of a potential exon, $S$, $L_E(S)$ defined by sites $s_a$ (start/acceptor) and $s_d$ (donor/stop) is computed as


\begin{displaymath}
L_E(S) = L_A(s_a) + L_D(s_d) + L_M (S)
\end{displaymath} (2)

This score can be assumed to be the log-likelihood ratio of the probability of finding such sites and sequence composition given an actual exon over the probability of finding it on a random sequence bounded by AG and GT dinucleotides. Actually, because $L_M$ is the logarithm of the ratio of the probability of the sequence under the coding model over the probability under the non-coding model (not the under a random model), $L_E$ only approximates such a log-likelihood ratio.

4.3 assembly of genes

If a gene structure $g$ is a sequence of exons, $e_1 , e_2 , \cdots ,
e_n$, a natural scoring function is

\begin{displaymath}
L_G (g) = L_E(e_1)+L_E(e_2)+ \cdots + L_E(e_n)
\end{displaymath} (3)

$L_G$ (g) can be approximately interpreted as the log-likelihood ratio of the probability of the defining sites and the codon composition of the resulting product given a gene sequence, over this probability given a non-gene sequence. Among all the gene structures that can be assembled from the set of predicted exons for the sequence, we will assume the solution to be the gene prediction maximixing $L_G (g)$.

The number of potential gene structures grows exponentially with the number of predicted exons, however a dynamic programming chaining algorithm that grows only linearly can be used to obtain the optimal solution.

The exons of the real gene are
  1. 187-278
  2. 409-631
  3. 1482-1610
And the predicted exons are:
  1. 409-631
  2. 1482-1610





rguigo@imim.es