Roderic Guigó, IMIM/UPF/CRG, Barcelona

Gene Prediction Accuracy

An issue that naturally arises is that of the reliability of the predictions obtained by Gene Identification and Gene Structure prediction programs. The issue concerns both users and developers. It is particularly important for the users confronted with the need to identify those functional domains potentially encoded in large uncharacterized genomic regions. Indeed, experiments are often planned on the basis of such predictions, and these may involve a considerable amount of effort and resources. It is also important for the developers in order to assess the actual status of the problem. Moreover, identification of coding genes is the first necessary step to convert into biologically relevant knowledge the nucleotide sequence of an organism genome. Automatic annotation, however, is only possible if highly reliable gene prediciton programs exist.

Burset and Guigó (Genomics 34, 353-367, 1996) attempted to develop data set and performance metrics standards for consistently testing of Gene Identification software. They tested a number of programs on a large set of vertebrate sequences with known gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The data set used by Burset and Guigó was highly biased including only relatively short DNA sequences with simple gene structure and high coding density. Thus, the performance of the programs is likely to provide an overoptimistic estimation of the actual accuracy of the progrmas when confronted with large DNA sequences from randomly sequenced genomic regions of less standard composition and more complex structure.

Recently, Rogic et al. (Genome Research 11, 817-832, 2001) published a new independent comparative analysis of seven gene prediction programs. The programs were again tested in single gene sequences--from human and rodent species. In order to avoid overlap with the training sets of the programs, only sequences were selected that had been entered in GenBank, after the programs were developed and trained.

The paper by Rogic et al. represented a valuable update on the accuracy of gene finders, but it suffered from the same limitation as did the previous work by Burset and Guigó and others: gene finders were tested in controlled data sets made of short genomic sequences encoding a single gene with a simple gene structure. These data sets are not representative of the genome sequences being currently produced: large sequences of low coding density, encoding several genes and/or incomplete genes, with complex gene structure. Thus, accuracy measures inferred from them are likely to be substantial overstimations of the actual accuracy of the programs in real genomic sequences. Indeed, the initial analysis on chromosome 22 suggests that the ab initio gene finders predict a large number of false positive genes, and while they tend to predict most of the known genes, only an small fraction of the genes have all exons correctly predicted. For instance, GENSCAN predicts 94% of the known genes in chromosome 22, but only about 20% of them had all the exons correctly predicted.

Ideally, one would like to benchmark the gene identification programs in real genomic sequences. The main problem is that until very recently on very few of these sequences the structure of the genes has been exhaustively verified by experimental means, making it almost impossible to construct large enough test sets of well characterized and large enough genomic sequences with which the accuracy of the gene predictions can be calibrated. Because of the lack of well characterized human sequences, and because genes or exons may remain undetected even if a detailed experimental analysis is performed, Guigó et al. (Genome Research 10, 1631-1642, 200) investigated the accuracy of ab initio gene finders by constructing a set of semi-artificial sequences to emulate large well annotated genomic sequences. In these semi-artificial sequences, known genic sequences have been embedded in simulated intergenic DNA. Therefore, the location of all coding exons is known.

Measures of Prediction Accuracy

The accuracy of the predictions were measured at three different levels

coding nucleotide (base level)
exonic structure (exon level)
protein product (protein level)

The figure below illustrates graphically the measures at this three different levels.

Base Level

Prediction accuracy per base coding/non-coding.

Sn: Sensitivity = TP/(TP+FN)
Sp: Specificity = TN/(TN+FP)
AC: Approximate Coefficient = 0.5x((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1

TP:true positives
TN:true negatives
FP:false positives
FN:false negatives

Exon Level

Prediction accuracy with respect to exact prediction of exon start and end points

Sn: Sensitivity = fraction of correct exons over actual exons
Sp: Specificity = fraction of correct exons over predicted exons
ME: Missing Exons: fraction of true exons that are not identified at all
WE: Wrong Exons: fraction of predicted exons that do not overlap any true exon

Protein Level

Predcition accuracy with respect to the protein product encoded by the predicted gene.

% Sim: percentage similarity between the amino acid sequence encoded by the predicted gene and the amino acid sequence encoded by the actual gene

Burset and Guigó (1996)

For a postscript version of the paper click here

The evaluation by Burset and Guigó consisted of three steps:

Extraction of a reliable set of vertebrate sequences of know genic structure (Test Set)
Definition of a number of measures of predictive accuracy
Evaluation of a number of programs on the Test Set using the measures of predictive accuracy

Generation of the Test Set

The details of the protocol used to generate the Test data set (ALLSEQ) are schematized in the figure below.

The ALLSEQ data set is characterized by two features.
First, it includes only short sequences encoding a single complete gene with simple structure, with high coding density, and free of sequencing errors.
Second, ALLSEQ was built attempting to minimize the overlap with the training sets used during the development of the programs analyzed. Thus, only sequences entered into the public databases after January 1993 were selected into ALLSEQ. In addition, a subset of ALLSEQ was considered separately (NEWSEQ), comprising only those sequences that did not show a significant similarity to the vertebrate sequences entered into the database prior to January 1993. The ALLSEQ data set has 570 vertebrate sequences, 196 of them included in the NEWSEQ subset.

Files containing the DNA sequences in ALLSEQ, the locations of the exons in the sequences, and the corresponding encoded amino acid sequences can be found at http://genome.imim.es/databases/genomics96/index.html

Generation of the Test Set

The table below summarizes the results obtained when the performance of a number of programs was analyzed on the ALLSEQ and NEWSEQ data sets was analyized.

**Table:** Accuracy of gene predictions programs on single gene vertebrate sequences. Adapted from Burset and Guigó (1996)
Program	dataset	Sn	Sp	CC	Sn	Sp	(Sn+Sp)/2	ME	WE
		Nucleotide			Exon
fgeneh	allseq	0.77	0.88	0.80	0.61	0.64	0.64	0.15	0.12
	newseq	0.70	0.83	0.73	0.51	0.54	0.54	0.22	0.18

geneid	allseq	0.63	0.81	0.65	0.44	0.46	0.45	0.28	0.24
	newseq	0.58	0.78	0.60	0.41	0.43	0.42	0.34	0.27

geneparser2	allseq	0.66	0.79	0.65	0.35	0.40	0.37	0.29	0.17
	newseq	0.63	0.76	0.62	0.33	0.39	0.36	0.32	0.20

genlang	allseq	0.72	0.79	0.71	0.51	0.52	0.52	0.21	0.22
	newseq	0.63	0.73	0.63	0.39	0.44	0.43	0.29	0.25

grail 2	allseq	0.72	0.87	0.76	0.36	0.43	0.40	0.25	0.11
	newseq	0.69	0.85	0.72	0.34	0.41	0.38	0.30	0.13

sorfind	allseq	0.71	0.85	0.72	0.42	0.47	0.45	0.24	0.14
	newseq	0.65	0.79	0.65	0.36	0.39	0.38	0.29	0.19

xpound	allseq	0.61	0.87	0.69	0.15	0.18	0.17	0.33	0.13
	newseq	0.58	0.83	0.64	0.12	0.15	0.14	0.36	0.16


geneid+	allseq	0.91	0.91	0.88	0.73	0.70	0.71	0.07	0.13
	newseq	0.88	0.87	0.84	0.68	0.64	0.66	0.10	0.15
geneparser3	allseq	0.86	0.91	0.85	0.56	0.58	0.57	0.14	0.09
	newseq	0.83	0.89	0.82	0.50	0.53	0.51	0.17	0.09

GeneID+ and GeneParser3 incorporate results from amino acid database searches to make the gene predictions.

Average CC for the programs analyzed in the ALLSEQ data set ranged from 0.65 to 0.78 at the nucletoide level, while the average exon prediction accuracy ranged from 0.37 to 0.64. Accuracy was lower and less variable between programs in the NEWSEQ dataset: average $CC$ ranged here from 0.60 to 0.70, and average exon accuracy ranged from 0.36 to 0.54. The authors of the study also found that the programs showed some dependency---usually only slight---on the phylogenetic group within vertebrates in which they had been trained and tended to perform worse on low G+C content sequences, but this was not a general trend. Programs were also sensitive to high rates of sequence errors, but their robustness varied widely.

Rogic et al. (2001)

After the publication of this study, and during the second half of the 1990s, there were substantial developments in the field of computational gene identification. A new generation of programs---notably those based on Hidden Markov Models---, showed substantially increased prediction accuracy, and overcame some of the technical limitations of earlier programs: these programs were able to analyze larger sequences, predicting multiple genes in both strands. Accuracy of most such programs was still being evaluated in the test dataset described in \cite{burset:1996a}. While this allowed for effective comparison between programs, with the years this was also making it increasingly difficult to guarantee a clear separation between the sequence data sets in which the programs had been trained, and the sequence set in which they were being tested. To address this concern Rogic et al. (2000) published recently a new independent comparative analysis of seven gene prediction programs. The programs were again tested in single gene sequences---from human and rodent species. In order to avoid overlap with the training sets of the programs, only sequences were selected that had been entered in GenBank, after the programs were developed and trained. Their extraction process resulted in 195 genes, the HMR195 data set. The table below shows the the accuracy measures averaged over the set of sequences effectively analyzed for each of the tested programs.

**Table:** Accuracy of gene prediction programs on single gene mammalian sequences. Adapted from Rogic et al. (2000)
Program	Sn	Sp	CC	Sn	Sp	(Sn+Sp)/2	ME	WE
	Nucleotide			Exon
fgenes	0.86	0.88	0.83	0.67	0.67	0.67	0.12	0.09
genemark.hmm	0.87	0.89	0.83	0.53	0.54	0.54	0.13	0.11
genie	0.91	0.90	0.88	0.71	0.70	0.71	0.19	0.11
genscan	0.95	0.90	0.91	0.70	0.70	0.70	0.08	0.09
hmmgene	0.93	0.93	0.91	0.76	0.77	0.76	0.12	0.07
morgan	0.75	0.74	0.69	0.46	0.41	0.43	0.20	0.28
mzef	0.70	0.73	0.66	0.58	0.59	0.59	0.32	0.23

Obviously, the programs tested by Rogic et al (2001) showed substantially higher accuracy than the programs tested by Burset and Guigó (1996). Now, average CC at the nucleotide level ranged from 0.66 to 0.91, while average exon prediction accuracy ranged from 0.43 to 0.76. The authors also found that while earlier gene finders were tuned to have a low false-positive rate which resulted in, on average, 20% higher specificity than sensitivity, programs of the new generation have more equilibrated sensitivity vs. specificity. The authors also found a large gap between accuracy at the nucleotide level and at the exon level. This indicates that, while the programs correctly predict a large proportion of the coding region of the genes, they have trouble identifying the exact exon boundaries. In this regard, the gene prediction problem---formulated as the problem of identifying the nucleid acid signals involved in gene specification, and delineating only from them, the corresponding gene structure---is still far from being solved.

Other findings of Rogic et al. (2001) include dependency of some programs on G+C content, but little on phylogenetic specific training within mammals, and the general trend among the programs tested to have a very low proportion of correctly predicted short exons or very long exons. This, together with the fact that splice sites seem to be better predicted than start and stop codons, could explain the tendency, consistent among the programs, of better predicting internal exons than either internal or terminal exons. This, in turn, leads to rather poor predictions of the gene termini, as we will see later.

Accuracy in large DNA sequences. Guigó et al. (2000)

To overcame the lack of large well annotated genomic sequences in the human genome Guigó et al. constructed a set of semi-artificial genomic sequences. They proceeded in the following way. First, a typical benchmark set was extracted made of sequences from the EMBL database release 50 (1997). These are 178 human genomic sequences coding for single complete genes for which both the mRNA and the coding exons are known. We will refer to this set here as h178. The coding density of this set is 21%. Second, to simulate large sequence contigs, the sequences in h178 were randomly placed in a background of intergenic DNA, randomly generated using a Markov Model of order 5. The length of each artificial contig is generated randomly according to a normal distribution. Genomic fragments containing genes and random-sized segments of intergenic sequence are then concatenated until their combined lengths exceed the predetermined contig target length. The strands are also chosen at random for each genic subsequence. After this process, the 178 genic sequences collapsed into 42 semiartificial genomic sequences (SAG). We will refer to this set as Gen178. The coding densitiy on this set was only 2.3\%, very close now to the coding density estimated for the whole human genome.

The table below shows the accuracies of GENSCAN ---as the most accurate representative of "ab initio" gene finders---in the set of single gene sequences h178 and in the set of large artificial contigs Gen178.

**Table:** Accuracy of gene prediction programs on semiartificial multigene human sequences. Adapted from Guigó et al. (2000)
Program	dataset	Sn	Sp	CC	Sn	Sp	Sn+Sp/2	ME	WE	MG	WG
		Nucleotide			Exon					Gene
genscan	Gen178	0.89	0.64	0.76	0.64	0.44	0.54	0.14	0.41	0.03	0.28
	h178	0.92	0.92	0.91	0.76	0.76	0.76	0.09	0.09
blastx nr	h178	0.90	0.92	0.90	0.10	0.12	0.11	0.12	0.07
genewise	Gen178	0.98	0.98	0.97	0.88	0.91	0.89	0.06	0.02
	h178	0.98	0.98	0.97	0.88	0.91	0.89	0.06	0.02
procrustes	Gen178	0.93	0.94	0.93	0.80	0.75	0.77	0.10	0.16
	h178	0.93	0.95	0.93	0.76	0.82	0.79	0.11	0.04

As it is possible to see, while the accuracy of GENSCAN in the single gene sequences is similar to that estimated by Rogic et al (2001), it is substantially lower in the set of larger multiple gene SAG sequences: specificity at the nucleotide level drops to 0.64 from 0.92, and the proportion of wrong exons climbs to 41% from 9% in the single gene sequences. In addition, a significant decrease in sensitivity, is also observed, with GENSCAN failing to predict exons that were correctly identified in the single gene sequences. For instance, the proportion of missing exons increases from 9% to 14%. Almost 30% of the GENSCAN genes are predicted in the simulated-intergenic DNA. The authors concluded that for " ab initio" gene finders, these accuracy values (on SAG sequences) are more representative of their true accuracy on large genomic sequences than those obtained in the typical single gene benchmark experiments.

Accuracy in chromosome 22

The recent availability of the complete sequence of the first human chromosomes allows for a more realistic estimation of the accuracy of the gene finders in human genome sequences. Of the three chromosomes so far completely sequenced, 20, 21 and 22, the annotation of chromosome 22 is likely to be the most complete. We have evaluated the accuracy of two "ab initio" gene finders, GENSCAN and GeneID, in the complete sequence of chromosome 22. The analysis has some circularity, since the annotation of the chromosome 22 is already based on GENSCAN predictions. In that respect, the results here could overestimate the actual accuracy of gene finders. We used the sequence of chromosome 22 as in the April 2001 release of the UCSC assembly the GENSCAN predictions were taken from the same site, while the GeneID predictions were obtained after masking the sequence with repeatmasker (A. Smith and P. Green, unpublished). We used version 1.0 of geneid. The accuracy of the predictions was measured against the set of 554 annotated genes in of httP://www.cs.columbia.edu/~vic/sanger2gbd. Results appear in the table below.

**Table:** Accuracy of gene prediction programs on the complete chromosome 22 sequence
Program	Sn	Sp	CC	Sn	Sp	Sn+Sp/2	ME	WE
	Nucleotide			Exon
geneid	0.80	0.55	0.64	0.60	0.54	0.57	0.21	0.30
genscan	0.89	0.46	0.61	0.69	0.36	0.52	0.13	0.54

GENSCAN sensitivity in chromosome 22 is quite similar to that estimated by Guigó et al. (2000) in the set of artificial genomic sequences, but specificity is somewhat lower. A reason could be the existence of un-annotated genes in the chromosome 22 sequence, and need not necessarily be due to poorer GENSCAN performance. This is consistent with recent updates of the number of genes in chromosome 22, which could be as high as 700 or more, instead of the 550 used here to evaluate the accuracy of the predictions. In any case, it does not appear that the overall exon accuracy of GENSCAN and GeneID on the chromosome 22 sequence will be much higher than 60%.

Accuracy of Sequence Similarity Based Gene Prediction Tools

**Table:** Accuracy of gene prediction tools on semi-artificial multigene human sequences, when either strongly or moderately similar sequences are used as target for genewise and procrustes. Adapted from Guigó et al (2000) ().
	Very similar target						Moderate similar target
	p-value < e-50						e-50 < p-value < e-6
	17 SAG sequences						26 SAG sequences
	Nucleotide			Exon			Nucleotide			Exon
Program	Sn	Sp	CC	Sn	Sp	(Sn+Sp)/2	Sn	Sp	CC	Sn	Sp	(Sn+Sp)/2>
genscan	0.91	0.66	0.77	0.67	0.46	0.56	0.91	0.61	0.74	0.67	0.43	0.55
genewise	0.99	0.99	0.99	0.90	0.93	0.91	0.68	0.98	0.81	0.46	0.63	0.54
procrustes	0.92	0.96	0.94	0.80	0.75	0.77	0.66	0.79	0.72	0.48	0.32	0.40

Roderic Guigó (i Serra), IMIM and UB. rguigo@imim.es