|
We analyzed an anonymous human genome BAC using a variety of computational tools. We first downloaded the sequence, AC012089, in fasta format from GenBank (http://www.ncbi.nlm.nih.gov/GenBank). It is useful to have also the sequence in tabular format because it allows a performance of some preliminary analysis about the length and g+c content. Repeat Masker analysises the presence and distribution of the repetitive regions along the sequence. Thanks to that we were capable to obtain all kind of repetitions: SINES, LINES, Alu, LTRs, Simple Repeats. The EMBL RepeatMasker server (http://woody.embl-heidelberg.de/repeatmask/) produces a list of files that belong to the masked sequence: its repeat annotation and the summary of the repeat content.After masking the sequence accordingly, we used a number of gene prediction programs to obtain an initial delimitation of putative genes encoded. They are :
- GeneId (http://genome.imim.es/geneid.html): Consisting of recursive optimization techniques that are guaranteed to find the highest scoring prediction using Position Weight Matrix.
- Genescan (http://genes.mit.edu/GENSCAN.html): Tool that uses Hidden Markov Models, statistics models to get the dependencies/bias among sets of consecutive nucleotides.
- Grail 2 exons (http://compbio.ornl.gov/Grail-1.3/ ): Based on neural networks, that is, trainning an engine with real exemples to be able of distinguish two input sets or classes: positive and negative.
- GenMark (http://dixie.biology.gatech.edu/GeneMark/genemark24.cgi): A genefinding tool that uses an algorithm based on non-homogenous Markov Chain Models.
It is necessary to convert the output archives to gff format, and so we have used some programs in awk script. Gff2ps (http://bioweb.pasteur.fr/seqanal/interfaces/gff2ps.html) was used to visualize the gff archives through goshview (ps format).We then validated the predictions comparing them against databases of known coding sequences. Because of the fact that we are interested in nearly identhical matches, we used Megablast (http://www.ncbi.nlm.nih.gov/BLAST) against the Human EST database. The Blast results must be converted to gff format too (awk script). We selected the most interesting ESTs (the spliced ones) by using a program called Parseblast (http://genome.imim.es/courses/BioinformaticaUPF/P18/scripts/).
We tried to get more evidences to support the existence of the predicted genes searching in different programs such as TBlastx (same address as Megablast) to compare DNA sequences by translating them into proteins. To characterize the gene product we used: Blastp (same address as Megablast) and Swissprot (http://www.expasy.ch/sprot/) to compare proteins, InterPro (http://www.ebi.ac.uk/interpro/) to identify protein domains and Clustalw (http://www2.ebi.ac.uk/clustalw/) to align sequences. It was also required the application Blast2sequence (same address as Megablast) to compare DNA sequence-protein or two nucleotide sequences.
© Porta,M
Ros,S
Sancho,M
Trujillo,E / March 2002