Simple Implementation of the BLAST algorithm

The objective of this project is to develop a simple PERL implementation of the ungapped BLAST algorithm for the comparison of two protein (or DNA) sequences according to a specified scoring matrix.

INTRODUCTION

A protein has one or more active sites that characterize its biochemical function. Acceptance of mutations in the middle of an active site is uncommon as they can disrupt the protein function. On the other hand, the segments between the active sites usually allow for more mutations, as they play a role in giving the protein its shape but not in its function.

Accordingly, when comparing two proteins to infer a potential functional homology, we must take into account this variability. It is for this reason that the standard way of performing these searches is with local alignment algorithms. A local alignment of two sequences is an alignment of any substring of one sequence with any substring of the other sequence.

The local similarity of two sequences is defined as the highest score obtained by any local alignment between the two sequences. The Needleman-Wunsch dynamic programming algorithm for the global alignment of two sequences can be modified to produce a local alignment. This new algorithm is known as the Smith-Waterman algorithm (SW). The SW algorithm is, however, very slow when searching a large amount of sequences.

A way of improving the running time of the comparison algorithm is using 'heuristics'. This means that, although they can find the best scoring alignment, there is no guarantee of doing so. BLAST is one of those heuristics methods.

BLAST heuristic relies on the following two assumptions:

  1. The first is that most high-scoring alignments contain one or more high-scoring pairs of W-letter substrings called words. The value used for W is typically 3 for aminoacids (10 for nucleotides), but it can vary. BLAST collects seed-alignments by identifying places in both sequences where high-scoring word pairs occur.
  2. The second is that homologous sequences usually contain gap-free alignments of considerable length. Accordingly, the seed-alignments can be subsequently extended to form gap-free alignment blocks.

The BLAST algorithm can then be divided into the following steps:

SUGGESTED WORK PLANNING

  1. Write a perl program that reads two protein sequences in fasta format and a scoring matrix, e.g.:
    	./simple_blast.pl  seq1.fa  seq2.fa  blosum62
    
  2. The program should hold in memory the sequences and the scoring matrix for later manipulation.
  3. Consider the first file to correspond to the query. Modify the program above to add the processing of the query sequence, i.e. build a query index for the query sequence: For each 3-letter word in the query sequence, find all the neighbouring words ( the word neighbourhood), that is, all the 3-letter words that only differ from the query word in one letter and have a high score when compared to it. A score is usually high when it is >= 11.
  4. Modify the script to allow to introduce the length of the word and the minimum score for the word nieghbourhood as parameters, W and T, respectively, e.g.
    	./simple_blast.pl  seq1.fa  seq2.fa  blosum62 W=4 T=10
    
  5. Extend the program to add a function to scan the target sequence. This function should search the target sequence word by word (note the word length used = W) and use the table built from the query sequence to detect hits. Store the found hits.
  6. Extend the program to add a function that extend each one of the hits found in the target sequence. The extension should stop before we add an alignment that makes the score negative. Store all the alignments.
  7. Extend the program to print the alignments sorted by score and beyond a certain length that we can pass as a paramter, e.g.:
     
    	./simple_blast.pl  seq1.fa  seq2.fa  blosum62 W=4 T=10 L=20
    
  8. It is very important that the program is well documented and for each function there is a clear description of what it does. This documentation must be insde the program and can be in the format of Perl comment using the symbol #.

The description of the project and the results obtained should be clearly explained in a simple web site or in article format. It is important that the work is presented in a way that is reproducible by others. The construction of a web-server will be positively evaluated.