Treball Bioinformatica UPF. Simple Implementation of the BLAST algorithmt

Simple Implementation of the BLAST algorithm

The objective of this project is to develop a simple PERL implementation of the ungapped BLAST algorithm for the comparison of two protein (or DNA) sequences according to a specified scoring matrix.

INTRODUCTION

A protein has one or more active sites that characterize its biochemical function. Acceptance of mutations in the middle of an active site is uncommon as they can disrupt the protein function. On the other hand, the segments between the active sites usually allow for more mutations, as they play a role in giving the protein its shape but not in its function.

Accordingly, when comparing two proteins to infer a potential functional homology, we must take into account this variability. It is for this reason that the standard way of performing these searches is with local alignment algorithms. A local alignment of two sequences is an alignment of any substring of one sequence with any substring of the other sequence.

The local similarity of two sequences is defined as the highest score obtained by any local alignment between the two sequences. The Needleman-Wunsch dynamic programming algorithm for the global alignment of two sequences can be modified to produce a local alignment. This new algorithm is known as the Smith-Waterman algorithm (SW). The SW algorithm is, however, very slow when searching a large amount of sequences.

A way of improving the running time of the comparison algorithm is using 'heuristics'. This means that, although they can find the best scoring alignment, there is no guarantee of doing so. BLAST is one of those heuristics methods.

BLAST heuristic relies on the following two assumptions:

The first is that most high-scoring alignments contain one or more high-scoring pairs of W-letter substrings called words. The value used for W is typically 3 for aminoacids (10 for nucleotides), but it can vary. BLAST collects seed-alignments by identifying places in both sequences where high-scoring word pairs occur.
The second is that homologous sequences usually contain gap-free alignments of considerable length. Accordingly, the seed-alignments can be subsequently extended to form gap-free alignment blocks.

The BLAST algorithm can then be divided into the following steps:

In the first step the query sequence is processed. We consider a table of all the possible words (for 3-letter words: 20*20*20=8000). To each word we associate the list of positions in the query sequence of all the words than give a high score when aligned with the key word. This is called the query index. The score is calculate according to the scoring matrix. The exact 3-letter match is always high scoring, but note that other matches are also possibly high.
In the second step, we scan the target sequence. We consider all the 3-letter words in the target sequence, and look them up in the query index, to obtain the match position and score to the query. Initially one can consider exact matches. In order to increase sensitivity, we can use a parameter T, and use all the words in the target that score T or greater according to an specified substitution matrix.
Each word hit is subsequently extended to the right and to the left to increase the alignment's score, up to whatever position maximizes the overall alignment score. Note that extensions beyond the first negative score are pointless. Each alignment built in this way is an high-scoring pair or HSP.
The minimum score (S) for the reported HSPs ( and the maximum allowed expected value (E) of the HSPs, but we will not work with E values in this case ) can be specified by the user. HSPs that meet these criteria will be reported by BLAST. Additionally, the user can also specify the maximum number of reported 'found targets' (B) and HSPs (alignments) (V). Note that there can be more than one alignment per target.

SUGGESTED WORK PLANNING

Write a perl program that reads two protein sequences in fasta format and a scoring matrix, e.g.:
```
	./simple_blast.pl  seq1.fa  seq2.fa  blosum62
```
The program should hold in memory the sequences and the scoring matrix for later manipulation.
Consider the first file to correspond to the query. Modify the program above to add the processing of the query sequence, i.e. build a query index for the query sequence: For each 3-letter word in the query sequence, find all the neighbouring words ( the word neighbourhood), that is, all the 3-letter words that only differ from the query word in one letter and have a high score when compared to it. A score is usually high when it is >= 11.
Modify the script to allow to introduce the length of the word and the minimum score for the word nieghbourhood as parameters, W and T, respectively, e.g.
```
	./simple_blast.pl  seq1.fa  seq2.fa  blosum62 W=4 T=10
```
Extend the program to add a function to scan the target sequence. This function should search the target sequence word by word (note the word length used = W) and use the table built from the query sequence to detect hits. Store the found hits.
Extend the program to add a function that extend each one of the hits found in the target sequence. The extension should stop before we add an alignment that makes the score negative. Store all the alignments.
Extend the program to print the alignments sorted by score and beyond a certain length that we can pass as a paramter, e.g.:
```
 
	./simple_blast.pl  seq1.fa  seq2.fa  blosum62 W=4 T=10 L=20
```
It is very important that the program is well documented and for each function there is a clear description of what it does. This documentation must be insde the program and can be in the format of Perl comment using the symbol #.

The description of the project and the results obtained should be clearly explained in a simple web site or in article format. It is important that the work is presented in a way that is reproducible by others. The construction of a web-server will be positively evaluated.