Help page

Needleman &Wunsch algorithm [1]

Aligns two sequences completely from end-to-end (a global alignment)
Not very practical for this reason, but illustrates important concepts
Used when an objective and optimal measure is needed to compare two sequences and it is valid to assume that the length of the sequences is equivalent.

Will determine an optimally scoring alignment of two sequences
In some cases there may be more than one equally good alignment

Algorithm
Construct matrix with one index for each sequence.
Fill that matrix recursively with scores representing the best alignment from the beginning to the index point.

Use a PAM or BLOSUM matrix to calculate the score for each pair of aligned amino acids.
Score a gap penalty for any case where an amino acid must be paired with a gap.

The score of an alignment will be the sum of scores for individual amino acid pairs.
Start with a score of zero

The matrix can then be filled in recursively using the maximum of three possibilities:
No new gap must be inserted.
The score of the sub-alignment is the sum of the partial alignment plus the BLOSUM score for the next two amino acids.

A gap must be inserted in sequence i.
The score of the sub-alignment is the sum of the partial alignment plus the gap penalty

A gap must be inserted in sequence j.
The score of the sub-alignment is the sum of the partial alignment plus the gap penalty.

These three possibilites are represented in the matrix by the three adjoining cells above and to the left of each cell.
For each cell in the matrix, fill in the maximum of these three possibilities, and record from which cell the calculation was propagated.

When the entire matrix is full, follow the path of best options from the lower right corner to the upper left to yield the alignment.

Sequence input

The sequences must be pasted in FASTA format as showed below

                >protein identification (just one line)
                MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPN
                FFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTV
                KTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHI

OR

                >DNA identification (just one line)
                GTCCGCCGCCGCCTGCTGGGCCGGCCGAGGATGCAGCGCAGCGCCTCGGTGGCCAGGCTC
                AGCGTGCTTGCTAACTTCCCCGGCTCCGTCTCTGCCTGCCGGGGTCGCCCCGTGTCCCTG
                GTCTGGTTCTCTAAGCTCTCTGGGCGCTGCCTCCGGGTCCCTTGCAGCCCGCTCGCGAGC
                CTCCTGCGCCCCACCCTCGTCCTCGCCATGCTGCCCTTCGGCCTCGTGGCCGCCCTGCTG

Comparison Matrix[2]

The PAM family

PAM matrices are based on global alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence.

Other PAM matrices are extrapolated from PAM1.

The BLOSUM family

BLOSUM matrices are based on local alignments.

BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

The relationship betwen BLOSUM and PAM matrices.BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.

For DNA sequences we propose two matrices:

The "identity matrix" don't penalize transversions (purine to pyrimidine, pyrimidine to purine) in front of transitions (purine to purine, pyrimidine to pyrimidine).
The "transition/transversion matrix" penalize transversions in front of transitions because transversions are mutations much less frequents.

Gap penalty
Introduction of gaps into sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. If gaps are introduced without a penalty than they can be introduced at random and eventually all characters will be aligned in even random sequences. For example:
A---SDL-LLL---A-DDDD
|    |  |||   |         
ADDD-DLALLLAAAAS----
If the gap penalty is too low a lot of gaps will be introduced in the sequences and only the correct matching aminoacids will be matched, as shows the previous example.

If the gap penalty is too high only a few gaps will be introduced in the sequences and there will be a los of mismatch betwen them.

We propose a standar gap penalty of -4.

GARA.pl
GARA.pl is a perl program developed to align pairs of sequences. This web uses the program for obtaining the sequence alignment in the following format:
seq1,   1       MPIGSK-ERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEP    60
                ||:  |  |||| |||| ||: ||||||||||||||||||||||||| |||| | : |||
seq2,           MPVEYKR-RPTFWEIFKARCSTADLGPISLNWFEELSSEAPPYNSEPPEESEYKPHGYEP
 
seq1,   61      NLFKTPQRKPSYNQLASTPIIFKE--QGLTLPL-YQSPVKELDKFKLDLGRNVPNSRHK-    120
                 ||||||| | |:| ||||| |||  |  ||||  |||      |: :||: | :|:||
seq2,           QLFKTPQRNPPYHQFASTPIMFKERSQ--TLPLD-QSP------FR-ELGKVVASSKHKT
 
seq1,   121     -SLRTVKTKMDQADDV-SCPLLNSCLSESPVVLQCTH-VTPQRDKSVVCGSLFHTPKFVK    180
                 | :  ||| |   || | | | |||||||  | ||: |  ||:| || |||| |||  |
seq2,           HS-KK-KTKVDPVVDVAS-PPLKSCLSESPLTLRCTQAV-LQREKPVVSGSLFYTPK-LK
 
seq1,   181     -GRQTPKHI--------  197
                 | |||| |
seq2,           EG-QTPKPISESLGVEV
 
 
                Score: 667
                Percent of matching: 60.41% 
                Substitution array: PAM30
                Gap penalty: -4
The symbol | shows a correct match, the symbol : shows a mismatch but betwen two similar aminoacids (the similarity rate is taken from the substitution array, the pairs of aminoacids that have a positive score are considerated similar).
You can also dowload GARA.pl for using it from your own computer (from a linux terminal):
To run the program type:
 
$ ./GARA.pl sequence1_file sequence2_file [substitution array] [gap penalty]
in case you want some help type:
$ ./GARA.pl -h

The authors

Gabriele Praderio
Ramon Roset

Students of Biology in the "Universitat Pompeu Fabra" of Barcelona (4th year).

This project is part of the subject "Bioinformàtica" lectured by Roderic Guigó.

References:

Needleman, S. B., Wunsch, C. D., J. Mol. Biol. (1970) 48:443-453
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html

For comments and suggestions mail to: Gabriele Praderio and Ramon Roset

UPF Barcelona, March 18, 2002