ClustalMU
MULTIPLE SEQUENCE ALIGNMENT (MSA)
Introduction :
ClustalMU is a program written in Perl language, whose objective is to
obtain the best alignment between multiple sequences of proteins. It has been
developed following the ClustalW method, which was designed and optimized by
Julie D. Thomson et al., who based the multiple alignment into three steps:
- First of all, it makes a global pairwaise alignment which is used in
order to obtain the similarity between all the sequences, after what it will
construct a distance matrix.
-The next step consists in providing a tree guide making use of the
Neighbour-Joining method, comparing the distances obtained in the distance
matrix, created in the first step.
-Finally, it performs the progressive alignment following the order obtained
at the tree guide.
After all this steps are done, it creates and output file where all the sequences
are aligned, providing a powerful skill in order to discover the aminoacids
that are specially conserved in all the proteins and that have been preserved
during their evolution because of negative selection processes. Moreover,
this regions can be important for their function or for keeping their
functional structure.
How ClustalMU works :
The process is divided into several steps, detailed below:
Step 1) Extracting the sequences from a FASTA file:
It can read as sequences as you want from a FASTA file, saving the data into
a hash, and then in two separated arrays: one for the sequences (values), and
another for the names (keys).
Step 2) Reading a substitution matrix (i.e.: Blossum62):
It extracts the relationships between two aminoacids and saves them into a
hash, that will be important when alingning.
i.e.: Aligning an alanine with an arginine.
If alingning an alanine (A) with an arginine (R) has a score of -1, it saves
{"A"}{"R"} as keys, and -1 as value.
Step 3) Obtaining an alignment order:
It will be necessary for the global pairwaise alignment. It creates an array
where will be all of the possible combinations between two sequences for
aligning them.
i.e.: For 5 sequences of proteins.
If we have 5 sequences of proteins, it will make all the combinations without
repeating: 1-2; 1-3; 1-4; 1-5; 2-3; 2-4; 2-5; 3-4; 3-5; 4-5.
Step 4) The global pairwaise alignment:
It aligns all the sequences taking the order from the order array. The
alignment is done making use of the substitution matrix hash created in the step
2. The substitution matrix hash is used due to an important aspect: there
is the necessity to differenciate all the possible mismatches that can take
place when aligning one aminoacid with another one which is different to the
first.
After each of the alignments is finished, ClustalMU provides a score for each one, that
will be necessary when constructing the distance matrix. The score is saved in
an array that correspond to another one that contains which sequences give each
score.
i.e.: An example of how ClustalMU aligns and calculates the distance.
Alineo fuguercc5 i urokinaseiso2: i la distancia es 0.958413085087872
Alineo fuguercc5 i apolipoprotein: i la distancia es 0.973029406646946
Alineo fuguercc5 i homoercc5: i la distancia es 0.558135704874835
Alineo fuguercc5 i urokinaseiso3: i la distancia es 0.954585000870019
Alineo fuguercc5 i tetraercc5: i la distancia es 0.442317730990082
Alineo urokinaseiso2 i apolipoprotein: i la distancia es 0.89735516372796
Alineo urokinaseiso2 i homoercc5: i la distancia es 0.962121212121212
Alineo urokinaseiso2 i urokinaseiso3: i la distancia es 0.32252027448534
Alineo urokinaseiso2 i tetraercc5: i la distancia es 0.949809545149003
Alineo apolipoprotein i homoercc5: i la distancia es 0.978919631093544
Alineo apolipoprotein i urokinaseiso3: i la distancia es 0.88646288209607
Alineo apolipoprotein i tetraercc5: i la distancia es 0.959220255433565
Alineo homoercc5 i urokinaseiso3: i la distancia es 0.95998023715415
Alineo homoercc5 i tetraercc5: i la distancia es 0.645092226613966
Alineo urokinaseiso3 i tetraercc5: i la distancia es 0.94174322204795
This will be the necessary to construct the first distance matrix.
Step 5) Constructing the distance matrix:
Distance matrixs are constructed for obtaining the tree guide. Once it obtains
a distance matrix, ClustalMU searchs for the minimum distance into the matrix, saves
the references of the sequences that give that distance and rebuilds the distance matrix
recalculating the distances among those sequences that gives the minimum
distance. That step is done until all the sequences are saved together.
i.e.: An example of a distance matrix constructed by ClustalMU.
fuguercc5 |
0 |
urokinaseiso2 |
0.958413085087872 |
0 |
apolipoprotein |
0.973029406646946 |
0.89735516372796 |
0 |
homoercc5 |
0.558135704874835 |
0.962121212121212 |
0.978919631093544 |
0 |
urokinaseiso3 |
0.954585000870019 |
0.32252027448534 |
0.88646288209607 |
0.95998023715415 |
0 |
|
tetraercc5 |
0.442317730990082 |
0.949809545149003 |
0.959220255433565 |
0.645092226613966 |
0.94174322204795 |
0 |
Step 6) Creating the tree guide:
The tree guide is constructed while obtaining the distance matrix. In step 5, a
node is defined every time a minimum distance is found. The sequences that
give the minimum distances are joined because a node exists between
them. Joining sequences, we finally obtain the tree guide from which the
progresive alignment would be done.
Results obtained when running 6 sequences of proteins are showed below:
A partir d'aquestes sequencies fuguercc5 urokinaseiso2 apolipoprotein homoercc5 urokinaseiso3 tetraercc5
L'ordre a seguir a l'hora d'aliniar és: (2,((1,4),(3,(0,5))));
Although we have worked very hard in order to finish the program, it has been
impossible for us to complete it. We obtain the order from which ClustalMU
would have done the progressive alignment, but we were unable to
make the regular expression that would have allowed ClustalMU to understand in
which order does it has to align sequences.
When the contruction of the tree guide ends, ClustalMU obtains an array (@nodes) with the following structure:
$nodes[0]=(2,((1,4),(3,(0,5))))
$nodes[1]=((1,4),(3,(0,5)))
$nodes[2]=(3,(0,5))
$nodes[3]=(0,5)
$nodes[4]=(1,4)
Its structure shows the order in which ClustalMU has found the nodes. Now it should read this array and create the progressive alignment.
HOME