Introduction | Goals | Methodology | Programs | Results | Conclusions | References | Authors | Acknowledgements |
PROGRAMS
COMPARISON OF THE COMPLEXITY BETWEEN BRUTE-FORCE PATTERN MATCHING ALGORITHM AND KNUTH-MORRIS-PRATT ALGORITHM
In order to compare the efficiency of the brute-force pattern matching algorithm with Knuth-Morris-Pratt algorithm, we made a program in Perl that implements the brute-force pattern matching algorithm. The logic of this algorithm is detailed in Methodology.
The main program function is calculate the number of operations that brute-force pattern matching algorithm does, and then, we can compare with Knuth-Morris-Pratt algorithm. For that reason, the input of the program as a parameter is the sequence of mRNA. The program generates all the possible patterns from mRNA and seek which of them are unique. Moreover, the program contains a meter ($x variable), which has the number of operations that the program does. To read the code of this program, download correspondencia.pl.
To compare the algorithmic complexity, we made a similar program, which also generate pattterns from mRNA and seek if there were unique or not in that mRNA, but with Knuth-Morris-Pratt algorithm. The results of this comparison can be consulted in Results.
|
|||||||||||||||||||||||||||||||||
SEARCH OF UNIQUE SUBSEQUENCES OF ONE MRNA IN A GENOME
|
|||||||||||||||||||||||||||||||||
The main function of the program consists in looking for unique and specfic subsequences from one mRNA in one enterely genome (human, mouse,...).
The way we developed the program to achieve this goal has been creating optimum program's substructures for each problem to solve:
Next table construction
Next table is a sine qua non factor of Knuth-Morris-Pratt algorithm and, for this reason, one of the first steps. The basis of this coincidences table in the Knuth-Morris-Pratt algorithm are available in Methodology.
The program code, which allow us to calculate this next table in the program, is encapsulated inside a function named nexttable:
|
This function receives as first and unique parameter the pattern, which it will use for building the next table. This table is based on the construction of a coincidences table in an array named @coin (coincidences) indexed by the same positions of the pattern. This array always have the value of -1 in the first position (0) and the value of 0 in the second position (1). Then, the algorithm goes through the rest of pattern positions comparing the maximum previous position prefix with the pattern and annotating the coincidences in each position. If you want a detailed explanation of this operation, you can see program code.
Once you have the next table, although it is very important, there are more problems to be solved yet.
Knuth-Morris-Pratt Algorithm
The logic algorithm and why the achievement in efficiency regarding to the brute-force pattern matching algorithm is detailed in Methodology.
Its implementation in situ the program is in two functions: one function named knuth, to check if the patterns are unique in the mRNA; and another function, knuth4genome, to check the same verification in the whole genome.
The first function, knuth, receives three parameters: the next table calculated before, the mRNA sequence and the pattern. And returns 1 (true) or 0 (false) if the pattern is unique or not in the mRNA.
The second function, knuth4genome, receives four parameters: the next table, the genome directory address, the pattern and the pattern length. And returns 1 (true)* or 0 (false) if the pattern is unique or not in the genome, and in the case that is unique, also returns the movement and the chromosome where it is localize.
* The function returns 1 when the unique patterns meter is equal or less than 1. The reason of that is for those patterns which are unique in the mRNA but there are not found in the genome, because the pattern can be located in two different exons with an intron among them.
However, in this second function, before implementing KMP algorithm, we must considered some preparations:
|
|
|
|
By the way, in both functions the KMP algorithm is the next one:
|
This algorithm is based on going through the sequence, where we want to check if the patterns is unique or not, comparing with the pattern. At the moment of modifying the movement, if the pattern matches exactly or not, the program uses the information from the next table. In this way, the program achieves moving the pattern over the sequence enough to save the sequence positions that are not necessary to compare and also, start the next comparison in a fix pattern position (saving pattern positions that doesn't need to be compared neither). For a more detailed explanation, see program code.
Pattern generation
The program is capable of opening the FASTA file of mRNA by itself and generates all the possible patterns, with the Perl function substr, for a fix length. In this way, is sure that the program analyzes all the subsequences susceptible to be unique in mRNA.
|
Automatic strategy of modifying the length Once the program has checked unique patterns for a fix length in mRNA, there are two possible options:
|
|
|
|