After testing our program with different datasets, we can affirm that is not always good enough for our purpouses, because it has big problems refering the initiation value of the random 'P' matrix.
If we run the program, reached the convergence, we realize that the maximum value that predicts the probability that any position could be the motif start point is not statistically significant. We ran the program with the following randomly generated sequences. As we show below, these contain the marked motif (which is 6- nucleotide long):
Sequence 1: CACGTACCTACTCTGTATATACAACTACAA
Sequence 2: GGGGGTATATACTTTTCCTCCGAATCATTA
Sequence 3: GACAGTGGATCTCTAACCGCTATGTATATA
Sequence 4: CGAAACGTATATAACCACGTGAGCACTTAT
Sequence 5: TCTGACAGAGTGCTCTCCCGCACTATATAG
We ran the program 10 times and in all of them the motif found was not the real motif hidden in each sequences.
The motifs found were:
Sequence 1: CACGTACCTACTCTGTATATACAACTACAA
Sequence 2: GGGGGTATATACTTTTCCTCCGAATCATTA
Sequence 3: GACAGTGGATCTCTAACCGCTATGTATATA
Sequence 4: CGAAACGTATATAACCACGTGAGCACTTAT
Sequence 5: TCTGACAGAGTGCTCTCCCGCACTATATAG
And the probabilities of each sequence motif starting point are shown in the next table (these values didn't changed after running the program 10 times):
Sequence |
Max Prob Motif Start Point |
1 | 0.05953 |
2 | 0.07148 |
3 | 0.05777 |
4 | 0.06361 |
5 | 0.06384 |
The probability of any concrete position could be the motif starting point is too low, so in any case, it cannot be considered as a real motif starting point.
So, considering that the motif is not found by our program we conclude that is due to its incapacity to find a position with a higher probability to be the motif starting point.
We also try a lot of different set of sequences, like only 3 sequences and shorter, or more sequences and longer, etc. The best result we had is shown in next table:
CCGTATATAAGCGCGCCAAGAATGGTTATC
CTCCGACAGGATCACGGCGCGCATATATAG
ACTTGTATATATAGCTCCGGCCGCGCGCTG
Sequence |
Max Prob Motif Start Point |
1 | 0.36817 |
2 | 0.39502 |
3 | 0.30516 |
After thinking about why that happened, we believe that EM algorithm can only find the real hidden motif depending on the following parameters :
|
|
|
|
|
Once we have had our program done and working, it seemed not to work as well as expected. It found the motif once for each 8 times we ran the program, and as explained before the Z matrix values were to low to be statistically significant.
In an atempt to find a solution we tried to change the weak part of the program, the random star. So we try to make ten iteration, each one creating a random P matrix and doing only once all the path trough the program, and storing each matrix in a big hash. Then comparing the likelihood of each matrix and take the best one as our starting P matrix for the main body of the program. But by doing it the program didn't compile, and we haven't be able to fix it.
To see this test: Improvement failed