THE SCRIPT

The script written to get the results in this study can be divided in three major parts to explain what it does. The lines that follow have a general explanation about the idea behind each part of the script, but more specific comments can be read within the script.

When the program is called, 3 files are needed: one with the sequences of the genes in FASTA format, another one with the results of exonerate and the third one with the name of the species used.

Orthologous genes found

The first thing done is reading the file with the sequences and saving them in a hash, which has the name as the key and the sequence as the value. The location of the genes is also saved in another hash. After that, the file with the species is read and the names saved in an array. The last step to know whether the orthologous gene in any species was found was to compare the information of the two files. In order to do that a function called 'ortoleg' reads the names of the species, compares them with the keys of the hash, and returns whether they were found or not. Then, the script prints the result of the comparison and, if the gene was found, its location, which is in the second hash built before.

Reading the exonerate output

The next part of the script reads the exonerate file and extracts the information needed in this study. The lines starting with 'vulgar' in the exonerate output are saved in an array and the lines of the GFF output in another one. Then, the GFF lines are read to extract the name of the sequence, the start and end of the intron, the donor and acceptor sites, the terminal nucleotides of the intron and the strand. All this things are saved in an array. The characteristics of the intron are extracted in a different way depending on the strand of the sequence, because when it is the negative strand, a change in the coordinates is needed to look for the site sequences in the hash, as well as changing to the reverse and complementary sequence. When all the information is collected, the matrix which has it is printed. The lines starting with vulgar are read to know if the query sequence for exonerate was in the negative strand. If a result like that is obtained it is discarded because the sequence was given in the correct way. In these lines it is also shown whether an alignment was found, so when an alignment but not an intron is found, the script shows its coverage (the percentage of the query sequence aligned).

Donor, acceptor and branch sites

Finally, donor and acceptor sites are saved in FASTA files and geneid (version ) is called to score the donor sites and the acceptor sites. Another program is called to look for the sequence of the branch point. Afterwards all this information is parsed in order to show the name of the sequence, the type of intron (U2 or U12), the scores of donor and acceptor sites, the branch site sequence and its position within the intron. If a sequence has donor, acceptor and branch site, all of them are saved in an array. If an acceptor or a branch sites do not have its corresponding donor site, they are saved in other arrays. At the end all this information is printed out.

Output

As a result, in the output of the script it is shown whether the orthologous genes were found and, if so, their location, a table with the features of the introns extracted in the second part and another one with the scores of donor and acceptor sites, the branch site sequence and its position. If a query in the negative strand or an alignment without an intron were found, it is shown between the second and the third part.