Blosum45 matrix was used when no orthologous gene was found for A. thaliana, C. elegans and S. cerevisiae. In this link you can find the ensembl geneID for all these genes (for A.thaliana, the locus name in the database used is given). The species were selected taking into account that the aim of the study was to look into the evolution of U12 introns, so there are several eukaryotic organisms from different taxa: a plant, a yeast and several animals, including a worm, several arthropoda, a frog, two fishes, a bird and two mammals. Some of them are very close phylogenetically (F. rubripes and T. nigroviridis; and M. musculus and R. norvegicus), so that differences between similar and more diverse organisms can be observed.
The intron flanking regions, 60 nucleotides upstream and 60 downstream, were extracted for each human U12 intron in the genes chosen (the ID of the exons and their number can be found in the geneID file). These sequences were used to perform exonerate analysis against the orthologous genes in the other species with exonerate:coding2genome 0.8.2. The command line used for exonerate was:
exonerate --model coding2genome --bestn 1 --showtargetgff TRUE intronX_flankingsequence.fa genesequence.fa
Blosum45 matrix was used with the plant, the yeast, the worm and arthropoda when nothing was found with the default blosum62, but no more results were found with this variation. The whole genome of S. cerevisiae was scanned for all the introns, using the version that is available in Persy.
A script was written in order to parse the exonerate output. Moreover, the script also calls geneid v 1.2 and a script to score branch sites and then parses the outputs from those programs. Geneid was used in order to score donor and acceptor sites and see whether they were U12 or U2. The treshold was set to -100 with the aim of being able to compare the score for the U2-type and the U12-type. When the latter is higher, the intron is considered to be a U12-type intron. For the introns which start with AT, geneid only gives a score for U12-type introns, despite the low treshold.
Finally, the analysis was extended to some paralogous genes. Several paralogs present in chicken, mouse and human were selected. The search was performed with the name of the gene, without the final number. ERCC4, ERCC6 and ERCC8 were analised as paralogous genes of ERCC5 and NHE1, NHE2, NHE7 as paralogs of NHE6. The geneID for these genes can be found here. Paralogous genes for KIFAP3 were not found in human while searching with the name (KIFAP), so we decided not to study these paralogs. The results of the first analysis were parsed in order to extract the flanking regions of U12 introns in NHE6 and ERCC5 in mouse and chicken. Then exonerate analysis were performed in the same way as before, but changing the query and target sequences. For each species the query was the flanking region of the corresponding intron from NHE6 or ERRC5 in that species and the target were the different paralogous genes in each especies. After the exonerate analysis, its output was parsed in the same way as before, but with some modifications in the script not to say anything about the orthologous genes found.