Methods
QUERY OBTENTION
The queries represent proteins of Homo sapiens and Equus caballus, downloaded from SelenoDB database. The queries obtained are representative of the most conserved families of selenoproteins in mammalians.
We decided to compare our reference genome with Equus caballus, as it is a good phylogenetically candidate. Besides, we used the species Homo sapiens to complete the information of some of the proteins, as it is better annotated than Equus caballus in the selenoprotein data base (13).
GENOME OBTENTION
We were provided with the genome of Odocoileus virginianus texanus, which was saved in the following directory:
We were supposed to be given a genome.index too but we didn't have one, so we created it with the help of the tutors. It was saved in our own directory and so was the file of the genome:
When we performed the tblastn, we realized that we were obtaining very few scaffolds comparing with the total number of proteins we had downloaded from SelenoDB. Taking into account that very conserved selenoproteins were missing in the analysis of both human and horse queries, we decided to compare our reference genome with a third species: Bos taurus , suspecting that we might have chosen not so phylogenetically related species. As the analysis of Bos taurus gave very similar results as Homo sapiens and Equus caballus, we started to think about a possible problem in our reference genome. Looking back with the path in the original directory that was given, we found that our database had less number of coding proteins (7000) than the same genome package obtained in NHCI (17000). Therefore, we proceeded to download the new package and run a tblastn again.
The results showed a higher number of scaffolds than before, so we loaded the new package into our user account and started working with it. Besides, we concluded that our initial problem with the index was probably due to this too. We saved the new files with the following names:
To export the modules and the necessary paths for the execution of the program, we wrote the following commands in shell:
MANUAL PREDICTION OF SELENOPROTEINS
BLAST (Basic Local Alignment Search Tool) is a program that compares biological sequences to find regions of similarity. This program is based in an heuristic algorism which uses the most efficient alignment strategy in order to find the most realistics results. (20)
In this study we aim to find the region of our database (Odocoileus virginianus texanus genome) that is highly likely to encode a selenoprotein similar to the ones from Homo sapiens and Caballus equus.
To do so, we used the specific program TBLASTN, which compares a database translated in all open reading frames against a protein query sequence which, in our case, would be the proteins from Homo sapiens and Caballus equus.
We used the option "-outfmt 6" so that the result of the blast would appear tabulated. We also used the option "-evalue 0.01" with in order to discard the statistically non significant results.
We used two different commands because the proteins were saved in different directories, depending on the specie they belonged to. The output was saved in a file with the name of the selenoprotein followed by .blast, inside a directory called Humanblast or Horseblast.
After performing TBLASTN we got an output with all the potential matches between the queries and the genome of reference.
In order to do a first screening and selection of the scaffolds, we wrote all the results in an excel page and looked at different parameters as e-value, length and identity percentage (click here to see it). This way, we removed the hits containing a really short length and a bad identity. We also compared the hits that were repeated among families and selected them only in the protein who had the best parameters (click here to see it).
Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using a many alignment models, either exhaustive dynamic programming or a variety of heuristics (10). The aligment model used for the present study is the dynamic programming. This software contains many programs that are necessary to perform a correct alignment, which include fastaindex, fastafetch and fastasubseq.
With the program fastaindex we assigned a position to each region of our reference genome, so that when we executed Fastafetch this tool would be able to read our genome by positions and would select the specific region of the genome where the scaffold is located.
This program was used to obtain the scaffold that is going to be analyzed from the genome of reference.
We only looked for the specific hits we had selected as the high likely candidates, according to the characteristics explained in the section "Obtaining the best scaffold candidates".
The command we used was the following:
The program fastasubseq was used to extract the region of the genome where the scaffold was found. The start point of the region corresponded to the smallest nucleotide position among the selected hits, and the final position coincided with the largest nucleotide position. We extended the region selected 30000 nucleotides on each side, to make sure that the whole gene sequence was extracted.
This program was executed using the following command:
The output was saved in the home directory with the name of the protein followed by ".subseq.fa".
Finally, we compared the genome of reference with the query using exonerate program. (10) This allowed us to obtain the exons of the extracted region. This step would help us to construct the cDNA of each selenoprotein. The "-egrep" command was used to extract the exon coordinates from the exonerate file.
This is the command we used to do so:
The output of the program was saved in a file named after the protein analyzed, and ending with ".exonerate.gff". The gff format describes the locations and attributes of gene and transcript features on the genome.
Once we had a file with all the exons obtained from the exonerate we used the program fastaseqfromGFF in order to concatenate all the exons and get the cDNA of the extraction.
selenoprotein.exonerate.gff > /export/home/u114892/selenoprotein.fastaseq.fa
The output file was called with the name of the selenoprotein followed by "fastaseq.fa".
With the program Fastatranslate we obtained the protein sequence from the cDNA. The command of fastatranslate was the following:
T-COFFEE (Tree-based Consistency Objective Function for alignment evaluation) is a tool that allows the overall alignment of the resulting protein in the fastatranslate with the protein query extracted from the genome of our reference species (Equus caballus and Homo sapiens) (21). We compared proteins to obtain information about the homology between the two sequences, in order to conclude if our alignment could be used for a satisfactory genome annotation.
Before executing T-COFFEE, we changed the "*" characters from fastatranlate files for "U" to be able to see the alignment of selenocysteines, as the program is not capable of working with this characters.
The command used to perform T-COFFEE was the following:
The output was saved in a file named after the protein analyzed, ending with ".tcoffee".
AUTOMATION OF THE COMPUTATIONAL PROCESS
In order to make easier this process, we decided to automatize some of the steps.
First, we made two programs (download1, download2) (one for each species we compared with Odocoileus virginianus texanus) to execute the tblastn automatically for all the queries we had. In these programs we were also converting every "U" into an "X" because tblastn cannot process sequences with this character:
Then, we made a manual selection of the hits (to see all of them obtained with tblastn, click here to see it), to avoid scaffold overlaps among proteins of the same family (to see those we selected click here). We also joined the different hits belonging to the same scaffold in the sense. This way we created big scaffolds englobing all its hits (click here to see the fusion). This selection process and the other steps explained above were realized executing this program (download):
We also made another program (download) to insert manually the selenoprotein and some other data we want to analyze in case we wanted to analyze some particular case individually:
AUTOMATION OF THE WEB CREATION
To link the table elements, we decided to make a program (download) to avoid spending a lot of time linking and checking each link (click here to see the file used in the following program) to see the table :
To convert some files and make them compatible with the web browser, we created the following program (download):