BLAST
BLAST (Basic Local Alignment Search Tool) is the most frequently used tool for comparing primary biological sequence information, such as proteins or nucleic acids. It is based on a heuristic algorithm that searches for local similarity between a query sequence and a sequence database.
Depending on the type of query and the type of database, the selection of the specific flavour of blast is different. In our analysis we used tblastn and blastn. tblastn compares a protein query sequence against a nucleotide sequence dynamically translated in the six possible reading frames. blastn compares a nucleotide query sequence against a nucleotide database sequence. More information about BLAST and its flavours can be found in the official website
Once the protein queries were selected, we used tblastn to find genomic regions highly likely to encode a protein similar to the query. An important parameter to take into consideration in blast is the expectation value (E-value) of a given hit, which is the number of hits one can “expect” to see only by chance when searching a database of a particular size. We set blastto retrieve only those hits with an E-value lower than 1x10^-5 (1 out of 100.0000 to find the hit by chance).
The use of blast in the UNIX terminal requires, apart from the installation of the program, the previous creation of a genome's database. This can be accomplished by running a tool called formatdb. This program formats the database (in FASTA format) to a binary file, so it can be searched by BLAST.
formatdb -i genome.fa -p F -n db
- -i: input file, genomic sequence in fasta format
- -p: file type, F correspond to nucleotide
- -n: output name of the database.
More information about this program can be obtained in the following link
The executable (blastall) accepts many arguments and options to provide an optimal search. The ones that we used were the following:
blastall -p tblastn -i query_file.fa -d genome_file -o query_file.blast -m 9 -e 0.00001
- -p: specifies the type of search (blast flavour)
- -i: specifies input query file
- -d: specifies the target database
- -e: E-value cut-off. Default is 10. Only hits with E-value smaller than the specified will be reported
- -m: speficies output alignment view. The value 9 is to set a tabular format with comment lines and post search sorting.
- -o: specifies the output file
There are other options which can also be taken into account, such as the gap opening penalty, the gap extension penalty, the mismmatch penalty or the threshold for HSP, among others. More information about blastall is available in the following website
Blast search reports an output file which contains the following information for each one of the hits found:
- Query id: The query sequence id (e.g. SPP00000081_2.0)
- Subject id (scaffold): The matching subject sequence id (e.g. gi|511782574|gb|CM001960.1|)
- % identity: percentage of identity between a subsequence of the query and the corresponding subsequence found in the database (e.g. 94.61%)
- Alignment length: the length of the alignment between a subsequence of the query and the corresponding subsequence found in the database (e.g. 167)
- Mismatches: number of mismatches of the alignment (e.g. 7)
- Gap openings: number of opening gaps in the alignment (e.g. 1)
- q.start: start position of the subsequence “blasted” in the query file (e.g. 23)
- q.end: final position of the subsequence “blasted” in the query file(e.g. 187)
- s.start: start position of the subsequence found in the target genome (e.g. 32693020)
- s.end: final position of the subsequence found in the target genome (e.g. 32693520)
- e-value: expectation value of the given hit. (e.g. 2e-57)