Taxonomic filtering to improve your BLAST searches
What is taxonomy, and why BLAST with taxonomic restriction?
Taxonomy is the science of classifying and naming organisms. It aims to organize the vast diversity of life on Earth into a system that reflects evolutionary relationships. A taxon is a related group of organisms. The plural of taxon is taxa. Examples of taxa include:
- “Animalia”: all animals (a taxonomic Kingdom)
- “Mammalia”: all mammals (a taxonomic Class)
- “Felidae”: all cats (a taxonomic Family)
- “Panthera leo”: Lions (a species)
We can use such classification to restrict BLAST to subsets of large databases. Restricting BLAST searches to species or groups of taxa can make it easier to interpret results.
For BLAST, taxonomic filtering requires taxonomic identifiers, or “taxids”. These taxids can be found in the NCBI Taxonomy database. It also requires BLAST databases to understand taxids – this is the case for very large databases, including NCBI (nt/nr) and UniProtKB/SwissProt, but only some other smaller databases.
Advantages of taxonomic restriction
Restricting the BLAST search to specific taxonomic groups has some advantages, which include:
- Focused results: By including only relevant taxa you can more easily make sense of your results.
- Reduced file size output: Having only relevant hits also reduces the sizes of output files.
- No custom post-BLAST filtering: If you take results from all species and try to subsequently filter things down, you’re using custom code that may be less reliable than BLAST’s built-in mechanism.
- Increased within-taxon sensitivity: BLAST only shows the strongest results. When restricting results to fewer taxa, you are more likely to find all of the true results for those taxa.
Taxonomic restriction doesn’t really speed up BLAST
Unfortunately, taxonomic filtering with BLAST does not directly reduce the initial search space. Rather than BLAST immediately focusing on the targeted data subset, it first searches the entire dataset for hits and then performs taxonomic restriction. Therefore, the computational effort and time of a taxonomically restricted versus unrestricted BLAST search are similar.
Using different taxonomy levels to restrict BLAST searches
We can limit BLAST searches to specific taxonomic groups, using any taxonomic level like order, genus, or species. However, a higher-level taxon will override lower-level ones. For example, if we limit the search to the Diptera order, it will include the Drosophila genus and Drosophila melanogaster species since both are within Diptera.
- The species of Drosophila melanogaster with the taxid 7227. This taxid is circled in red in the image below.
- Or the genus Drosophila with taxid 7215
- Or the order Diptera with taxid 7147
Example: Finding vitellogenin genes in particular ant genera
Vitellogenin is important for egg yolk. We identified a predicted vitellogenin gene in the ant Camponotus floridanus. For our research question, we want to find other possible vitellogenin genes in the ant genera Camponotus, Formica, and Solenopsis. We can use BLAST with taxonomic restriction to search within the NCBI nt database, which limits the returned hits to just these three genera.
For instance, you can use the NCBI taxonomy identifiers for these genera:
- Camponotus (taxid: 13390)
- Formica (taxid: 72766)
- Solenopsis (taxid: 13686).
Once we have our taxids, we must use SequenceServer’s “Advanced Parameters’’ options prior to initiating the BLAST search. The taxids are inserted (multiple IDs delimited by ‘,’) after the -taxids
command. We can also exclude taxa with the command -negative_taxids
, but this cannot be used at the same time as -taxids
.
We can simultaneously use other BLAST parameters, such as E-value cutoff. The following advanced parameters retain only the strongest hits for our three focal ant genera.
The top of SequenceServer’s BLAST report also indicates which parameters were used; below the taxid restriction is circled in red.
Thanks to the taxid restriction, the above report focuses exclusively on the taxa we are interested in. Without this restriction, we would also have obtained many other hits from related genera (including Polyrhachis, Cataglyphis, Nylanderia, and Cardiocondyla).
NOTE: The E-values differ for the same subject hits with and without taxonomic restriction. The E-values are lower (more significant) in the taxonomic restricted BLAST. This highlights how the taxonomic restriction can be used to gain further confidence by limiting searches to only relevant taxon groups.
Taxonomically restricted BLASTs with SequenceServer
SequenceServer makes BLAST with taxonomic restrictions straightforward. Your results with the taxonomic restriction are also saved to your BLAST History, allowing you to keep track of the taxids you have used. Why not have a go at taxonomic restricted BLASTs with a free trial of SequenceServer!
Happy BLASTing!