BLAST E-values: how they are calculated and what they mean
A crucial measure that accompanies every hit sequence that BLAST identifies is the E-value, from Expectation value. (aka, E value, e-value, evalue). Here, we’ll walk through:
- What is an E-value?
- How it is calculated?
- How to interpret it?
- How to get the most power and sensitivity from your BLAST analysis?
SequenceServer BLAST result highlighting that E-values are shown in two places in the BLAST result report: in the table of all hits, and as part of the alignment of each hit.
What is an E-value?
The BLAST E-value is:
- Not a p-value.
- Not the exact number of times a sequence was found due to chance.
Instead, it is an estimate of the expected number of random alignments with a particular score or better that could be found by chance in a given database search. In other words, it represents the likelihood that a specific sequence alignment is due to chance rather than a true biological relationship between the sequences.
Interpreting E-values
The E-value describes the number of hits we expect to see by chance when BLASTing a database. It helps us understand if our hits are relatively unique or not. For example, an E-value of 1 means that one expects by chance to see 1 match with a similar score. We need to be careful with interpreting E-values and need to consider the biological question and datasets. This includes the context of the specific research question and alongside other factors like alignment length, sequence identity, and biological context or question. However, in general:
- Lower (i.e., stronger) E-values indicate more “significant” alignments, suggesting a higher probability that the sequences share a common evolutionary origin.
- Higher (i.e., weaker) E-value indicates that the alignment might be a random event.
E-values are not fixed thresholds for determining the significance of an alignment. Always consider the biological context and the datasets used.
In many cases, BLAST analysis is just a first step. In particular, a stronger E-value does not necessarily imply a stronger evolutionary relationship.
Interpreting it like that is a common mistake! To understand relationships across sequences, you should typically also perform multiple sequence alignment followed by phylogenetic reconstruction. Additional evidence also helps (e.g., understanding sequence conservation and domain architecture).
How is the BLAST E-value calculated?
The E-value is calculated based on the alignment score (S), the search space size (m × n), and the parameters derived from the scoring system and the database composition, such as the Karlin-Altschul parameters (K and λ). The formula for E-value is:
E-value = K × m × n × e-λS
Where:
- m is the length of the query sequence.
- n is the length of the database (i.e., the sum of all the lengths of all the sequences in the database).
- K and λ are the Karlin-Altschul parameters. They can be estimated from large sets of random sequence alignments. The λ parameter normalizes the alignment score, while the K parameter scales the E-value based on the database and sequence lengths.
- S is the alignment score. It is calculated based on the selected scoring matrix and the given sequence alignment. The score reflects the sum of substitution and gap scores for the aligned residues.
The E-value thus depends on the database size. Larger databases have more chances of producing the alignment you see by chance… so E-values for the same amount of similarity end up being weaker (higher).
So how should I tweak my BLAST analysis to get the most power?
- Use the appropriate database. If you’re looking for a particular gene in humans… only BLAST against the human genome… not against a database that is orders of magnitude greater. Doing so would make it less likely for you to get strong E-values, even if the gene is present in the human genome. And the BLAST analysis would also take much longer.
- Use the appropriate BLAST algorithm for your biological question and evolutionary distance. Consider that nucleotides diverge faster than protein sequences. So:
- if you’re comparing highly similar sequences (e.g., to help identify intron-exon boundaries, or allelic differences), use BLASTN.
- if you’re identifying orthologs across species, use BLASTP. To be certain that a gene is absent from a species, use TBLASTN.
- Use an appropriate scoring matrix. BLOSUM62 is used by default. But for longer evolutionary timescales, the PAM250 is more appropriate.
- Investigate different E-value thresholds to see the impact on the resulting hits.
Aren’t these kinds of adjustments “E-value hacking”?
No. If done appropriately it’s just using the right tool for the job. In fact, we need to consider all of the above to make sure the E-value is useful for our biological questions.