BLAST E-values: how they are calculated and what they mean

A crucial measure that accompanies every hit sequence that BLAST identifies is the E-value (Expectation value). Here, we’ll walk through:

BLAST E-value

SequenceServer BLAST result highlighting that E-values are shown in two places in the BLAST result report: in the table of of all hits, and as part of the alignment of each hit.

What is an E-value?

The BLAST E-value is:

Instead, it is an estimate of the expected number of random alignments with a particular score or better that could be found by chance in a given database search. In other words, it represents the likelihood that a specific sequence alignment is due to chance rather than a true biological relationship between the sequences.

Interpreting E-values

E-values should be interpreted in the context of the specific research question and alongside other factors like alignment length, sequence identity, and biological context or question. In general:

E-values are not fixed thresholds for determining the significance of an alignment. Always consider the biological context.

In many cases, BLAST analysis is just a first step. In particular, a stronger E-value does not necessarily imply a stronger evolutionary relationship.

Interpreting it like that is a common mistake! To understand relationships across sequences, you should typically also perform multiple sequence alignment followed by phylogenetic reconstruction. Additional evidence also helps (e.g., understanding sequence conservation and domain architecture).

How is the BLAST E-value calculated?

The E-value is calculated based on the alignment score (S), the search space size (m × n), and the parameters derived from the scoring system and the database composition, such as the Karlin-Altschul parameters (K and λ). The formula for E-value is:

E-value = K × m × n × e-λS

Where:

The E-value thus depends on the database size. Larger databases have more chances of producing the alignment you see by chance… so E-values for the same amount of similarity end up being weaker (higher).

So how should I tweak my BLAST analysis to get the most power?

  1. Use the appropriate database. If you’re looking for a particular gene in humans… only BLAST against the human genome… not against a database that is orders of magnitude greater. Doing so would make it less likely for you to get strong E-values, even if the gene is present in the human genome. And the BLAST analysis would also take much longer.
  2. Use the appropriate BLAST algorithm for your biological question and evolutionary distance. Consider that nucleotides diverge faster than protein sequences. So:
    • if you’re comparing highly similar sequences (e.g., to help identify intron-exon boundaries, or allelic differences), use BLASTN.
    • if you’re identifying orthologs across species, use BLASTP. To be certain that a gene is absent from a species, use TBLASTN.
  3. Use an appropriate scoring matrix. BLOSUM62 is used by default. But for longer evolutionary timescales, the PAM250 is more appropriate.

Aren’t these kinds of adjustments “E-value hacking”? No. If done appropriately it’s just using the right tool for the job.

Stay up to date

To receive the latest news from our team, enter your email:

Some blog posts you might like: