BLASTing Illumina reads in FASTQ format
BLAST uses FASTA format for queries and for database creation. So the BLAST algorithm does not directly understand FASTQ format. This is because:
- BLAST was created long before the FASTQ format was created,
- and because FASTQ files are typically inappropriate for BLAST analysis.
FASTQ files typically result from Illumina or Nanopore sequencing. They typically are huge files that containing tens to hundreds of millions of reads, with many being from the from the same subset of the genome or transcriptome, or from a particular amplicon. Such information is highly redundant. When this is the case:
- BLAST analysis will be slow because the algorithm needs to search through a much larger set of sequences than if redundancy had been removed.
- If results are found, they are likely to include a lot of redundancy (many similar reads obtaining high scores). This makes interpretation difficult.
- Having a particularly large set of redundant sequences to search through also reduces BLAST’s ability to identify sequence similarities. This is because BLAST’s detection power depends on the size of the query and database. Indeed, the e-value of a particular alignment is lower if your database or search are larger (E-value is grossly equivalent to “the number of times I would find this match by chance if the database were made up of random nucleotides” - but see here for a detailed explanation of Evalue).
Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach
It is likely that you want to first reduce redundancy in your dataset. The most biologically relevant way is often to perform whole genome or transcriptome assembly of your raw reads prior to BLASTing them. Sometimes, simple deduplication or collapsing is sufficient.
If you do want to work with the raw reads, BLAST often isn’t the best way to perform analysis.
But what if I really do need to run BLAST on FASTQ files?
While it is often inappropriate to BLAST raw reads, gaining biological insight sometimes does depend on it.
If you just are BLASTing a single read, or a few reads, its probably easiest to just copy-paste the ACGT sequence into the BLAST search box. You’ll want to:
- replace the
@at the beginning of the sequence identifier line with
- remove the
- remove the quality scores (i.e., the letters, numbers and symbols that aren’t ACGTs).
If you are BLASTing a large number of reads, you’ll want to use a more automated approach to convert the FASTQ format file to a FASTA format file. The following
seqtk command is one easy way:
seqtk seq -A input.fq > output.fasta
The resulting FASTA file can then be used as a BLAST query.
There are many other ways of converting FASTQ to FASTA - I recommend using a tried and tested tool rather than creating your own thing by creatively using grep or perl.