BLASTing Illumina reads in FASTQ format

BLAST uses FASTA format for queries and for database creation. So the BLAST algorithm does not directly understand FASTQ format. This is because:

FASTQ files typically result from Illumina or Nanopore sequencing. They typically are huge files that containing tens to hundreds of millions of reads, with many being from the from the same subset of the genome or transcriptome, or from a particular amplicon. Such information is highly redundant. When this is the case:

Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach

It is likely that you want to first reduce redundancy in your dataset. The most biologically relevant way is often to perform whole genome or transcriptome assembly of your raw reads prior to BLASTing them. Sometimes, simple deduplication or collapsing is sufficient.

If you do want to work with the raw reads, BLAST often isn’t the best way to perform analysis.

But what if I really do need to run BLAST on FASTQ files?

While it is often inappropriate to BLAST raw reads, gaining biological insight sometimes does depend on it.

If you just are BLASTing a single read, or a few reads, its probably easiest to just copy-paste the ACGT sequence into the BLAST search box. You’ll want to:

If you are BLASTing a large number of reads, you’ll want to use a more automated approach to convert the FASTQ format file to a FASTA format file. The following seqtk command is one easy way:

seqtk seq -A input.fq > output.fasta

The resulting FASTA file can then be used as a BLAST query.

There are many other ways of converting FASTQ to FASTA - I recommend using a tried and tested tool rather than creating your own thing by creatively using grep or perl.

By leveraging cloud computing and publication-ready graphics, SequenceServer Cloud makes it easy to perform BLAST searches and to interpret them. Learn more

Sequence Search with SequenceServer