BLASTing Illumina reads in FASTQ format

Automatic detection and convertion of FASTQ to FASTA input data

While it is often inappropriate to BLAST raw reads in FASTQ format, gaining biological insight sometimes depends on it.

Sequences typically come in FASTA or in FASTQ format, or in their compressed variations (i.e., with an additional .gz or .bz2).

SequenceServer automatically detects and converts FASTQ to FASTA format. Just paste the FASTQ reads into the search box. SequenceServer will instantly convert to FASTA for BLASTing.

Using FASTQ/FASTA files with BLAST

BLAST uses FASTA format for queries and for database creation. So the BLAST algorithm does not directly understand FASTQ format. This is because:

BLAST was created long before the FASTQ format was created,
and because FASTQ files are typically inappropriate for BLAST analysis.

FASTQ files typically result from Illumina or Nanopore sequencing. They typically are huge files that containing tens to hundreds of millions of reads, with many being from the same subset of the genome or transcriptome, or from a particular amplicon. Such information is highly redundant. When this is the case:

BLAST analysis will be slow because the algorithm needs to search through a much larger set of sequences than if redundancy had been removed.
If results are found, they are likely to include a lot of redundancy (many similar reads obtaining high scores). This makes interpretation difficult.
Having a particularly large set of redundant sequences to search through also reduces BLAST’s ability to identify sequence similarities. This is because BLAST’s detection power depends on the size of the query and database. Indeed, the E-value of a particular alignment is lower if your database or search are larger (E-value is grossly equivalent to “the number of times I would find this match by chance if the database were made up of random nucleotides” - but see here for a detailed explanation of E-value).

Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach

It is likely that you want to first reduce redundancy in your dataset. The most biologically relevant way is often to perform whole genome or transcriptome assembly of your raw reads prior to BLASTing them. Sometimes, simple deduplication or collapsing is sufficient.

If you do want to work with the raw FASTQ reads, BLAST often isn’t the best way to perform analysis.

Command-line batch conversion of FASTQ to FASTA

If you have huge numbers of reads, you’ll want to use a more automated approach to convert the FASTQ format file to a FASTA format file. Using a tried and tested tool is less risky than creating your own custom script by creatively using grep, sed, python, perl or chatgpt. The following seqtk command is one easy way:

seqtk seq -A input.fq > output.fasta

Before using huge numbers of reads for database creation or as queries, it’s often a good idea to remove redundancy. You can directly reduce redunancy with a tool like cd-hit, but it’s often best to run a quick assembly (e.g. with spades or megahit).

Contact SequenceServer for custom support options

If you need a transcriptome, metagenome or genome assembly done on your raw data, we can help you with that. Contact support with your details and we’ll get back to you. We offer cheap and fast transcriptome, genome, and metagenome assembly services.

By leveraging cloud computing and publication-ready graphics, SequenceServer Cloud makes it easy to perform BLAST searches and to interpret them.