FASTA and FASTQ formats
Overview
- FASTA stores sequence records using a header line starting with
>, followed by one or more sequence lines. - FASTQ stores reads using a strict 4-line record layout: header (
@), sequence, separator (+), and quality string.
- Format structure: how records are delimited and how lines are arranged.
- Alphabet expectations: which letters are meaningful depends on DNA vs RNA vs protein and on the tool or workflow.
FASTA format
A FASTA file is a series of records. Each record begins with a header line starting with >. The header typically contains an identifier, optionally followed by a description. One or more sequence lines follow. Many tools treat the first whitespace-delimited token in the header as the record ID; the rest is description.
>seq1 some optional description
ACGTACGTACGT
ACGTNNNNACGT
>seq2
GATTACA- FASTA sequences may be wrapped across multiple lines; parsers are expected to concatenate them without altering content.
- Many tools treat FASTA as case-insensitive; lowercase is commonly used to represent (soft) masking.
- Line length limits are typically imposed by tools or pipelines rather than by the FASTA format itself; conventions such as ~80 characters are common but not required.
- Blank lines and embedded whitespace inside sequence lines are frequent sources of parsing errors in real workflows.
FASTQ format
FASTQ extends FASTA-style records by associating a quality string with each sequence. The most common representation uses a strict 4-line record:
@header line (read identifier with optional description)- sequence line
+separator line (may be just+, or may repeat header text)- quality line (same length as the sequence line)
@read1
ACGTN
+
IIIII- In the standard FASTQ representation, sequence and quality strings must have identical length.
- The
+line marks the record boundary; it may be empty or may repeat identifier text. - FASTQ exists in several historical variants, most notably in how quality scores are encoded; in practice, most modern tools expect the 4-line-per-record representation together with a specific encoding. Older or nonstandard FASTQ variants may wrap lines.
- Whitespace inside sequence or quality strings is almost always unintended and breaks many parsers.
Paired-end FASTQ layouts
Paired-end reads are commonly represented using one of two layouts:
- Two-file layout: one file for forward reads (R1) and one for reverse reads (R2). R1 and R2 files must have the same number of reads and matching order (usually matching IDs).
- Interleaved layout: forward and reverse reads alternate within a single file; the file must contain an even number of reads.
Quality scores and encodings (FASTQ)
In modern FASTQ files, quality scores are typically stored as printable ASCII characters (usually in the range 33–126), where each character represents a numeric score after applying an encoding offset. Quality scores are logarithmic: lower values indicate lower confidence in a base call, while higher values indicate higher confidence. For example, in Phred+33, ! corresponds to Q0 (very low confidence), while I corresponds to Q40 (high confidence).
- Phred+33 is the dominant modern encoding.
- Phred+64 appears in older datasets and legacy pipelines.
- Other representations exist in specialized contexts and are not interchangeable with ASCII-encoded FASTQ.
Common pitfalls (what breaks parsers)
- FASTA: sequence data appears before the first
>header. - FASTA: a header record not followed by any sequence lines.
- FASTA: blank lines or unexpected empty records.
- FASTA/FASTQ: spaces or tabs embedded inside sequence strings.
- FASTA/FASTQ: Newline conversions (Windows CRLF vs Unix LF) are often tolerated, but mixed/newline corruption can break strict parsers.
- FASTQ: missing or malformed
+separator line. - FASTQ: truncated files (not a multiple of 4 lines).
- FASTQ: sequence and quality lengths do not match.
- FASTQ: non-printable characters introduced into quality strings.
Practical tips
- Be careful with text editors: newline normalization and character re-encoding can corrupt files.
- Keep paired-end layout consistent: do not mix R1/R2 and interleaved layouts unless the tool explicitly supports both.
- If a FASTQ fails, first check: (1) line count multiple of 4, (2) every 1st line starts with
@, every 3rd with+, (3) sequence/quality lines have same lengths.