FASTA and FASTQ formats

Overview FASTA FASTQ Paired-end layouts Quality scores Common pitfalls Practical tips

Overview

FASTA stores sequence records using a header line starting with >, followed by one or more sequence lines.
FASTQ stores reads using a strict 4-line record layout: header (@), sequence, separator (+), and quality string.

FASTA and FASTQ allow very little ambiguity in record boundaries and line structure. In real workflows, most failures trace back to broken record boundaries, truncated files, stray whitespace, or FASTQ records where sequence and quality lengths no longer match.

Two separate ideas:

Format structure: how records are delimited and how lines are arranged.
Alphabet expectations: which letters are meaningful depends on DNA vs RNA vs protein and on the tool or workflow.

For IUPAC letters and ambiguity codes, see Sequence alphabets & IUPAC codes.

FASTA format

A FASTA file is a series of records. Each record begins with a header line starting with >. The header typically contains an identifier, optionally followed by a description. One or more sequence lines follow. Many tools treat the first whitespace-delimited token in the header as the record ID; the rest is description.

Minimal example

>seq1 some optional description
ACGTACGTACGT
ACGTNNNNACGT
>seq2
GATTACA

Notes

FASTA sequences may be wrapped across multiple lines; parsers are expected to concatenate them without altering content.
Many tools treat FASTA as case-insensitive; lowercase is commonly used to represent (soft) masking.
Line length limits are typically imposed by tools or pipelines rather than by the FASTA format itself; conventions such as ~80 characters are common but not required.
Blank lines and embedded whitespace inside sequence lines are frequent sources of parsing errors in real workflows.

FASTQ format

FASTQ extends FASTA-style records by associating a quality string with each sequence. The most common representation uses a strict 4-line record:

@ header line (read identifier with optional description)
sequence line
+ separator line (may be just +, or may repeat header text)
quality line (same length as the sequence line)

Minimal example

@read1
ACGTN
+
IIIII

Notes

In the standard FASTQ representation, sequence and quality strings must have identical length.
The + line marks the record boundary; it may be empty or may repeat identifier text.
FASTQ exists in several historical variants, most notably in how quality scores are encoded; in practice, most modern tools expect the 4-line-per-record representation together with a specific encoding. Older or nonstandard FASTQ variants may wrap lines.
Whitespace inside sequence or quality strings is almost always unintended and breaks many parsers.

Paired-end FASTQ layouts

Paired-end reads are commonly represented using one of two layouts:

Two-file layout: one file for forward reads (R1) and one for reverse reads (R2). R1 and R2 files must have the same number of reads and matching order (usually matching IDs).
Interleaved layout: forward and reverse reads alternate within a single file; the file must contain an even number of reads.

Quality scores and encodings (FASTQ)

In modern FASTQ files, quality scores are typically stored as printable ASCII characters (usually in the range 33–126), where each character represents a numeric score after applying an encoding offset. Quality scores are logarithmic: lower values indicate lower confidence in a base call, while higher values indicate higher confidence. For example, in Phred+33, ! corresponds to Q0 (very low confidence), while I corresponds to Q40 (high confidence).

For a detailed explanation of Phred scores, encodings, and practical interpretation, see FASTQ quality scores.

Phred+33 is the dominant modern encoding.
Phred+64 appears in older datasets and legacy pipelines.
Other representations exist in specialized contexts and are not interchangeable with ASCII-encoded FASTQ.

Practical warning: Copy/paste operations, text editors, and re-encoding can introduce non-printable characters or alter line breaks, silently corrupting FASTQ quality strings.

Common pitfalls (what breaks parsers)

FASTA: sequence data appears before the first > header.
FASTA: a header record not followed by any sequence lines.
FASTA: blank lines or unexpected empty records.
FASTA/FASTQ: spaces or tabs embedded inside sequence strings.
FASTA/FASTQ: Newline conversions (Windows CRLF vs Unix LF) are often tolerated, but mixed/newline corruption can break strict parsers.
FASTQ: missing or malformed + separator line.
FASTQ: truncated files (not a multiple of 4 lines).
FASTQ: sequence and quality lengths do not match.
FASTQ: non-printable characters introduced into quality strings.

Practical tips

Be careful with text editors: newline normalization and character re-encoding can corrupt files.
Keep paired-end layout consistent: do not mix R1/R2 and interleaved layouts unless the tool explicitly supports both.
If a FASTQ fails, first check: (1) line count multiple of 4, (2) every 1st line starts with @, every 3rd with +, (3) sequence/quality lines have same lengths.

Validate a FASTA/FASTQ file

Strictly validate FASTA and FASTQ file structure, record boundaries, and FASTQ sequence/quality length matching.

Open FASTA/FASTQ Validator Sequence alphabets reference