FASTQ quality scores

All references

Overview

A FASTQ record consists of four lines: a header, a sequence, a separator, and a quality string. The quality string must contain exactly one character per base in the sequence.

Minimal example
@read1 optional description
ACGTN
+
I?I5I

Quality characters must be interpreted using the specific encoding used by the basecaller or upstream pipeline.

Phred scores

FASTQ quality values are reported on the Phred scale, a logarithmic transformation of an estimated probability that the base call is incorrect (p):

Q = -10 × log10(p)
p = 10^(-Q/10)
For example, Q30 corresponds to an estimated error probability of 10⁻³ (approximately one error per thousand bases).
Higher Q means lower estimated error probability; every +10 in Q corresponds to a 10× decrease in p.
Important: quality scores are model-derived estimates produced by basecalling software. They are not observed error rates and depend on chemistry, signal processing, and training data.

Interpreting Phred values

The table below shows common reference points on the Phred scale. These values follow directly from the definition and should not be treated as universal cutoffs.

Phred (Q)Error probability (p)Implied accuracy (%)
100.190%
200.0199%
300.00199.9%
400.000199.99%

Aggregate metrics such as mean or median quality are derived from per-base values and are not directly comparable across platforms or basecalling pipelines.

ASCII encoding (Phred+33)

FASTQ stores quality values as printable ASCII characters. In the widely used Sanger convention, the encoded character is chr(Q + 33). This encoding is commonly referred to as Phred+33.

Q0  → '!'
Q10 → '+'
Q20 → '5'
Q30 → '?'
Q40 → 'I'

The FASTQ file stores the characters (e.g., ?), not the numeric Q values.

Non-printable characters, unexpected whitespace, or replacement glyphs usually indicate corruption introduced by editors, copy/paste, or character re-encoding.

Short reads vs long reads

FASTQ is used across sequencing technologies, but the operational meaning of quality scores depends on the basecaller and processing stage. Treat Q-scores as platform-specific estimates rather than a universal currency.

Oxford Nanopore (ONT)
ONT basecallers output FASTQ with Phred-scale quality values encoded using the Sanger convention. Summary read Q-scores (when provided) are derived from per-base estimates and should be interpreted in the context of the basecaller and model version.
PacBio
PacBio pipelines can produce FASTQ representations with Phred-scale quality values encoded as ASCII-33. Distributions and downstream interpretation depend strongly on whether reads are raw or consensus-derived.
Practical takeaway: avoid hard, platform-agnostic cutoffs. Use QC tools and thresholds appropriate to the data source and processing stage.

Common pitfalls

  • Incomplete records due to truncated files.
  • Sequence and quality strings with mismatched lengths.
  • Line wrapping introduced by editors or copy/paste.
  • Unexpected characters due to encoding conversions or copy/paste.
  • Legacy FASTQ conventions in older datasets.

Practical validation

A strict FASTQ validation typically checks:

  • complete 4-line records
  • proper header and separator lines
  • exact sequence/quality length matching
  • quality characters consistent with the expected encoding (e.g., Phred+33)
Structural validity and alphabet expectations are separate concerns. A file may be structurally valid even if the sequence alphabet is unusual for a given workflow.
Related references:

Further reading