Sequence statistics

Overview Counts & lengths N50 / L50 N90 / L90 GC%

Overview

Sequence statistics describe the size and distribution of sequences or reads in a dataset. They are commonly used for quick sanity checks, dataset comparison, and to surface unexpected properties before downstream analyses.

These statistics are most commonly applied to nucleotide data (DNA or RNA). For protein sequences, only length-based summaries are meaningful; nucleotide-specific metrics such as GC% do not apply.

What sequence statistics are (and are not):

Descriptive: they summarize how long sequences are and how those lengths are distributed.
Format-agnostic: they apply to both FASTA and FASTQ inputs.
Not validation: they do not check file structure or alphabet correctness.
Not biological interpretation: they do not, on their own, say anything about correctness, completeness, or functional relevance.

Counts and basic length metrics

Sequence / read count: number of records in the file.
Total length: sum of all sequence lengths (bp or aa).
Minimum / maximum length.
Mean length: arithmetic average.
Median length: middle value when lengths are sorted.

Comparing mean and median lengths helps reveal skewed distributions, such as datasets dominated by many short sequences and a small number of long ones.

N50 and L50

N50 is the sequence length such that 50% of the total dataset length is contained in sequences of length N50 or longer. L50 is the number of sequences required to reach that 50% threshold.

In other words, sort sequences by length (largest to smallest), then sum lengths until you reach at least 50% of the total. L50 is how many sequences you needed, and N50 is the length of the shortest sequence among them.

Example (N50/L50):
Sequence lengths (sorted, largest → smallest):
1200, 1000, 800, 600, 400, 300, 200, 100 (total = 4600).
Cumulative lengths:
1200 → 2200 → 3000 → 3600 → 4000 → 4300 → 4500 → 4600
The 50% threshold is 2300. It is crossed at 800.
N50 = 800, L50 = 3.

Visual schematic (how N50 and L50 are computed)

This step-by-step example illustrates how N50 and L50 are calculated in practice, which is often clearer than the formal definition alone.


Sequences sorted by length (largest → smallest)

Seq1  ████████████████████████  1200 bp
Seq2  ████████████████████      1000 bp
Seq3  ████████████████           800 bp  ← N50 (crosses 50%)
Seq4  ████████████               600 bp
Seq5  ████████                   400 bp
Seq6  ██████                     300 bp
Seq7  ████                       200 bp
Seq8  ██                         100 bp
-----------------------------------------
Total length = 4600 bp
50% threshold = 2300 bp

Cumulative length calculation:
Seq1 = 1200
+ Seq2 = 2200
+ Seq3 = 3000   ← crosses 50% threshold
+ Seq4 = 3600
+ Seq5 = 4000
+ Seq6 = 4300
+ Seq7 = 4500
+ Seq8 = 4600

⇒ N50 = 800 bp   (i.e. the length of Seq3)
⇒ L50 = 3        (number of sequences needed)

N90 and L90

N90 and L90 are defined analogously to N50/L50, using 90% of the total sequence length.

These metrics emphasize the short-sequence tail of a distribution and are particularly informative when many fragments are small.

GC%

GC% is the fraction of guanine (G) and cytosine (C) bases relative to the total number of unambiguous A, C, G, and T bases, excluding ambiguous symbols such as N by convention.

Defined only for nucleotide sequences.
Ambiguous bases are typically ignored.
Not reported for protein data.

GC content varies substantially between organisms and is often relatively consistent within a genome. As a result, GC% is commonly used as a characteristic property of sequences and genomes.

In practice, GC% is frequently examined together with other metrics such as sequence length or coverage. When plotted against these measures, GC% can help reveal contamination, mixed-species datasets, plasmids, or regions with atypical composition compared to the main genome.

Compute sequence statistics

Compute descriptive statistics for FASTA and FASTQ files, including length metrics, N50/N90, and GC% (for nucleotide data).

Open Sequence Stats FASTA/FASTQ Validator