Overview

Sequence statistics describe the size and distribution of sequences or reads in a dataset. They are commonly used for quick sanity checks, dataset comparison, and to surface unexpected properties before downstream analyses.

These statistics are most commonly applied to nucleotide data (DNA or RNA). For protein sequences, only length-based summaries are meaningful; nucleotide-specific metrics such as GC% do not apply.
What sequence statistics are (and are not):
  • Descriptive: they summarize how long sequences are and how those lengths are distributed.
  • Format-agnostic: they apply to both FASTA and FASTQ inputs.
  • Not validation: they do not check file structure or alphabet correctness.
  • Not biological interpretation: they do not, on their own, say anything about correctness, completeness, or functional relevance.

Counts and basic length metrics

  • Sequence / read count: number of records in the file.
  • Total length: sum of all sequence lengths (bp or aa).
  • Minimum / maximum length.
  • Mean length: arithmetic average.
  • Median length: middle value when lengths are sorted.
Comparing mean and median lengths helps reveal skewed distributions, such as datasets dominated by many short sequences and a small number of long ones.

N50 and L50

N50 is the sequence length such that 50% of the total dataset length is contained in sequences of length N50 or longer. L50 is the number of sequences required to reach that 50% threshold.

In other words, sort sequences by length (largest to smallest), then sum lengths until you reach at least 50% of the total. L50 is how many sequences you needed, and N50 is the length of the shortest sequence among them.

Example (N50/L50):
Sequence lengths (sorted, largest → smallest):
1200, 1000, 800, 600, 400, 300, 200, 100 (total = 4600).
Cumulative lengths:
1200 → 2200 → 3000 → 3600 → 4000 → 4300 → 4500 → 4600
The 50% threshold is 2300. It is crossed at 800.
N50 = 800, L50 = 3.
Visual schematic (how N50 and L50 are computed)
This step-by-step example illustrates how N50 and L50 are calculated in practice, which is often clearer than the formal definition alone.

Sequences sorted by length (largest → smallest)

Seq1  ████████████████████████  1200 bp
Seq2  ████████████████████      1000 bp
Seq3  ████████████████           800 bp  ← N50 (crosses 50%)
Seq4  ████████████               600 bp
Seq5  ████████                   400 bp
Seq6  ██████                     300 bp
Seq7  ████                       200 bp
Seq8  ██                         100 bp
-----------------------------------------
Total length = 4600 bp
50% threshold = 2300 bp

Cumulative length calculation:
Seq1 = 1200
+ Seq2 = 2200
+ Seq3 = 3000   ← crosses 50% threshold
+ Seq4 = 3600
+ Seq5 = 4000
+ Seq6 = 4300
+ Seq7 = 4500
+ Seq8 = 4600

⇒ N50 = 800 bp   (i.e. the length of Seq3)
⇒ L50 = 3        (number of sequences needed)

Using N50 and L50 in genome assemblies

In genome assembly workflows, N50 and L50 are commonly reported as rough indicators of contiguity, that is, how fragmented or continuous an assembly is. Assemblies with longer contiguous sequences tend to have higher N50 values and lower L50 values.

This makes N50/L50 useful for quick comparisons between assemblies produced from the same organism, similar input data, or different parameter choices. They help answer the narrow question: “Are sequences generally longer or more fragmented?”

Important limitation: N50/L50 and N90/L90 summarize sequence length distributions only. They do not assess correctness, detect misassemblies, or indicate whether biologically meaningful regions are present or complete.

An assembly can have a high N50 while still being biologically poor, for example, if long sequences are incorrectly joined, missing genes, or represent collapsed repeats. Conversely, a fragmented assembly may still contain most biologically relevant content.

For biological completeness, additional evidence is needed beyond length metrics, such as whether expected genes, conserved regions, or functional elements are present. N50/L50 should therefore be interpreted as structural descriptors, not as comprehensive quality measures.

N90 and L90

N90 and L90 are defined analogously to N50/L50, using 90% of the total sequence length.

These metrics emphasize the short-sequence tail of a distribution and are particularly informative when many fragments are small.

GC%

GC% is the fraction of guanine (G) and cytosine (C) bases relative to the total number of unambiguous A, C, G, and T bases, excluding ambiguous symbols such as N by convention.

  • Defined only for nucleotide sequences.
  • Ambiguous bases are typically ignored.
  • Not reported for protein data.

GC content varies substantially between organisms and is often relatively consistent within a genome. As a result, GC% is commonly used as a characteristic property of sequences and genomes.

In practice, GC% is frequently examined together with other metrics such as sequence length or coverage. When plotted against these measures, GC% can help reveal contamination, mixed-species datasets, plasmids, or regions with atypical composition compared to the main genome.

Compute sequence statistics
Compute descriptive statistics for FASTA and FASTQ files, including length metrics, N50/N90, and GC% (for nucleotide data).