Sequence statistics
All referencesOverview
Sequence statistics describe the size and distribution of sequences or reads in a dataset. They are commonly used for quick sanity checks, dataset comparison, and to surface unexpected properties before downstream analyses.
- Descriptive: they summarize how long sequences are and how those lengths are distributed.
- Format-agnostic: they apply to both FASTA and FASTQ inputs.
- Not validation: they do not check file structure or alphabet correctness.
- Not biological interpretation: they do not, on their own, say anything about correctness, completeness, or functional relevance.
Counts and basic length metrics
- Sequence / read count: number of records in the file.
- Total length: sum of all sequence lengths (bp or aa).
- Minimum / maximum length.
- Mean length: arithmetic average.
- Median length: middle value when lengths are sorted.
N50 and L50
N50 is the sequence length such that 50% of the total dataset length is contained in sequences of length N50 or longer. L50 is the number of sequences required to reach that 50% threshold.
In other words, sort sequences by length (largest to smallest), then sum lengths until you reach at least 50% of the total. L50 is how many sequences you needed, and N50 is the length of the shortest sequence among them.
Sequence lengths (sorted, largest → smallest):
1200, 1000, 800, 600, 400, 300, 200, 100 (total = 4600).Cumulative lengths:
1200 → 2200 → 3000 → 3600 → 4000 → 4300 → 4500 → 4600The 50% threshold is
2300. It is crossed at 800.N50 = 800, L50 = 3.
Sequences sorted by length (largest → smallest)
Seq1 ████████████████████████ 1200 bp
Seq2 ████████████████████ 1000 bp
Seq3 ████████████████ 800 bp ← N50 (crosses 50%)
Seq4 ████████████ 600 bp
Seq5 ████████ 400 bp
Seq6 ██████ 300 bp
Seq7 ████ 200 bp
Seq8 ██ 100 bp
-----------------------------------------
Total length = 4600 bp
50% threshold = 2300 bp
Cumulative length calculation:
Seq1 = 1200
+ Seq2 = 2200
+ Seq3 = 3000 ← crosses 50% threshold
+ Seq4 = 3600
+ Seq5 = 4000
+ Seq6 = 4300
+ Seq7 = 4500
+ Seq8 = 4600
⇒ N50 = 800 bp (i.e. the length of Seq3)
⇒ L50 = 3 (number of sequences needed)
N90 and L90
N90 and L90 are defined analogously to N50/L50, using 90% of the total sequence length.
GC%
GC% is the fraction of guanine (G) and cytosine (C) bases relative to the total number of unambiguous A, C, G, and T bases, excluding ambiguous symbols such as N by convention.
- Defined only for nucleotide sequences.
- Ambiguous bases are typically ignored.
- Not reported for protein data.
GC content varies substantially between organisms and is often relatively consistent within a genome. As a result, GC% is commonly used as a characteristic property of sequences and genomes.
In practice, GC% is frequently examined together with other metrics such as sequence length or coverage. When plotted against these measures, GC% can help reveal contamination, mixed-species datasets, plasmids, or regions with atypical composition compared to the main genome.