Sequence statistics
All referencesOverview
Sequence statistics describe the size and distribution of sequences or reads in a dataset. They are commonly used for quick sanity checks, dataset comparison, and to surface unexpected properties before downstream analyses.
- Descriptive: they summarize how long sequences are and how those lengths are distributed.
- Format-agnostic: they apply to both FASTA and FASTQ inputs.
- Not validation: they do not check file structure or alphabet correctness.
- Not biological interpretation: they do not, on their own, say anything about correctness, completeness, or functional relevance.
Counts and basic length metrics
- Sequence / read count: number of records in the file.
- Total length: sum of all sequence lengths (bp or aa).
- Minimum / maximum length.
- Mean length: arithmetic average.
- Median length: middle value when lengths are sorted.
N50 and L50
N50 is the sequence length such that 50% of the total dataset length is contained in sequences of length N50 or longer. L50 is the number of sequences required to reach that 50% threshold.
In other words, sort sequences by length (largest to smallest), then sum lengths until you reach at least 50% of the total. L50 is how many sequences you needed, and N50 is the length of the shortest sequence among them.
Sequence lengths (sorted, largest → smallest):
1200, 1000, 800, 600, 400, 300, 200, 100 (total = 4600).Cumulative lengths:
1200 → 2200 → 3000 → 3600 → 4000 → 4300 → 4500 → 4600The 50% threshold is
2300. It is crossed at 800.N50 = 800, L50 = 3.
Sequences sorted by length (largest → smallest)
Seq1 ████████████████████████ 1200 bp
Seq2 ████████████████████ 1000 bp
Seq3 ████████████████ 800 bp ← N50 (crosses 50%)
Seq4 ████████████ 600 bp
Seq5 ████████ 400 bp
Seq6 ██████ 300 bp
Seq7 ████ 200 bp
Seq8 ██ 100 bp
-----------------------------------------
Total length = 4600 bp
50% threshold = 2300 bp
Cumulative length calculation:
Seq1 = 1200
+ Seq2 = 2200
+ Seq3 = 3000 ← crosses 50% threshold
+ Seq4 = 3600
+ Seq5 = 4000
+ Seq6 = 4300
+ Seq7 = 4500
+ Seq8 = 4600
⇒ N50 = 800 bp (i.e. the length of Seq3)
⇒ L50 = 3 (number of sequences needed)
Using N50 and L50 in genome assemblies
In genome assembly workflows, N50 and L50 are commonly reported as rough indicators of contiguity, that is, how fragmented or continuous an assembly is. Assemblies with longer contiguous sequences tend to have higher N50 values and lower L50 values.
This makes N50/L50 useful for quick comparisons between assemblies produced from the same organism, similar input data, or different parameter choices. They help answer the narrow question: “Are sequences generally longer or more fragmented?”
An assembly can have a high N50 while still being biologically poor, for example, if long sequences are incorrectly joined, missing genes, or represent collapsed repeats. Conversely, a fragmented assembly may still contain most biologically relevant content.
For biological completeness, additional evidence is needed beyond length metrics, such as whether expected genes, conserved regions, or functional elements are present. N50/L50 should therefore be interpreted as structural descriptors, not as comprehensive quality measures.
N90 and L90
N90 and L90 are defined analogously to N50/L50, using 90% of the total sequence length.
GC%
GC% is the fraction of guanine (G) and cytosine (C) bases relative to the total number of unambiguous A, C, G, and T bases, excluding ambiguous symbols such as N by convention.
- Defined only for nucleotide sequences.
- Ambiguous bases are typically ignored.
- Not reported for protein data.
GC content varies substantially between organisms and is often relatively consistent within a genome. As a result, GC% is commonly used as a characteristic property of sequences and genomes.
In practice, GC% is frequently examined together with other metrics such as sequence length or coverage. When plotted against these measures, GC% can help reveal contamination, mixed-species datasets, plasmids, or regions with atypical composition compared to the main genome.