Genome assembly statistics
All referencesGenome assembly statistics summarize the records in an assembly file and help describe contiguity, total size, unresolved bases, and nucleotide composition. In FASTA assemblies, each record represents the current assembly level, such as a contig, scaffold, or chromosome-level record. These metrics are useful for inspection and reporting, but they are descriptive rather than proof of correctness or completeness.
What assembly statistics measure
Assembly statistics describe the sequence records in an assembly (contigs, scaffolds, or chromosome-level records, depending on the assembly level). They summarize how many records are present, how long they are, how much total sequence they contain, and how nucleotide composition varies.
| Metric | Meaning | How to interpret it |
|---|---|---|
| Record count | Number of FASTA entries. | More records often means greater fragmentation, but expected values depend on assembly type. |
| Total span | Sum of all record lengths. | Compare with expected genome size, if known; large deviations may indicate missing sequence, contamination, or redundancy. |
| Longest / shortest | Extremes of the record-length distribution. | Useful for spotting dominant scaffolds, tiny fragments, or unexpected record-size ranges. |
| Mean / median | Average and middle record length. | A large gap between mean and median indicates a skewed distribution. |
| N content | Number or percentage of ambiguous N bases. | High N content can indicate gaps, unresolved bases, or scaffold joins. |
| GC% | Fraction of G and C among counted A/C/G/T bases. | Unexpected shifts or multiple composition groups can suggest mixed inputs or unusual records. |
N50, L50, N90, and assembly contiguity
N50 is the record length at which records of that length or longer contain at least 50% of the total assembly span. L50 is the number of longest records needed to reach that same 50% threshold. N90 and L90 use the same idea at 90%.
Sorted lengths:
5000, 3000, 1200, 800, 500; total span = 10,500 bp.Half the span is
5,250 bp. The cumulative sum crosses that threshold after the first two records: 5000 + 3000 = 8000.N50 = 3000 bp and L50 = 2.
In comparable assemblies of the same organism or expected genome size, a higher N50 and lower L50 generally indicate greater contiguity: fewer, longer pieces cover a large fraction of the assembly. N90 and L90 are often more sensitive to the shorter-record tail because they ask how much sequence is needed to cover 90% of the span. For a visual step-by-step schematic of the N50 calculation, see the N50 and L50 visual schematic in the general sequence statistics reference.
GC%, N content, and ambiguous bases
GC% is usually computed as (G + C) / (A + C + G + T). Ambiguous symbols such as N are commonly excluded from the denominator and reported separately. This distinction matters: a scaffold with many N bases can have an ordinary GC% among resolved bases while still containing many unresolved positions.
- N count and N% summarize unresolved bases or scaffold gaps.
- Longest N run can highlight long gap blocks inside scaffolds.
- Ambiguous count includes non-standard IUPAC nucleotide letters beyond A/C/G/T.
- GC outliers may indicate plasmids, low-complexity regions, contamination, or mixed assemblies.
GC composition alone is not diagnostic. It becomes more useful when combined with record length, coverage, taxonomy, gene content, or read mapping evidence.
Useful plots for assembly summaries
Assembly plots make length distributions easier to interpret than a single table. They are especially useful when an assembly has many fragments or when two assemblies have similar N50 values but different tails.
| Plot | What it shows | Why it helps |
|---|---|---|
| Cumulative span | Records sorted longest to shortest; y-axis is cumulative bases. | Shows how quickly assembly span accumulates and where N50/N90 thresholds fall. |
| Nx curve | Nx value from 0% to 100% of total span. | More informative than reporting only N50 because it shows the whole contiguity profile. |
| Length distribution | Counts of records across length bins. | Reveals whether the assembly is dominated by many small fragments or a few large records. |
| GC% by record | Per-record GC% across records ordered by length. | Can reveal composition outliers, especially when combined with length or coverage information. |
Interpretation limits
Assembly statistics are best treated as first-pass structural descriptors. They can identify obvious issues and help document an assembly, but they cannot answer all quality questions.
- Completeness: use gene-content approaches such as BUSCO or lineage-specific marker checks.
- Correctness: use alignments, read mapping, structural validation, or reference-based tools when appropriate.
- Contamination: combine GC, coverage, taxonomy, and gene evidence; GC alone is insufficient.
- Comparability: compare assemblies produced from similar inputs, filtering, and expected genome sizes.
A compact bacterial assembly and a large eukaryotic genome assembly can have very different expected ranges. The same N50 value can mean different things in different biological and technical contexts.
Reporting checklist
When reporting assembly statistics, include enough context for another person to understand what was counted and how the metrics were produced.
- Whether records are contigs, scaffolds, chromosome-level records, or another unit.
- Any filtering threshold, such as excluding records below 500 bp or 1 kb.
- Total span, record count, longest record, N50/L50, and N90/L90.
- N content and whether ambiguous bases were excluded from GC% denominator.
- Expected genome size or reference context, if known.