Genome assembly statistics

All references

Genome assembly statistics summarize the records in an assembly file and help describe contiguity, total size, unresolved bases, and nucleotide composition. In FASTA assemblies, each record represents the current assembly level, such as a contig, scaffold, or chromosome-level record. These metrics are useful for inspection and reporting, but they are descriptive rather than proof of correctness or completeness.

What assembly statistics measure

Assembly statistics describe the sequence records in an assembly (contigs, scaffolds, or chromosome-level records, depending on the assembly level). They summarize how many records are present, how long they are, how much total sequence they contain, and how nucleotide composition varies.

MetricMeaningHow to interpret it
Record countNumber of FASTA entries.More records often means greater fragmentation, but expected values depend on assembly type.
Total spanSum of all record lengths.Compare with expected genome size, if known; large deviations may indicate missing sequence, contamination, or redundancy.
Longest / shortestExtremes of the record-length distribution.Useful for spotting dominant scaffolds, tiny fragments, or unexpected record-size ranges.
Mean / medianAverage and middle record length.A large gap between mean and median indicates a skewed distribution.
N contentNumber or percentage of ambiguous N bases.High N content can indicate gaps, unresolved bases, or scaffold joins.
GC%Fraction of G and C among counted A/C/G/T bases.Unexpected shifts or multiple composition groups can suggest mixed inputs or unusual records.

N50, L50, N90, and assembly contiguity

N50 is the record length at which records of that length or longer contain at least 50% of the total assembly span. L50 is the number of longest records needed to reach that same 50% threshold. N90 and L90 use the same idea at 90%.

Worked example:
Sorted lengths: 5000, 3000, 1200, 800, 500; total span = 10,500 bp.
Half the span is 5,250 bp. The cumulative sum crosses that threshold after the first two records: 5000 + 3000 = 8000.
N50 = 3000 bp and L50 = 2.

In comparable assemblies of the same organism or expected genome size, a higher N50 and lower L50 generally indicate greater contiguity: fewer, longer pieces cover a large fraction of the assembly. N90 and L90 are often more sensitive to the shorter-record tail because they ask how much sequence is needed to cover 90% of the span. For a visual step-by-step schematic of the N50 calculation, see the N50 and L50 visual schematic in the general sequence statistics reference.

Important: Higher N50 does not automatically mean a better assembly. N50 can increase if records are incorrectly joined, if repeats are collapsed, or if contaminant sequence is included. Interpret N50 with total span, record count, N content, GC composition, read support, completeness checks, and the biological question.

GC%, N content, and ambiguous bases

GC% is usually computed as (G + C) / (A + C + G + T). Ambiguous symbols such as N are commonly excluded from the denominator and reported separately. This distinction matters: a scaffold with many N bases can have an ordinary GC% among resolved bases while still containing many unresolved positions.

  • N count and N% summarize unresolved bases or scaffold gaps.
  • Longest N run can highlight long gap blocks inside scaffolds.
  • Ambiguous count includes non-standard IUPAC nucleotide letters beyond A/C/G/T.
  • GC outliers may indicate plasmids, low-complexity regions, contamination, or mixed assemblies.

GC composition alone is not diagnostic. It becomes more useful when combined with record length, coverage, taxonomy, gene content, or read mapping evidence.

Useful plots for assembly summaries

Assembly plots make length distributions easier to interpret than a single table. They are especially useful when an assembly has many fragments or when two assemblies have similar N50 values but different tails.

PlotWhat it showsWhy it helps
Cumulative spanRecords sorted longest to shortest; y-axis is cumulative bases.Shows how quickly assembly span accumulates and where N50/N90 thresholds fall.
Nx curveNx value from 0% to 100% of total span.More informative than reporting only N50 because it shows the whole contiguity profile.
Length distributionCounts of records across length bins.Reveals whether the assembly is dominated by many small fragments or a few large records.
GC% by recordPer-record GC% across records ordered by length.Can reveal composition outliers, especially when combined with length or coverage information.

Interpretation limits

Assembly statistics are best treated as first-pass structural descriptors. They can identify obvious issues and help document an assembly, but they cannot answer all quality questions.

  • Completeness: use gene-content approaches such as BUSCO or lineage-specific marker checks.
  • Correctness: use alignments, read mapping, structural validation, or reference-based tools when appropriate.
  • Contamination: combine GC, coverage, taxonomy, and gene evidence; GC alone is insufficient.
  • Comparability: compare assemblies produced from similar inputs, filtering, and expected genome sizes.

A compact bacterial assembly and a large eukaryotic genome assembly can have very different expected ranges. The same N50 value can mean different things in different biological and technical contexts.

Reporting checklist

When reporting assembly statistics, include enough context for another person to understand what was counted and how the metrics were produced.

  • Whether records are contigs, scaffolds, chromosome-level records, or another unit.
  • Any filtering threshold, such as excluding records below 500 bp or 1 kb.
  • Total span, record count, longest record, N50/L50, and N90/L90.
  • N content and whether ambiguous bases were excluded from GC% denominator.
  • Expected genome size or reference context, if known.
Inspect an assembly FASTA file
Use the browser tool to calculate assembly FASTA metrics and export cumulative span, Nx, length distribution, and GC% plots.