Sequence stats
Tools
- Sequence stats
- Reverse complement
- FASTA/FASTQ format validator
- FASTQ → FASTA
- FASTA ID deduplicator
- DNA ↔ RNA converter
Deterministic FASTA/FASTQ statistics with interactive length distributions: sequence/read counts, N50/L50, N90/L90, and GC% for nucleotide data.
About this tool
Sequence Stats computes fast, reproducible summary statistics for FASTA and FASTQ inputs, including sequence/read counts, total length, length distribution metrics (N50/L50, N90/L90), and GC% for nucleotide sequences. For definitions, interpretation, and examples, see Sequence statistics reference. For protein sequences, statistics are reported in amino acids (aa) and GC% is omitted.
It is intended for computing descriptive statistics on small to moderately sized FASTA and FASTQ inputs, for inspection and exploratory analysis. The tool does not modify input or infer biological meaning. For strict format checks and optional DNA/RNA/protein alphabet validation, use the FASTA/FASTQ Validator tool. For IUPAC nucleotide and protein codes, see Sequence alphabets.
- ✓ No hidden transformations or guessing
- ✓ Input is processed once and not stored
- ✓You can optionally create shareable result pages; shared pages include only derived statistics and sequence headers, never raw sequence content
- ✓ Shared result pages are temporary and expire automatically after 20 minutes
Details
- N50: length such that 50% of total length is contained in sequences of this length or longer.
- L50: number of sequences needed to reach 50% of total length when sequences are sorted by length (descending).
- N90 / L90: analogous to N50/L50, but using 90% of total length. Useful for understanding tail fragmentation in length distributions.
- N50/N90 (protein sets): descriptive length-distribution statistics only; they do not reflect assembly contiguity or biological completeness.
- GC%: computed from A/C/G/T only; ambiguous or non-standard bases are excluded from GC% calculation (nucleotide sequences only).
- Input validation: malformed FASTA/FASTQ fails loudly; no auto-correction.
- See the Sequence statistics reference for extended explanations and examples.
- Quickly confirm a FASTA/FASTQ file is the expected size and read/sequence count.
- Sanity-check length distributions after trimming, filtering, or deduplication.
- Spot unexpected GC shifts that can indicate contamination or wrong reference data.
- Compare simple QC metrics across samples before running heavier pipelines.
- Interactively explore sequence length distributions and assess how length filtering affects N50/N90, total bases, and contiguity before exporting subsets for downstream analysis.
Length statistics, GC%, and N50/N90 are commonly used to assess basic properties of sequence collections. They help detect obvious issues early (unexpected read counts, extreme lengths, or unusual composition) before investing time in larger workflows.
FASTA and FASTQ appear throughout routine bioinformatics work: raw sequencing output, trimmed reads, assemblies/contigs, reference sets, and intermediate QC steps. This tool provides a deterministic, browser-based summary suitable for quick checks and exploratory validation.