FASTA/FASTQ format validator
Tools
- Sequence stats
- Reverse complement
- FASTA/FASTQ format validator
- FASTQ → FASTA
- FASTA ID deduplicator
- DNA ↔ RNA converter
Strictly validate FASTA and FASTQ file structure (headers, record layout, and FASTQ length matching). Pinpoints formatting errors by record, file line, and column. Optional alphabet checks (DNA, RNA, protein) run after format validation and emit warnings only.
About this tool
Checks whether an input file is structurally valid FASTA or strict FASTQ (4-line records). Validation checks structural correctness (record boundaries, headers, sequence lines, quality lines for FASTQ) and optionally alphabet compliance; it does not attempt to repair malformed files.
After successful format validation, optional alphabet checks can be applied for DNA, RNA, or protein inputs. Alphabet checks generate warnings only and never affect format validity. For FASTQ inputs, the validator verifies that the sequence and quality strings in each record have exactly the same length. Quality strings are also checked to ensure they contain only printable ASCII characters (range 33–126). Non-ASCII characters generate warnings only.
The validator does not assess biological correctness, annotations, or sequence plausibility. Format rules and examples are documented in the FASTA/FASTQ formats reference.
- ✓ Deterministic validation
- ✓ No hidden transformations or auto-fixes
- ✓ Input processed transiently and not stored
- ✓ First-error diagnostics with precise location
Details
- Scope: “Valid” means format-valid (parseable), not biologically valid.
- FASTA: records start with
>; sequences can span multiple lines. - FASTQ (strict): 4-line records; sequence and quality lengths must match.
- Fail loudly: reports the first structural error; no auto-correction.
- Alphabet checks: optional post-validation checks for DNA, RNA, or protein. Warnings only; format validity is never affected.
- FASTQ quality encoding: quality lines are expected to contain printable ASCII characters (33–126). Non-ASCII characters (often introduced by copy/paste) are reported as warnings and do not invalidate the file.
- Line wrapping expectations: FASTQ inputs are expected to contain single-line sequence and quality records, as defined by the FASTQ specification. FASTA inputs may be wrapped or unwrapped; however, extremely long single-line FASTA sequences (≫1 Mb per line) are intentionally rejected to ensure safe and predictable processing.
- Quick sanity check before running expensive pipelines or uploads.
- Debug “malformed FASTQ” errors from downstream tools.
- Confirm a file is truly FASTA vs FASTQ after conversions.
- Spot missing
+lines, length mismatches, or indented headers.
FASTA/FASTQ are simple, but tiny formatting issues can break downstream tools. This validator is strict and deterministic so you can catch problems early.
SciDataUtils does not “fix” inputs automatically. Instead, you get a clear first error with a record index and line reference.