Sequence alphabets and IUPAC codes
All referencesDNA, RNA, and protein sequence alphabets.
DNA alphabet (IUPAC)
DNA sequences normally use the four canonical bases A, C, G, T. The IUPAC standard extends this alphabet with N and additional ambiguity codes to represent uncertainty or mixed populations at a position.
| Code | Meaning |
|---|---|
| A | Adenine |
| C | Cytosine |
| G | Guanine |
| T | Thymine |
| N | Any base (A/C/G/T) |
| Ambiguity codes | |
| R | A or G |
| Y | C or T |
| S | G or C |
| W | A or T |
| K | G or T |
| M | A or C |
| B | C or G or T (not A) |
| D | A or G or T (not C) |
| H | A or C or T (not G) |
| V | A or C or G (not T) |
- Case is usually ignored by parsers; lowercase is commonly used to indicate masking.
- Characters such as
-(gap) usually come from alignments; non-standard placeholders like?may appear in some pipelines.
RNA alphabet
RNA sequences follow the same conventions as DNA, except that U (uracil) replaces T. In real datasets, T sometimes still appears due to upstream conversions or mixed conventions.
A C G U N plus the same IUPAC ambiguity codes (R Y S W K M B D H V). Protein alphabet (amino acids)
Protein sequences use a one-letter alphabet representing the 20 canonical amino acids. The letter X is widely used to represent an unknown or unspecified residue. Some datasets also include U (selenocysteine) and O (pyrrolysine), which are rare but valid amino acids.
| Code | 3-letter | Amino acid |
|---|---|---|
| A | Ala | Alanine |
| C | Cys | Cysteine |
| D | Asp | Aspartic acid |
| E | Glu | Glutamic acid |
| F | Phe | Phenylalanine |
| G | Gly | Glycine |
| H | His | Histidine |
| I | Ile | Isoleucine |
| K | Lys | Lysine |
| L | Leu | Leucine |
| M | Met | Methionine |
| N | Asn | Asparagine |
| P | Pro | Proline |
| Q | Gln | Glutamine |
| R | Arg | Arginine |
| S | Ser | Serine |
| T | Thr | Threonine |
| V | Val | Valine |
| W | Trp | Tryptophan |
| Y | Tyr | Tyrosine |
| X | — | Unknown / any amino acid |
*(stop) and-(gap) commonly appear in translated or aligned protein sequences.- Rare but valid encoded amino acids:
U= selenocysteine (Sec),O= pyrrolysine (Pyl). - Common ambiguity letters seen in some databases:
B= Asx (AspDor AsnN),Z= Glx (GluEor GlnQ),J= LeuLor IleI. Support varies by tool and database.
Practical implications for FASTA and FASTQ
Important distinction: FASTA and FASTQ define record structure, not biological meaning. A file can be structurally valid while containing letters that are unexpected for a given biological context.
- Format validity: whether a parser can read the file.
- Alphabet expectations: depend on biological context, i.e. which letters make sense for DNA, RNA, or protein.