Sequence alphabets and IUPAC codes

DNA, RNA, and protein sequence alphabets.

DNA alphabet (IUPAC)

DNA sequences normally use the four canonical bases A, C, G, T. The IUPAC standard extends this alphabet with N and additional ambiguity codes to represent uncertainty or mixed populations at a position.

Code	Meaning
A	Adenine
C	Cytosine
G	Guanine
T	Thymine
N	Any base (A/C/G/T)
Ambiguity codes
R	A or G
Y	C or T
S	G or C
W	A or T
K	G or T
M	A or C
B	C or G or T (not A)
D	A or G or T (not C)
H	A or C or T (not G)
V	A or C or G (not T)

Notes

Case is usually ignored by parsers; lowercase is commonly used to indicate masking.
Characters such as - (gap) usually come from alignments; non-standard placeholders like ? may appear in some pipelines.

RNA alphabet

RNA sequences follow the same conventions as DNA, except that U (uracil) replaces T. In real datasets, T sometimes still appears due to upstream conversions or mixed conventions.

Expected letters: A C G U N plus the same IUPAC ambiguity codes (R Y S W K M B D H V).

Protein alphabet (amino acids)

Protein sequences use a one-letter alphabet representing the 20 canonical amino acids. The letter X is widely used to represent an unknown or unspecified residue. Some datasets also include U (selenocysteine) and O (pyrrolysine), which are rare but valid amino acids.

Code	3-letter	Amino acid
A	Ala	Alanine
C	Cys	Cysteine
D	Asp	Aspartic acid
E	Glu	Glutamic acid
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methionine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
X	—	Unknown / any amino acid

Notes

* (stop) and - (gap) commonly appear in translated or aligned protein sequences.
Rare but valid encoded amino acids: U = selenocysteine (Sec), O = pyrrolysine (Pyl).
Common ambiguity letters seen in some databases: B = Asx (Asp D or Asn N), Z = Glx (Glu E or Gln Q), J = Leu L or Ile I. Support varies by tool and database.

Practical implications for FASTA and FASTQ

Important distinction: FASTA and FASTQ define record structure, not biological meaning. A file can be structurally valid while containing letters that are unexpected for a given biological context.

Format validity: whether a parser can read the file.
Alphabet expectations: depend on biological context, i.e. which letters make sense for DNA, RNA, or protein.

Open FASTA/FASTQ Validator Open Sequence Stats