Sequence alphabets and IUPAC codes

All references

DNA, RNA, and protein sequence alphabets.

DNA alphabet (IUPAC)

DNA sequences normally use the four canonical bases A, C, G, T. The IUPAC standard extends this alphabet with N and additional ambiguity codes to represent uncertainty or mixed populations at a position.

CodeMeaning
AAdenine
CCytosine
GGuanine
TThymine
NAny base (A/C/G/T)
Ambiguity codes
RA or G
YC or T
SG or C
WA or T
KG or T
MA or C
BC or G or T (not A)
DA or G or T (not C)
HA or C or T (not G)
VA or C or G (not T)
Notes
  • Case is usually ignored by parsers; lowercase is commonly used to indicate masking.
  • Characters such as - (gap) usually come from alignments; non-standard placeholders like ? may appear in some pipelines.

RNA alphabet

RNA sequences follow the same conventions as DNA, except that U (uracil) replaces T. In real datasets, T sometimes still appears due to upstream conversions or mixed conventions.

Expected letters: A C G U N plus the same IUPAC ambiguity codes (R Y S W K M B D H V).

Protein alphabet (amino acids)

Protein sequences use a one-letter alphabet representing the 20 canonical amino acids. The letter X is widely used to represent an unknown or unspecified residue. Some datasets also include U (selenocysteine) and O (pyrrolysine), which are rare but valid amino acids.

Code3-letterAmino acid
AAlaAlanine
CCysCysteine
DAspAspartic acid
EGluGlutamic acid
FPhePhenylalanine
GGlyGlycine
HHisHistidine
IIleIsoleucine
KLysLysine
LLeuLeucine
MMetMethionine
NAsnAsparagine
PProProline
QGlnGlutamine
RArgArginine
SSerSerine
TThrThreonine
VValValine
WTrpTryptophan
YTyrTyrosine
XUnknown / any amino acid
Notes
  • * (stop) and - (gap) commonly appear in translated or aligned protein sequences.
  • Rare but valid encoded amino acids: U = selenocysteine (Sec), O = pyrrolysine (Pyl).
  • Common ambiguity letters seen in some databases: B = Asx (Asp D or Asn N), Z = Glx (Glu E or Gln Q), J = Leu L or Ile I. Support varies by tool and database.

Practical implications for FASTA and FASTQ

Important distinction: FASTA and FASTQ define record structure, not biological meaning. A file can be structurally valid while containing letters that are unexpected for a given biological context.

  • Format validity: whether a parser can read the file.
  • Alphabet expectations: depend on biological context, i.e. which letters make sense for DNA, RNA, or protein.