Interpreting Genbank Accession Numbers
Introduction to GenBank Accession Numbers
The International Nucleotide Sequence Database Collaboration (INSDC) consists of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank at NCBI. As part of the Collaboration, all three organizations accept new sequence submissions and share sequence data among the three databases. To facilitate the exchange of data, each member of the collaboration is assigned certain accession prefixes. In addition to the accession number, GenBank records also have a GI number. The GI number is simply a series of digits assigned consecutively to sequences submitted to NCBI.
Format of GenBank accession numbers
|Nucleotide||1 letter + 5 numbers or 2 letters + 6 numbers|
|Protein||3 letters + 5 numbers|
|WGS||4 letters + 2 numbers for WGS assembly version + 6-8 numerals|
Primary GenBank accession number prefixes
|AE, CP, CY||Genome projects (nucleotide)|
|U, AF, AY, DQ||Direct submissions (nucleotide)|
|AAAA-AZZZ||Whole genome shotgun sequences (nucleotide)|
|EAA-EZZ||WGS protein ID|
|O, P, Q||Swissprot (protein)|
Version number suffix
GenBank sequence identifiers consist of the accession number of the record followed by a dot and a version number (i.e. accession.version). The version number is incremented whenever the sequence record is updated.
Refseq Accession Format
Refseq accession numbers do not follow the standards set by INSDC. These have a distinct format of 2 letters + underbar + 6 digits (i.e. NM_012345). Refseq records can either be curated (manually reviewed by NCBI staff or collaborators) or automated (records are not individually reviewed)
The complete list of accession numbers is available at http://www.ncbi.nlm.nih.gov/Sequin/acc.html.