Home
 Content
 Lead article
 EBI
 Bio-eye
 Events
BioInformer Logo -- click for homepage

A publication of EMBL - Outstation Hinxton, The European Bioinformatics Institute

EBI logo -- click for homepage
biobrddwn

EMBL Nucleotide Sequence Database Release 56

The EMBL Nucleotide Sequence Database was frozen to create Release 56 on 16-Sep-1998. The release contains 2,689,618 sequence entries comprising 1,904,091,473 nucleotides. This represents an increase of about 15,5 % over Release 55.

A breakdown of Release 56 by division is shown below: 

Division

Entries

Nucleotides

Bacteriophage

1410

2770820

ESTs

1805351

681267485

Fungi

21078

50092308

GSSs

325033

150179165

HTG

1766

215099029

Human

81346

254151712

Invertebrates

32248

132830872

Organelles

31936

28543481

Other Mammals

16448

15557780

Other Vertebrates

15027

17270350

Patents

114488

36841696

Plants

29698

58621402

Prokaryotes

50789

122916787

Rodents

41279

55282269

STSs

59159

21122318

Synthetic

2756

6219908

Unclassified

1837

1744523

Viruses

57969

53579568

Total

2689618

1904091473

EST & GSS Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, the EST division has been split into several files. (EST1.DAT - EST19.DAT), while the GSS division has been split into 4 files (GSS1.DAT - GSS4.DAT).

 

Database Growth Graph:

dbgrowth

Issues to note:

During the collaborative meeting in May 1998 at the EBI the International Nucleotide Sequence Database Collaboration DDBJ/EMBL/GenBank came to the following decisions:

1. Long sequences - 350 kilobase limit relaxed

The current 350kb limit for the maximum sequence length for a a single database record will be relaxed. It was decided that the size limit for an EMBL/DDBJ/GenBank entry will remain 350 kb, unless

  • a single gene exceeds 350 kb, in which case the creation of an entry containing a sequence longer than 350 is legal.
  • unfinished HTG data (i.e. high-throughput sequences) can exceed the 350 kb limit,  once finished these are split up into segments <350 kb.

2. Online Journals in reference section

Also at the 1998 May Collaborative Meeting the need to represent references which are published only online has been discussed. Until final specifications for such references are available from library organisations, EMBL will present online references like this:

RA        Miller, A.
RT        Cloning and expression of a phospholipase gene
RL        Online Publication
RC        Online-Journal-name; Article Identifier; URL

3.  New /country Feature Qualifier

A new source related qualifier '/country' will be implemented in Feature Table Definition document 2.1 (December 1998) to specify the origin of DNA samples used for epidemiological or population studies.

Qualifier:      /country=" "
Definition:     Country of origin for DNA sample, intended
                for epidemiological or population studies.
Value format:   "any country from annex 7.5.7 of the Feature
                Table documentation"
Example:        "Canada"
Comment:        /country should be a single token taken from
                the controlled list of annex 7.5.7 of the
                feature table documentation. /country can
                also have the following format:
                country:sub_region,
                such as: /country="Canada:Vancouver".

4. New  nucleotide and protein sequence identifiers

Both nucleotide and protein identifiers will consist of a stable part which will not change, and a version part which will be incremented whenever the underlying nucleotide sequence or protein translation changes. The new form of identifiers will allow easier tracking of changes to nucleotide and protein identifiers by external  databases  compared to the current identifiers.

Nucleotide Sequence Identifier

Currently,  the line type 'NI' contains an identifier (e.g. e1344565) for each nucleic acid sequence.  The value of this identifier will only change, when a change in the sequence occurs, while the accession-number on the AC line may remain unchanged. The new nucleotide sequence identifier will be of the form of

'Accession.Version' (eg, Z86131.1),

where the accession number part will be stable, but the version part will  be  incremented  when the sequence changes. A new linetype 'SV'  (Sequence Version) will be introduced to represent this information.
Example:

ID   DBSELBGEN  standard; DNA; PRO; 2196 BP.
AC   X99911;
SV   X99911.3

Protein Sequence Identifier

The new protein identifier (replacing PIDs e.g.: /db_xref="PID:e123345) will  consist of a stable ID portion (3+5 format with 3 position letters and 5 numbers) plus a version number after a decimal point.
Example:

/protein_id="CAA12345.6"

The version number will change only when the protein sequence coded by the CDS  changes, while the stable part will remain unchanged. This qualifier will be valid only on CDS features which translate into a valid protein.
During a transition phase both the old and new forms of identifiers will be provided, e.g.:

FT   CDS             1124. .1939
FT                   /db_xref="PID:g45266"
FT                   /protein_id="CAA12345.6"
FT                   /db_xref="SWISS-PROT:P29808"
FT                   /gene="aacC3"
FT                   /product="aminoglycoside-(3)-N-acetyl-transferase isoenzyme
FT                   III"

Subject to synchronization amongst the international databases we plan to introduce the new form of nucleotide and protein identifiers early 1999.

Information by: EBI´s EMBL team


 

Resources and further information

 

External sites are not endorsed by EMBL-EBI

 

biobrddwn

Direct questions or comments to Bioinformer Editor. This page last modified Friday, 16 July, 1999.
ISSN 1462-1363.
More information about the BioInformer.

(c) 1997-1999 EMBL-EBI. All Rights Reserved.