Home
 Content
 Lead article
 Industry
 EBI
 Bio-eye
 Events
BioInformer Logo -- click for homepage

A publication of EMBL - Outstation Hinxton, The European Bioinformatics Institute

EBI logo -- click for homepage
biobrddwn

EMBL Release 53

Introduction

The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the 16th December 1997. The release contains 1,917,868 sequence entries comprising 1,281,391,651 nucleotides. This represents an increase of about 8% over Release 52. A breakdown of Release 53 by division is shown below:

Division

Entries

Nucleotides


Bacteriophage

ESTs

Fungi

GSSs

HTG

Human

Invertebrates

Organelles

Other Mammals

Other Vertebrates

Plants

Patent

Prokaryotes

Rodents

STSs

Synthetic

Unclassified

Viruses

1388

1343796

18137

100154

1868

74384

28126

24715

14429

13145

22136

91221

42666

37043

51172

2424

2380

48684

2188305

496603984

44602064

49099107

102763872

139022655

107524431

22870076

13785092

14653255

37736590

29511807

102750354

46489741

17685717

5377292

2387088

46340221


Total

1917868

1281391651

 

EST Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, we have split the EST division into several files. At this release we have created one extra files of EST data named EST14.DAT. Additional files will be added in subsequent releases as appropriate.

Feature Table Qualifiers

New SOURCE Qualifier /specimen_voucher

This new source feature qualifier valid at this release indicates the source of a sample , eg a museum identification tag, of the sequenced material.

/specimen__voucher="text"

This is an identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution. Example:

/specimen__voucher="Smith s. n. 4-IV-1995 (U. S. Natl. Herbarium)"

New /focus

This new source feature qualifier valid at this release defines the main source feature for records with more than one source feature (e.g. proviral/cellular sequences).

/focus

This qualifier defines the preferred source feature for records that have more than one source feature. Example:

/focus

This qualifier is to be used only if there is more than one source feature. The preferred source feature is  used to determine which organism is displayed in the SOURCE and ORGANISM lines and to determine the EMBL division in which it is placed.
For sequences derived from more than one organism, and therefore containing more than one 'source' feature key, the /focus qualifier will be attached to the source key which represents the major organism, that which was the focus of the sequencing effort. If no translation table is specified, the organism with /focus will define the translation table.

Forthcoming changes

Nucleotide And Protein Identifiers

Nucleotide Indentifiers

The NI linetype of the EMBL flat-file format currently contains a unique identifier for the nucleotide sequence. While the sequence remains the same, so  does the value of this identifier. When a sequence change occurs, however  minor, a new NI value will be assigned whilst the accession number on the AC line may remain unchanged. These identifiers are collaboratively maintained with GenBank and DDBJ, for example:

NI g21954

It has become clear from users and other database groups that confusion has been created about the relationship between these identifiers and the GenBank 'gi' numbers.
It has been decided therefore to introduce a new system of nucleotide identifiers of the form 'accession.version', eg: X12345.3, where the accession number part will be stable, but the version part will be incremented if the sequence changes.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to  implement this new form of nucleotide identifier during 1998.

Protein Identifiers

Protein identifiers are currently assigned to all CDS features in the nucleotide sequence database and are found in the feature table qualifier /db_xref, eg:

/db_xref="PID:e123456789"

As for nucleotide identifier values (above), confusion resulted amongst users concerning the relationship of these to GenBank 'gi' numbers also assigned to CDS features. To clarify this, and to adopt a comparable scheme of identifiers for both nucleotides and proteins, the collaborating databases have decided to create a new feature table qualifier /protein_id, eg:

/protein_id="AAA12345.1"

This form of identifier also allows easier tracking of changing protein identifiers by external databases than the previous PIDs.
This qualifier consists of a stable ID portion (3+5 format with 3 positions letters and 5 numbers) plus a version number after a decimal point. The version number will change only when the protein sequence coded by the CDS changes, while the stable part will remain unchanged. This qualifier will be valid only on CDS features which translate into a valid protein.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to implement this new form of protein identifier during 1998.

Feature Table Qualifiers

New /protein_id

As mentioned above a new feature table qualifier /protein_id will be created:

/protein_id="<identifier>"

This is a Protein Identifier, issued by International collaborators. This qualifier consists of a stable ID portion (3+5 format  with 3 positions letters and 5 numbers) plus a version number as the decimal point. Example:

/protein_id="AAA12345.1"

Only when the protein sequence coded by the CDS changes, the version number will change, while the stable part will remain unchanged. This qualifier is valid only on CDS features which translate into a valid protein. The list of 3-letter prefixes will be maintained by EBI.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to implement this new form of protein identifiers during 1998.

/translation And Related Feature Qualifiers

The collaborating databases DDBJ/EMBL/GenBank have decided that translation related qualifiers should only be used with the primary CDS feature key. These translation related qualifiers are:

/codon
/codon_start
/exception
/translation
/transl_table
/transl_except

Starting at release 54, translation related qualifiers will only be valid with the CDS feature key and will be removed from the following list of non-CDS features:

C_region
D_segment
exon
J_segment
mat_peptide
N_region
sig_peptide
S_region
transit_peptide
V_region
V_segment

Written by: Peter Stoehr


 

Resources and further information

 

External sites are not endorsed by EMBL-EBI

biobrddwn

Direct questions or comments to Bioinformer Editor. This page last modified Friday, 16 July, 1999.
ISSN 1462-1363.
More information about the BioInformer.

(c) 1997-1999 EMBL-EBI. All Rights Reserved.