|
EMBL Release 53
Introduction
The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the 16th December 1997. The release contains 1,917,868 sequence entries comprising 1,281,391,651 nucleotides. This represents an increase of about 8% over Release 52. A breakdown of Release 53 by division is shown below:
 |
|
Bacteriophage |
|
ESTs |
|
Fungi |
|
GSSs |
|
HTG |
|
Human |
|
Invertebrates |
|
Organelles |
|
Other Mammals |
|
Other Vertebrates |
|
Plants |
|
Patent |
|
Prokaryotes |
|
Rodents |
|
STSs |
|
Synthetic |
|
Unclassified |
|
Viruses |
|
 |
|
1388 |
|
1343796 |
|
18137 |
|
100154 |
|
1868 |
|
74384 |
|
28126 |
|
24715 |
|
14429 |
|
13145 |
|
22136 |
|
91221 |
|
42666 |
|
37043 |
|
51172 |
|
2424 |
|
2380 |
|
48684 |
|
 |
|
2188305 |
|
496603984 |
|
44602064 |
|
49099107 |
|
102763872 |
|
139022655 |
|
107524431 |
|
22870076 |
|
13785092 |
|
14653255 |
|
37736590 |
|
29511807 |
|
102750354 |
|
46489741 |
|
17685717 |
|
5377292 |
|
2387088 |
|
46340221 |
|
|
|
EST Database Files
In order to keep the size of the data files within reasonable limits for handling purposes, we have split the EST division into several files. At this release we have created one extra files of EST data named EST14.DAT. Additional files will be added in subsequent releases as appropriate.
Feature Table Qualifiers
New SOURCE Qualifier /specimen_voucher
This new source feature qualifier valid at this release indicates the source of a sample , eg a museum identification tag, of the sequenced material.
/specimen__voucher="text"
This is an identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution. Example:
/specimen__voucher="Smith s. n. 4-IV-1995 (U. S. Natl. Herbarium)"
New /focus
This new source feature qualifier valid at this release defines the main source feature for records with more than one source feature (e.g. proviral/cellular sequences).
/focus
This qualifier defines the preferred source feature for records that have more than one source feature. Example:
/focus
This qualifier is to be used only if there is more than one source feature. The preferred source feature is used to determine which organism is displayed in the SOURCE and ORGANISM lines and to determine the EMBL division in which it is placed. For sequences derived from more than one organism, and therefore containing more than one 'source' feature key, the /focus qualifier will be attached to the source key which represents the major organism, that which was the focus of the sequencing effort. If no translation table is specified, the organism with /focus will define the translation table.
Forthcoming changes
Nucleotide And Protein Identifiers
Nucleotide Indentifiers
The NI linetype of the EMBL flat-file format currently contains a unique identifier for the nucleotide sequence. While the sequence remains the same, so does the value of this identifier. When a sequence change occurs, however minor, a new NI value will be assigned whilst the accession number on the AC line may remain unchanged. These identifiers are collaboratively maintained with GenBank and DDBJ, for example:
NI g21954
It has become clear from users and other database groups that confusion has been created about the relationship between these identifiers and the GenBank 'gi' numbers. It has been decided therefore to introduce a new system of nucleotide identifiers of the form 'accession.version', eg: X12345.3, where the accession number part will be stable, but the version part will be incremented if the sequence changes. Subject to synchronisation of this change with GenBank and DDBJ, we plan to implement this new form of nucleotide identifier during 1998.
Protein Identifiers
Protein identifiers are currently assigned to all CDS features in the nucleotide sequence database and are found in the feature table qualifier /db_xref, eg:
/db_xref="PID:e123456789"
As for nucleotide identifier values (above), confusion resulted amongst users concerning the relationship of these to GenBank 'gi' numbers also assigned to CDS features. To clarify this, and to adopt a comparable scheme of identifiers for both nucleotides and proteins, the collaborating databases have decided to create a new feature table qualifier /protein_id, eg:
/protein_id="AAA12345.1"
This form of identifier also allows easier tracking of changing protein identifiers by external databases than the previous PIDs. This qualifier consists of a stable ID portion (3+5 format with 3 positions letters and 5 numbers) plus a version number after a decimal point. The version number will change only when the protein sequence coded by the CDS changes, while the stable part will remain unchanged. This qualifier will be valid only on CDS features which translate into a valid protein. Subject to synchronisation of this change with GenBank and DDBJ, we plan to implement this new form of protein identifier during 1998.
Feature Table Qualifiers
New /protein_id
As mentioned above a new feature table qualifier /protein_id will be created:
/protein_id="<identifier>"
This is a Protein Identifier, issued by International collaborators. This qualifier consists of a stable ID portion (3+5 format with 3 positions letters and 5 numbers) plus a version number as the decimal point. Example:
/protein_id="AAA12345.1"
Only when the protein sequence coded by the CDS changes, the version number will change, while the stable part will remain unchanged. This qualifier is valid only on CDS features which translate into a valid protein. The list of 3-letter prefixes will be maintained by EBI. Subject to synchronisation of this change with GenBank and DDBJ, we plan to implement this new form of protein identifiers during 1998.
/translation And Related Feature Qualifiers
The collaborating databases DDBJ/EMBL/GenBank have decided that translation related qualifiers should only be used with the primary CDS feature key. These translation related qualifiers are:
/codon /codon_start /exception /translation /transl_table /transl_except
Starting at release 54, translation related qualifiers will only be valid with the CDS feature key and will be removed from the following list of non-CDS features:
C_region D_segment exon J_segment mat_peptide N_region sig_peptide S_region transit_peptide V_region V_segment
Written by: Peter Stoehr
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
EMBL Nucleotide Sequence Database Homepage http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
EMBL latest release ftp site ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
External sites are not endorsed by EMBL-EBI |