|
EMBL Nucleotide Sequence Database Release 56
The EMBL Nucleotide Sequence Database was frozen to create Release 56 on 16-Sep-1998. The release contains 2,689,618 sequence entries comprising 1,904,091,473 nucleotides. This represents an increase of about 15,5 % over Release 55.
A breakdown of Release 56 by division is shown below:
|
Division |
Entries |
Nucleotides |
|
Bacteriophage |
1410 |
2770820 |
|
ESTs |
1805351 |
681267485 |
|
Fungi |
21078 |
50092308 |
|
GSSs |
325033 |
150179165 |
|
HTG |
1766 |
215099029 |
|
Human |
81346 |
254151712 |
|
Invertebrates |
32248 |
132830872 |
|
Organelles |
31936 |
28543481 |
|
Other Mammals |
16448 |
15557780 |
|
Other Vertebrates |
15027 |
17270350 |
|
Patents |
114488 |
36841696 |
|
Plants |
29698 |
58621402 |
|
Prokaryotes |
50789 |
122916787 |
|
Rodents |
41279 |
55282269 |
|
STSs |
59159 |
21122318 |
|
Synthetic |
2756 |
6219908 |
|
Unclassified |
1837 |
1744523 |
|
Viruses |
57969 |
53579568 |
|
Total |
2689618 |
1904091473 |
|
EST & GSS Database Files
In order to keep the size of the data files within reasonable limits for handling purposes, the EST division has been split into several files. (EST1.DAT - EST19.DAT), while the GSS division has been split into 4 files (GSS1.DAT - GSS4.DAT).
Database Growth Graph:
Issues to note:
During the collaborative meeting in May 1998 at the EBI the International Nucleotide Sequence Database Collaboration DDBJ/EMBL/GenBank came to the following decisions:
1. Long sequences - 350 kilobase limit relaxed
The current 350kb limit for the maximum sequence length for a a single database record will be relaxed. It was decided that the size limit for an EMBL/DDBJ/GenBank entry will remain 350 kb, unless
- a single gene exceeds 350 kb, in which case the creation of an entry containing a sequence longer than 350 is legal.
- unfinished HTG data (i.e. high-throughput sequences) can exceed the 350 kb limit, once finished these are split up into segments <350 kb.
2. Online Journals in reference section
Also at the 1998 May Collaborative Meeting the need to represent references which are published only online has been discussed. Until final specifications for such references are available from library organisations, EMBL will present online references like this:
RA Miller, A. RT Cloning and expression of a phospholipase gene RL Online Publication RC Online-Journal-name; Article Identifier; URL
3. New /country Feature Qualifier
A new source related qualifier '/country' will be implemented in Feature Table Definition document 2.1 (December 1998) to specify the origin of DNA samples used for epidemiological or population studies.
Qualifier: /country=" " Definition: Country of origin for DNA sample, intended for epidemiological or population studies. Value format: "any country from annex 7.5.7 of the Feature Table documentation" Example: "Canada" Comment: /country should be a single token taken from the controlled list of annex 7.5.7 of the feature table documentation. /country can also have the following format: country:sub_region, such as: /country="Canada:Vancouver".
4. New nucleotide and protein sequence identifiers
Both nucleotide and protein identifiers will consist of a stable part which will not change, and a version part which will be incremented whenever the underlying nucleotide sequence or protein translation changes. The new form of identifiers will allow easier tracking of changes to nucleotide and protein identifiers by external databases compared to the current identifiers.
Nucleotide Sequence Identifier
Currently, the line type 'NI' contains an identifier (e.g. e1344565) for each nucleic acid sequence. The value of this identifier will only change, when a change in the sequence occurs, while the accession-number on the AC line may remain unchanged. The new nucleotide sequence identifier will be of the form of
'Accession.Version' (eg, Z86131.1),
where the accession number part will be stable, but the version part will be incremented when the sequence changes. A new linetype 'SV' (Sequence Version) will be introduced to represent this information. Example:
ID DBSELBGEN standard; DNA; PRO; 2196 BP. AC X99911; SV X99911.3
Protein Sequence Identifier
The new protein identifier (replacing PIDs e.g.: /db_xref="PID:e123345) will consist of a stable ID portion (3+5 format with 3 position letters and 5 numbers) plus a version number after a decimal point. Example:
/protein_id="CAA12345.6"
The version number will change only when the protein sequence coded by the CDS changes, while the stable part will remain unchanged. This qualifier will be valid only on CDS features which translate into a valid protein. During a transition phase both the old and new forms of identifiers will be provided, e.g.:
FT CDS 1124. .1939 FT /db_xref="PID:g45266" FT /protein_id="CAA12345.6" FT /db_xref="SWISS-PROT:P29808" FT /gene="aacC3" FT /product="aminoglycoside-(3)-N-acetyl-transferase isoenzyme FT III"
Subject to synchronization amongst the international databases we plan to introduce the new form of nucleotide and protein identifiers early 1999.
Information by: EBI´s EMBL team
Resources and further information
European Bioinformatics Institute (EMBL-EBI) http://www.ebi.ac.uk/
EMBL Nucleotide Sequence Database homepages http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
Latest EMBL Nucleotide Sequence Database release ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
External sites are not endorsed by EMBL-EBI
|