|
Current Developments in the EMBL Nucleotide Sequence Database
by: Günther Stösser The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Contents
Introduction
Release Report
Major Changes 1996/97
Nucleotide and Protein Identifiers
Webin
Migration Annotation Tools to New Production Environment
Patent Data
Direct Submissions
Genome Project Data
Introduction
The EMBL Nucleotide Sequence database is a comprehensive database of DNA and RNA sequences collected from the scientific literature, patent applications, and directly submitted from researchers and genome sequencing groups. The database is produced in collaboration with GenBank and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis
Release report
During 1996 database releases 46, 47, 48, 49 have been produced. The EMBL Nucleotide Sequence Database was frozen to make Release 50 on the 18th February 1997. The release contains 1,187,455 sequence entries comprising 789,755,858 nucleotides. This represents an increase of about 13% over Release 49. A breakdown of Release 50 by taxonomic division is shown below:
 |
|
Bacteriophage |
|
ESTs |
|
Fungi |
|
GSSs |
|
HTGs |
|
Human |
|
Invertebrates |
|
Organelles |
|
Other Mammals |
|
Other Vertebrates |
|
 |
|
1318 |
|
820648 |
|
15704 |
|
7947 |
|
511 |
|
64239 |
|
22683 |
|
18062 |
|
12069 |
|
10956 |
|
 |
|
1932976 |
|
302434896 |
|
39671216 |
|
3098838 |
|
33688644 |
|
73101184 |
|
82803976 |
|
17876379 |
|
11938942 |
|
12408930 |
|
|
|
Major Changes 1996/1997
HTG - High Throughput Genome Sequences
A new division for High Throughput Genome Sequences (HTG) has been created. This new division is used for genomic sequences which are produced by high-throughput sequencing projects. The records are primarily nematode and human and include long sequences. The annotation for many of these records is expected to be generated through computer analyses. Entries in this division all contain the keyword HTG, and a second keyword to indicate the status of the sequencing as follows:
|
HTGS_PHASE1 |
An unordered set of sequence pieces (typically 7-20), with no annotation. The pieces are separated by a long run of 'N's |
|
HTGS_PHASE2 |
Ordered pieces (typically 2 or 3), with no annotation, again separated by a long run of 'N's |
|
HTGS_PHASE3 |
One contiguous sequence, some annotation |
|
A single accession number is normally assigned to one clone, and as sequencing progresses and the entry passes from one phase to another, it will retain the same accession number.
IMPORTANT: These data are unfinished and do not necessarily represent the correct sequence. Work on the sequence is in progress and the release of this data is based on the understanding that the sequence may change as work continues.
Example entry: Z93042
ID HS6B17 standard; DNA; HTG; 157343 BP. XX AC Z93042; XX NI e1041791 XX DT 19-MAR-1997 (Rel. 51, Created) DT 19-MAR-1997 (Rel. 51, Last updated, Version 1) XX DE Human DNA sequence *** SEQUENCING IN PROGRESS *** from clone 6B17 XX KW HTG; HTGS_PHASE1. XX OS Homo sapiens (human) OC Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; OC Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-157343 RA Buck D.;RT ; RL Submitted (06-NOV-1996) to the EMBL/GenBank/DDBJ databases. RL Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, RL UK. E-mail enquires: humquery@sanger.ac.uk Clone requests: RL clonerequest@sanger.ac.uk XX CC IMPORTANT: This sequence is unfinished and does not necessarily CC represent the correct sequence. Work on the sequence is in progress CC and the release of this data is based on the understanding that the CC sequence may change as work continues. The sequence may be CC contaminated with foreign sequence from E.coli, yeast, vector, CC phage etc. XX CC Order of segments is not known; 800 n's separate segments. CC Unfinished sequence: dJ6B17 Contig_ID: 01251 Length: 26837 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01452 Length: 21575 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00230 Length: 2960 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01187 Length: 4030 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01213 Length: 19770 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01206 Length: 5250 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00110 Length: 6122 bp CC Unfinished sequence: dJ6B17 Contig_ID: 02179 Length: 5084 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00366 Length: 4335 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01691 Length: 1239 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01482 Length: 22035 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00363 Length: 1346 bp CC Unfinished sequence: dJ6B17 Contig_ID: 02224 Length: 928 bpCC CC Unfinished sequence: dJ6B17 Contig_ID: 00881 Length: 838 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00427 Length: 1013 bp CC Unfinished sequence: dJ6B17 Contig_ID: 02076 Length: 1012 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01319 Length: 751 bp CC Unfinished sequence: dJ6B17 Contig_ID: 00664 Length: 878 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01575 Length: 8438 bp CC Unfinished sequence: dJ6B17 Contig_ID: 01216 Length: 7702 bp XX FH Key Location/Qualifiers FH FT source 1..157343 FT /organism="Homo sapiens" FT /clone="6B17" FT /clone="6B17" FT /chromosome="22" XX SQ Sequence 157343 BP; 33669 A; 34723 C; 35752 G; 37649 T; 15550 other; Z93042.Dat Length: 157343 March 19, 1997 13:36 Type: N Check: 3442 ..
New GSS Division
A new division for Genome Survey Sequences (GSS) has been be added to the database starting at release 48, September 1996. This division is of similar nature to the EST division, except that its sequences will be genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data:
- random "single pass read" genome survey sequences
- single pass reads from cosmid/BAC/YAC ends
- exon trapped genomic sequences
- Alu PCR sequences
Example Entry: Z86131
Accession Number Format Change
Until recently, accession numbers used by the nucleotide sequence databases consisted of one prefix letter followed by 5 digits. EST projects and projects to add patent data have accelerated the need to extend the accession number space.
1+5 format: 1 (prefix letter) + 5 (digits) e.g. Y12345 2+6 format: 2 (prefix letters) + 6 (digits) e.g. AC123456
Note: Existing 6-character accession numbers will remain as they are, and will never be transformed to an 8-character form.
Protein and Nucleotide Sequence Identifiers PIDs & NIDs
Protein Sequence Identifiers
PIDs have been designed to identify exact protein translations for a coding region (CDS) - and for full funtionality need to:
- be maintained collaboratively
- be portable amongst databases
- give access to an exact version without tieing the user to one of the collaborative databases.
Problem:
EMBL entries displayed NCBI assigned gi:numbers to any CDS features created prior to January 1996. After that point in time EMBL assigned PIDs to EMBL entries. Since EMBL had begun assigning PIDs itself, GenBank has been carrying those in their records as well. The GenBank gi number for the proteins (PID) continued to be shown in every record, including EMBL records where EMBL had also assigned a PID. GenBank's retrieval system, based entirely on gi numbers, showed only those. For these reasons external users and external databases couldn't get the full functionality from the PID information.
Existing format example: /db_xref="PID:e123456789"
Therefore a proposal for a collaboratively maintained qualifier was presented by EMBL and the collaborators agreed on a new qualifier /protein_id:
 |
 |
|
Qualifier |
|
|
Definition |
|
|
Example |
|
|
Comment |
|
 |
 |
|
/protein_id="<identifier>" |
|
|
Protein Identifier, issued by International collaborators. This qualifier consists of a stable ID portion (3+5 format with 3 positions letters and 5 numbers) plus a version number as the decimal point. |
|
|
/protein_id="AAA12345.1" |
|
|
Only when the protein sequence coded by the CDS changes, the version number will change, while the stable part will remain unchanged. This qualifier is valid only on CDS features which translate into a valid protein. The list of 3-letter prefixes will be maintained by EBI |
|
|
|
Subject to synchronisation of this change with GenBank and DDBJ, implementation of this new form of nucleotide identifier is planned at release 53, December 1997.
Nucleotide Sequence Identifiers
At the Collaborative Meeting in 1996 GenBank/EMBL had agreed to attach version identifiers to nucleotide sequences. EMBL NIDs have been assigned to EMBL entries since June 1996, EMBL entries created prior to this date display NCBI assigned gi-numbers (NCBI assigned NIDs to DDBJ entries are used until DDBJ begins assigning DDBJ NIDs). Problem: NIDs are version identifiers to the sequences assigned by the database owning the entries, and the full functionality is only provided if these identifiers are maintained collaboratively and portable amongst databases, giving the user access to an exact version without tieing the user to one of the collaborative databases. Obvious problems concerning the full functionality of this NI Nucleotide Sequence Identifiers arose due to the fact that currently EMBL NIDs are not shown in GenBank flatfiles.
These issues have been rediscussed and resulted in a proposal to (for nucleotide sequences) replace the current NID by a accession+version format analogous to the solution for proteins:
The nucleotide version number will start at 1 for the first version publicly released and be incremented every time the sequence is updated after that. Subject to synchronisation of this change with GenBank and DDBJ, it is planned to implement this new form of nucleotide identifier at release 53, December 1997.
Constructed Sequences
Ways of representing very long sequences have been investigated, such as complete genomes, or other sequences contructed from existing sequences, with EBI's collaborators at GenBank and DDBJ. The conclusion is that a new division of the database 'CON' for CONstructed sequences will be generated, which will contain entries which are built from others in the normal EMBL divisions. Entries in the CON division will initially contain no feature table or sequence data, but will include information about how the sequence is built from its components, eg:
CO join(Z46921:1..38990,Z38059:20..75317,Z46833:1..26142, Z38125:10..43504,Z46728:77..25913,Z37997:1..17731, Z38060:1..37730)
Additionally, for these constructions, a new operator gap() will be introduced, eg:
CO join(X00123:66..100, X00124:109..1002, gap(),X99199:990..10000)
where the gap is of unknown length or
CO join(X00123:66..100, X00124:109..1002, gap(2000),X99199:990..10000)
where the gap is of estimated length.
WEBIN - New WWW Sequence Submission Tool
WebIn is the new WWW Sequence Submission Tool for submitting nucleotide sequence data and associated biological information to the EMBL Nucleotide Sequence Database at the European Bioinformatics Institute (EBI).
To access WebIn at the EBI please use the following URL:
http://www.ebi.ac.uk/submission/webin.html
Dabase entries created by the new WWW submission tool and submitted to the EMBL Nucleotide Sequence Database at the EBI will be exchanged and shared among the International Collaboration of Nucleotide Sequence Databases (DDBJ/EMBL/GenBank).
Migration of existing Annotation Tools to New Production Environment
The following technical changes have been made the last couple of months that have some effect on the way the EMBL Nucleotide Sequence Database is maintained:
- Upgraded the Oracle RDBMS from version 6 to version 7
- Devloped and started using a new improved Database schema
- Moved from a VMS platform to a Unix platform
- Upgraded SQL*Forms version 3 to version 4.5 (GUI !)
Patent Data
The EMBL datase has an ongoing collaboration with the European Patent Office and in this context has been involved in the Patent Backfile and Frontfile Projects:
Patent Backfile Project
This project, which has been finished during the course of last year, included patent documents from 1960 - 1993. About 2400 sequence bearing documents have bveen processed resulting in the creation of about 30000 database entries (both nucleotide and protein). Since these documents had been provided by the EPO in hard-copy form only, this effort involved a significant data entry task.
Patent Frontfile Project
This project started after completion of the Backfile project. The EPO has been requesting electronic patent applications from their applicants since 1993 and we have created software to parse these electronic applications in a more automated fashion. This is an ongooing project and EMBL Release 51 will contain the latest data from the patent frontfile project. At present, the time period covered is for application dates from January 1993 - December 1995. EPO's policy is to release data to the public (and to EMBL) 18 months after the application date, independent of whether a patent has been granted or not. At present we have created a total of 16190 entries from the frontfile data, of which about 10000 are nucleotide sequences.
Direct Submissions
Most journals now expect that DNA and amino acid sequences that appear in articles will be submitted to a sequence database before publication. A new generation of submission tools are now available:
- Webin - WWW Sequence Submission Tool
- Sequin - Multi-platform (Mac/PC/Unix) stand-alone software tool
- E-mail Submission Form
About 400 - 500 sequence submissions from individual scientists are received at the EBI every month yielding between 1200 and 2000 new sequence entries.
Genome Project Data
Data from the Genome Sequencing Projects appears in the public domain at an exhausting rate creating a new major challenge - assignment of function. A number of small genomes (prokaryotic and eukaryotic unicellular) have been finished, e.g
- Haemophilus influenza
- Methanococcus jannaschii
- Mycoplasma genitalium
- Saccharomyces cerevisiae
- Escherichia coli
- etc.
The vast majority of data integrated into the database is now originating from Genome Sequencing Projects. As an example, the Sanger Centre (Wellcome Trust Genome Campus) has contributed about 13% to the overall database growth between November 1996 and February 1997. After being substantially involved in the yeast genome sequencing effort, the centre's main projects are now C.elegans, Homo sapiens, S.pombe and Pathogens.
List of Current Genome Projects submitting to the EBI
Sanger Centre C.elegans nematode project
Sanger Centre Brugia malayi
Sanger Centre Center S. pombe
Sanger Centre Centre M.tuberculosis
Sanger Centre Centre STS
Sanger Centre Ciona intestinalis
Sanger Centre Mycoplasma leprae
Sanger Centre Plasmodium falciparum
Sanger Centre Centre Homo sapiens
ESSA MIPS Arabidopsis thaliana
European Drosophila Mapping Consortium
French Arabidopsis cDNA project GDR
Genexpress Genethon
HIV project Amsterdam
MHC project Tuebingen
MRC/HGMP Fugu GSS
Mycoplasma capricolum NCHGR
Padova University Human EST project
S.cerevisiae yeast project
UK Human Genome Mapping Project
Written by: Günther Stösser
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
EMBL Nucleotide Sequence Database (latest release) ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
WebIn direct submission tool http://www.ebi.ac.uk/submission/webin.html
The Sanger Centre http://www.sanger.ac.uk/
National Center for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov/
GenBank http://www.ncbi.nlm.nih.gov/Web/Search/index.html
Sequin -- DNA submission and update tool http://www.ncbi.nlm.nih.gov/Sequin/index.html
DNA Data Bank of Japan (DDBJ) http://www.nig.ac.jp/
External sites are not endorsed by EMBL-EBI |