|
The EBI's nucleotide and protein sequence databases and services: current developments
by: Rolf Apweiler, Vivien Junker, Alain Gateau, Claire O'Donovan, Fiona Lang, Nicoletta Mitaritonna
The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
This article was first published on the POPE 95 CD-ROM: Perspectives in Protein Engineering Proceedings '95, published by Biodigm. Parts of this article have been updated to reflect latest changes and updates since then.
Abstract
Central activities of the European Bioinformatics Institute (EBI), a new outstation of the EMBL are the development and distribution of the EMBL Nucleotide Sequence Database and, in collaboration with Amos Bairoch of the University of Geneva, of the SWISS-PROT Protein Sequence Data Bank. Over fifty additional specialist molecular biology databases, as well as software and documentation of interest to molecular biologists are also distributed through EBI releases and network services. The EBI network services include database searching and sequence similarity searching facilities. The EBI is constantly extending the integration of biomolecular databases and developing new expert databases closely entangled with both the EMBL database and SWISS-PROT. Probably the most interesting current development is the introduction of TREMBL, a computer-annotated protein database supplementing SWISS-PROT. TREMBL consists of entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL database, except the CDSs already included in SWISS-PROT. From SWISS-PROT release 33 onwards, TREMBL will be distributed with SWISS-PROT on CD-ROM and via the EBI network services (email fileserver, FTP server, gopher, WWW server, Sequence Retrieval System (SRS) and sequence search facilities such as BLITZ and Mail-FastA).
Introduction
The European Bioinformatics Institute (EBI) is an EMBL Outstation, located at the Wellcome Trust Genome Campus in Hinxton, near Cambridge, UK. Since September 1994, all activities previously based at the EMBL Data Library in Heidelberg, Germany are located at the EBI. The database services of the EBI are the continuation and extension of the EMBL Data Library. A central activity of the European Bioinformatics Institute (EBI) is the development and distribution of the EMBL Nucleotide Sequence database, Europe's primary nucleotide sequence data resource. The EBI also maintains and distributes the SWISS-PROT Protein Sequence database in collaboration with Amos Bairoch of the University of Geneva. Over fifty additional specialist molecular biology databases, as well as software and documentation of interest to molecular biologists are also distributed through EBI releases and network services. The EBI network services include database searching and sequence similarity searching facilities. The EBI is constantly extending the integration of biomolecular databases and developing new expert databases closely entangled with both the EMBL Database and SWISS-PROT. Probably the most interesting current development is the release of a computer-annotated protein database supplementing SWISS-PROT.
Contents:
EMBL nucleotide sequence database
SWISS-PROT protein sequence database
TREMBL - annotated supplements to SWISS-PROT
RHdb radiation hybrid mapping database
Other databases
Data acquisition and submission
Data distribution and contact points
References
WWW Links referred to here
Abbreviations
The EMBL Nucleotide Sequence Database
The main activity of the EMBL Nucleotide Sequence Database group is the development, maintenance and distribution of a comprehensive database of nucleotide sequences. The EMBL nucleotide sequence database, produced in collaboration with GenBank (NCBI, Bethesda, USA) and the DNA database of Japan (Mishima), is Europe's primary nucleotide sequence data resource. Each of these three groups collect a portion of the total sequence data reported world-wide. All new and updated database entries are exchanged between the groups on a daily basis. The database currently doubles in size every 12 months and currently (February 1997) contains over 696 million bases from 1047263 sequence entries.
Important sources of data are genomic sequencing projects and other groups, such as phylogenetic research groups, who produce large quantities of new nucleotide sequence data. A collaboration with the European Patent Office has resulted in the capture of nucleotide and protein sequences which were published in patent documents between 1960 and 1993 and previously not publicly available in electronic form.
The complete database is distributed in quarterly releases on compact disc (CD-ROM). The database including daily additions of all new and updated entries is available via the EBI network services (see later) and from nodes of the European Molecular Biology Network (EMBnet).
The nucleotide sequence database entries are distributed in the EMBL flat-file format, which is supported by most sequence analysis software packages. A typical entry contains a sequence, a brief description for cataloging purposes, the taxonomic description of the source organism, bibliographic information, and the feature table, containing locations of coding regions and other biologically significant sites. The feature table follows the DDBJ/EMBL/GenBank Feature Table Definition (a copy of which can be retrieved from the EBI network server). Where appropriate, entries may also be cross-referenced to SWISS-PROT, Eukaryotic Promoter database, TransFac or FlyBase.
The SWISS-PROT Protein Sequence Data Bank
SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute).
The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. Release 34.0 of SWISS-PROT (October 1996) contains 59,021 sequence entries, comprising 21,210,389 amino acids abstracted from about 50,052 references.
The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria:
a) Annotation
In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data consists of the sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) while the annotation consists of the description of the following items:
- Function(s) of the protein
- Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation,
- GPI-anchor, etc.
- Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc.
- Secondary structure
- Quaternary structure
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the protein
- Sequence conflicts, variants, etc.
The SWISS-PROT group tries to include as much annotation information as possible in SWISS-PROT. To obtain this information SWISS-PROT uses, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. SWISS-PROT also makes use of external experts, who have been recruited to send SWISS-PROT their comments and updates concerning specific groups of proteins.
In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by `topics'; this approach permits the easy retrieval of specific categories of data from the database.
b) Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT tries as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
c) Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. SWISS-PROT is currently cross-referenced with 26 different databases.
Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT.
Table 1: List of the databases cross-referenced to SWISS-PROT
|
EMBL Database |
EMBL Nucleotide Sequence Database |
|
DICTYDB |
Dictyostelium discoideum genome database |
|
ECO2DBASE |
Escherichia coli gene-protein database (2D gel spots) |
|
ECOGENE |
Escherichia coli K12 genome database (EcoGene) |
|
ENZYME |
ENZYME data bank |
|
FLYBASE |
Drosophila genome database (FlyBase) |
|
GCRDB |
G-protein--coupled receptor database (GCRDB) |
|
HIV |
HIV sequence database |
|
HSSP |
Homology-derived secondary structure of proteins database (HSSP) |
|
LISTA |
Yeast (Saccharomyces cerevisiae) genome database |
|
MAIZEDB |
Maize genome database (MaizeDB) |
|
MEDLINE |
Medline from the National Library of Medicine (NLM) |
|
MIM |
Mendelian Inheritance in Man Database |
|
PDB |
Brookhaven Protein Data Bank |
|
PHDP |
The Radiation Hybrid Database |
|
PIR |
Protein sequence database of the Protein Information Resource |
|
PROSITE |
PROSITE dictionary of sites and patterns in proteins |
|
REBASE |
Restriction enzyme database |
|
AARHUS/GHENT-2DPAGE |
Human keratinocyte 2D gel protein database from Aarhus and Ghent universities |
|
SGD |
Saccharomyces Genome Database |
|
STYGENE |
Salmonella typhimurium LT2 genome database (StyGene) |
|
SUBTILIST |
Bacillus subtilis 168 genome database (SubtiList) |
|
SWISS-2DPAGE |
Human 2D Gel Protein Database from the University of Geneva |
|
TRANSFAC |
Transcription factor database (Transfac) |
|
WORMPEP |
Caenorhabditis elegans genome sequencing project protein database (Wormpep) |
|
YEPD |
Yeast electrophoresis protein database |
|
Model organisms in SWISS-PROT
We have selected a number of organisms which are the target of genome sequencing and/or mapping projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates.
- Provide a higher level of annotation.
- Cross-reference to specialized database(s) that contain, among other data, some genetic information about the genes which code for these proteins.
- Provide specific indices or documents.
The organisms currently selected are: Arabidopsis thaliana, (mouse ear cress); Bacillus subtilis; Candida albicans; Caenorhabditis elegans (worm); Dictyostelium discodeum (slime mold); Drosophila melanogaster (fruit fly); Escherichia coli; Haemophilus influenzae; Homo sapiens (human); Mycobacterium tuberculosis; Mycoplasma genitalium; Saccharomyces cerevisae (budding yeast); Salmonella typhimurium; Schizosaccharomyces pombe (fission yeast); Sulfolobus solfataricus. Details of the database entries for these organisms are given in table 2.
Table 2: Organisms entered in the data bank
|
Organism |
Database |
Index file |
Number of sequences |
|
A.thaliana
B.subtilis
C.albicans
C.elegans
D.discoideum
D.melanogaster
E.coli
H.influenzae
H.sapiens
M. tuberculosis
M.genitalium
S.cerevisiae
S.typhimurium
S.pombe
S.solfataricus |
None yet
Subtilist
None yet
WormPep
DictyDB
FlyBase
EcoGene
None yet
MIM
None yet
None yet
LISTA/SGD
StyGene
None yet
None yet |
In preparation
SUBTILIS.TXT
CALBICAN.TXT
CELEGANS.TXT
DICTY.TXT
In preparation
ECOLI.TXT
HAEINFLU.TXT
MIMTOSP.TXT
None yet
In preparation
YEAST.TXT
SALTY.TXT
POMBE.TXT
None yet |
562
1783
124
1208
265
910
3606
1591
4000
474
425
4340
617
956
42 |
|
Collectively these organisms represent ~35% of the total number of sequence entries in SWISS-PROT.
In the last few months we have included in SWISS-PROT fully annotated versions of the protein sequence entries encoded on the complete genome ofHaemophilus influenzae, as well as entries originating from the full sequence of yeast chromosomes I, II, III, V, VI, VII, VIII, IX, X, XI, XII, XIII, XV, and XVI.
Documentation files
SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The following table list all the documents that are currently available or that will be added in the next few months.
|
File name |
Description |
|
userman.txt |
User manual |
|
relnotes.txt |
Release notes |
|
submit.txt |
Submission of sequence data to the SWISS-PROT Data Bank |
|
shortdes.txt |
Short description of entries in SWISS-PROT |
|
|
|
jourlist.txt |
List of abbreviations for journals cited |
|
keywlist.txt |
List of keywords in use |
|
speclist.txt |
List of organism identification codes |
|
experts.txt |
List of on-line experts forPROSITE and SWISS-PROT |
|
|
|
acindex.txt |
Accession number index |
|
autindex.txt |
Author index |
|
citindex.txt |
Citation index |
|
keyindex.txt |
Keyword index |
|
speindex.txt |
Species index |
|
|
|
7tmrlist.txt |
List of 7-transmembrane G-linked receptors entries |
|
aatrnasy.txt |
List of aminoacyl-tRNA synthetases |
|
allergen.txt |
Nomenclature and index of allergen sequences |
|
calbica.txt |
Index of Candida albicans entries and their corresponding gene designations |
|
cdlist.txt |
CD nomenclature for surface proteins of human leucocytes |
|
celegans.txt |
Index of Caenorhabditis elegans entries and corresponding gene designations and Wormpep cross-references |
|
dicty.txt |
Index of Dictyostelium discoideum entries and corresponding gene designations and DictyDB cross-references |
|
ec2dtosp.txt |
Index of Escherichia coli Gene-protein database entries referenced in SWISS-PROT |
|
ecoli.txt |
Index of Escherichia coli K12 chromosomal entries and corresponding EcoGene cross-references |
|
embltosp.txt |
Index of EMBL Database entries referenced in SWISS-PROT |
|
extradom.txt |
Nomenclature of extracellular domains |
|
glycosyl.txt |
Index of glycosyl hydrolases classified by families on the basis of sequence similarities |
|
haeinflu.txt |
Index of Haemophilus influenzae RD chromosomal entries |
|
hoxlist.txt |
Vertebrate homeobox proteins: nomenclature and index |
|
humchr21.txt |
Index of protein sequence entries encoded on human chromosome 21 |
|
humchr22.txt |
Index of protein sequence entries encoded on human chromosome 22 |
|
humchry.txt |
Index of protein sequence entries encoded on human chromosome Y |
|
mimtosp.txt |
Index of MIM entries referenced in SWISS-PROT |
|
nomlist.txt |
List of nomenclature related references for proteins |
|
pdbtosp.txt |
Index of Brookhaven PDB entries referenced in SWISS-PROT |
|
peptidas.txt |
Classification of peptidase families and index of peptidases entries |
|
plastid.txt |
List of chloroplast and cyanelle encoded proteins |
|
pombe.txt |
Index of Schizosaccharomyces pombe entries in SWISS-PROT and corresponding gene designations |
|
restric.txt |
List of restriction enzymes and methylases entries |
|
ribosomp.txt |
Index of ribosomal proteins classified by families on the basis of sequence similarities |
|
salty.txt |
Index of Salmonella typhimurium LT2 chromosomal entries and corresponding StyGene cross-references |
|
subtilis.txt |
Index of Bacillus subtilis 168 chromosomal entries and corresponding SubtiList cross-references |
|
yeast.txt |
Index of Saccharomyces cerevisiae entries and corresponding gene designations |
|
yeast1.txt |
Yeast Chromosome I entries |
|
yeast2.txt |
Yeast Chromosome II entries |
|
yeast3.txt |
Yeast Chromosome III entries |
|
yeast5.txt |
Yeast Chromosome V entries |
|
yeast6.txt |
Yeast Chromosome VI entries |
|
yeast8.txt |
Yeast Chromosome VIII entries |
|
yeast9.txt |
Yeast Chromosome IX entries |
|
yeast10.txt |
Yeast Chromosome X entries |
|
yeast11.txt |
Yeast Chromosome XI entries |
|
TREMBL - a computer-annotated supplement to SWISS-PROT
Ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However as we also want to make the sequences available as fast as possible, a supplement to SWISS-PROT was introduced. The first full release of TREMBL (TRanslation of EMBL nucleotide sequence database) was introduced with release 34 of SWISS-PROT. TREMBL consists of entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT.
The current TREMBL contains 116,379 sequence entries, comprising 31,293,053 amino acids, and is split into two main sections: SP-TREMBL (SWISS-PROT TREMBL) which contains entries which will be added after complete annotation to SWISS-PROT and REM-TREMBL (REMaining TREMBL) which contains entries not for inclusion in SWISS-PROT.
Most of the 116,379 sequence entries currently in SP-TREMBL are additional sequence reports of entries already in SWISS-PROT and will lead to updates of these SWISS-PROT entries. However, some 20,000 to 40,000 entries now in SP-TREMBL will eventually be included as new sequence entries in SWISS-PROT.
Identical sequences in SP-TREMBL from the same species have been merged to reduce redundancy. Currently we are working on a further reduction of redundancy by establishing rules to merge sub-fragments with full-length sequences and also for the identification of sequence differences due to polymorphisms, strain variations and sequencing errors with the goal of eventually establishing rules to merge conflicting sequence reports about one and the same sequence into one entry.
For SP-TREMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced whereby valuable annotation has been added automatically. EMBL entries contain information that could, and indeed should, be added to the SP-TREMBL entry as a way to enhance annotation content. Procedures have been developed to extract all relevant information and to put this into the SP-TREMBL entries. This information comes from the EMBL DR, RX, DE, and KW lines and from an assortment of lines in the feature table. A range of sequence analysis tools and the PROSITE pattern database are also used to detect any consensus sequences/motifs present. Tools were developed to use this analysis for adding information about the potential function of the protein, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and other annotation to the entry whenever appropriate. We also make use of the ENZYME database, using the EC number as a reference point. Information such as catalytic activity, cofactors and relevant keywords can be taken from ENZYME and added automatically to SP-TREMBL entries. Furthermore we make use of specialized databases to parse information like the correct gene nomenclature into TREMBL entries. We are currently investigating methods for scanning Medline abstracts for relevant information that can automatically added.
REM-TREMBL (REMaining TREMBL) contains the entries (16,586) that we do not wish to include in SWISS-PROT. This section is organized into four subsections:
- Most REM-TREMBL entries are immunoglobulins and T-cell receptors. We stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we want to keep only germ line gene- derived translations of these proteins in SWISS-PROT and not all known somatic recombinant variations of these proteins as this would bias database-wide searches. At the moment, there are more than 11,000 immunoglobulins and T-cell receptors in TREMBL. We would like to create a specialized database, IMGT-TREMBL (ImMunoGeneTics-TREMBL), dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section in SWISS-PROT.
- Another category of data which will not be included in SWISS-PROT are synthetic sequences (SWISS-PROT represents only naturally occurring sequences). Again, we do not want to leave these entries in TREMBL. Ideally one should build a specialized database for artificial sequences as a further supplement to SWISS-PROT.
- A third subsection consists of fragments with less than seven amino acids.
- The last subsection consists of CDS translations where we have strong evidence to believe that these CDS do not code for real proteins.
The production of TREMBL has emphasised the importance of linking not only to the whole EMBL entry but to linking within the EMBL entry. This point is highlighted by the numerous genome projects that are currently submitting sequences to the EMBL/Genbank/DDBJ Nucleotide Sequence Database. As these projects continue, longer contiguous sequences will be submitted. These longer contigs will contain many more CDS features resulting in many more SWISS-PROT/SP-TREMBL entries. In this context, the need for linking at the CDS feature level is evident. This linking has now been achieved by using the PID, the Protein IDentification number found in the
/db_xref qualifier tagged to every CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TREMBL entries pointing to an EMBL database entry are now citing the EMBL AC number as the primary identifier and the PID as the secondary identifier. In all cases where a PID is already integrated into SWISS-PROT, a
/db_xref qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS feature labelled with this PID. In the remaining cases the
/db_xref qualifier is pointing to the corresponding TREMBL entry.
For example, the SWISS-PROT entry with accession number P10662 and the DR line:
DR EMBL; M15160; G171969; -.
Is represented in EMBL as:
FT CDS 80..1045 FT /db_xref="PID:g171969" FT /db_xref="SWISS-PROT:P10662"
This allows an even deeper integration throughout the world of biomolecular databases and to a much finer level of detail than before. This concept of deeper integration, which subsequently leads to a wider scope of other available information, can be illustrated as follows:
|
FT CDS |
x..y |
|
|
|
/db_xref="PID: " -> |
|
|
|
/db_xref="SWISS -PROT:" -> |
SWISS-PROT -> |
linked to 25 different databases |
|
/db_xref="SGD:" -> |
Saccharomyces Genome Database -> |
cosmid; clones; named genes; |
|
/db_xref="Flyba se:" -> |
Flybase -> |
maps; and a whole assortment of other information |
|
/db_xref="GDB:" -> |
Human Genome Database -> |
named genes; gene analysis; nearby genes & markers |
|
/db_xref="MIM:" -> |
Mendelian Inheritance in Man |
gene maps; genetic disorders |
|
This approach enables us to point precisely from a given SWISS-PROT or TREMBL entry to one of potentially many CDS in the corresponding EMBL entry, and vice versa. This change will allow the development of software tools that automatically retrieve that part of a nucleotide sequence entry that codes for a specific protein. This will be especially useful in the context of the World Wide Web, as it will render obsolete the current situation where, for example, one needs to retrieve the complete sequence of a yeast chromosome when one wants the nucleotide sequence coding for a specific protein encoded on that chromosome.
Moreover, the concepts outlined contain a common goal and that is to link features from one dataset to all other relevant datasets. This is a goal that we are determined to achieve at the EBI, not only with SWISS-PROT but also with its supplement TREMBL. Along with the development of tools to achieve automatic addition of relevant information, we have achieved a much deeper integration with the EMBL Nucleotide Sequence Database which serves to enhance our close collaboration.
The Radiation Hybrid mapping database
The Radiation Hybrid database (Rhdb) is a new development at the EBI. This database is an archive of raw data (i.e. PCR results on radiation hybrid panels) with links to other related databases. All cross-references known to the authors or the databases maintainers are included. The user is also able to directly query the relational database (on the World Wide Web) either by using a set of pre-compiled queries or by writing his own ad-hoc queries. The database is distributed in a similar file format as the EMBL database with which it is fully cross-referenced. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).
Submissions to this database are made using a standard format. Various export formats are supported, as well as different ways of accessing the data. The traditional flat file format is used to export text data on a regular basis.
The database is exported in 4 files :
- panel : hybrid panels are set of clones (radiation hybrid cells)
- rh : the hybridation raw data
- exp : the experimental conditions
- map : the maps
The current working release is 9.0 and contains the following information:
- 38386 RH entries composed by :
- 19969 ESTs
- 2156 Généthon Genetic markers
- 3129 Genetic markers
- 336 Entirely sequenced cDNA
- 744 CHLC Genetic markers
- 3100 Alternative STSs created from genetic loci
- 1638 STS of no known genetically polymorphic or expressed element
- 1 Marker found in CpG islands
- 23 Maps
- 69 entries describing experimental conditions
- 123023 cross references to the following databases:
ATCC
19 to CGM-WUSM
849 to CHLC
29722 to GDB
5803 to Genethon
91 to Genexpress
1 to IMAGE
375 to KDRI
1238 to NCHGR
109 to PAGE
10286 to RHalloc
9854to RHdb
284 to SALK
11934 to SHGC
6586 to Sanger_STS
3479 to TIGR
188 to UCHSC
125 to UT
13948 to WICGR
2787 to WTCHG
12216 to dbEST
13123 to dbSTS
Other databases
The ImMunoGeneTics database
The ImMunoGeneTics database (IMGT) is a database containing nucleotide sequence information of genes important in the function of the immune system. It collects and annotates sequences belonging to the immunoglobin superfamily which are involved in immune recognition. IMGT is produced and maintained in a collaboration of the EBI with three other laboratories in Europe [LIGM (FR), ICRF (UK), Univ. Of Koln (DE)]. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).
The Bio-Catalog
The Bio-Catalog is a list of software of general interest in molecular biology and genetics. First developed at CEPH/Genethon it is now maintained and distributed by the EBI. In addition to this database the EBI maintains a repository of biology related software on its network servers. This software is also distributed once a year on CD-ROM.
EBI distributed databases
The EBI is a major distributor of molecular biological databases produced by other groups in Europe and world-wide. More than 50 databases are available via the EBI network and 30 of them are included on CD-ROM (see Table 3). The EBI also mirrors dbEST, a database of Expressed Sequences Tags developed at the NCBI, offering query and retrieval access through the World Wide Web.
Table 3: Databases distributed by EBI and the mechanism of distribution in each case
|
ALU |
ALU sequences and alignments |
|
WWW |
|
BERLIN |
5S rRNA sequences |
CD-ROM |
WWW |
|
BLOCKS |
Protein Blocks Database |
CD-ROM |
WWW |
|
CPGISLE |
CpG Islands database |
CD-ROM |
WWW |
|
CUTG |
Codon usage tabulated from GenBank |
CD-ROM |
WWW |
|
DSSP |
Secondary structure digests of PDB files |
CD-ROM |
|
|
ECD |
E. coli map database |
CD-ROM |
WWW |
|
EMBL Database |
EMBL Nucleotide Sequence Database |
CD-ROM |
WWW |
|
ENZYME |
ENZYME data bank |
CD-ROM |
WWW |
|
EPD |
Eukaryotic promoter database |
CD-ROM |
WWW |
|
FANS-REF |
Functional analysis bibliography |
CD-ROM |
|
|
FLYBASE |
Drosophila genome database (FlyBase) |
CD-ROM |
WWW |
|
HAEMB |
Haemophilia B database of mutations |
CD-ROM |
WWW |
|
HLA |
HLA class I and II sequence database |
CD-ROM |
|
|
HSSP |
Homology-derived secondary structure of proteins database |
CD-ROM |
WWW |
|
IMGT |
Immunogenetics database |
CD-ROM |
WWW |
|
LIMB |
Listing of mol. biology databases |
CD-ROM |
WWW |
|
KABAT |
Proteins of immunological interest |
CD-ROM |
WWW |
|
METHYL |
Site specific methylation |
CD-ROM |
WWW |
|
PDB |
Brookhaven Protein Data Bank |
CD-ROM |
|
|
PKCDD |
Protein kinase catalytic domains |
CD-ROM |
WWW |
|
PROSITE |
PROSITE dictionary of sites and patterns in proteins |
CD-ROM |
WWW |
|
REBASE |
Restriction enzyme database |
CD-ROM |
WWW |
|
RELIB |
Restriction enzyme library |
CD-ROM |
|
|
RLDB |
Reference Library Database |
CD-ROM |
WWW |
|
RRNA |
Small subunit rRNA sequences |
CD-ROM |
WWW |
|
SEQANALREF |
Sequence analysis bibliography |
CD-ROM |
WWW |
|
SMALLRNA |
Small RNA sequences |
CD-ROM |
WWW |
|
SRP |
Signal recognition particle database |
CD-ROM |
WWW |
|
SWISS-PROT |
Protein sequence database |
CD-ROM |
WWW |
|
TFD |
Transcription Factor Database |
CD-ROM |
WWW |
|
TRANSFAC |
Transcription factor database (Transfac) |
CD-ROM |
WWW |
|
TRANSTERM |
Translation termination signals |
CD-ROM |
WWW |
|
TRNA |
tRNA sequences |
CD-ROM |
WWW |
|
3D-ALI |
Structure-based sequence alignments |
CD-ROM |
|
|
Data acquisition
Today, approximately 95% of all nucleotide sequence data is directly submitted to one of the collaborating databases (EMBL, GenBank and DDBJ). The entries created by each group are exchanged on a daily basis. The remaining 5% are still extracted from the literature (especially patent documents), which is a time-consuming and error-prone task.
Direct submissions
The EBI provides a number of different mechanisms for the direct submission of data (see Table 4). Direct submission of sequence data to the nucleotide sequence databases is the primary means of data acquisition. Sequences submitted can be released either immediately after processing or upon publication. In general, unless otherwise directed by the author, submitted sequences are available to the research community before the sequence appears in a journal. One of the direct submission mechanisms is via the Authorin program, which allows authors to prepare their data interactively using MS-DOS or Macintosh computers. The Authorin program can be obtained on diskettes from NCBI (GenBank/NCBI, NIH, Bldg 38A, Bethesda, MD 20894 USA; email: authorin@ncbi.nlm.nih.gov) or electronically from the EBI network server. The Direct Submission Form can also be used for nucleotide sequence submissions. It can be obtained from the EBI network server or by contacting the EBI directly, and a copy is also published periodically in relevant journals. This submission form can either be sent to the EBI by post or by electronic mail. A new submission system has been developed at the EBI using the World Wide Web (WWW). The URL for this system is
http://www.ebi.ac.uk/subs/emblsubs.html.
With regards to submission to SWISS-PROT, there is a automatic data flow from the nucleotide sequence databases to the protein database via the computer-annotated supplement, TREMBL. Therefore protein sequences should only be submitted directly to SWISS-PROT when the peptide(s) have been sequenced. This data can be submitted via the Authorin program or the Direct Submission Form as above.
To submit data to SWISS-PROT and for all enquires regarding submission, one should contact:
datasubs@ebi.ac.uk (for submission)
junker@ebi.ac.uk (for enquiries)
Table 4: Summary of submission mechanisms for the EMBL database
|
Databases |
Submission Methods |
|
EMBL Nucleotide Sequence Database |
Authorin
Direct Submission Form
WWW submission |
|
SWISS-PROT |
Authorin
Direct Submission Form |
|
Submission accounts
For groups producing large volumes of nucleotide sequence data over an extended period, submission accounts can be established with the EBI. A submission protocol is agreed upon and database entries produced at the research site can be deposited and updated directly by the originating group via FTP. A number of genome projects and research groups have established submission accounts in the past few years, and the procedure has demonstrated itself to be flexible and efficient both for the research groups and for database staff. Each submission account is `curated' by EBI biologists, who check to ensure that new entries follow database annotation conventions and are consistent with other entries from the same project. The curator also serves as an informed liaison between the sequencing group and the database. A list of groups who already submit data using this method or are expected to begin doing so in the near future is given below.
- European Drosophila Mapping Consortium
- French Arabidopsis cDNA project GDR
- Genexpress Genethon (FR)
- Genethon (FR)
- Genexpress Munich (DE)
- HIV project Amsterdam (NL)
- MHC project Tuebingen
- Mycoplasma capricolum NCHGR
- Sanger Centre (UK), C.elegans nematode project
- Sanger Centre (UK) Human genome project
- Sanger Centre (UK) Mycobacterium tuberculosis.
- Sanger Centre (UK) S.pombe project
- Sanger Centre (UK) Yeast Chromosome IV
- Sanger Centre (UK) Yeast Chromosome IX
- Sanger Centre (UK) Yeast Chromosome XIII
- Sanger Centre (UK) Yeast Chromosome XVI
- UK Human Genome Mapping Project
- Radiation Hybrid Mapping Consortium
Sequences from patent literature
The protein and nucleotide sequence data reported in the patent literature since 1960 has now been processed, with >25 000 protein and nucleotide sequences captured (with first priority on those from outside the USA and Japan). It should be noted that only a portion of the patent entries are suitable for inclusion in the EMBL nucleotide sequence database; the others are made available in a separate file. The EBI and the European Patent Office (EPO) are collaborating on new measures to ensure that patent sequences appear in the public databases with less delay in the future. Since September 1993, the EPO requires that protein and nucleotide sequences appearing in patent applications be submitted in an electronic form, which greatly facilitates the speedy incorporation of these sequences into the database as they become publicly available.
Journal-scanning activities
Mandatory sequence submission requirements on the parts of many journals, the regular practice of publishing database accession numbers in papers, as well as early distribution of `Table of Contents' listings by some of the most important journals, have greatly enhanced the effectiveness of the EBI journal scanning activities over the past years. The EBI continues to scan all major European molecular biology journals, but the activity is directed more towards updating bibliographic references in existing (submitted) entries than towards capturing new sequences. There is still, unfortunately, a certain small percentage of published sequence data which has not been submitted to any of the three collaborating databases. When these sequences are identified, the authors are contacted and asked to submit their data. The database regularly makes use of entries produced by the NCBI journal scanning operations, both for updating bibliographic references in existing entries, and for including the NCBI entries in the database when no submission exists.
Data distribution
CD-ROM
CD-ROMs are distributed quarterly as a set of compact discs written in the international ISO 9660 standard format. There is a separate CD-ROM distribution for EMBL and SWISS-PROT databases.
The collaborative databases are distributed on a separate CD-ROM twice a year (see Table 3 for the list of databases included). Software for data query and retrieval is also provided on the CD-ROM.
The programs EMBL-Search for Macintosh and for Windows allow data access by entry name, accession number, keyword, citation, author name, taxonomic classification, database cross-reference, free text, and date. EMBL-Search also provides access to the PROSITE and ENZYME databases, and enables navigation between related entries via the cross-references built into these databases. It uses binary indices whose structure is documented and therefore available for other software systems. The sequence databases are also provided in NBRF format for use with software such as FASTA on Macintosh or MS-DOS systems.
EBI network services
In addition to archiving sequence and genome data, the EBI provides an ever-expanding number of free network services to external users. The EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and the other EBI databases are currently accessible via electronic mail fileserver, FTP, and World Wide Web (WWW). New and updated entries from all three collaborating nucleotide sequence databases are added daily to the network servers, making it possible to retrieve entries and perform sequence similarity searches on the very latest nucleotide data. Weekly additions of new and updated SWISS-PROT entries are also available.
The complete collection of additional specialist molecular biology databases is also available. Complementing these extensive data resources is a collection of molecular biology software for MS-DOS, Macintosh, VMS and UNIX. Documents such as subscription and submission forms, and the DDBJ/EMBL/GenBank Features Table Definition, can also be retrieved.
EBI network fileserver
The EBI network fileserver enables access via electronic mail (e-mail) to the EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and to the full collection of other databases, public domain software and documentation maintained by EBI. Items are retrieved from the server by sending a command in an e-mail message to the fileserver address. Detailed instructions on using the fileserver, and a current list of contents, can be obtained by sending a message to the Internet address Netserv@ebi.ac.uk with the word HELP in the body of the message. A full set of instructions will be returned automatically.
EBI FTP server
This is the main route for retrieving the EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and other databas |