Home
 Content
 Lead article
 Interview
 Industry
 EBI
 Bio-eye
 Events
BioInformer Logo -- click for homepage

A publication of EMBL - Outstation Hinxton, The European Bioinformatics Institute

EBI logo -- click for homepage
biobrddwn

The EBI's nucleotide and protein sequence databases and services: current developments

by: Rolf Apweiler, Vivien Junker, Alain Gateau, Claire O'Donovan, Fiona Lang,
Nicoletta Mitaritonna

The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

This article was first published on the POPE 95 CD-ROM: Perspectives in Protein Engineering Proceedings '95, published by Biodigm. Parts of this article have been updated to reflect latest changes and updates since then.

Abstract

Central activities of the European Bioinformatics Institute (EBI), a new outstation of the EMBL are the development and distribution of the EMBL Nucleotide Sequence Database and, in collaboration with Amos Bairoch of the University of Geneva, of the SWISS-PROT Protein Sequence Data Bank. Over fifty additional specialist molecular biology databases, as well as software and documentation of interest to molecular biologists are also distributed through EBI releases and network services. The EBI network services include database searching and sequence similarity searching facilities. The EBI is constantly extending the integration of biomolecular databases and developing new expert databases closely entangled with both the EMBL database and SWISS-PROT. Probably the most interesting current development is the introduction of TREMBL, a computer-annotated protein database supplementing SWISS-PROT. TREMBL consists of entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL database, except the CDSs already included in SWISS-PROT. From SWISS-PROT release 33 onwards, TREMBL will be distributed with SWISS-PROT on CD-ROM and via the EBI network services (email fileserver, FTP server, gopher, WWW server, Sequence Retrieval System (SRS) and sequence search facilities such as BLITZ and Mail-FastA).

Introduction

The European Bioinformatics Institute (EBI) is an EMBL Outstation, located at the Wellcome Trust Genome Campus in Hinxton, near Cambridge, UK. Since September 1994, all activities previously based at the EMBL Data Library in Heidelberg, Germany are located at the EBI. The database services of the EBI are the continuation and extension of the EMBL Data Library. A central activity of the European Bioinformatics Institute (EBI) is the development and distribution of the EMBL Nucleotide Sequence database, Europe's primary nucleotide sequence data resource. The EBI also maintains and distributes the SWISS-PROT Protein Sequence database in collaboration with Amos Bairoch of the University of Geneva. Over fifty additional specialist molecular biology databases, as well as software and documentation of interest to molecular biologists are also distributed through EBI releases and network services. The EBI network services include database searching and sequence similarity searching facilities. The EBI is constantly extending the integration of biomolecular databases and developing new expert databases closely entangled with both the EMBL Database and SWISS-PROT. Probably the most interesting current development is the release of a computer-annotated protein database supplementing SWISS-PROT.

Contents:

The EMBL Nucleotide Sequence Database

The main activity of the EMBL Nucleotide Sequence Database group is the development, maintenance and distribution of a comprehensive database of nucleotide sequences. The EMBL nucleotide sequence database, produced in collaboration with GenBank (NCBI, Bethesda, USA) and the DNA database of Japan (Mishima), is Europe's primary nucleotide sequence data resource. Each of these three groups collect a portion of the total sequence data reported world-wide. All new and updated database entries are exchanged between the groups on a daily basis. The database currently doubles in size every 12 months and currently (February 1997) contains over 696 million bases from 1047263 sequence entries.

Important sources of data are genomic sequencing projects and other groups, such as phylogenetic research groups, who produce large quantities of new nucleotide sequence data. A collaboration with the European Patent Office has resulted in the capture of nucleotide and protein sequences which were published in patent documents between 1960 and 1993 and previously not publicly available in electronic form.

The complete database is distributed in quarterly releases on compact disc (CD-ROM). The database including daily additions of all new and updated entries is available via the EBI network services (see later) and from nodes of the European Molecular Biology Network (EMBnet).

The nucleotide sequence database entries are distributed in the EMBL flat-file format, which is supported by most sequence analysis software packages. A typical entry contains a sequence, a brief description for cataloging purposes, the taxonomic description of the source organism, bibliographic information, and the feature table, containing locations of coding regions and other biologically significant sites. The feature table follows the DDBJ/EMBL/GenBank Feature Table Definition (a copy of which can be retrieved from the EBI network server). Where appropriate, entries may also be cross-referenced to SWISS-PROT, Eukaryotic Promoter database, TransFac or FlyBase.

The SWISS-PROT Protein Sequence Data Bank

SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute).

The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. Release 34.0 of SWISS-PROT (October 1996) contains 59,021 sequence entries, comprising 21,210,389 amino acids abstracted from about 50,052 references.

The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria:

a) Annotation

In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data consists of the sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) while the annotation consists of the description of the following items:

  • Function(s) of the protein
  • Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation,
  • GPI-anchor, etc.
  • Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc.
  • Secondary structure
  • Quaternary structure
  • Similarities to other proteins
  • Disease(s) associated with deficiencie(s) in the protein
  • Sequence conflicts, variants, etc.

The SWISS-PROT group tries to include as much annotation information as possible in SWISS-PROT. To obtain this information SWISS-PROT uses, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. SWISS-PROT also makes use of external experts, who have been recruited to send SWISS-PROT their comments and updates concerning specific groups of proteins.

In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by `topics'; this approach permits the easy retrieval of specific categories of data from the database.

b) Minimal redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT tries as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

c) Integration with other databases

It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. SWISS-PROT is currently cross-referenced with 26 different databases.

Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT.

Table 1: List of the databases cross-referenced to SWISS-PROT

EMBL Database

EMBL Nucleotide Sequence Database

DICTYDB

Dictyostelium discoideum genome database

ECO2DBASE

Escherichia coli gene-protein database (2D gel spots)

ECOGENE

Escherichia coli K12 genome database (EcoGene)

ENZYME

ENZYME data bank

FLYBASE

Drosophila genome database (FlyBase)

GCRDB

G-protein--coupled receptor database (GCRDB)

HIV

HIV sequence database

HSSP

Homology-derived secondary structure of proteins database (HSSP)

LISTA

Yeast (Saccharomyces cerevisiae) genome database

MAIZEDB

Maize genome database (MaizeDB)

MEDLINE

Medline from the National Library of Medicine (NLM)

MIM

Mendelian Inheritance in Man Database

PDB

Brookhaven Protein Data Bank

PHDP

The Radiation Hybrid Database

PIR

Protein sequence database of the Protein Information Resource

PROSITE

PROSITE dictionary of sites and patterns in proteins

REBASE

Restriction enzyme database

AARHUS/GHENT-2DPAGE

Human keratinocyte 2D gel protein database from Aarhus and Ghent universities

SGD

Saccharomyces Genome Database

STYGENE

Salmonella typhimurium LT2 genome database (StyGene)

SUBTILIST

Bacillus subtilis 168 genome database (SubtiList)

SWISS-2DPAGE

Human 2D Gel Protein Database from the University of Geneva

TRANSFAC

Transcription factor database (Transfac)

WORMPEP

Caenorhabditis elegans genome sequencing project protein database (Wormpep)

YEPD

Yeast electrophoresis protein database

 

Model organisms in SWISS-PROT

We have selected a number of organisms which are the target of genome sequencing and/or mapping projects and for which we intend to:

  • Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates.
  • Provide a higher level of annotation.
  • Cross-reference to specialized database(s) that contain, among other data, some genetic information about the genes which code for these proteins.
  • Provide specific indices or documents.

The organisms currently selected are: Arabidopsis thaliana, (mouse ear cress); Bacillus subtilis; Candida albicans; Caenorhabditis elegans (worm); Dictyostelium discodeum (slime mold); Drosophila melanogaster (fruit fly); Escherichia coli; Haemophilus influenzae; Homo sapiens (human); Mycobacterium tuberculosis; Mycoplasma genitalium; Saccharomyces cerevisae (budding yeast); Salmonella typhimurium; Schizosaccharomyces pombe (fission yeast); Sulfolobus solfataricus. Details of the database entries for these organisms are given in table 2.

Table 2: Organisms entered in the data bank

Organism

Database

Index file

Number of sequences

A.thaliana

B.subtilis

C.albicans

C.elegans

D.discoideum

D.melanogaster

E.coli

H.influenzae

H.sapiens

M. tuberculosis

M.genitalium

S.cerevisiae

S.typhimurium

S.pombe

S.solfataricus

None yet

Subtilist

None yet

WormPep

DictyDB

FlyBase

EcoGene

None yet

MIM

None yet

None yet

LISTA/SGD

StyGene

None yet

None yet

In preparation

SUBTILIS.TXT

CALBICAN.TXT

CELEGANS.TXT

DICTY.TXT

In preparation

ECOLI.TXT

HAEINFLU.TXT

MIMTOSP.TXT

None yet

In preparation

YEAST.TXT

SALTY.TXT

POMBE.TXT

None yet

562

1783

124

1208

265

910

3606

1591

4000

474

425

4340

617

956

42

Collectively these organisms represent ~35% of the total number of sequence entries in SWISS-PROT.

In the last few months we have included in SWISS-PROT fully annotated versions of the protein sequence entries encoded on the complete genome ofHaemophilus influenzae, as well as entries originating from the full sequence of yeast chromosomes I, II, III, V, VI, VII, VIII, IX, X, XI, XII, XIII, XV, and XVI.

Documentation files

SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The following table list all the documents that are currently available or that will be added in the next few months.

File name

Description

userman.txt

User manual

relnotes.txt

Release notes

submit.txt

Submission of sequence data to the SWISS-PROT Data Bank

shortdes.txt

Short description of entries in SWISS-PROT

jourlist.txt

List of abbreviations for journals cited

keywlist.txt

List of keywords in use

speclist.txt

List of organism identification codes

experts.txt

List of on-line experts forPROSITE and SWISS-PROT

acindex.txt

Accession number index

autindex.txt

Author index

citindex.txt

Citation index

keyindex.txt

Keyword index

speindex.txt

Species index

7tmrlist.txt

List of 7-transmembrane G-linked receptors entries

aatrnasy.txt

List of aminoacyl-tRNA synthetases

allergen.txt

Nomenclature and index of allergen sequences

calbica.txt

Index of Candida albicans entries and their corresponding gene designations

cdlist.txt

CD nomenclature for surface proteins of human leucocytes

celegans.txt

Index of Caenorhabditis elegans entries and corresponding gene designations and Wormpep cross-references

dicty.txt

Index of Dictyostelium discoideum entries and corresponding gene designations and DictyDB cross-references

ec2dtosp.txt

Index of Escherichia coli Gene-protein database entries referenced in SWISS-PROT

ecoli.txt

Index of Escherichia coli K12 chromosomal entries and corresponding EcoGene cross-references

embltosp.txt

Index of EMBL Database entries referenced in SWISS-PROT

extradom.txt

Nomenclature of extracellular domains

glycosyl.txt

Index of glycosyl hydrolases classified by families on the basis of sequence similarities

haeinflu.txt

Index of Haemophilus influenzae RD chromosomal entries

hoxlist.txt

Vertebrate homeobox proteins: nomenclature and index

humchr21.txt

Index of protein sequence entries encoded on human chromosome 21

humchr22.txt

Index of protein sequence entries encoded on human chromosome 22

humchry.txt

Index of protein sequence entries encoded on human chromosome Y

mimtosp.txt

Index of MIM entries referenced in SWISS-PROT

nomlist.txt

List of nomenclature related references for proteins

pdbtosp.txt

Index of Brookhaven PDB entries referenced in SWISS-PROT

peptidas.txt

Classification of peptidase families and index of peptidases entries

plastid.txt

List of chloroplast and cyanelle encoded proteins

pombe.txt

Index of Schizosaccharomyces pombe entries in SWISS-PROT and corresponding gene designations

restric.txt

List of restriction enzymes and methylases entries

ribosomp.txt

Index of ribosomal proteins classified by families on the basis of sequence similarities

salty.txt

Index of Salmonella typhimurium LT2 chromosomal entries and corresponding StyGene cross-references

subtilis.txt

Index of Bacillus subtilis 168 chromosomal entries and corresponding SubtiList cross-references

yeast.txt

Index of Saccharomyces cerevisiae entries and corresponding gene designations

yeast1.txt

Yeast Chromosome I entries

yeast2.txt

Yeast Chromosome II entries

yeast3.txt

Yeast Chromosome III entries

yeast5.txt

Yeast Chromosome V entries

yeast6.txt

Yeast Chromosome VI entries

yeast8.txt

Yeast Chromosome VIII entries

yeast9.txt

Yeast Chromosome IX entries

yeast10.txt

Yeast Chromosome X entries

yeast11.txt

Yeast Chromosome XI entries

 

TREMBL - a computer-annotated supplement to SWISS-PROT

Ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However as we also want to make the sequences available as fast as possible, a supplement to SWISS-PROT was introduced. The first full release of TREMBL (TRanslation of EMBL nucleotide sequence database) was introduced with release 34 of SWISS-PROT. TREMBL consists of entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT.

The current TREMBL contains 116,379 sequence entries, comprising 31,293,053 amino acids, and is split into two main sections: SP-TREMBL (SWISS-PROT TREMBL) which contains entries which will be added after complete annotation to SWISS-PROT and REM-TREMBL (REMaining TREMBL) which contains entries not for inclusion in SWISS-PROT.

Most of the 116,379 sequence entries currently in SP-TREMBL are additional sequence reports of entries already in SWISS-PROT and will lead to updates of these SWISS-PROT entries. However, some 20,000 to 40,000 entries now in SP-TREMBL will eventually be included as new sequence entries in SWISS-PROT.

Identical sequences in SP-TREMBL from the same species have been merged to reduce redundancy. Currently we are working on a further reduction of redundancy by establishing rules to merge sub-fragments with full-length sequences and also for the identification of sequence differences due to polymorphisms, strain variations and sequencing errors with the goal of eventually establishing rules to merge conflicting sequence reports about one and the same sequence into one entry.

 For SP-TREMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced whereby valuable annotation has been added automatically. EMBL entries contain information that could, and indeed should, be added to the SP-TREMBL entry as a way to enhance annotation content. Procedures have been developed to extract all relevant information and to put this into the SP-TREMBL entries. This information comes from the EMBL DR, RX, DE, and KW lines and from an assortment of lines in the feature table. A range of sequence analysis tools and the PROSITE pattern database are also used to detect any consensus sequences/motifs present. Tools were developed to use this analysis for adding information about the potential function of the protein, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and other annotation to the entry whenever appropriate. We also make use of the ENZYME database, using the EC number as a reference point. Information such as catalytic activity, cofactors and relevant keywords can be taken from ENZYME and added automatically to SP-TREMBL entries. Furthermore we make use of specialized databases to parse information like the correct gene nomenclature into TREMBL entries. We are currently investigating methods for scanning Medline abstracts for relevant information that can automatically added.

REM-TREMBL (REMaining TREMBL) contains the entries (16,586) that we do not wish to include in SWISS-PROT. This section is organized into four subsections:

  1. Most REM-TREMBL entries are immunoglobulins and T-cell receptors. We stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we want to keep only germ line gene- derived translations of these proteins in SWISS-PROT and not all known somatic recombinant variations of these proteins as this would bias database-wide searches. At the moment, there are more than 11,000 immunoglobulins and T-cell receptors in TREMBL. We would like to create a specialized database, IMGT-TREMBL (ImMunoGeneTics-TREMBL), dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section in SWISS-PROT.
  2. Another category of data which will not be included in SWISS-PROT are synthetic sequences (SWISS-PROT represents only naturally occurring sequences). Again, we do not want to leave these entries in TREMBL. Ideally one should build a specialized database for artificial sequences as a further supplement to SWISS-PROT.
  3. A third subsection consists of fragments with less than seven amino acids.
  4. The last subsection consists of CDS translations where we have strong evidence to believe that these CDS do not code for real proteins.

The production of TREMBL has emphasised the importance of linking not only to the whole EMBL entry but to linking within the EMBL entry. This point is highlighted by the numerous genome projects that are currently submitting sequences to the EMBL/Genbank/DDBJ Nucleotide Sequence Database. As these projects continue, longer contiguous sequences will be submitted. These longer contigs will contain many more CDS features resulting in many more SWISS-PROT/SP-TREMBL entries. In this context, the need for linking at the CDS feature level is evident. This linking has now been achieved by using the PID, the Protein IDentification number found in the

/db_xref qualifier tagged to every CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TREMBL entries pointing to an EMBL database entry are now citing the EMBL AC number as the primary identifier and the PID as the secondary identifier. In all cases where a PID is already integrated into SWISS-PROT, a

/db_xref qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS feature labelled with this PID. In the remaining cases the

/db_xref qualifier is pointing to the corresponding TREMBL entry.

For example, the SWISS-PROT entry with accession number P10662 and the DR line:

DR EMBL; M15160; G171969; -.

Is represented in EMBL as:

FT CDS 80..1045
FT     /db_xref="PID:g171969"
FT     /db_xref="SWISS-PROT:P10662"
 

This allows an even deeper integration throughout the world of biomolecular databases and to a much finer level of detail than before. This concept of deeper integration, which subsequently leads to a wider scope of other available information, can be illustrated as follows:

FT CDS

x..y

/db_xref="PID: " ->

/db_xref="SWISS -PROT:" ->

SWISS-PROT ->

linked to 25 different databases

/db_xref="SGD:" ->

Saccharomyces Genome Database ->

cosmid; clones; named genes;

/db_xref="Flyba se:" ->

Flybase ->

maps; and a whole assortment of other information

/db_xref="GDB:" ->

Human Genome Database ->

named genes; gene analysis; nearby genes & markers

/db_xref="MIM:" ->

Mendelian Inheritance in Man

gene maps; genetic disorders

This approach enables us to point precisely from a given SWISS-PROT or TREMBL entry to one of potentially many CDS in the corresponding EMBL entry, and vice versa. This change will allow the development of software tools that automatically retrieve that part of a nucleotide sequence entry that codes for a specific protein. This will be especially useful in the context of the World Wide Web, as it will render obsolete the current situation where, for example, one needs to retrieve the complete sequence of a yeast chromosome when one wants the nucleotide sequence coding for a specific protein encoded on that chromosome.

Moreover, the concepts outlined contain a common goal and that is to link features from one dataset to all other relevant datasets. This is a goal that we are determined to achieve at the EBI, not only with SWISS-PROT but also with its supplement TREMBL. Along with the development of tools to achieve automatic addition of relevant information, we have achieved a much deeper integration with the EMBL Nucleotide Sequence Database which serves to enhance our close collaboration.

The Radiation Hybrid mapping database

The Radiation Hybrid database (Rhdb) is a new development at the EBI. This database is an archive of raw data (i.e. PCR results on radiation hybrid panels) with links to other related databases. All cross-references known to the authors or the databases maintainers are included. The user is also able to directly query the relational database (on the World Wide Web) either by using a set of pre-compiled queries or by writing his own ad-hoc queries. The database is distributed in a similar file format as the EMBL database with which it is fully cross-referenced. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).

Submissions to this database are made using a standard format. Various export formats are supported, as well as different ways of accessing the data. The traditional flat file format is used to export text data on a regular basis.

The database is exported in 4 files :

  • panel : hybrid panels are set of clones (radiation hybrid cells)
  • rh : the hybridation raw data
  • exp : the experimental conditions
  • map : the maps

The current working release is 9.0 and contains the following information:

  • 38386 RH entries composed by :
    • 19969 ESTs
    • 2156 Généthon Genetic markers
    • 3129 Genetic markers
    • 336 Entirely sequenced cDNA
    • 744 CHLC Genetic markers
    • 3100 Alternative STSs created from genetic loci
    • 1638 STS of no known genetically polymorphic or expressed element
    • 1 Marker found in CpG islands
  • 23 Maps
  • 69 entries describing experimental conditions
  • 123023 cross references to the following databases:

Other databases

The ImMunoGeneTics database

The ImMunoGeneTics database (IMGT) is a database containing nucleotide sequence information of genes important in the function of the immune system. It collects and annotates sequences belonging to the immunoglobin superfamily which are involved in immune recognition. IMGT is produced and maintained in a collaboration of the EBI with three other laboratories in Europe [LIGM (FR), ICRF (UK), Univ. Of Koln (DE)]. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).

The Bio-Catalog

The Bio-Catalog is a list of software of general interest in molecular biology and genetics. First developed at CEPH/Genethon it is now maintained and distributed by the EBI. In addition to this database the EBI maintains a repository of biology related software on its network servers. This software is also distributed once a year on CD-ROM.

EBI distributed databases

The EBI is a major distributor of molecular biological databases produced by other groups in Europe and world-wide. More than 50 databases are available via the EBI network and 30 of them are included on CD-ROM (see Table 3). The EBI also mirrors dbEST, a database of Expressed Sequences Tags developed at the NCBI, offering query and retrieval access through the World Wide Web.

Table 3: Databases distributed by EBI and the mechanism of distribution in each case

ALU

ALU sequences and alignments

WWW

BERLIN

5S rRNA sequences

CD-ROM

WWW

BLOCKS

Protein Blocks Database

CD-ROM

WWW

CPGISLE

CpG Islands database

CD-ROM

WWW

CUTG

Codon usage tabulated from GenBank

CD-ROM

WWW

DSSP

Secondary structure digests of PDB files

CD-ROM

ECD

E. coli map database

CD-ROM

WWW

EMBL Database

EMBL Nucleotide Sequence Database

CD-ROM

WWW

ENZYME

ENZYME data bank

CD-ROM

WWW

EPD

Eukaryotic promoter database

CD-ROM

WWW

FANS-REF

Functional analysis bibliography

CD-ROM

FLYBASE

Drosophila genome database (FlyBase)

CD-ROM

WWW

HAEMB

Haemophilia B database of mutations

CD-ROM

WWW

HLA

HLA class I and II sequence database

CD-ROM

HSSP

Homology-derived secondary structure of proteins database

CD-ROM

WWW

IMGT

Immunogenetics database

CD-ROM

WWW

LIMB

Listing of mol. biology databases

CD-ROM

WWW

KABAT

Proteins of immunological interest

CD-ROM

WWW

METHYL

Site specific methylation

CD-ROM

WWW

PDB

Brookhaven Protein Data Bank

CD-ROM

PKCDD

Protein kinase catalytic domains

CD-ROM

WWW

PROSITE

PROSITE dictionary of sites and patterns in proteins

CD-ROM

WWW

REBASE

Restriction enzyme database

CD-ROM

WWW

RELIB

Restriction enzyme library

CD-ROM

RLDB

Reference Library Database

CD-ROM

WWW

RRNA

Small subunit rRNA sequences

CD-ROM

WWW

SEQANALREF

Sequence analysis bibliography

CD-ROM

WWW

SMALLRNA

Small RNA sequences

CD-ROM

WWW

SRP

Signal recognition particle database

CD-ROM

WWW

SWISS-PROT

Protein sequence database

CD-ROM

WWW

TFD

Transcription Factor Database

CD-ROM

WWW

TRANSFAC

Transcription factor database (Transfac)

CD-ROM

WWW

TRANSTERM

Translation termination signals

CD-ROM

WWW

TRNA

tRNA sequences

CD-ROM

WWW

3D-ALI

Structure-based sequence alignments

CD-ROM

 

Data acquisition

Today, approximately 95% of all nucleotide sequence data is directly submitted to one of the collaborating databases (EMBL, GenBank and DDBJ). The entries created by each group are exchanged on a daily basis. The remaining 5% are still extracted from the literature (especially patent documents), which is a time-consuming and error-prone task.

Direct submissions

The EBI provides a number of different mechanisms for the direct submission of data (see Table 4). Direct submission of sequence data to the nucleotide sequence databases is the primary means of data acquisition. Sequences submitted can be released either immediately after processing or upon publication. In general, unless otherwise directed by the author, submitted sequences are available to the research community before the sequence appears in a journal. One of the direct submission mechanisms is via the Authorin program, which allows authors to prepare their data interactively using MS-DOS or Macintosh computers. The Authorin program can be obtained on diskettes from NCBI (GenBank/NCBI, NIH, Bldg 38A, Bethesda, MD 20894 USA; email: authorin@ncbi.nlm.nih.gov) or electronically from the EBI network server. The Direct Submission Form can also be used for nucleotide sequence submissions. It can be obtained from the EBI network server or by contacting the EBI directly, and a copy is also published periodically in relevant journals. This submission form can either be sent to the EBI by post or by electronic mail. A new submission system has been developed at the EBI using the World Wide Web (WWW). The URL for this system is

http://www.ebi.ac.uk/subs/emblsubs.html.

With regards to submission to SWISS-PROT, there is a automatic data flow from the nucleotide sequence databases to the protein database via the computer-annotated supplement, TREMBL. Therefore protein sequences should only be submitted directly to SWISS-PROT when the peptide(s) have been sequenced. This data can be submitted via the Authorin program or the Direct Submission Form as above.

To submit data to SWISS-PROT and for all enquires regarding submission, one should contact:

datasubs@ebi.ac.uk (for submission)

junker@ebi.ac.uk (for enquiries)

Table 4: Summary of submission mechanisms for the EMBL database

Databases

Submission Methods

EMBL Nucleotide Sequence Database

Authorin

Direct Submission Form

WWW submission

SWISS-PROT

Authorin

Direct Submission Form

 

Submission accounts

For groups producing large volumes of nucleotide sequence data over an extended period, submission accounts can be established with the EBI. A submission protocol is agreed upon and database entries produced at the research site can be deposited and updated directly by the originating group via FTP. A number of genome projects and research groups have established submission accounts in the past few years, and the procedure has demonstrated itself to be flexible and efficient both for the research groups and for database staff. Each submission account is `curated' by EBI biologists, who check to ensure that new entries follow database annotation conventions and are consistent with other entries from the same project. The curator also serves as an informed liaison between the sequencing group and the database. A list of groups who already submit data using this method or are expected to begin doing so in the near future is given below.

  • European Drosophila Mapping Consortium
  • French Arabidopsis cDNA project GDR
  • Genexpress Genethon (FR)
  • Genethon (FR)
  • Genexpress Munich (DE)
  • HIV project Amsterdam (NL)
  • MHC project Tuebingen
  • Mycoplasma capricolum NCHGR
  • Sanger Centre (UK), C.elegans nematode project
  • Sanger Centre (UK) Human genome project
  • Sanger Centre (UK) Mycobacterium tuberculosis.
  • Sanger Centre (UK) S.pombe project
  • Sanger Centre (UK) Yeast Chromosome IV
  • Sanger Centre (UK) Yeast Chromosome IX
  • Sanger Centre (UK) Yeast Chromosome XIII
  • Sanger Centre (UK) Yeast Chromosome XVI
  • UK Human Genome Mapping Project
  • Radiation Hybrid Mapping Consortium

Sequences from patent literature

The protein and nucleotide sequence data reported in the patent literature since 1960 has now been processed, with >25 000 protein and nucleotide sequences captured (with first priority on those from outside the USA and Japan). It should be noted that only a portion of the patent entries are suitable for inclusion in the EMBL nucleotide sequence database; the others are made available in a separate file. The EBI and the European Patent Office (EPO) are collaborating on new measures to ensure that patent sequences appear in the public databases with less delay in the future. Since September 1993, the EPO requires that protein and nucleotide sequences appearing in patent applications be submitted in an electronic form, which greatly facilitates the speedy incorporation of these sequences into the database as they become publicly available.

Journal-scanning activities

Mandatory sequence submission requirements on the parts of many journals, the regular practice of publishing database accession numbers in papers, as well as early distribution of `Table of Contents' listings by some of the most important journals, have greatly enhanced the effectiveness of the EBI journal scanning activities over the past years. The EBI continues to scan all major European molecular biology journals, but the activity is directed more towards updating bibliographic references in existing (submitted) entries than towards capturing new sequences. There is still, unfortunately, a certain small percentage of published sequence data which has not been submitted to any of the three collaborating databases. When these sequences are identified, the authors are contacted and asked to submit their data. The database regularly makes use of entries produced by the NCBI journal scanning operations, both for updating bibliographic references in existing entries, and for including the NCBI entries in the database when no submission exists.

Data distribution

CD-ROM

CD-ROMs are distributed quarterly as a set of compact discs written in the international ISO 9660 standard format. There is a separate CD-ROM distribution for EMBL and SWISS-PROT databases.

The collaborative databases are distributed on a separate CD-ROM twice a year (see Table 3 for the list of databases included). Software for data query and retrieval is also provided on the CD-ROM.

The programs EMBL-Search for Macintosh and for Windows allow data access by entry name, accession number, keyword, citation, author name, taxonomic classification, database cross-reference, free text, and date. EMBL-Search also provides access to the PROSITE and ENZYME databases, and enables navigation between related entries via the cross-references built into these databases. It uses binary indices whose structure is documented and therefore available for other software systems. The sequence databases are also provided in NBRF format for use with software such as FASTA on Macintosh or MS-DOS systems.

EBI network services

In addition to archiving sequence and genome data, the EBI provides an ever-expanding number of free network services to external users. The EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and the other EBI databases are currently accessible via electronic mail fileserver, FTP, and World Wide Web (WWW). New and updated entries from all three collaborating nucleotide sequence databases are added daily to the network servers, making it possible to retrieve entries and perform sequence similarity searches on the very latest nucleotide data. Weekly additions of new and updated SWISS-PROT entries are also available.

The complete collection of additional specialist molecular biology databases is also available. Complementing these extensive data resources is a collection of molecular biology software for MS-DOS, Macintosh, VMS and UNIX. Documents such as subscription and submission forms, and the DDBJ/EMBL/GenBank Features Table Definition, can also be retrieved.

EBI network fileserver

The EBI network fileserver enables access via electronic mail (e-mail) to the EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and to the full collection of other databases, public domain software and documentation maintained by EBI. Items are retrieved from the server by sending a command in an e-mail message to the fileserver address. Detailed instructions on using the fileserver, and a current list of contents, can be obtained by sending a message to the Internet address Netserv@ebi.ac.uk with the word HELP in the body of the message. A full set of instructions will be returned automatically.

EBI FTP server

This is the main route for retrieving the EMBL nucleotide sequence database, the SWISS-PROT protein sequence data bank and other databas