|
EMBL Nucleotide Sequence Submissions: From Receipt to Distribution
Guenter Stoesser EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
Introduction
The EMBL Nucleotide Sequence Database involved in an international collaboration - DDBJ (Japan), EMBL (UK), and GenBank (USA) - is growing rapidly as a result of large-scale sequencing efforts. EMBL Release 59 (June 1999) contained over 3.9 million entries comprising about 2924 million nucleotides from over 25000 organisms. This represents an increase of about 24% over Release 58 (March 1999). The EMBL Database and the derived protein sequence databases TREMBL and SWISS-PROT have made electronic identification of homologues and members of gene families routine. Discovery of novel genes, identification of homologous genes, analysis of alternative splicing and detection of polymorphisms are some of the uses of the database in the context of biomedical research, and this will only increase as large-scale sequencing efforts result in more high-throughput sequence data (HTG) and more complete genomes being included in the dataset. Bioinformatics tools for database searching, sequence and homology searching, gene prediction, multiple sequence alignments, etc. are available from the databases allowing 'in silico' analysis. Submission of new sequence data and update information to the public database is an essential prerequisite for building and maintaining a complete and up-to-date dataset allowing the scientific community to perform similarity searches and analysis on the latest nucleotide and protein sequence data.
The EMBL Submission Systems
WEBIN
WEBIN is an Internet based tool for submission of nucleotide sequences to the EMBL database. WEBIN is designed to allow fast submission of either single, multiple or even very large numbers of sequences (bulks). WEBIN is available from the EMBL WWW home page or at URL:
http://www.ebi.ac.uk/embl/Submission/webin.html
Soon after its inception in 1997 WEBIN has quickly become the preferred submission system. Since then, many new developments have further enhanced this system. The submission procedure has been accelerated and simplified by enabling fast-track copying of data from previous WEBIN sessions, copying of features from one sequence to the next in multiple sequence submissions, and by providing 'template' 'forms for frequent biological combinations of features e.g. rRNA. The quality of flat files generated by WEBIN has been increased by incorporating strict validation checks, allowing automatic translation of DNA into protein for CDS features. Help and online annotation guides have been expanded and are accessible from within WEBIN.
SEQUIN
Sequin is a stand-alone software tool for submitting and updating nucleotide sequences to the GenBank, EMBL, or DDBJ databases. Sequin contains a number of built-in validation functions for enhanced quality assurance and runs on Macintosh, PC/Windows, and UNIX computers. Sequin is available for download from the EBI servers.
E-MAIL
A new version of the e-mail submission form has been available since December 1998. The form solicits contact, sequence, source, citation and feature information, and is processed manually by the curation team As a result assigning of accession numbers may be delayed. Versions of the form which pre-date December 1998 are no longer accepted by the database. Copies of the completed form may be submitted to us via e-mail only. The EMBL database no longer accepts submission by post or on disk. Contact datasubs@ebi.ac.uk to request a copy of this form. Please note, that this medium should only be used by submitters who do not have a reliable connection to the WWW.
GENOME PROJECT ACCOUNTS
Large-scale sequencing projects have already become the major sources of new sequence data. The EMBL database opens submission accounts for groups producing large volumes of nucleotide sequence data over an extended period. Database entries produced at the research site are deposited and updated directly by the genome project submitter using FTP or e-mail. Each submission account is curated by EBI biologists, who ensure that new entries follow EMBL database annotation conventions. The curators act as a liaison between the sequencing groups and the database, will offer advice to project submitters and will also try to resolve problems if genome sequences repeatedly fail to load into the database. Groups that wish to make use of this submission procedure should contact the database at datasubs@ebi.ac.uk.
Sequence data produced at sequencing centres are included into the database as soon as they are received from the individual sequencing groups, and become immediately available for homology searches via network services. The exact procedure of data acquisition is dependent on whether the sequence data to be incorporated represents `unfinished´ high-throughput genome data (HTG) or `finished´ sequence data. The recently announced draft human genome sequencing involving 5 major international sequencing centres will produce over 3 Gigabases of High-Throughput phase 1 (HTG1) sequence data to be integrated into the database during the course of this year. About 1 Gigabase (~33%) of this data will be produced in Europe and collected by the database while the other portion will be incorporated via international collaborative data exchange.
Certificate of submission: The accession number EBI staff checks that new submissions meet the according requirements and database accession numbers are assigned within 2 working days. The accession numbers serve as a confirmation that the sequences have been submitted and are permanent, citable numbers that will allow sequences to be referenced in publications by yourself and others. These same numbers are used to retrieve your sequences from the EMBL Database or from one of the other International Database Collaborators, GenBank and DDBJ. Accession numbers consist of one letter and five digits, or two letters and six digits, and do not change even if the record or its sequence is updated.
Data confidentiality and release dates If the record is not to be held confidential, it is incorporated into EMBL and available via the network services immediately. A confidential record will not be released into the public database until you have notified EMBL or it is published, whichever comes first. At any time, you may update information in your record. We encourage authors to notify the EMBL Database of publication so that confidential records may be released and public records can be updated in a timely manner. Use the WWW Update Form or send a message with the new information to update@ebi.ac.uk; please include your accession number with all correspondence.
Database Curation
Whether you are using the sequence submission tool WEBIN on the WWW, the stand-alone program SEQUIN, or one of the specialised submission procedures for GENOME PROJECT DATA your submission is received by the EMBL Database staff - a group of biologists and database specialists who manage the collection and distribution of the EMBL Nucleotide Sequence Database. While genome project submissions are processed in large numbers using semi-automated systems, all other types of sequence records are processed manually to ensure biological integrity and internal consistency with annotation rules established by the International Nucleotide Sequence Database Collaboration DDBJ/EMBL/GenBank. A team of curators trained in molecular biology and skilled in database production operations annotate, organise, and maintain the ever-growing number of database entries. New submissions are checked for mandatory information, biological features, translations of coding regions etc. EMBL database entries must contain sufficient biological information to describe the sequence and to indicate the hypothetical or experimentally proven function of regions within the sequence. One of the most important features in EMBL entries is the protein coding sequence (CDS). All CDS features within EMBL entries are translated and added automatically to the TrEMBL and SWISS-PROT protein sequence databases. Coding regions in EMBL entries are cross-referenced (via /db_xref qualifier) to the protein databases, and where appropriate EMBL entries are linked to other specialised databases (species-specific databases). These cross-references allow access to additional information concerning the entry that is more appropriately stored in other dedicated databases.
|
|
|
Figure 1: Acquisition and distribution dataflow |
|
Sequence annotation is an essential part of EMBL sequence records and current database policy is to reject submissions for which no sequence annotation has been provided, unless these describe EST's or unfinished high throughput genome sequences. Both WEBIN and SEQUIN allow to add sequence annotation - any number of relevant features can be easily added to the sequence feature table via the according feature forms. Additionally, the WEBIN 'Summary and Sequence Features' page presents a sequence flat file which summarises all the data entered so far. This flat file summary also contains hyperlinks to all the WEBIN forms already visited, and so allows the submitter to make any necessary corrections to data entered in previous forms. To help and guide submitters in annotating their sequences, two new Internet guides are now available via hyperlinks from the EMBL-EBI WWW site and from within WEBIN: EMBL Features & Qualifiers and EMBL Annotation Examples. The EMBL Annotation Examples consist of a list of EMBL approved feature table annotations for common biological sequences. The EMBL Features & Qualifiers is a complete list of feature table key and qualifier definitions and provides full explanations of how to use all feature table elements. (Figure 1).
Genome annotation Initial submissions of genome data by sequencing projects include preliminary gene annotations based on gene prediction programs. The underlying information concerning methods used and matches found are currently not available to the user community. Given that a sequence might have a large number of matches to other database entries and might be run through several different algorithms using different parameters, the sheer quantity of analysis information is currently considered to be beyond the scope of the Feature Table. Additionally, sequencing groups may not maintain the annotation after a sequencing project has finished. Efforts are undertaken to implement a model by which genome analysis information can be maintained by the community - in particular the species-specific databases (e.g., Flybase, SGD etc.). The sequence database can then link via /db_xref from a given CDS feature to the according external up-to-date methods/matches/annotation information.
|
|
|
Figure 2: EMBL Features & Qualifiers and Annotation Examples WWW resources
|
|
Data Access and Information distribution
The turnaround time from submission to distribution is from 2 to 5 days. The main mechanisms of data distribution is by quarterly database releases and daily updates via network services which take full advantage of the rapid progress in computer network technologies. Once the record is included in the database, scientists can access the record the next day (see also a sample database entry). The nucleotide sequence database is accessible via the World Wide Web, electronic mail fileserver and FTP providing the most advanced network access to a broad range of molecular biology information resources. New and updated entries are added daily to the network servers, making it possible to retrieve entries and perform sequence similarity searches on the very latest nucleotide data algorithms such as FASTA and BLAST. Additionally, remote copies of the nucleotide sequence database, updated daily, as well as other molecular biology resources, are held at nationally mandated European nodes.
Bioinformatics tools for database searching, sequence and homology searching, gene prediction, multiple sequence alignments, etc. are available from the databases allowing 'in silico' analysis. A selection of existing tools:
|
Database Searching |
|
|
SRS |
searching & sequence retrieval |
|
Homology Searches
|
|
|
Fasta3 |
similarity & homology searching |
|
WU-Blast2 |
Washington University blast2 (blast 1.4 with gaps) |
|
NCBI-Blast2 |
NCBI blast2 (blastall) program |
|
Analysis Tools
|
|
|
ClustalW_mp |
Multiple sequence alignments |
|
GeneMark |
Gene prediction service |
|
Utilities
|
|
|
CpG Islands |
CpG Islands finder |
|
Genetic Code Viewer |
Review of genetic code differences |
|
Protein Engine |
Translate DNA sequences |
|
Article by: Guenter Stoesser
Resources and further information
European Molecular Biology Laboratory, Heidelberg, Germany http://www.embl-heidelberg.de/
European Bioinformatics Institute, Hinxton, Cambridge, UK http://www.ebi.ac.uk/
The EMBL Nucleotide Database http://www.ebi.ac.uk/embl/
The DDBJ/EMBL/GenBank Feature Table Definition http://www.ebi.ac.uk/embl/Documentation/FT_definitio ns/feature_table.html
WebFeat, EMBL Features and Qualifiers http://www3.ebi.ac.uk/Services/WebFeat/
EMBL Annotation Examples http://www3.ebi.ac.uk/Services/Standards/web/
SEQUIN http://www.ebi.ac.uk/Submissions/index.html (or ftp://ftp.ebi.ac.uk/pub/software/sequin/)
WEBUP linked from http://www.ebi.ac.uk/embl/Submission/webin.html
Alignment submissions http://www.ebi.ac.uk/embl/Submission/alignment.html and ftp://ftp.ebi.ac.uk/pub/databases/embl/align/
Patent data ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/
BlastAll (NCBI Blast2) Vectors Scanning http://www2.ebi.ac.uk/blastall/vectors.html
E-mail search addresses fasta@ebi.ac.uk, blast@ebi.ac.uk, blitz@ebi.ac.uk, bic@ebi.ac.uk
EMBnet http://www.embnet.org/
Genome MOT http://www.ebi.ac.uk/~sterk/genome-MOT/
SRSWWW server http://srs.ebi.ac.uk/
SWISS-PROT & TrEMBL http://www.ebi.ac.uk/swissprot/
DNA Data Bank of Japan, Mishima, Japan http://www.ddbj.nig.ac.jp/
GenBank, NCBI, Bethesda, MD, USA http://www.ncbi.nlm.nih.gov/
External sites are not endorsed by EMBL-EBI
|