|
SPTR - A comprehensive, non-redundant and up-to-date protein sequence database
by Henning Hermjakob, Fiona Lang, Rolf Apweiler.
EMBL-Outstation Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
Introduction
SPTR is a comprehensive protein sequence database that combines the high quality of annotation in SWISS-PROT1 with the completeness of the weekly updated translation of protein coding sequences from the EMBL2 nucleotide database. It is composed of three parts:
SWISS-PROT - a manually curated protein sequence database which strives to provide a high quality of annotation, a minimal level of redundancy and a high level of integration with other biomolecular databases. The SWISS-PROT component of SPTR contains the latest SWISS-PROT release as well as the new or updated entries in SWISSNEW.
TrEMBL - a computer-annotated protein sequence database supplementing SWISS-PROT. It contains translations of all protein coding sequences in the EMBL nucleotide sequence database which are not yet in SWISS-PROT. TrEMBL is split in two main sections, SP-TrEMBL and REM-TrEMBL. SP-TrEMBL contains the entries which will be incorporated into SWISS-PROT and is one of the three SPTR components. REM-TrEMBL contains the entries that will not be included into SWISS-PROT for a variety of reasons, e.g. synthetic sequences and pseudogenes. Therefore REM-TrEMBL is not included in SPTR.
TrEMBL-NEW - the weekly update to SP-TrEMBL which contains the protein-coding sequences from EMBLNEW. During the quarterly release building procedure, TrEMBLNEW entries are moved into SP-TrEMBL.
SPTR Data Flow
During the weekly SPTR building process, all three components undergo a syntax error check and a redundancy check. Entries which are filtered out during the error check or the redundancy check are manually updated and re-integrated into the next weekly SPTR release. This introduces a minimal incompleteness in SPTR, but we regard the current average of five extracted entries or 0.002% of all entries per weekly release as tolerable.
SPTR contents
On Friday, Sept 18, 1998, SPTR contained the following entries:
|
SWISS-PROT |
74988 |
|
TrEMBL |
164285 |
|
TrEMBL-NEW |
37860 |
|
SPTR complete |
277133 |
|
SPTR is comprehensive
Many bioinformatics sites construct non-redundant databases from a number of component databases, using e.g. the NRDB program provided by the NCBI3 or they use external non-redundant databases, e.g. OWL4. Both strategies improve the situation for the end user considerably, but they require the time and resource consuming maintenance of multiple databases or the acceptance of a certain time lag between creation of an entry and its appearance in the non-redundant database. Furthermore, both strategies are prone to a loss of information in the individual entry due to the diversity of database formats. While OWL preserves most of the text of an entry and some of its structure, the NRDB program requires a conversion of the component databases to FASTA format which contains only one description line per entry. With SPTR, we strive to provide a protein sequence database that provides a high information content as well as being comprehensive, non-redundant and up-to date.
To ensure that SPTR is comprehensive, we have examined several papers which describe the construction of non-redundant protein sequence databases and have verified that the content of the additional databases is also contained in SPTR. The following table lists the databases which are components of non-redundant protein databases and how they are included in SPTR.
|
GenPept |
Included by translation of EMBL. |
|
NCBI journal scan |
Included through cooperation with NCBI. |
|
Wormpep |
Include by translation of EMBL. |
|
PIR |
Matched against SPTR, < 5% unmatched entries currently being checked. |
|
NRL_3D (Sequences from PDB) |
Matched against SPTR, 21 new entries created. |
|
SPTR is non-redundant
The redundancy check carried out during the weekly SPTR production ensures non-redundancy on the level of accession numbers, IDs, and the protein identifiers (PIDs)5. We do not automatically merge entries with sequence similarity into single entries because this would also merge entries which should be kept separate, e.g. fragments of different viral strains. When building the quarterly major releases of the component databases, we automatically identify entries with sequence identity and matches of fragments against longer sequences with the LASSAP package6, but all merges are manually checked.
SPTR is up-to-date
The three international collaborating nucleotide databases DDBJ/EMBL/GenBank exchange their data on a daily basis and all new coding sequences from EMBL are incorporated into SPTR on a weekly basis. Therefore protein coding sequences submitted to DDBJ/EMBL/GenBank will appear in SPTR within one week in the average case and two weeks in the worst case.
Availability
SPTR is available via FTP from the EBI and Expasy FTP servers and as part of the EBI search services:
Article by: Henning Hermjakob, Fiona Lang, and Rolf Apweiler
Resources and further information
European Bioinformatics Institute (EMBL-EBI) http://www.ebi.ac.uk/
SWISS-PROT Web pages http://www.ebi.ac.uk/ebi_docs/swissprot_db/swisshome.html
FTP access to SPTR ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/
Expasy http://www.expasy.ch/
SWISS-PROT Web pages http://www.expasy.ch/sprot/sprot-top.html
References:
Bairoch, A. and Apweiler, R. (1997) Nucleic Acids Res. 25, 31-36.
Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Garcia-Pastor, M., and Sterk, P. (1998) Nucleic Acids Res. 26, 8-15.
Gish, W., ftp://ncbi.nlm.nih.gov/pub/nrdb/README
Bleasby, A., Akrigg, D., & Attwood, T. (1994) Nucleic Acids Res. 22, 3574-3577.
O'Donovan, Martin, M. J., Apweiler, R., Codani, J.-J., & Glemet, E. (1998) In Press.
Glemet,E. & Codani, J.(1997) Comp. Appl. BioSci.. 13, 137-143.
External sites are not endorsed by EMBL-EBI
|