Home
 Content
 Lead article
 EBI
 Bio-eye
 Events
BioInformer Logo -- click for homepage

A publication of EMBL - Outstation Hinxton, The European Bioinformatics Institute

EBI logo -- click for homepage
biobrddwn

SPTR - A comprehensive, non-redundant and up-to-date protein sequence database

by Henning Hermjakob, Fiona Lang, Rolf Apweiler.

EMBL-Outstation Hinxton, European Bioinformatics Institute,
Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, United Kingdom.

Introduction

SPTR is a comprehensive protein sequence database that combines the high quality of annotation in SWISS-PROT1 with the completeness of the weekly updated translation of protein coding sequences from the EMBL2 nucleotide database. It is composed of three parts:

  • SWISS-PROT - a manually curated protein sequence database which strives to provide a high quality of annotation, a minimal level of redundancy and a high level of integration with other biomolecular databases. The SWISS-PROT component of SPTR contains the latest SWISS-PROT release as well as the new or updated entries in SWISSNEW.
  • TrEMBL - a computer-annotated protein sequence database supplementing SWISS-PROT. It contains translations of all protein coding sequences in the EMBL nucleotide sequence database which are not yet in SWISS-PROT. TrEMBL is split in two main sections, SP-TrEMBL and REM-TrEMBL. SP-TrEMBL contains the entries which will be incorporated into SWISS-PROT and is one of the three SPTR components. REM-TrEMBL contains the entries that will not be included into SWISS-PROT for a variety of reasons, e.g. synthetic sequences and pseudogenes. Therefore REM-TrEMBL is not included in SPTR.
  • TrEMBL-NEW - the weekly update to SP-TrEMBL which contains the protein-coding sequences from EMBLNEW. During the quarterly release building procedure, TrEMBLNEW entries are moved into SP-TrEMBL.

SPTR Data Flow

dataflow_ext

During the weekly SPTR building process, all three components undergo a syntax error check and a redundancy check. Entries which are filtered out during the error check or the redundancy check are manually updated and re-integrated into the next weekly SPTR release. This introduces a minimal incompleteness in SPTR, but we regard the current average of five extracted entries or 0.002% of all entries per weekly release as tolerable.

SPTR contents

On Friday, Sept 18, 1998, SPTR contained the following entries:

SWISS-PROT

74988

TrEMBL

164285

TrEMBL-NEW

37860

SPTR complete

277133

SPTR is comprehensive

Many bioinformatics sites construct non-redundant databases from a number of component databases, using e.g. the NRDB program provided by the NCBI3 or they use external non-redundant databases, e.g. OWL4. Both strategies improve the situation for the end user considerably, but they require the time and resource consuming maintenance of multiple databases or the acceptance of a certain time lag between creation of an entry and its appearance in the non-redundant database.
Furthermore, both strategies are prone to a loss of information in the individual entry due to the diversity of database formats. While OWL preserves most of the text of an entry and some of its structure, the NRDB program requires a conversion of the component databases to FASTA format which contains only one description line per entry. With SPTR, we strive to provide a protein sequence database that provides a high information content as well as being comprehensive, non-redundant and up-to date.

To ensure that SPTR is comprehensive, we have examined several papers which describe the construction of non-redundant protein sequence databases and have verified that the content of the additional databases is also contained in SPTR. The following table lists the databases which are components of non-redundant protein databases and how they are included in SPTR.

GenPept

Included by translation of EMBL.

NCBI journal scan

Included through cooperation with NCBI.

Wormpep

Include by translation of EMBL.

PIR

Matched against SPTR, < 5% unmatched entries currently being checked.

NRL_3D (Sequences from PDB)

Matched against SPTR, 21 new entries created.

SPTR is non-redundant

The redundancy check carried out during the weekly SPTR production ensures non-redundancy on the level of accession numbers, IDs, and the protein identifiers (PIDs)5. We do not automatically merge entries with sequence similarity into single entries because this would also merge entries which should be kept separate, e.g. fragments of different viral strains. When building the quarterly major releases of the component databases, we automatically identify entries with sequence identity and matches of fragments against longer sequences with the LASSAP package6, but all merges are manually checked.

SPTR is up-to-date

The three international collaborating nucleotide databases DDBJ/EMBL/GenBank exchange their data on a daily basis and all new coding sequences from EMBL are incorporated into SPTR on a weekly basis. Therefore protein coding sequences submitted to DDBJ/EMBL/GenBank will appear in SPTR within one week in the average case and two weeks in the worst case.

Availability

SPTR is available via FTP from the EBI and Expasy FTP servers and as part of the EBI search services:

Article by: Henning Hermjakob, Fiona Lang, and Rolf Apweiler


 

Resources and further information

  • European Bioinformatics Institute (EMBL-EBI)
    http://www.ebi.ac.uk/
  • SWISS-PROT Web pages
    http://www.ebi.ac.uk/ebi_docs/swissprot_db/swisshome.html
  • FTP access to SPTR
    ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/
  • Expasy
    http://www.expasy.ch/
  • SWISS-PROT Web pages
    http://www.expasy.ch/sprot/sprot-top.html
  • References:
    1. Bairoch, A. and Apweiler, R. (1997) Nucleic Acids Res. 25, 31-36.
    2. Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Garcia-Pastor, M., and Sterk, P. (1998) Nucleic Acids Res. 26, 8-15.
    3. Gish, W., ftp://ncbi.nlm.nih.gov/pub/nrdb/README
    4. Bleasby, A., Akrigg, D., & Attwood, T. (1994) Nucleic Acids Res. 22, 3574-3577.
    5. O'Donovan, Martin, M. J., Apweiler, R., Codani, J.-J., & Glemet, E. (1998) In Press.
    6. Glemet,E. & Codani, J.(1997) Comp. Appl. BioSci.. 13, 137-143.

 

External sites are not endorsed by EMBL-EBI

 

biobrddwn

Direct questions or comments to Bioinformer Editor. This page last modified Friday, 16 July, 1999.
ISSN 1462-1363.
More information about the BioInformer.

(c) 1997-1999 EMBL-EBI. All Rights Reserved.