Home
 Content
 Lead article
 EBI
 Bio-eye
 Events
BioInformer Logo -- click for homepage

A publication of EMBL - Outstation Hinxton, The European Bioinformatics Institute

EBI logo -- click for homepage
biobrddwn

nrdb90: a nonredundant sequence database

To maximise the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of firsthand annotation. These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the SWISS-PROT, Swissnew, TrEMBL, TrEMBLnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy.  Nrdb90 is available for academic use from http://www.ebi.ac.uk/~holm/nrdb90.

 Information by: Liisa Holm


 

Resources and further information

  • The European Bioinformatics Institute (EMBL-EBI)
    http://www.ebi.ac.uk/
  • Reference:
    • Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections.  Bioinformatics 1998 Jun; 14(5):423-429

 

External sites are not endorsed by EMBL-EBI

 

biobrddwn

Direct questions or comments to Bioinformer Editor. This page last modified Friday, 16 July, 1999.
ISSN 1462-1363.
More information about the BioInformer.

(c) 1997-1999 EMBL-EBI. All Rights Reserved.