nrdb90: a nonredundant sequence database
To maximise the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of firsthand annotation. These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the SWISS-PROT, Swissnew, TrEMBL, TrEMBLnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. Nrdb90 is available for academic use from http://www.ebi.ac.uk/~holm/nrdb90.
Information by: Liisa Holm
Resources and further information
The European Bioinformatics Institute (EMBL-EBI)
Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998 Jun; 14(5):423-429
External sites are not endorsed by EMBL-EBI