|
EBI general news
Nucleotide sequence database 1996-1997
Last year an average of one sequence per minute was entered into the EMBL Nucleotide Database, which, by the end of 1996, contained about 700 million base pairs from over 1 million sequences. Collaboration and data exchange with GenBank and DDBJ have proceeded smoothly, and the long-term project to extract sequence data from the 'backfile' of pre-1993 patent literature was completed. Major genomic sequencing began in 1996. The database programming group developed automated procedures for handling the very high volumes of genomic data in varying stages of completion and analysis from the Sanger Centre's human and nematode sequencing projects. The genomes of yeast and several bacteria were completed. Substantial effort has been directed at redesigning the relational database schema which will be implemented in early 1997 on EBI's new multi-processor Digital 8400 servers. The redesign overcomes some inefficiencies which developed as the database has grown in size about 50-fold during its current incarnation, it will provide extra functionality such as constructed (virtual) sequences and alignments, and it will enable still greater integration of nucleotide and protein sequenced data. Four releases of the database were created as usual during the year, and made available via network servers. The releases are also distributed quarterly on CD-ROM with retrieval software for Windows and Apple Macintosh. Daily batches of updates are made public via EBI's anonymous FTP server for those sites (e.g. EMBNet nodes) who wish to maintain an up-to-date local copy of the database.
A new team at EBI: Macromolecular Structures
During 1996 EBI hired the team to begin our involvement in the Macromolecular Structure Database (MSD), in collaboration with the PDB group at Brookhaven National Laboratories, and the NDB group at Rutgers. The team, headed by Phil McNeil, currently consists of four researchers. A mirror service at the EBI for both PDB (Protein Data Bank) and NDB (Nucleic Acid Database) was established to enable better access to this data for European researchers. Many collaborative discussions took place concerning database schemas, data deposition systems, and a strategy was developed for cleaning up the existing PDB dataset at the EBI.
Radiation Hybrid Database
The Radiation Hybrid Database (RHdb) is an archive of PCR results on radiation hybrid panels, and mapping information with links to other databases. There are now about 38000 (raw data) entries and radiation hybrid linkage maps for the 23 human chromosomes. The RHdb team took part in the collaborative effort to produce the first Transcripts Map of the Human Genome.
IMGT
The ImmunoGenetics database (IMGT) contains nucleotide sequence information on genes important in the function of the immune system. The project is a collaboration between the EBI, LIGM (Montpellier, France), ICRF, and the University of Köln. Release 9611 contains data from 47 species (mostly human) and 7037 annotated nucleotide sequences.
Network services
Our e-mail, anonymous FTP, and WWW servers continue to be popular and demanding resources. The preferred mode of access to EBI services is now the World Wide Web (WWW). Our WWW server gives access to all databases maintained at the EBI, with SRS being the key application for database query and retrieval. last year, Mirror services have been set up on our WWW server for PDB, NDB, and SubtiList (B. subtilis) databases. The sequence similarity search service based on FASTA received a boost in capacity thanks to the installation of new Digital 8400 servers in a TruCluster configuration, and the Smith-Waterman search service continues on the MasPar using the MPsrch. A Bioccelator from Compugen has been evaluated and acquired as an eventual replacement for the MasPar service. Another specialised device which has been evaluated is the Fast Data Finder (FDF) which can be put to work either on Smith-Waterman database searching or in its more original role in large text-scanning applications.
Database technology
We began to investigate and compare current database technologies, relational, object relational and object oriented. An object-oriented model of RHdb was mapped to ORACLE, Illustra and IDB. Different technologies and software for developing object-oriented models, and the mapping of these to relational database tables, have also been investigated. We have chosen CORBA (Common Object Request Broker Architecture) as the central framework in which the EBI database and services will interoperate. A first CORBA environment was developed for RHdb, building both client and servers in Java using Visibroker object request broker (ORB). Another was developed using Sun Neo and SyBase to produce a phylogenetic tree viewer in Java (see elsewhere in this newsletter).
Sequence analysis
Work centered on the development of sequence alignment methods and software. Genetic algorithms were applied to the problem of multiple sequence alignment. The result is a program called SAGA which can be used to align sequences using any arbitrarily chosen measure of multiple alignment quality. The program appears to be capable of finding globally optimal solutions in reasonable time when used to optimise the weighted sums of pairs objective function. Genetic algorithms have also been applied to the problem of the alignment of RNA molecules with known secondary structure. If one has two sequences to align and if one of them has a known secondary structure, the two sequences can be aligned taking both primary and secondary matching into account. Again, the program, RAGA, appears capable of finding optimal solutions within reasonable time for even quite long sequences. Finally, we have attempted to optimise the alignment of new small subunit rRNA molecules to existing alignments, using conventional profile alignment methods, incorporating novel sequence weighting schemes. The intention is to combine the genetic algorithm based structural alignments with the conventional profiles in order to help automate the addition of new rRNA sequences to the sequence database.
Info supplied by: EBI Services |