|
A CORBA interface to the EMBL Sequence Database
by: Timothy Slidel The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Introduction.
The EBI is spearheading a Europe-wide effort to make remote sources of biological information interoperate using an open industry standard - the Common Object Request Broker Architecture (CORBA). As part of this effort the industry support and services research and development teams, in consultation with the database applications group, have developed an initial version of a CORBA wrapper for the EMBL sequence database.
At present most of the information arriving at and leaving the EBI does so in the form of structured text files (flat files). Very often the sources and destinations of this information are databases, where the information is maintained, stored and often used in much smaller units. Communication via flat files therefore restricts the eloquence of the conversations going on between sources of biological information and their users. For example, if a sequence feature is updated the whole file for the entry must be downloaded and parsed to retrieve the new data. Furthermore, the lack of a generic framework for flat file formats has lead to their proliferation and the consequent lack of interoperability between what is often very similar information.
However, the problem of managing distributed resources is not unique to bioinformatics and has been addressed by the informatics community at large in the form of the Common Object Request Broker Architecture (CORBA). Whilst it is not the panacea for our problems (we will still have to agree on some things!) it does provide an open, industry approved framework on which to build for the future. By using CORBA we can represent the structure of biological information and the operations we perform on it in an extensible and robust manner whilst also providing a friendly environment in which it can be manipulated. For example, the use of object oriented inheritance within CORBA means that a subset of information can be made available to all without restricting the accessibility of more specialist data to those who need it. Clearly, all this will not happen over night - it will require a good deal of work to be done, but it will provide the most future-proof solution to these bioinformatics problems that we have.
Developing the server.
The EMBL sequence database is currently maintained in Oracle using a relational schema developed over the last year at the EBI. One of the constraints on the development of the CORBA server was that it should not involve making changes to this schema. However, we also wanted to provide an intuitive interface to the database at the CORBA level, which would hide the details of interacting with Oracle. The main challenges therefore, were mapping the relational schema to the object oriented CORBA interface and hiding the complexities of accessing the relational database.
The development of the server can be divided into three stages as follows:
- Conceptual object model
- Relational to object mapping
- CORBA server implementation
Conceptual Object Model.
The importance of a well-designed object model cannot be over-emphasised, it serves as a focal point for discussion within the development team and forms the basis for top down implementation. At present the object model (and the server itself) covers sequences, features, locations and taxonomy - publication, contact, patent and related information may be added at a later date. CORBA uses an interface definition language (IDL) to specify the exact nature of the interface to the objects within a CORBA server. We used software tools to help us map the object model directly into IDL, this then forms the basis for the development of the CORBA server and defines how clients should interact with it.
Relational to Object Mapping.
We used a software tool to generate the mapping from the relational database schema to the object model. After some manual intervention, this tool generates one C++ class for each table in the relational schema, giving one C++ object per table row, the package also provides caching and management for the objects created whilst still allowing SQL access to the database. By crossing the boundary between database tables and objects, this automates much of the most laborious part of the development process.
CORBA Server Implementation.
The CORBA part of the server has two parts, one part (the "skeleton" code) is automatically generated from the IDL specification and the other (the "implementation" code) was coded manually. The former knows only about CORBA and dispatches and receives the information specified in the IDL; the latter forms the "flesh on the skeleton", providing the link between CORBA and the database C++ wrapper. So it was at this second level that we had to hide the complexities of accessing the database and massage the database schema into our top level object model. As this layer forms another tier of objects above those generated through the relational to object mapping it was also necessary to implement object management strategies that take into account the distributed, concurrent accesses to data that CORBA facilitates.
Availability.
The current server is at alpha release stage, so the rate of change will remain high for some time. However, we are very interested in any feedback regarding this resource and would like to encourage those interested to develop simple clients or add CORBA connectivity to existing programs. This is most easily done using Java - CORBA connectivity can be added to well-written Java code in a matter of hours. The Java code that handles the CORBA connection to the server can be downloaded from the EBI, so there is no need to purchase expensive software. An example Java client that can connect to the server and retrieve and display sequence and feature information has been developed at the EBI and can be used as a starting point for development. More information about this client and the project in general can be found at http://industry.ebi.ac.uk/~slidel/embl-corba/ or via the CORBA home page at the EBI.
Future Plans.
As the server evolves we will incorporate more functionality and improve its stability and performance. As the CORBA network at the EBI and throughout Europe expands, we hope to develop dynamic data update services based on CORBA so that distributed databases can be updated on the fly in a customisable way, rather than by bulk data transfer. We also hope to see more interoperation between data sources and the development of software tools that allow the naďve user to build complex bioinformatics applications that can help us to extract knowledge from the wealth of distributed information around us.
Written by: Timothy Slidel
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Project 'CORBA wrapper for EMBL database' http://industry.ebi.ac.uk/~slidel/embl-corba/
Project Manager: Tomas Flores (tflores@ebi.ac.uk)
Development team: Jeroen Coppieters (jecop@ebi.ac.uk) Carsten Helgesen (carsten@ebi.ac.uk) Philip Lijnzaad (lijnzaad@ebi.ac.uk) Timothy Slidel (slidel@ebi.ac.uk)
EBI's CORBA homepage http://industry.ebi.ac.uk/~corba/
External sites are not endorsed by EMBL-EBI |