Distributed Computing in the Life Sciences
by Timothy Slidel
EMBL-Outstation Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
It is now increasingly common for computational tasks within life sciences research to be carried out in a heterogeneous, distributed computing environment. Information must be moved from one machine to another or disks cross-mounted so that different programs can be run on multiple systems. Frequently programs have to be re-written in another programming language so that they can be compiled and executed on another architecture. Funding constraints and technological requirements dictate the make and model of workstations and servers so that a variety of platforms are in use. Furthermore, integrated systems are now scaling beyond the intranet to include data and compute resources available throughout the Internet. Maintaining a working software system in this computational jungle is a laborious and time-consuming practice, using up valuable resources.
In addition to the logistics of such a system there is also the issue of how best to represent the data itself. At present most of the information moving around life sciences systems does so in the form of structured text files (flat-files). This method of communication suffers from a number of problems:
- Text files do not represent scientific data very well so it must usually be reconstructed before use. The plethora of existing flat-file formats means that a correspondingly large amount of parsing software must be maintained and used to do this.
- Data related to a number of flat-files must often be repeated in each one due to the lack of a well-defined information structure, this results in redundant network and storage use.
- Frequently only a subset of the information in a file is used at its destination which also represents an inefficient use of resources.
These problems, together with the torrents of data emerging from the genome projects and other sources will put severe strain on existing systems.
The increasing reliance of business on information technology has resulted in a similar demand on information resources and the informatics community at large is attempting to address the problems of scalable distributed heterogeneous computing. In complex systems the advantages of object-oriented programming are undeniable. This is further accentuated when system components are distributed and need to interoperate robustly in different environments. As a result the IT industry has developed "Distributed Object Technology" - associated acronyms like RMI, CORBA and DCOM are amongst today's most popular IT buzzwords and they have found their way into the parlance of the life sciences informatics community. At the EBI we believe that CORBA (Common Object Request Broker Architecture) is the most promising technological solution to our computing needs and we have embarked on an extensive program of development to augment the EBI's services and resources.
This article introduces CORBA in the context of life sciences research and explains how it differs from some of the other solutions available.
Given an infinite set of distributed heterogeneous computer software and hardware resources required to conduct life sciences research we wish to provide an effective computational infrastructure that will enable them to interoperate efficiently, robustly and flexibly to achieve their purpose.
The CORBA Solution
CORBA is one of many technology specifications that have emerged from the Object Management Group (OMG), the largest non-profit software consortium in the World, with over 800 member organisations (of which the EBI is one). The OMG has a proven process for "technology adoption" which results in standard specifications for software technology, that is - it doesn't standardise software itself, only a description of how conforming software should behave. This process is open to participation from anyone, so adopted standards directly reflect the needs and expectations of the entire community.
CORBA forms the communications infrastructure of the OMG's Object Management Architecture (OMA) - a network of (virtual - because they are specifications) software components that together provide the basic needs for objects interoperating in a distributed environment as well as the more specific needs of individual technology domains. The OMG has a number of Special Interest Groups (SIGs) that are responsible for nurturing standard specifications within these technology domains. In August, 1997 a SIG was formed in order to promote the evolution of OMG standards within the life sciences informatics field, information about the Life Sciences Research SIG can be found at http://lsr.ebi.ac.uk.
How does CORBA work?
Collaborating objects, like people in large organisations, rely on a clear understanding of their respective responsibilities in order to achieve their collective aims. In the OMA world each object has a clearly defined contract in the form of an interface definition that tells other objects what it can be expected to do. The text of this contract is written in Interface Definition Language (IDL), a simple language that only describes what is done and not how to do it (so it doesn't contain any "for" loop instructions, for example). In the CORBA world, once an IDL interface has been written for an object it no longer matters how that object is implemented (on which platform or with which programming language) because other objects only rely on the IDL contract when they interact with it. Of course, there are some hidden practicalities that do the actual translation between objects and there must be some way of identifying them, but in essence that is all there is to CORBA.
When implementing a CORBA system much of the more complex CORBA programming is provided automatically by what is called an "IDL compiler", this software tool translates between a specific IDL interface and the underlyingimplementation written in the developer's chosen programming language. The translation is done according to an OMG standard "mapping" between IDL and the programming language (mappings have been defined for many of the most common programming languages like Java, C and C++). CORBA objects can then communicate via their IDL defined interfaces using a standardcommunicationsprotocol (this allows them to interact over networks like the Internet).
Each CORBA object has a unique identifier (called an "object reference") that can be used to look it up and interact with it from any other CORBA-aware system in the network. This "location transparency" is achieved through the use of an "Object Request Broker" or ORB, this software component knows about all the objects in its immediate environment and can resolve an object reference to the corresponding object instance. With knowledge of its object reference and IDL one object can interact with another using exactly the same code, regardless of whether they are within the same process or on opposite sides of the world (see Figure 1). Location transparency affords the software developer incredible flexibility in the implementation of distributed systems.
Figure 1. CORBA Communication. Each object (circles) can be implemented using a different programming language on a different architecture, but all objects can still communicate via their IDL interfaces and ORBs.
The EBI has hosted a number of interactive courses on the applications of CORBA in the life sciences (click here for information about CORBA at the EBI). Further information about CORBA can be found at the OMG site here.
Alternatives to CORBA
CORBA is by no means the only solution for distributed object computing. However, it is the most appropriate solution given the above problem statement and the shortfalls of alternative technologies such as those outlined below.
Microsoft's DCOM (Distributed Component Object Model)
DCOM (or ActiveX as it is sometimes known) has been adapted from a single system architecture (OLE/COM) in an attempt to encompass the world of distributed objects. As a result it is technologically inferior to CORBA, which was designed with distributed computing in mind. DCOM is also:
- Only recently publicly available, whereas CORBA has been in active use for a number of years
- Available only for a restricted number of platforms and programming languages
- Almost entirely controlled by Microsoft - so it is not an open industry standard as CORBA is
The OMG have produced specifications for CORBA/DCOM interworking and a number of products now exist that allow DCOM objects to be used from CORBA and vice-versa. So using one technology does not exclude the other. Further information about DCOM and CORBA can be found in the following publications:
Comparing ActiveX and CORBA/IIOP
Andrew Watson, Richard Soley, and Mike Bradley of the Object Management Group, Feb 97
DCOM and CORBA Side by Side, Step by Step, and Layer by Layer
P. E. Chung, Y. Huang, S. Yajnik, D. Liang, J. C. Shih, C.-Y. Wang, and Y. M. Wang
Java RMI (Remote Method Invocation)
Java has proved very popular as a programming language, especially due to its platform independence. However, until Java 1.1 arrived it lacked a way of communicating amongst separate applications. RMI was added to the language as a way of enabling this. However, RMI is not language independent, it was always intended as an extension to Java rather than as a competitor to CORBA. Indeed it is possible that RMI could be implemented using CORBA's Internet communication protocol IIOP, bringing the two technologies into line. For more information about Java and CORBA see:
Java, RMI and CORBA
David Curtis of the Object Management Group, May 97
Extending the Web
The world wide web has made a significant difference to life sciences informatics. However, HTML (HyperText Markup Language), the language used to construct web-pages, is limited to the representation of text-based documents and is therefore a poor medium for scientific information. In addition the protocol used to interact with Web pages, HTTP (HyperText Transfer Protocol), has a very limited number of ways of doing so. As a result current web technology is unable to meet the requirements of distributed object computing - "This old dog cannot learn any new tricks". Enter XML and HTTP-NG (HTTP-Next Generation). XML is a language that can be used to create new "markup languages" - like HTML, but tailored to some specific needs. For example, CML (Chemical Markup Language) is an XML-defined markup language for describing chemical information. However, XML still defines text-based formats. HTTP-NG is a work in progress that will resolve some of the shortfalls of HTTP and facilitate interaction with distributed object technologies like CORBA and DCOM. So, although current Web technology can be extended to interact with distributed objects, it is unlikely to be a suitable solution to our problem.
Designing a Distributed Object System using CORBA
The actual wiring one uses to construct a distributed object system is somewhat secondary to how the systems is designed. A badly designed distributed system with the world's best communications infrastructure will not perform. Most object oriented text books will emphasise that the OO analysis and design stage is vital to producing quality software. This is more so if the system is distributed since this adds an extra dimension to the design problem. The OMG has recently adopted the UML (Unified Modeling Language) as the standard way of modelling distributed object systems, some software design tools (e.g. Rational Rose) will now generate IDL directly from a UML design. Exciting future OMG specifications promise to further ease the process of designing and implementing distributed systems based around CORBA.
This article has discussed many of the reasons why CORBA is a promising technology for distributed object computing in this field and has provided pointers to further information. With some luck, hard work and the standards adopted through the Life Sciences Research SIG the life sciences community can look forward to increased interoperability and utility in informatics software over the next few years.
Article by: Timothy Slidel
Resources and further information
European Bioinformatics Institute
Life Sciences Research SIG
Object Management Group (OMG)
External sites are not endorsed by EMBL-EBI