There are three that come to mind. The most important task in bioinformatics is the creation of nucleotide and protein databases, and making them accessible to the bioinformatics community. Especially good software to search the databases and to obtain relevant data is very important. In my opinion, we have done a good job at that.
The second accomplishment is the provision of computer support for the Genome Projects. Large scale genome sequencing projects were only possible because of advances in computer technology the last 10 years. We are currently keeping in step with the advances in computer technology; it might even be a limiting factor for the advancement of bioinformatics.
Thirdly: advances in protein structure and prediction, which is in my opinion a gradual process -- there are few dramatics leaps forward but still it is an important accomplishment.
Do you think bioinformatics is already a mature science, or is it still in its infancy with regard to results/applications?
It is certainly not a mature science: it is relatively new and full of new opportunities and new ways to think. There are major opportunities for new results, particulary I believe in evolutionary genetics.
How would you define bioinformatics?
Bioinformatics is a relative new word and is still settling in. I myself resisted the term for some time, but now I have come to realise that it is appropiate, because the information is so central. I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information.
Is there any specialisation within bioinformatics that you think failed to deliver on its promises?
That is not the way to look at things: there will be more results produced later on. I am wary about making strong promises for results to be obtained by bioinformatics. In my opinion, we are doing a good job in 'riding the wave' of the flood of information and delivering the obvious things; these will be substantial and valuable. For me that seems work enough.
What do you see as the future way to go in bioinformatics? Are there methods or techniques that will play an important role in the next 10 years? Or for which you have high hopes?
In my view, sequence analysis will become increasingly based on probabilistic models. This is happening already. Identifying similaritis in sequences is essentially as statistical problem, and more formal probabilistic methods such as hidden Markov models are clearly well suited to these sorts of problems.
Presently, making bioinformatics information available to biologists is coupled with a rapid maturation of Internet technologies. I do not know which flavours will win out at the end; but it is a fact that this rapid changing field is important for bioinformatics.
I would like to say that one principle I do not expect to ever see win out completely is that everything can be done automatically. Information has to be structured in a way that humans can take in the information and use it accordingly. People can handle about ten thousand words or objects. Experts can learn about tens of thousands of objects. For example, an expert gardener knows tens of thousands plants and their characteristics. It is altogether feasible that a researcher can learn the essential facts about most if not all the genes in a genome.
However, we do not know enough structure to capture this knowledge comfortably yet. Computation and information gathering will be done by computers, and then presented to the researcher in a well-structured manner. It is then up to the researcher of course to draw conclusions; the computer is only a tool - I don't believe it will do our science for us.
The Human Genome Project is scheduled to have sequenced the human genome in 2002. What is, in your opinion, the number one practical application of the database?
The main benefits are that it will definitely speed up work for molecular biologists, and will provide opportunities for new connections to be made. It will reduce cost and time by an enormous amount in the short term, and results will also come quicker. I think there will be whole set of secondary large scale methods that will feed off the complete genome information - one can already see the prospects for this in some of the genome technology companies that are starting up.
What are in your opinion the key technologies and issues that play a role in bioinformatics databases (developing, maintaining, using). How will we survive?
There are lots of possibilities. The key issue here is who makes what work. Good graphical visualisation tools are very important. There are various initiatives on the way, e.g. using Java (applets) or CORBA. We will have to see when the dust settles, but in my opinion one should concentrate on the human factor because it will always come back to that.
How would the Human Genome Project, or bioinformatics, affect the average person? What effect or result will they notice more pronounced?
The main short term effect on the average person will be that the price of medicine will go up. Correspondingly, there will be some chance that they will be cured of a disease (not necessarily genetic) or live longer. Less tangibly, I think it will have an effect on our conception of ourselves and our relationship to nature, and other humans. In the long term, there will be huge influence on the course of biological science, and by one or two hundred years' time, it will have changed medical research and large areas of medicine. In my view obtaining the complete human genome sequence is primarily an endeavour of long term scientific value. What we are doing now will be regarded as forming a critical foundation for biological science from now on.
How do you see the future of database funding and support?
In my opinion, big nucleotides and protein databases will be supported by international funding agencies. There is a strong realisation in the international community that these databases play a vital role in biological and medical research.
There is less funding available for more specialised databases; these tend to be surprisingly costly in maintenance. This is a problem, because they are important, and will become more so. I think they are best supported within a large biology programme which is relevant to the database. I would like to see them funded by national experimental biology research grants within their own structure, perhaps as adjuncts to successful experimental groups with an interest in managing a public database.
Much bioinformatics research is done in pharmaceutical and medical industry, often leading to proprietary databases and/or patenting DNA/RNA sequences. What is your opinion on this?
I see the need of patenting certain functions that are attached to sequences. Pharmaceutical companies clearly have to protect their interests somehow. However, I dislike blanket patenting of sequences without function. It seems clear to me the sequence in itself is a discovery - if one finds a particular use that is patentable, but if someone else finds another use for the same sequence they should be allowed to use that, and if they wish patent it.
With regard to research results in pharmacia: it does not bother me too much that sequences are kept secret - that is the right of those who pay for the information. Although it can be frustrating for academic researchers to know that information they want is available somewhere else, I would much rather that things are kept a secret than that strong rights are obtained over anonymous sequences.
I strongly dislike patenting algorithms - although they are indeed inventions, the patent system seems singularly inppropriate there, for what is essentially a pure mathematical idea.
Where do you think academia and industry can work closer together in bioinformatics?
There is a lot of data in the public domain which is not well organised and presented. I would like to see industry and academia work together to help and support structuring and presenting those data. It is a waste of time and effort for each company to independently work on their own to structure those public databases. A concerted effort reduces the cost and eases in the long run the maintainability of internal databases for the companies, as well as benefiting the research community. Also, the models and tools developed for public data could be used for their private databases as well.
With the Human Genome Project there also come the ethical questions. Behavioural genetics is one that springs to mind. What is your opinion on this?
I would plead for strict controls on the use of genetic information. I was appalled that the UK insurance industry wants to have access to genetic data. In America they have put strict limits on the use of genetic information by the health insurance industry (Health Insurance Portability and Accountability Act of 1996).
There are other areas where public debate is important. I believe that society should set the bounds, and that scientists should educate and try to persuade in public debate where they feel issues are important. There is an area of direct personal interest to me. My brother suffers from schizophrenia. Looking at psychological disorders and the genetics of emotions and other facets of the mind seems important to me, but others may be less comfortable with it.
In fact, the situation in Britain today with respect to such things does not seem bad to me - I appreciate the efforts of the media and of popularisers of genetics such as Steve Jones. I think it is important that people do discuss the issues, and that the media, and politicians, take an active interest.
One of your projects today is the ACEDB package. Could you tell the readers what the package is all about?
Since ACEDB is linked with a quite a number of special purpose visualisation tools, it is used by molecular biologists to construct genetic and physical maps of genomes. The ACEDB software was written and developed by Jean Thierry-Mieg (CRBM-CNRS, Montpellier, France) and me, starting in 1990. It is written in the C programming language and uses the X11 windowing system to provide a platform independent graphical user interface.
The underlying database engine is an unconventional one: we designed it specifically to store information about biological objects like genes, alleles, clones, sequences, etc. Besides these 'raw' data, we required a structure that could handle additional information gained by for example experiments, or extracted from papers. Furthermore, we want the database structure to be able to evolve as experience is gained. The then popular relational databases Sybase and Oracle, and the object-oriented Object Store for example, didn't seem flexible to us at the times. That is why we designed our own database engine.
Nowadays, ACEDB has hundreds of regular users: the yeast genome database at Stanford University, the US National Agricultural Library and the C. elegans and other model organism communities to name a few. A WWW interface to ACEDB has been developed in France and at NAL. At the Sanger Centre we use ACEDB to manage all our human mapping data, and annotations of the sequences that we produce.
What are the future developments in ACEDB?
Currently, there is an effort to develop a Java web interface to ACEDB, called JADE, by Jean Thierry-Mieg and Lincoln D. Stein (Whitehead Institute/MIT Centre for Genome Research). Furthermore, we expect to see more tools to handle protein sequences and a growth in the number of visualisation tools that can be used together with ACEDB.