|
Novel algorithms and the need to resolve data comparability issue concludes Data Mining conference
Jean-Jack M. Riethoven EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
HINXTON, United Kingdom - Data comparability and the need for novel algorithms are key issues, concludes the conference on "Data mining in Bioinformatics - towards in silico Biology". Organised by the European Bioinformatics Institute and held at the Wellcome Trust Genome Campus from 10 to 12 November 1999, the conference attracted more than 275. With fifteen speakers from academia and industry, and forty posters, the conference offered a good opportunity to both get an idea what data mining is all about and to see what it can do, now and in the nearby future, for the bioinformatics scientist.
"During the last few years bioinformatics has been overwhelmed with increasing floods of data, both in terms of volume and in terms of new databases and new types of data. We are now entering the post-genomic age, where, in addition to complete genome sequences, we are learning about gene expression patterns and protein interactions on genomic scales. This poses new challenges. Old ways of dealing with data item by item are no longer sustainable and it is necessary to create new opportunities for discovering biological knowledge in silico by data mining", Alvis Brazma (EMBL-EBI), chair of the organising committee, explains.
"Data mining is roughly defined as 'exploration and analysis by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules'. Our conference on 'Data mining for Bioinformatics' brought together researchers interested in both fields, with the aim of generating new ideas and insights into how to tackle the challenge of floods of data in molecular biology. "
The conference kicked off with a plenary talk of Heikki Mannila (Helsinki University of Technology and Nokia Technology) who, as one of the top data mining experts in the world, gave an overview of the various issues and algorithms that are involved in traditional data mining. Usually data mining tasks involve finding some global structure in e.g. clustering or mixture modelling, or finding a local structure in discovering motifs or patterns. The trend now is to go towards combining database methods with statistical procedures.
On data mining in bioinformatics, Heikki said: "I see a very fruitful time of cross-fertilisation of the areas, because first of all in both traditional statistics and computational data-analysis there are a lot of things that are already in place, many of which people maybe don't know much about. But on the other hand what bioinformatics is creating are totally new types of data that statisticians in their very long and solid discipline have never encountered, for example like analysing huge sets of sequences of letters. That simply isn't a traditional statistical data-type. The same is true for gene expression data: that kind of data hasn't been around that long and consequently there are no methods for analysing it. It seems to me that there is much to be done in creating new types of concepts: what do you actually want to find from there. Similarity and clustering are obvious procedures but there might be other things in the data that are biologically even more relevant. Some types of knowledge that are in the data could be obtained and for this new algorithms are needed."
|

|
|
The keynote lecture of Heikki Mannila (left) and Paul Spellman´s microarray presentation (right). © 2000 BioInformer. All Rights Reserved. |
|
The next day was dedicated to the introduction in data mining for bioinformatics, with pattern discovery, databases of premium biological information, and the induction of rules and relations from large data sets as prime examples. Aris Floratos of the IBM TJ Watson Research Center presented a talk about discovering and exploiting patterns in biological databases, where he focused on the use of the Teiresias algorithm to find so-called seqlets in the GenPept database. These seqlets can then be used in homology searches, but also to describe 3D structure. Building further on the previous talk, Inge Jonassen of the University of Bergen discussed methods for the automatic discovery of patterns in sequences, with particular attention to algorithms used in the Pratt line of programs.
Rolf Apweiler (EMBL-EBI) presented InterPro, the new integrated resource of protein sites and functional domains that can be used for large-scale protein characterisation. The database uses various algorithms to automatically collect and properly unify various types of data from a range of protein-related databases. The InterPro database is used by EBI for example to assign common annotation to unannotated entries in TrEMBL (Translation of EMBL), thus preventing overpredictions and standardising annotation. Continuing with mining databases was Phil Bourne of the UC San Diego Supercomputer Center, who presented recent results from mining the Protein Data Bank (PDB) and other macromolecular structure databases. David Westhead (University of Leeds) briefed the audience on simplified descriptions of protein 3D structures and their use in searching and structural pattern recognition. Topology diagrams (TOPS) are visual aids that are simple to create and a very powerful way to scale down the complexity of a 3D structure without losing biological relevant information.
A more traditional field of data mining in other domains, but new to bioinformatics, is the mining of scientific papers to look for interesting nuggets of information. Christos Ouzounis (EMBL-EBI) presented very promising results from a pilot-project where 2,500 MedLine abstracts were automatically analysed and thousands of protein-protein interactions were found. Being able to easily extract these data from abstracts (or full articles) can lead to prediction of similar interactions in other species where sequence homology allows. The day closed with a presentation from Beatriz de la Iglesia of the University of East Anglia who talked about induction of simple, understandable, and interesting rules from large data sets (nugget discovery), an application that is certain to arise in the biological domain very soon.
|

Discussions during the poster sessions and cofee breaks. © 1997-2000 BioInformer. All Rights Reserved. |
|
The third day, which focused mainly on the analysis of gene expression data, started with a talk from Martin Vingron of the German Cancer Institute (DKFZ) who presented various examples how to compare and complement heterogeneous information to answer bioinformatics questions. Paul Spellman of Stanford University gave an animated presentation of various techniques that can be used to characterise function when looking at gene expression data generated from microarrays. He exemplified this by looking at the data generated from 25 different stress conditions for yeast, a total of 400 microarray hybridisations. "I think there are some good things coming out of the junction of data mining and bioinformatics, like for example the functional classification of genes, which is really going to be an exciting area combining all the techniques of biology that people are using. Especially anything that you can do systematically to generate data even if its not the same general format, for example two hybrid expression results, a protein structure, and a crystal structure; combining all that kind of information will be some real neat and exciting stuff", Paul said on the topic of future directions for data mining in bioinformatics. John Aach from the Harvard University Medical School made a strong case for the standardisation of data in his presentation on the development of integrated databases and analysis tools for functional genomics. Besides more traditional bottlenecks, the data comparability issue makes integrating various information resources for the purpose of further computation or information extraction a difficult chore. He briefly mentioned the initiatives of both NCBI and EBI (and partners) to standardise and collect microarray data. Ronald Taylor of the National Cancer Institute (NIH) presented a Bayesian similarity measure for gene expression array experiments as better and more correct alternatives for the usual Euclidean distance or correlation measures that are being used traditionally. The last three presentations were invited poster presentations. Alex Hartemink (MIT Laboratory for Computer Science) introduced the audience to the use of high-density DNA array data to statistically validate models of genetic regulatory networks; Reinhard Guthke (Hans Knöll Institute for Natural Products Research) talked about data mining and model based experimental design for bioprocess optimisation and functional genomics; and finally Susanne Kneitz (Albert Einstein College of Medicine) discussed the practical application of expression profiling of preimplantation mouse embryonic development using cDNA microarrays.
In summary, conclusions were drawn that the issue of data comparability is very important and needs to be resolved, and "there is a lot to be done in the area of the algorithm development; perhaps the most important thing is to have algorithms that produce robust answers in an understandable form so that the biologist who is using the algorithm really can understand the result", Heikki Manila concluded as he went on to exemplify this with the easy understandable (visual) result of multiple sequence alignment algorithms.
There are plans to organise a follow-up on this conference next year, because the general feeling amongst the participants is one of extreme interest in this promising field, warranting close monitoring in the next few years.
Article by: Jean-Jack M. Riethoven
More pictures of the conference are available on a separate page. Note that although thumbnails of the pictures are shown first, the approximate time to load the complete page including the images takes 1.5 minutes (14k4 modem), 50 seconds (28k8 modem) or 15 seconds (ISDN or faster).
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Industry Programme http://industry.ebi.ac.uk/
Datamining `99 Conference Site http://industry.ebi.ac.uk/datamining99/
MicroArray page http://www.ebi.ac.uk/microarray/
External sites are not endorsed by EMBL-EBI
|