|
Mining the yeast genome expression and sequence data
by Alvis Brazma EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
First genomic scale data about gene expression have recently started to become available in addition to complete genome sequence data and annotations. For instance, DeRisi et al (Science, Vol 278, 1997) have measured1 relative changes in the expression levels of almost all yeast genes during the diauxic shift at seven time points at 2 hour intervals. The amounts of such data will be increasing rapidly, thus providing researchers with new challenges of finding ways to transform this data into knowledge, on one hand, while opening new possibilities of pure in silico studies of various aspects of genome functioning, on the other hand.
We have used publicly available data about the diauxic shift to study some aspects of yeast metabolism and gene regulation. The project was started as a part of EU funded technology transfer activity (TTN) BIOVIS and its first results were reported in EBI workshop "Data mining and bioinformatics", March 1998. Currently the project is continuing within the framework of BioStandards project. Its long term goal is to develop methods and software for using gene expression data in combination with other genome data to obtain insights into metabolic and regulatory pathways. A shorter term goal is to explore ways to relate gene expression profiles during the diauxic shift to specific functional classes or specific regulation mechanisms. In this paper we will try to give some flavour of what has been achieved in the project so far.
To pursue the stated goals we used several approaches in parallel:
- we used visualisation approaches to look for correlations between gene functional classes and their expression levels at different time-points;
- we used decision-trees to find rules predicting different gene functional classes based on their expression levels at various time-points;
- we clustered genes by similarities in their expression profiles and looked for common sequence patterns in their upstream regions to discover binding sites for putative transcription factors participating in diauxic shift.
For visualisation and decision tree building we used a general purpose data mining and visualisation tool Decisionhouse developed by Quadstone Ltd.
Visualisation
Visualisation of the complete data gave us some flavour of the changes in the gene expression during the diauxic shift. We retrieved for each gene its functional class and for the genes that belong to the energy or metabolism classes, also the functional subclass from the yeast genome database MIPS (as annotated by end of 1997). Most genes belong to one of 14 basic functional classes, while about 2500 genes are annotated as unclassified. We were particularly interested in functional class "energy", which contain approximately 230 genes and has 10 functional subclasses.
We visualised the expression profiles of the genes from the functional subclasses of the energy class to try to spot possible correlations. Although, as expected, most of the respiration genes increase their expression level during the diauxic shift, while the fermentation genes slightly decrease, there are several genes annotated as fermentation genes with a considerable expression increase. We found that all these genes have been annotated based on sequence similarity. It is possible that the function of these genes cannot be correctly inferred from sequence similarities. If our future investigations confirm that these genes belong to different functional classes, then this will be an example of how gene expression data may help in gene functional annotation.
Decision tree building
We used Decisionhouse to build decision-trees discriminating genes from different functional subclasses of the energy class, based on the gene expression levels at various time-points. For this we selected all 3347 genes that have been attributed known functional classes and tried to build decision-trees with the various functional classes as objectives. Such decision-trees can give us rules for predicting the functional classes of yet unclassified genes.
The most discriminatory decision-tree for a single functional subclass that we succeeded in building was the tree for the "respiration" genes. In total there were 64 such genes. The decision tree produced a rule which identifies a total of 88 genes, 22 of which (i.e., 25%) are "respiration" genes. Applying the rule to unclassified genes we obtain 61 genes from the original population of 2731. This naive prediction rule implies that with 25% "probability" the identified 61 genes are "respiration" genes.
Decision trees for other functional classes or subclasses were less discriminating. Either there are no strong correlations between the functional (sub)classes and the gene expression profiles or the annotations by sequence similarity are not always reliable, as indicated by the visualisation experiments. It is possible that better decision trees can be build using only the genes annotated by experimental evidence. Testing this is one of the directions of our future research.
Gene upstream region analysis
In the previous two described approaches we started from a priori given gene classification and tried to relate gene classes to the various expression profiles. In the last approach we started from expression profiles, clustering the genes by their similarity and then studying the clusters. For this we implemented a clustering algorithm based on discretising the time-series of the expression measurement space and including genes with expression profiles mapping to the same discrete pattern in separate clusters. Some rigorous selection criteria were used for defining "good clusters", which produced 11 clusters containing at least 25 genes. We hypothesised that some of the genes sharing similar expression profiles may also share similar transcription regulation mechanisms, including common transcription factor binding sites.
We extracted the genome sequences upstream from genes for each cluster and used a specifically designed sequence pattern discovery algorithm (J.Vilo) to look for common patterns in each cluster2. The algorithm was able to discover sequence patterns that are contained in known transcription factor binding site descriptions given in the TRANSFAC database and can be expected to participate in the regulation of diauxic shift. For details see "Predicting Gene Regulatory Elements in Silico on a Genomic Scale" (A.Brazma, I.Jonassen, J.Vilo, E.Ukkonen), which is to appear in Genome Research.
In conclusion we can say that, although the gene expression data that we used are only the first publicly available such data on genomic scale, the pure in silico studies have already revealed new facts about the genome. This should encourage one to believe that with more high quality gene expression data becoming available, in silico discoveries regarding gene regulation will be a reality. To facilitate this process, a public gene expression database should be established. Such a database would not only help in developing gene expression data analysis tools and methods, but also allow one to compare data obtained by different technologies, to evaluate their reliability, and to establish "gold" standards for gene expression measurements. We would like to encourage the community to support an initiative to establish such a database.
Acknowledgments: A.Ewing and N.Skilling from Quadstone Ltd. gave valuable and regular advise in using the Decisionhouse tool. All 3D figures in this article have been produced by Decisionhouse. The gene upstream region analysis was done in collaboration with I.Jonassen from the University of Bergen, and J.Vilo and E.Ukkonen from Helsinki University. The author also benefited substantially from discussions with the Industry support group at EBI, and with A.Robinson in particular.
Article by: Alvis Brazma
Resources and further information
European Bioinformatics Institute (EMBL-EBI) http://www.ebi.ac.uk/
Industry Programme http://industry.ebi.ac.uk/
Alvis Brazma´s homepage http://industry.ebi.ac.uk/~brazma/
Stanford University http://www.stanford.edu/
Program in Molecular and Genetic Medicine http://cmgm.stanford.edu/
The Brown Lab http://cmgm.Stanford.edu/pbrown/
Exploring the metabolic and genetic control of gene expression on a genomic scale http://cmgm.stanford.edu/pbrown/explore/index.h tml
Munich Information Centre for Protein Sequences (MIPS) http://www.mips.biochem.mpg.de/
The Yeast Genome Project http://www.mips.biochem.mpg.de/mips/yeast/
University of Helsinki http://www.helsinki.fi/english.html
Department of Computer Science http://www.cs.helsinki.fi/
Publications of Jaak Vilo on pattern recognition http://www.cs.helsinki.fi/~vilo/Publications/
National Research Centre for Biotechnology Ltd. (GBF) http://www.gbf-braunschweig.de/
Transcription Factor database http://transfac.gbf-braunschweig.de/TRANSFAC/
Quadstone Ltd. http://www.quadstone.co.uk/
DecisionHouse - data mining and visualisation package http://www.quadstone.co.uk/dh/overview/
Footnotes
J. F. DeRisi et al. (1997) studied the relative expression rate changes of yeast genes during the diauxic shift. They inoculated yeast cells from an exponentially growing yeast culture into fresh medium and after some initial period, harvested samples at seven 2-hour intervals, isolated their mRNA, and prepared fluorescently labelled cDNA. Two different fluorescents were used - one for cells harvested in each of the successive time-points, the other for reference, from cells harvested at the first time-point. The cDNA from each time-point together with the reference cDNA were hybridised to the microarray with approximately 6400 DNA sequences representing ORFs of the yeast genome. Measurement of the relative fluorescence intensity for each of the approximately 6400 * 7 elements reflect the relative abundance of the corresponding mRNA in each cell population. The measurement data is available on the Internet.
We extracted the genome sequences upstream from genes for each cluster and used a specifically designed sequence pattern discovery algorithm to look for common patterns in each cluster. Evaluation of statistical significance of the discovered patterns showed that the most significant pattern was CCCCT..T (the dot "." denotes a wild-card position in the pattern, meaning that the above given pattern matches all the sequences containing the substring CCCCT, followed by any two characters, followed by T). This pattern was discovered for the cluster of 38 genes that do not significantly change their expression level during the first 5 time-points, increased their expression level more than 4 times at time-point 6, and not changing it significantly during the last time-point. Note that substring CCCCT is related to a transcription factor binding site known as stress responsive motif. For details and other discovered patterns see "Predicting Gene Regulatory Elements from their Expression Data in the Complete Yeast Genome" (A.Brazma et al), in proceedings of German Bioinformatics Conference, GCB'98, and "Predicting Gene Regulatory Elements in Silico on a Genomic Scale" (A.Brazma, I.Jonassen, J.Vilo, E.Ukkonen) which is to appear in Genome Research).
External sites are not endorsed by EMBL-EBI
|