|
Establishing a public repository for DNA microarray-based gene expression data.
Alvis Brazma, Alan Robinson, and Jaak Vilo EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
DNA microarray technology is one of the most important recent breakthroughs in experimental molecular biology, which allows monitoring of gene expression on genomic scale, and is already creating considerable amounts of valuable data. Currently these data are scattered across various Internet sites, but in many cases are not publicly available at all, even after publishing the discussions concerning the data. With more and more laboratories acquiring this technology, the amounts of created gene expression data are rapidly growing, and it is likely that there will be a gene expression data "explosion" soon. At the same time there is no public repository for storing these high-throughput data. Establishing such a repository would give a number of benefits, among them:
- By combining data obtained by different laboratories, the repository will create the ability to build up progressively detailed gene expression profiles and will give access to this information to third parties.
- It will facilitate the cross-validation of data obtained by different laboratories, to characterise various techniques and to establish error rates, benchmarks and "gold" standards.
- It will enable bioinformatics groups possibly not directly related to microarray laboratories, to participate in the data analysis and to develop new methods and tools for such analysis.
- It will promote a public "sharing" ethos for these crucial data.
- It will create a public resouce which can reliably be referenced by the scientific literature, allowing articles to discuss data which have been deposited in the database.
Requiring that the raw experimental data should be public will be consistent with the policy of most journals regarding the verifiability of the published conclusions. This will not prevent authors from having the first hand in exploiting their data.
We have discussed the possibility of establishing a public repository for DNA microarray based gene expression data with Industry Associate partners and with many of the major laboratories developing and using these technologies in Europe and the USA. Among them were Stanford University, Whitehead Institute (MIT), National Cancer Institute (NIH), National Human Genome Research Institute (NIH) in the US, German Cancer Research Centre in Germany, the EMBL Nucleotide Database at the EBI and the Medical Research Council in the UK. A consensus emerged that this is the time to develop standards for storing and annotating these data. There has also been an interest from academic laboratories and some industry data providers in depositing their data in the database once it is established. In the light of these developments the European Bioinformatics Institute has committed itself to establishing a public repository for microarray based gene expression data (see EMBL Press Release - April 9, 1999, and Nature 398, p 646).
Currently the EBI is establishing a pilot database containing the microarray gene expression data that are available publicly. The top-level structure of the database is a table where rows represent the genes, and columns the experiments. The gene/experiment cell contains numbers describing the expression level (relative or absolute) of the particular gene in the particular experiment, and the measurement reliability indicator. The initial fluorescent images of the arrays will also be stored in the archive, when available, to enable possible re-analysis of the images, once the image analysis algorithms improve.
The most challenging problem in establishing the database is how to annotate the experiments. It is imperative that the annotations are standardised (making possible at least some comparison of the experiments), and machine-readable (i.e. limited vocabulary keywords, possibly hierarchical, and a minimal amount of free text comments). Also, there may be relations between different experiments (for instance, if several experiments are linked in time-series). The development and establishment of standards and methods to annotate the experiments is the main research part of the informatics project. We plan to set up international working groups to develop such experiment annotation and data representation standards. The information about genes will be essentially given by the links to the respective EMBL or other genome database entries.
Another major problem in establishing the database is the development of normalisation procedures and standards enabling the comparison of data from different experiments, possibly done on different experimental platforms. Although many experts agree that such normalisation and comparisons are possible, currently little research towards these ends has been done in the public domain. A concerted effort of several microarray laboratories and bioinformatics groups is needed for developing such standards. We are looking into possibilities of establishing a consortium of laboratories working towards these ends. Academic laboratories in Europe in most cases are only now acquiring the microarray technology, which puts Europe in a good position to take a lead in establishing common experimental standards. This will be more difficult later after the laboratories have developed individual practices. Such an initiative will also help European researchers to catch up with their US colleagues regarding the use and development of microarray technologies.
We would like to encourage laboratories wishing to discuss submitting their data in the database to contact us.
Article by: Alvis Brazma et al.
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Industry Programme (Associates) http://industry.ebi.ac.uk/BioStandards/
Micro-Array page http://industry.ebi.ac.uk/~alan/MicroArray/
External sites are not endorsed by EMBL-EBI
|