Monitoring the Progress of Major Genome Sequencing Projects: The Genome MOT
by Peter Sterk1 and Stephan Beck2
1 EMBL-Outstation Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
2 Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
During the last two years, genome-scale DNA sequencing has really taken off. The complete genomes of a small number of microbes and the first eukaryotic genome, that of Saccharomyces cerivisiae, have been sequenced, and it is expected that the 100 Mb genome of the worm Caenorhabiditis elegans will be finished during this year, as well as a number of smaller microbial genomes. An up-to-date list of ongoing projects can be found at the multipurpose automated genome project investigation environment (MAGPIE) World Wide Web site (http://www.mcs.anl.gov/home/gaasterl/magpie.html) which currently lists over 300 different projects and is steadily growing. Systematic efforts to sequence the genomes of the eukaryotes Arabidopsis thaliana (~100 Mb), Drosophila melanogaster (~120 Mb) and Homo sapiens (~3000 Mb) are well underway and provided that funding will continue, these genomes should be completed in the early years of the next millennium. Because sequencing of these larger genomes is carried out in many different laboratories spread all over the world, the actual progress is difficult to monitor. In the absence of a suitable system to follow the progress of large genome sequencing projects, we have established a genome monitoring table (Genome MOT), which allows the progress of a number projects to be viewed via the World Wide Web (http://www.ebi.ac.uk/~sterk/genome-MOT and Beck and Sterk (1998)).
The Genome MOT
The Genome MOT currently lists the total amounts of public and finished genomic DNA sequence submitted to the EMBL/Genbank/DDBJ databases broken down per year for Homo sapiens, Arabidopsis thaliana, Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae and Schizosaccharomyces pombe. The human data are broken down further according to chromosome number. For the definition of finished sequence and a description of other criteria we have applied, please refer to the Genome MOT home page (http://www.ebi.ac.uk/~sterk/genome-MOT/). As a representation of the current status of each of the projects, the total amounts of finished sequence are presented, both as absolute values and as percentages of chromosome or genome sizes. For these figures to be meaningful at all, database redundancy (sequence duplication) had to be considered. Since assessment of database redundancy is not trivial, we will discuss this issue in the next section. The genome MOT is automatically updated every Monday morning. A monthly updated graphical representation in the form of a cumulative progress plot and a yearly prediction of completion dates are also available.
Figure 1. Graphical representation of cumulative progress. Graph created using datasets from 9 February 1998.
It is immediately evident from the Genome MOT that the 12 Mb yeast genome appears to be oversubmitted more than twice. The estimation of the level of sequence duplication can essentially be done in two ways, (1) by applying a somewhat arbitrary sequence length cutoff and thereby eliminating the contribution of smaller database entries, or (2) by eliminating database entries that are highly homologous to a longer database entry. We have deliberately chosen to do both as one method may be more suitable for a given project than the other. The first method is obviously easy to apply and seems safe for projects that are well managed. We currently calculate the statistics applying cutoffs of 1000, 10000, 30000, 50000 and 100000 base pairs. Even the higher values do not seem unreasonable bearing in mind that, with sequencing techniques constantly improving, the length of most sequences originating from genome sequencing centres is nowadays at least tens of thousands of base pairs long. The second method often involves manual intervention and can be time consuming. We have opted for a mixed and automatable approach by using the program CLEANUP by Grillo et al. (1996) to determine the redundancy in the datasets containing sequences longer than 1000 base pairs. Looking at the results for yeast, about 48% redundancy, this approach appears to be reasonably accurate. With the redundancy values thus obtained, we have adjusted the percentages of completed genome in order to present a more accurate picture of the present status of each of the projects.
We are about to include the progress statistics for a number of smaller genomes, presented as the total number of base pairs in the EMBL database and as percentage of the genome size with database redundancy taken into account. We will also present a list of completed genomes with links to the corresponding database entry in the soon to be implemented CONtig database division. These entries will not contain a sequence and feature table, but instead all information necessary to build the complete genome from existing database entries.
We have established the current status and progress of the major genome projects and obtained a reasonable indication of the level of redundancy for these projects. Relatively low levels of redundancy levels can be expected for those projects that have a relatively small amount of duplication of effort, or in other words, managed projects. This appears to be the case for in particular H. sapiens, A. thaliana and C. elegans. The knowledge that these projects are managed efficiently is of particular importance when decisions for further funding for these and other projects have to be made. It is important to realize that at present funding has been awarded for only one third of the human genome, the target date of the year 2005 for completion is therefore still very speculative. Check the Genome MOT at regular intervals, and stay up-to-date!
Article by: Peter Sterk, Stephan Beck
Resources and further information
European Bioinformatics Institute
Genome MOT homepage
Multipurpose Automated Genome Project Investigation Environment (MAGPIE)
- Beck, S. and Sterk, P. (1998). Genome-scale DNA sequencing: where are we? Curr. Opin. Biotechnol. 9,116-120.
- Grillo, G., Attimonelli, M., Liuni, S., and Pesole G. (1996). CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases. CABIOS 12, 1-8.
External sites are not endorsed by EMBL-EBI