|
The Human Proteomics Initiative of SIB and EBI
Rolf Apweiler1 and Amos Bairoch2 1 The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. 2 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
Introduction
The life science research community now confidently expects that a first draft of the human genome sequence will be made available in the first few months of the year 2000. As the task of sequencing the genome draws to its end, the focus of research will switch to the protein level by relating sequences to function. The life science research community will need full access to comprehensive, high quality protein sequence data.
The Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI) provide with the SWISS-PROT protein sequence data bank such a resource. SIB and EBI initiated now a major effort to annotate, describe and distribute to the scientific community a large amount of highly curated information concerning human protein sequences. This initiative, hereafter known as the Human Proteomics Initiative (HPI), is tightly linked to an appeal to the user community to participate actively in this effort.
The HPI has two phases. The aim of the first phase, which will last until spring 2000, is to annotate the protein products of all known human genes. The second is a long-term commitment to continue to release well-annotated sequences of human protein entries as long as researchers produce new data.
The Human Proteomics Initiative
In a few months the combined efforts of a number of sequencing centres will produce a first draft of the human genome sequence. Such an endeavour is only a very preliminary step in the understanding of human biological processes. The first pitfall to overcome is the detection of all coding regions on the genomic sequence. Current algorithms, while being very powerful, are not capable of detecting with certainty all exons, are not well equipped to distinguish different splice variants and are unable to detect small proteins, which are numerous and crucial to many biological processes.
When all potential coding regions have been predicted, the user community will have at its disposition the sequence of 80'000 to 100'000 predicted proteins lacking any information on post-translational modifications (PTM) of which the majority of proteins are the target. Proteins, once synthesised on the ribosomes, are subjected to a multitude of modification steps. They are cleaved (thus eliminating signal sequences, transit- or pro-peptides and initiator methionines); many simple chemical groups, like acetyl-, methyl-, and phosphoryl-groups, can be attached to them, as well as some more complex molecules, such as sugars and lipids. Finally, they can be internally or externally cross-linked via disulfide bonds. More than a hundred different types of PTM are currently known and many more are yet to be discovered. The complexity due to all these modifications is compounded by the high level of diversity that alternative splicing can produce at the level of sequence. Thus the number of different protein molecules expressed by the human genome is probably closer to a million than to the hundred thousand generally considered by genome scientists.
Additional factors of complexity are polymorphisms at the protein sequence level. While some of these polymorphisms are linked to disease states, most are not, yet have in many cases a direct or indirect effect on the activities of the proteins.
We therefore are initiating a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. This mean providing, for each known protein, a wealth of information that include the description of its function, its domain structure, subcellular location, post-translational modifications, variants, similarities to other proteins, etc. There are currently 5'300 annotated human sequences in SWISS-PROT. These entries are associated with about 14'500 literature references; 16'000 experimental or predicted PTM's, 800 splice variants and 8'000 polymorphisms (most of which are linked with disease states). We will use the current information as the ground basis for the HPI.
The HPI contains a number of sub-components, which are briefly described below:
Annotation of all known human proteins. In the course of the next nine months (from July 1999 to end of March 2000) the human protein sequences that are not yet in SWISS-PROT will be fully annotated. These sequences are either in the computer-annotated supplement TrEMBL; or will appear during the course of these 9 months; or do not appear in any sequence database - because the coding sequence has not been annotated as such in the DNA databases or because the sequence has not been submitted. We will also review and complete the annotation of the human sequences currently in SWISS-PROT. At the end of this nine-month period we expect to be complete and up-to-date and to hereafter keep up with the appearance of new data relevant to human proteins.
Annotation of mammalian orthologs of human proteins. We will make sure that for any human protein, existing orthologs in other mammalian species will also be annotated at a level equivalent to that of the human sequences.
Annotation of all known human polymorphisms at the protein sequence level. These are now commonly termed 'c-SNPs' (coding single nucleotide polymorphisms) or 'SAPs' (single amino-acid polymorphisms). As mentioned above, SWISS-PROT already holds information on a sizeable amount of such polymorphisms, and it will significantly expand its effort to store and annotate all polymorphisms at the protein level.
Annotation of all known post-translational modifications in human proteins. During the next nine months a major effort will be made to supplement the already quite comprehensive description of known post-translational modifications in human proteins currently provided in SWISS-PROT.
Tight links to structural information. SWISS-PROT is tightly linked to the PDB/RCSB 3D-structure database and already includes many features useful to structural biologists (such as literature references concerning X-ray and NMR papers; links to the HSSP database; DSSP-derived secondary structure information, etc.). These tight links will be further expanded by providing - in close collaboration with the group of Manuel Peitsch (Glaxo Wellcome Experimental Research and SIB) - homology-derived models for all human proteins for which such an approach is scientifically relevant.
Clustering and classification of all known vertebrate proteins. Until March 2000 we will have clustered all known vertebrate proteins in a hierarchical manner. This will be done in collaboration with the group of Jean-Jacques Codani (Gene-IT). Furthermore, we will classify all known vertebrate proteins based on the protein domain and protein family information collected in InterPro (Integrated resource of Protein domains and functional sites, an EU-funded collaborative project of the SWISS-PROT group at the EBI, the PROSITE groups at SIB, and the Pfam, PRINTS and ProDom database groups).
For all aspects of the HPI projects, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to help and to speed up the comprehensive annotation of the human proteome.
The HPI project has two different time-related aspects: one of which is a nine-month "marathon" to catch up with the current state of research, the other one is a long-term commitment to keep such a project alive as long as it is necessary. As we are very far from the completion of the human proteome this is a long-term challenge. It could hardly be met by the SWISS-PROT groups at SIB and EBI without the financial means now being provided by the yearly license fees paid by industrial companies for access to SWISS-PROT and related databases.
For more information on the HPI project you can consult the web pages mentioned in the `Resources´ section below.
If you would like to participate in the HPI project, please send us e-mail at: hpi@isb-sib.ch
Article by: Rolf Apweiler, Amon Bairoch
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Human Proteomics Initiative (HPI) http://www.ebi.ac.uk/swissprot/hpi/
SWISS-PROT http://www.ebi.ac.uk/swissprot/
Swiss Institute for Bioinformatics http://www.isb-sib.ch/
Human Proteomics Initiative (HPI) http://www.expasy.ch/sprot/hpi/
External sites are not endorsed by EMBL-EBI
|