|
SWISS-PROT Release 36
Release 36.0 of SWISS-PROT contains 74'019 sequence entries, comprising 26'840'295 amino acids abstracted from 59'911 references. This represents an increase of 7% over release 35.
Changes since Release 35
Sequences and annotations
4'976 sequences have been added since release 35, the sequence data of 712 existing entries has been updated and the annotations of 9'954 entries have been revised.
What's happening with the model organisms
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates;
- Provide a higher level of annotation;
- Provide cross-references to specialised database(s) that contain, among other data, some genetic information about the genes that code for these proteins;
- Provide specific indices or documents.
What was done since the last release or in preparation for the next release concerning model organisms:
- We have continued our effort in catching up with the backlog of sequences from other model organisms. In particular we added about 350 entries from human and from E.coli, 300 from mouse, 250 from S.pombe, 200 from M.jannaschii, 150 from C.elegans, 100 from B.subtilis, H.pylori and from M.tuberculosis.
- We plan to finish as quickly as possible the annotation of the Escherichia coli and Haemophilus influenzae sequence entries which are not yet part of SWISS-PROT.
Here is the current status of the model organisms in SWISS-PROT:
|
Organism |
Database cross-referenced |
Index file |
Number of sequences |
|
A.thaliana |
None yet |
In preparation |
719 |
|
B.subtilis |
SubtiList |
SUBTILIS.TXT |
1970 |
|
C.albicans |
None yet |
CALBICAN.TXT |
192 |
|
C.elegans |
Wormpep |
CELEGANS.TXT |
1887 |
|
D.discoideum |
DictyDB |
DICTY.TXT |
280 |
|
D.melanogaster |
FlyBase |
FLY.TXT |
1042 |
|
E.coli |
EcoGene |
ECOLI.TXT |
4416 |
|
H.influenzae |
HiDB (TIGR) |
HAEINFLU.TXT |
1693 |
|
H.sapiens |
MIM |
MIMTOSP.TXT |
4980 |
|
H.pylori |
HpDB (TIGR) |
HPYLORI.TXT |
334 |
|
M.genitalium |
MgDB (TIGR) |
MGENITAL.TXT |
470 |
|
M.musculus |
MGD |
MGDTOSP.TXT |
3253 |
|
M.jannaschii |
MjDB (TIGR) |
MJANNASC.TXT |
1283 |
|
M.tuberculosis |
None yet |
None yet |
873 |
|
S.cerevisiae |
SGD |
YEAST.TXT |
4787 |
|
S.typhimurium |
StyGene |
SALTY.TXT |
706 |
|
S.pombe |
None yet |
POMBE.TXT |
1315 |
|
S.solfataricus |
None yet |
None yet |
72 |
| |
|
|
|
|
Collectively the entries from the above model organisms represent 40.9% of all SWISS-PROT entries.
Changes affecting the accession numbers
With the creation of the TrEMBL database and the rapid increase in the amount of sequence data, we are faced with a problem of availability of accession numbers. Currently we use a system based on a one-letter prefix followed by 5 digits. This system was also used by the nucleotide sequence databases which had originally reserved for SWISS-PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run out of space (due mainly to EST's), have been forced to start using a new format based on a two-letter prefix followed by 6 digits.
We have used up all possible numbers with 'P' and 'Q' and the only letter prefix which was not used by the nucleotide database is 'O'. As we believe that changing the format of the accession numbers to that used now by the nucleotide database would create havoc on the numerous software packages using SWISS-PROT, we have decided to keep a system of accession numbers based on a six-character code, but with the following changes:
We have started using 'O'. This extra letter should allow the continuation of the present format (1 prefix letter + 5 digits) for approximately one year.
When we will have finished using up 'O', we will introduce a system based on the following format:
1 2 3 4 5 6 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
What the above means is that we will keep a six-character code, but that in positions 3, 4 and 5 of this code any combination of letters and numbers can be present. This format allows a total of 14 million accession numbers (up from 300'000 with the current system).
We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession numbers can not be mistaken with gene names, acronyms, other type of accession numbers or any type of words!
Examples: P0A3S2, Q2ASD4, O13YX2, P9B123
Changes concerning the reference location line (RL)
The (IN) prefix used for books is now also used for references to the electronic Plant Gene Register. Example:
RL (IN) PLANT GENE REGISTER PGR98-023.
Cleaning up of the SIMILARITY comment line (CC) topic
We started a major overhaul of the "SIMILARITY" topic. We would like the majority of the information stored in this topic to be usable by computer programs (while being human-readable). We are therefore standardising the format of this topic using two different subformats. One to describe to which family a protein belongs to:
CC - !- SIMILARITY: BELONGS TO THE {Name1} FAMILY [OF {Name2}]. CC [{Name3} SUBFAMILY.]
And one to describe which domains are found in a given protein:
CC - !- SIMILARITY: CONTAINS n {Name} [DOMAIN|REPEAT][S].
We already have updated many entries in this release and plan to continue to do so for the next release.
Changes concerning cross-references (DR line)
We have added cross-references from SWISS-PROT to the Mendel database, a plant gene nomenclature database from the Commission for Plant Gene Nomenclature (CPGN). These cross-references are present in the DR lines:
Data bank identifier: MENDEL Primary identifier : The Mendel accession number for a gene in a given species. Secondary identifier: Composed of the acronym of the species (generally the same five-letter code as that defined and used by SWISS-PROT in the entry name), the gene name and a number. For example:
DR MENDEL; 294; Amahy;psbA;1.
Information by: Rolf Apweiler
Resources and further information
The European Bioinformatics Institute (EMBL-EBI) http://www.ebi.ac.uk/
SWISS-PROT & TrEMBL homepage http://www.ebi.ac.uk/ebi_docs/swissprot_db/swisshome.html
About TrEMBL (this BioInformer issue) http://bioinformer.ebi.ac.uk/newsletter/archives/4/trembl.html
John Innes Centre, Norwich (UK) http://www.jic.bbsrc.ac.uk/
Commission for Plant Gene Nomenclature (CPGN) http://jiio6.jic.bbsrc.ac.uk:80/index.html
Tarweed Consulting, California (USA) http://www.tarweed.com/
Plant Gene Register http://www.tarweed.com/pgr/
External sites are not endorsed by EMBL-EBI
|