|
EMBL Release 63
The EMBL Nucleotide Sequence Database was frozen to make Release 63 on 03-Jun-2000. The release contains 6,760,113 sequence entries comprising 8,255,674,441 nucleotides. This represents an increase of about 228% over Release 59.
A breakdown of Release 63 by division is shown below:
|
Division |
Entries |
Nucleotides |
|
Bacteriophage |
1,557 |
4,094,860 |
|
ESTs |
4,229,786 |
1,641,059,980 |
|
Fungi |
39,343 |
72,346,332 |
|
GSSs |
1,524,055 |
837,960,057 |
|
HTG |
67,380 |
3,756,057,434 |
|
Human |
113,725 |
824,675,648 |
|
Invertebrates |
51,775 |
326,125,907 |
|
Organelles |
66,587 |
55,930,648 |
|
Other Mammals |
24,911 |
23,245,210 |
|
Other Vertebrates |
22,846 |
26,434,387 |
|
Patents |
205,631 |
66,500,758 |
|
Plants |
65,352 |
203,929,799 |
|
Prokaryotes |
79,841 |
189,149,365 |
|
Rodents |
53,782 |
84,905,348 |
|
STSs |
115,362 |
50,539,815 |
|
Synthetic |
3,715 |
9,289,768 |
|
Unclassified |
1,190 |
1,785,944 |
|
Viruses |
93,275 |
81,643,181 |
|
Total |
6,760,113 |
8,255,674,441 |
|
Eight billion nucleotides
On 23-May-2000 the number of nucleotides in the EMBL Database has passed the 8,000,000,000 mark. Over the last 8 months (compare Oct 1, 1999: 3.6 Gigabases), the database size has increased by more than 128%.
Draft Human Genome
The completion of the human draft genome sequence has been announced on 26-June-2000. The draft sequence data is available from the EMBL Database HTG and HUM divisions. As a direct result of the human draft sequencing, Release 63 HTG (High Throughput Genome Sequences) division files now include over 3,756 Mb, compared to 1,137 Mb in Release 61 (Dec 1999). See also the Genome Monitoring Table for further detailed information.
Base Quality Values
Quality scores from draft HTG data are available on the EBI FTP server. The gzip'ed files in the directory contain base quality values for unfinished human sequences from Japanese, US and European sequencing centres. The FastA-type headers contain the EMBL accession number/version of the corresponding database entries.
Example:
AL009030.9 Phrap Quality (Length:229022, Min: 3, Max: 99)
In order to keep the size of the files within reasonable limits for handling purposes, files which in uncompressed form are bigger than 1 Gb, are split into smaller files.
Directory: ftp://ftp.ebi.ac.uk/pub/databases/embl/quality_scores
|
Current Files: |
/htg_sanger1.qscore.gz - /htg_sanger3.qscore.gz |
| |
/htg_genoscope1.qscore.gz |
| |
/htg_mpimg1.qscore.gz |
| |
/htg_gbf1.qscore.gz |
| |
/htg_japan1.qscore.gz |
| |
/htg_us1.qscore.gz - /htg_us8.qscore.gz |
| |
|
|
Quality score files are updated on a daily basis.
ENSEMBL automatic annotation
Ensembl provides automatic annotation to the human draft genome data. Ensembl information includes e.g. confirmed peptides, confirmed cDNAs and also predicted peptides. Additionally, repeat prediction along with integration of map information and SNPs are available. These data are available through the Ensembl web site. Ensembl is a joint project between the Sanger Centre and EMBL-EBI.
Forthcoming Changes
Genome Representation
At the May 2000 Collaborative Meeting it was confirmed by the sequence database collaboration DDBJ/EMBL/GenBank to go ahead to transform the currently existing experimental FTP directory representing genome data into a database division CON (Constructed Sequences) to represent complete genomes and other long sequences constructed from segment entries. The CON division entries will contain construct information (accession numbers and sequence locations) involved in building the genomes. CON entries and according information will be included into the daily data exchange mechanism between the collaborating databases. The CON entry file will include construct information and all accession numbers relevant to the genome. Additionally, we are planning to provide the complete entry in EMBL format (DNA and features) plus the complete DNA sequence in Fasta format. These entries will be linked, searchable and retrievable through SRS and available for BLAST and FASTA homology searching.
For an example representation, see the bacterial genome of Chlamydia muridarum (AE002160) in ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Bacteria/cmuridar um/
AE002160.con AE002160.embl AE002160.embl.Z AE002160.fasta AE002160.fasta.Z
New HTC (High Throughput cDNA) division
At the May 2000 collaborative meeting DDBJ/EMBL/GenBank agreed to create a new database division HTC to represent unfinished High Throughput cDNA sequences. HTC sequences may include 5'UTR and 3'UTR regions and (part of a) coding region and some sequences may also include introns (pre-mature mRNAs). Upon finishing of these sequences, they will be moved to the corresponding taxonomic division. HTC sequence entries will include the keyword 'HTC'. The keyword will be removed once the entry has been included into the according taxonomic division.
EMBL cumulative update file
We intend to discontinue the provision of the single cumulative update file. Several sites have reported problems handling our EMBL cumulative update file when it grows beyond 2GB (uncompressed), because of file systems that do not support files > 2Gb. Instead of the cumulative.dat.gz file, we will continue to make available on our FTP server a set of smaller data files, that contain together the same data as the full cumulative update file, named cum_*.dat.gz For further details please check the README file in directory ftp://ftp.ebi.ac.uk/pub/databases/embl/new/
Information by: EBI´s EMBL team
Resources and further information
European Bioinformatics Institute (EMBL-EBI) http://www.ebi.ac.uk/
EMBL Nucleotide Sequence Database homepages http://www.ebi.ac.uk/embl/
Latest EMBL Nucleotide Sequence Database release ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
Genome Monitoring Table http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.h tml
Ensembl http://www.ensembl.org/
The Sanger Centre http://www.sanger.ac.uk/
External sites are not endorsed by EMBL-EBI
|