|
INFOGENE
DataBases of Genome Known and Predicted Genes and Proteins
INFOGENE was designed by Victor Solovyev and Asaf Salamov The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
Summary and Perspective
Recently a broad agreement has been reached amongst genome centres in the US, the Sanger Centre, the Wellcome Trust and the US funding agencies to go ahead with a plan that will deliver all of the human sequence, part finished and part in draft, into the public domain by the end of 2001.
Using gene prediction the scientific community can start to work experimentally with any human gene during the next 3 years, because gene finding programs usually predict accurately at least the major part of exons in a gene sequence. Our experience shows that the accuracy of predictions is significantly less for long genomic sequences if you compare that with the usually presented tests (decreasing 10-20%). However, if the same exons are predicted by two programs that are based on different approaches, it is much more likely to be a real exon than if the prediction was made by a single program. Our programs Fgenes (Solovyev,1997), based on a pattern recognition approach, and Genescan based on a probabilistic approach (Burge,Karlin,1997) were used for the presented predictions. The new Fgenes-H (Salamov, Solovyev,1998) program will be used for the future updates.
We present new databases:
- Known GENES Structure and Functioning Database (INFOGENE Rel 1.)
- Sections: HUMAN_G, MOUSE_G, DROSM_G, and ARABT_G
These databases include the structure of known genes and their functional sites such as start of transcription; TATA-box, and poly-A signal (if known)
- Nucleotide and Protein sequences of INFOGENE genes
IG_NUC and IG_PRO
- Predicted GENES Structure and protein Databases (INFOGENEP Rel 1.)
Currently this DB includes genes predicted for finished and unfinished sequences from the Sanger Centre. There are 1500 locuses and 18000 protein sequences corresponding with predicted genes (by Fgenes and Genescan programs) in the database.
Known Protein and EST similarity is included in the data and it will be possible to make key word searches in SRS to find the protein of your interest.
If you find some interesting similarity with your sequence you can use the INFORGENP ID to check the gene structure of this protein in the INFOGENP DB and get corresponding clone name and sequence.
Because it will not possible in the near future to experimentally verify all genes in sequences of genome sequencing projects, computational prediction might have a great significance for study of new genes and proteins.
For example, you can (using a Blast search or key word search) find predicted genes of your current interest. Because most of predicted exons should be accurate (in average) you can use them to get corresponding cDNA and verify the exact gene structure.
This DB includes all predicted genes and proteins for the Human genome draft as well as genes and proteins predicted for other model organisms such as Drosophila and Arabidopsis.
We plan to make links between similar genes and connect the genes with known regulatory information in collaboration with the TRRD database developers from IC&G, Novosibirsk (Russia).
Information by: Victor Solovyev
Resources and further information
The Sanger Centre http://www.sanger.ac.uk/
Computational Genomics Group http://genomic.sanger.ac.uk/
Databases of Genome Known and predicted Genes and Proteins http://genomic.sanger.ac.uk/bla/infogen.html
Gapped BLAST search in the Database of protein sequences of predicted genes of finished and unfinished human sequences. http://genomic.sanger.ac.uk/db.html
Institute of Cytology and Genetics http://www.bionet.nsc.ru/
Siberian Regional Centre for Geneinformatics http://www.bionet.nsc.ru/SRCG/index.html
TRRD - database of transcription regulatory regions on eukaryotic genomes http://wwwmgs.bionet.nsc.ru/systems/TRRD/
External sites are not endorsed by EMBL-EBI
|