|
Standards to Create Clean Data Sets for Gene Prediction
T.A. Thanaraj EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
Gene prediction
Computational gene prediction tools are now essential components of every genome sequencing project. Annotation of sequenced DNA regions depends largely on such methods. Current computational approaches involve the following steps:
- Identification of gene structural elements (such as translation start/stop and splice sites) in an unknown sequence using signals observed in sequences of known structural elements;
- Identification of potential coding regions either by homology searches against databases (such as that of protein, EST, and cDNA sequences) or by searching the sequence for signals that characterise coding as opposed to non-coding sequences;
- Identification of potential exons using the outcome of the above two procedures; and
- Assembly of exons to find the most optimal gene by maximising the combined associated probabilities of constituent elements of the model.
Need for high quality data sets
The above-mentioned procedure to identify the gene structural elements and coding regions requires an accurate encapsulation of information from known sites. Different models such as patterns, profiles, weight matrices, and neural networks have been used to achieve this. The success of these methods depends largely on the quality of the data sets that are used as the training set. Different researchers have created their own data sets in an ad hoc manner and until recently there had been no single common data set. A standardised data set is necessary when the prediction accuracy of different programs needs to be assessed. It is a general observation that accuracies calculated for different programs with a common data set are lower than those reported by the authors originally.
Thus there is a need to define standards as well as to create clean data sets which can be used to train as well as to assess different gene prediction programs.
Methods and Standards for creating clean data sets
1. Standard Cleaning Procedures
This is mainly to select only those nucleotide entries from the data bases with no errors in annotation. The following selection criteria are recommended.
- Gene entries have been determined by individual researchers and are not the outcome of genome projects;
- The entries report genuine nuclear DNA and not any synthetic, artificial, or foreign genes;
- The entries do not contain a false gene or any alternative gene products, conflicts, variations, or mutations in the nucleotide sequence;
- The entry contain a complete coding region for a gene and has at least one intron;
- The description of the gene structure as given in the feature table (such as mRNA, CDS, exon, intron, 5' and 3' UTR, poly-A signal) is checked for consistency in annotation;
- Every nucleotide in a region, the end points of which are defined in the feature table, has been annotated as belonging to either an exon or an intron. In addition, the entries are to be scrutinised against simple sanity checks such as that
- the stop and start codons are standard ones (ATG for a start codon; TAA, TAG, TGA for stop codons);
- the coding length is a multiple of 3 nucleotides;
- no in-frame stop codon occurred;
- the splice sites are marked by the universal consensus di-nucleotide sequences, namely GT and AG (with the introns starting with GT and ending with AG).
2. Eliminating redundant sequences
Since the data sets are used as learning sets to derive signals, they should have an unique representation from each of different possible nucleotide distributions in the sequences. This is achieved by removing the redundancy in sequences. For each entry, an exonic sequence can be constructed by concatenating its constituent exons. The exonic sequences are searched against each other for similarity using FASTA. In a similar manner an intronic sequence can be constructed for every entry by concatenating its introns. The intronic sequences are also searched for similarity amongst each other. To ensure non-redundancy in the data set, if any group of entries shared greater than 80% identity (either in the exonic or intronic sequences), then only one sequence is retained and the others are discarded.
3. Decision Trees to identify unusual splice sites
It is often useful to cluster the nucleotide regions that are annotated to code for a particular function into groups that share common characteristics. Regions that contain erroneous sites would often be highlighted in this process and such regions can be examined further. Below we illustrate usage of a decision tree to validate the annotation of splice sites. A decision tree finds rules that recursively bifurcate the data set in order to produce subsets that are homogeneous within subsets and heterogeneous between subsets. The contents of these subsets can be described by a set of rules that use one or more data fields (termed as analysis candidates) of the data. In situations where the incoming data are of uncertain quality, the unusual data are often highlighted when they do not comply with the rules for groups found with the training set. Thus a decision tree is useful for identifying the unusual splice sites that warrant additional scrutiny.
A mixed population of real splice sites and false splice sites is used as input to the decision tree. Each of the records is composed of 40 data fields that represent the nucleotides at positions -20 to +20 around the splice sites. A four-layer decision tree (see Figure 1) produces eight leaf nodes of different match rates of real to false sites.
|
|
|
Figure 1: Decision Tree for Human acceptor sites. li´s mean intronic positions around the acceptor sites; le´s mean exonic positions. (Click image to enlarge) |
|
Donor sites belonging to the end node seven follow the rule "(li5 != G) AND (le1 != G) AND (li4 != A)". Such a rule characterises a population of 1271 sites of which only two are true donor sites and the rest are false sites. The eleven true donor sites that are identified with 1567 false donor sites in nodes seven and eight may be interpreted either as the results of incorrect annotation or as exceptions and may require further study. The other six nodes (nine to fourteen) required further splitting, as the rules generated thus far have not been able to differentiate adequately between true and false donor sites. Such unusual sites could be studied further by examining the information in the corresponding SWISS-PROT entries, by matching with the corresponding mRNA entries (if available), and by studying the CLUSTALW alignment of the translated protein sequence with homologous proteins from SWISS-PROT. A typical example of such an exercise is as below in Figure 2. The ClustalW alignment of the translated protein of the sequences containing one of the odd splice sites as picked up by decision tree (it corresponded to GAL1_HUMAN) shows a gap of 4 amino acids in the alignment implying that the annotated splice site is probably wrong.
|
Figure 2: CLUSTALW alignment of the translated protein of a sequence with wrong annotation at a splice site. |
|
|
|
Validation by matching with cDNA and RNA sequences
The splice sites can be further validated by matching the DNA entry with the corresponding cDNA and RNA entries (if they are available). Such an exercise would help to obtain experimental proof for the validation of splice sites. However, it is advisable to use EST sequences because EST sequences outnumber RNA or cDNA sequences in the databases.
Validation by match with human EST sequences
A splice site is characterised by its donor and acceptor junctions (see Figure 3). Since EST sequences contain only the exon regions, it is possible to confirm the splice sites by comparing sequence fragments encompassing the sites from DNA with them. Thus it is possible to obtain experimental proof for the annotation of the splice sites. While carrying out such an exercise, it is also possible to check about the involvement of the sites in alternative splicing events. One possible construct of sequence fragments is shown in Figure 3.
|
|
|
Figure 3: A typical query construct for confirming a splice site by matching with EST sequences. (Click image to enlarge) |
|
In this type of query sequence, the fifty-nucleotide exon regions preceding the donor site (exon_EI) and the fifty-nucleotide exon region following the acceptor site (exon_IE) are concatenated together. This 100-nucleotide length sequence is searched for similarity with EST sequences in EMBL using FASTA. The objective of this search is to identify at least one EST sequence that shows a match with the query sequence (ideally along all the 100 nucleotides) and thus to confirm the annotation of both the 5' and 3' splice sites. EST sequences with mismatches restricted to the point of concatenation (shown in Figure 3) would indicate that the splice site is probably of wrong annotation. An EST sequence showing a match only with exon_EI or with intron_IE but not with both the fragments will indicate possible alternative processing involving the splice site.
Announcement of data sets at EBI
We have created data sets conforming to the standards (mentioned in this article) for genes from Homo sapiens (219 gene entries), Mus musculus (110 gene entries), Drosophila melanogaster (127 gene entries), C. elegans (89 gene entries), and Arabidopsis thaliana (132 gene entries). However, EST confirmation has been carried out only in the case of Homo sapiens.
The data set for human is more exhaustive than the other species. Around 625 donor and acceptor sites have been confirmed by EST matches. These splice sites can be used as a high quality learning set. We have also provided a set of around 225 EST-confirmed splice sites for use as a test set. In addition, we have provided a list of regions (from these gene entries) that have a high confidence of not possessing any functional splice sites. These regions are useful to generate a control set of false splice sites.
The data sets and further details can be obtained from the EBI Industry Programme´s Gene Prediction web pages.
Other related web site resources for clean data sets.
Recently HGMP has made efforts to collect and distribute common data sets for gene finding. Data sets by other researchers can be obtained from their 'Genesafe' web site. Databases and data sets relating to gene prediction are also available at Rockefeller´s comprehensive site for gene prediction.
Article by: Alphonse Thanaraj
Resources and further information
European Bioinformatics Institute, Hinxton, Cambridge, UK http://www.ebi.ac.uk/
Industry Programme http://industry.ebi.ac.uk/
Gene Prediction web pages http://industry.ebi.ac.uk/~thanaraj/gene.html
Literature:
Thanaraj, T.A. (1999), "A clean data set of EST-confirmed human splice sites and standards for cleanup procedures", Nucleic Acids Research, 27, 2627-2637.
Human Genome Mapping Project - Resource Centre http://www.hgmp.mrc.ac.uk/
Genesafe http://www.hgmp.mrc.ac.uk/Genesafe/
Rockefeller University - Genetic Linkage Analysis http://linkage.rockefeller.edu/
Gene Prediction site http://linkage.rockefeller.edu/wli/gene/
External sites are not endorsed by EMBL-EBI
|