|
Sequence Space
by Chris Dodge EMBL - Outstation Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Introduction
Direct visualisation and exploration of biological datasets is one of the many challenges facing bioinformatics, however mathematical techniques can be used to process such datasets and present them in forms more amenable to standard visualization practices. A disadvantage of such techniques is that obtaining an intuitive feel for the data after such processing is difficult and can require an understanding of the mathematics used. In this project we have taken a method used to cluster aligned protein sequences and combined views on the resulting data with more conventional protein data viewers. As all of the viewers interact with one another, it is hoped that an understanding of the processed data can be obtained without the need to fully understand the mathematics, but rather by seeing the correspondence between conventional and mathematical views.
We have taken a method developed by Casari, Sander and Valencia [1] that positions protein sequences from a multiple alignment as points in a high dimensional, generalized "sequence space". An ordination method called Principle Component Analysis is then used to reduce the number of dimensions so that we can view the data normally. Ordination, as described by Higgins [2], "is used to take a set of objects, initially arranged in a high-dimensional space and represent these in a small number of dimensions, while preserving the inter-object distances as much as possible".
There are many ways in which we could define sequence space, and the advantage of the method used here if that it is based on protein sequence alone. As the input data is a set of aligned sequences, it is reasonable to assume that they will be clustered somewhere in sequence space. Using ordination to reduce the dimensionality of the data while preserving inter-object distance should yield clusters in a subspace viewable in two or three dimensions.
A further advantage of this method is that we can project each residue into the subspace, and so directly determine which residues are responsible for the protein family under analysis and also the clusters (subfamilies) within the family. Additionally, the conservation of residues within the whole family and the subfamilies can be clearly identified. Conserved residues are often of functional significance, so this technique can be used to predict functional residues including those which may modulate family function (i.e. be responsible for a subfamily) without prior biological knowledge of the protein family.
Implementation
There are two main parts to the code:
- A program that computes the protein and residue coordinates according to the algorithm given in [1]. This has been written in C++ and runs on most types of computer. Input is a multiple alignment in one of the following forms; HSSP [3], MSF [4] or as a plain text file. This program has also been wrapped to create a CORBA server that runs at EBI so that the viewers (see below) can load calculated data from HSSP files directly over the network.
- A set of Java viewers for interactive exploration of the sequence space results. As the program is written in Java, it runs on most platforms (albeit with minor variations in appearance). There are 7 types of viewer which interact with one another, so that when a selection is made in one viewer, this is reflected across all viewers with related data;
2D protein coordinate viewer.
3D protein coordinate viewer.
2D residue coordinate viewer.
3D residue coordinate viewer.
A standard table view.
A multiple alignment.
A 3D protein structure viewer.
2D Protein View
This view shows two dimensions of the resulting N dimensions following ordination of sequence space. Usually, sequence space is reduced to six dimensions, which can be selected and viewed here. Each point on the plot represents one protein of the original multiple alignment. In this plot, three main clusters can be seen along three directions moving out from the origin (the blue cross). Selected proteins are shown in red, and these points are then highlighted in all views.
3D Protein View
This view shows three dimensions of the resulting N dimensions following ordination of sequence space. The viewer allows free or continuous rotation of the 3D space, which gives an improved view of the subfamily clustering. Each cross represents one protein from the original multiple alignment. The selected proteins from the 2D view are also highlighted here.
2D Residue View
Each point on this plot represents a residue that has been projected into the same subspace as the 2D protein view. A correspondence between the 2D protein and 2D residue space can be seen in that the protein points and residue points are more concentrated along three directions from the origin (blue cross). Thus we can determine which residues are responsible for the clusters seem in the protein view. Additionally, the further from the origin a residue is the more conserved it is.
3D Residue View
This view shows the residue data in three dimensions.
Table View
This simply shows each protein in the original data set with associated data. Selected proteins are highlighted in blue. Click on figure to see the enlargement.
Multiple Alignment View
The full multiple alignment is shown here, with selected proteins in blue and selected residues in red. This view is useful for looking at the correspondence between selections made in the 2D or 3D proteins views and similar selections made in the residue views.
In the image below, highlighted proteins can be seen that were made by selecting a small cluster in the 2D protein view, and highlighted residues from the corresponding area in the 2D residue view. A good correspondence is shown between the selected proteins and those proteins in which the selected residues appear. If we had selected the most conserved residues in this cluster (i.e. those at the extremity of the 2D residue view) then we could perhaps make the prediction that they are somehow involved in the specific function of that cluster. Click on the image to see the enlargement.
Protein 3D Structure View
This was written by Dirk Walther, and is otherwise known as Webmol. If structure data is available for the protein against which other sequences are aligned (as with HSSP files), then the protein is shown in this view. If residues are selected in other views, then they are highlighted in this view with the thicker bonds. Click on the image to see the enlargement.
Code Availability and Downloading
All executable code and the majority of the source code is freely available for academic and commercial use. The code, as well as more detailed explanation and a user guide can be found on the web at http://industry.ebi.ac.uk/SeqSpace/.
Acknowledgements
This project was undertaken as part of the BioStandards Program of the Industry Programme at the EBI. To Georg Casari, Chris Sander and Alfonso Valencia for the original work on the algorithm used here, details of which are found in [1]. Again to Chris Sander for suggesting and overseeing the work done during this project. To Dirk Walther for kindly allowing me to use his Java 3D protein structure viewer (Webmol).
Article by: Chris Dodge
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Sequence Space Viewer http://industry.ebi.ac.uk/SeqSpace/
Chris Dodge- formerly EMBL-EBI - now Synomics. chris@dodgies.demon.co.uk
Industry Programme http://industry.ebi.ac.uk/
Stanford University Genomic Resources web server http://genome-www.stanford.edu/
WebMol homepage http://genome-www.stanford.edu/structure/webmol/
Dirk Walther walther@cmpharm.ucsf.edu
Literature:
- [1] A method to predict functional residues in proteins, Gerog Cassari, Chris Sander and Alfonso Valencia. Structural Biology volume 2, no. 2, February 1995.
- [2] Sequence Ordinations: a multivariate analysis approach to analysing large data sets. Desmond Higgins. CABIOS Vol. 8, no. 1, 1992.
- [3] The HSSP database of protein structure-sequence alignments and family profiles. Chris Dodge, Reinhard Schneider and Chris Sander. Nucleic Acids Research, Vol. 26. No. 1, 1998.
- [4] A comprehensive set of sequence analysis programs for the VAX. J. Devereux, P. Haeberli and O. Smithies. Nucleic Acids Research, Vol. 12, 387-395, 1984.
External sites are not endorsed by EMBL-EBI |