|
CLUSTAL W 1.7
Introduction
CLUSTAL W is a program for multiple sequence alignment which includes several modifications to improve results for the alignment of divergent protein sequences. These improvements are:
- assignment of individual weights to each sequence in a partial alignment in order to downweight near-duplicate sequences and upweight the most divergent ones
- variation of amino acid substitution matrices at different alignment stages according to the divergence of the sequences to be aligned
- residue specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.
- positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions.
New since version 1.6
The static arrays used by clustalw for storing the alignment data have been replaced by dynamically allocated memory. There is now no limit on the number or length of sequences which can be input.
The alignment of DNA sequences now offers a new hard-coded matrix, as well as the identity matrix used previously. The new matrix is the default scoring matrix used by the BESTFIT program of the GCG package for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0.
The transition weight option for aligning nucleotide sequences has been changed from an on/off toggle to a weight between 0 and 1. A weight of zero means that the transitions are scored as mismatches; a weight of 1 gives transitions the full match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
The RSF sequence alignment file format used by GCG Version 9 can now be read.
The clustal sequence alignment file format has been changed to allow sequence names longer than 10 characters. The maximum length allowed is set in clustalw.h by the statement: #define MAXNAMES 10 For the fasta format, the name is taken as the first string after the '>' character, stopping at the first white space. (Previously, the first 10 characters were taken, replacing blanks by underscores).
The bootstrap values written in the phylip tree file format can be assigned either to branches or nodes. The default is to write the values on the nodes, as this can be read by several commonly-used tree display programs. But note that this can lead to confusion if the tree is rooted and the bootstraps may be better attached to the internal branches: Software developers should ensure they can read the branch label format.
The sequence weighting used during sequence to profile alignments has been changed. The tree weight is now multiplied by the percent identity of the new sequence compared with the most closely related sequence in the profile.
The sequence weighting used during profile to profile alignments has been changed. A guide tree is now built for each profile separately and the sequence weights calculated from the two trees. The weights for each sequence are then multiplied by the percent identity of the sequence compared with the most closely related sequence in the opposite profile.
The adjustment of the Gap Opening and Gap Extension Penalties for sequences of unequal length has been improved.
The default order of the sequences in the output alignment file has been changed. Previously the default was to output the sequences in the same order as the input file. Now the default is to use the order in which the sequences were aligned (from the guide tree/dendrogram), thus automatically grouping closely related sequences.
The option to 'Reset Gaps between alignments' has been switched off by default.
The conservation line output in the clustal format alignment file has been changed. Three characters are now used: '*' indicates positions which have a single, fully conserved residue ':' indicates that one of the following 'strong' groups is fully conserved:- STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW '.' indicates that one of the following 'weaker' groups is fully conserved:- CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY These are all the positively scoring groups that occur in the Gonnet Pam250 matrix. The strong and weak groups are defined as strong score >0.5 and weak score =<0.5 respectively.
A bug in the modification of the Myers and Miller alignment algorithm for residue-specific gap penalites has been fixed. This occasionally caused new gaps to be opened a few residues away from the optimal position.
The GCG/MSF input format no longer needs the word PILEUP on the first line. Several versions can now be recognised:-
- The word PILEUP as the first word in the file
- The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT as the first word in the file
- The characters MSF on the first line in the line, and the characters .. at the end of the line.
The standard command line separator for UNIX systems has been changed from '/' to '-'. ie. to give options on the command line, you now type clustalw input.aln -gapopen=8.0 instead of clustalw input.aln /gapopen=8.0
Availability
Clustalw 1.7 is freely available for anonymous download at the EBI FTP server for Unix, Mac, PC and VMS systems.
Information provided by: Toby Gibson
Resources and further information
European Bioinformatics Institute http://www.ebi.ac.uk/
Download Clustalw 1.7
for Unix: ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/clustalw.17.tar. Z
for Mac: ftp://ftp.ebi.ac.uk/pub/software/mac/clustalw/clustalw.17.sea. hqx
for PC: ftp://ftp.ebi.ac.uk/pub/software/dos/clustalw/clustalw17$.exe
for VMS: ftp://ftp.ebi.ac.uk/pub/software/vms/clustalw/clustalw_17.uue
European Molecular Biology Laboratory http://www.embl-heidelberg.de/
External sites are not endorsed by EMBL-EBI |