Phylogenetic tree

From MDWiki
Jump to navigationJump to search

A very good review/tutorial about phylogeny is:

  • Sandra L. Baldauf

Phylogeny for the faint of heart: a tutorial
TRENDS in Genetics Vol.19 No.6 June 2003, 345-351.




Constructing a phylogenetic tree

This brief instruction will help you to construct a phylogenetic tree of your multiple sequence alignment using the phylip software. You can also compute neighbour joining trees within clustalx, as a simple and quick way to explore how well your data is suited for phylogenetic analysis. The process is described here. However, phylip is superior for tree construction and it should be used for your final tree. Check that your alignment is suitable, since only a reliable multiple sequence alignment will give a meaningful phylogeny. Similarly to clustalx, all the phylip programs are on the DVD


Phylip has a fairly simplistic text interface: it shows a list of options with each having a single-letter code (at the left). Type the letter corresponding to the option you want to change. Some options toggle through sequentially (for example, if letter 'a' sets option arbitrary to 1, 2, or 3, and it's set to 1, pressing 'a' will set it to 2, pressing 'a' again will set it to 3, and pressing 'a' again sets it back to 1). Other options come up with specific instructions and you have to type in the number or letter you want (for example, if letter 'j' sets option 'jumble' you will be asked to type in an odd random number). Also note that each phylip program prefers an input file called infile and creates an output file called outfile. Unfortunately, when you then run your next program, outfile is overwritten with the new results. When you run a phylip program, it will say it can't find infile, but you can just type in the name of the actual file you are using (eg target_align.phy). Once the particular phylip program has run, immediately rename outfile to something useful before running the next phylip program (and before you forget what program it is the outfile from).



Step 1: Distance matrix calculation.

The first step to constructing a phylogenetic tree is to calculate a distance matrices for your multiple sequence alignment. You need to use the clustal phylip output file from your multiple sequence alignment with the .phy extension as input to phylip. {Use save as option in clustalx and select the phylip format. The file should start with a line with two numbers: the number of sequences and the length of the alignment. It might be a bit longer than your target sequence if gaps have been inserted anywhere. Then there's a block of the first 60 characters in the alignment with the sequence names at the front, followed by the remaining blocks of the alignment without any names.} It's probably easiest if you create a new Directory (eg Phylogeny), and copy the phylip format alignment into it, then move into that directory and do the rest of the calculations there. Since you're working with protein sequences, use the "protdist" program. For protdist, use the distance method ('P') based on PAM.


Once the settings are correct, type "y" to run the distance program. Rename "outfile" (eg target_align.dis). Look at the distance matrix produced (with notepad or similar). First, try to determine if any distances are equal to 0. These represent sequences that are not different, based on this distance method. Note that the sequences might look different by eye, for example they may be different lengths, or have indels. However, there is currently no good way to incorporate indels into distance measurements.


Step 2: Tree construction using the neighbour joining method.

Calculate the neighbor-joining tree from the distance matrix in Step1 (use the PAM distance matrix). Run the phylip program 'neighbor', {Note the American spelling} and enter the name of the distance file when asked (eg target_align.dis). Set the option 'J' to randomise input order of species, and 'Y' to run. Rename the outfile (eg target_align.nei), and also rename the treefile or outtree (eg to target_align.ph). Look at these two files, and compare the information in these.

Critically inspect your tree by viewing the outtree file (eg to target_align.ph) with the treeview program, and if necessary remove certain sequences, redo the alignment and the phylogenetic tree. Once you are satisfied with your tree, you should calculate the confidence in your tree by determining bootstrap values.


Step 3: Bootstrapping the data.

In phylip, this is a multistep process, and some of the intermediate files are quite large - you will want to delete them after you have finished the analysis.

Step A:

take your original input alignment (extension .phy) and run it through 'seqboot' to produce 100 bootstrap samples. Rename the outfile as usual (target_boot.aln). Look at the file, comparing the pseudoreplicates to your original alignment.

Step B:

make the bootstrap distance matrices. Run protdist with the same settings as before and also choose the option 'M' for multiple, then 'D' for multiple data sets and enter '100' (i.e. 100 replicates). Rename the outfile as before (target_boot.dis). Again, have a quick look at the distance matrices in this file, and compare them to the original distance matrix.

Step C:

Run 'neighbor' with the same settings as Step 2, and again, choose option "M" for multiple with 100 data sets. Rename the treefile (target_boot.ph). Note that the outfile from bootstrapping isn't all that useful unless the program is actually crashing, as it is the treefile that is used for the next step.

Step D:

Run 'consense' on the bootstrap treefile you've just created. Rename outfile and treefile as usual. Note that from consense, it is now the outfile that you will need - treefile from consense isn't actually very useful as it doesn't have branch lengths, instead the 'branch lengths' are the bootstrap support values. This is probably the major shortcoming of phylip currently. Compare the topology of the tree produced by bootstrapping to the original tree (Step 2).

In your final results, highlighting any branches that are different from the previous tree, and any branches in your original tree that have low bootstrap values (<75%). Do the bootstrap values you get match your expectations? For example, if you had an unexpected branching pattern in Step 2, does it have low or high bootstrap support?


Step 4: Construct consensus tree with bootstrap values.

Copy the newick tree from Step 2 into a new file. You are now going to add the bootstrap values you found as "labels" to this tree, keeping the original branch lengths found by neighbor-joining in Step 2. You should use the consense outfile as it has reports on all branches, not just those included in the consensus tree (consense treefile). For example, if your neighbor joining tree has the grouping (human:0.0012, gorilla:0.0011):0.0234, and in the consense output it reports

**...... 98.0

(where the first two species are human and gorilla), then edit the New Hampshire/Newick tree format to

(human:0.0012, gorilla:0.0011)"98":0.0234

In other words, put the bootstrap value in quotes just after the closing brackets of the group that had that bootstrap value. The value after the colon is the branch length - leave that as it is. Note that you may have to read some groups 'backwards', for example if human was the first species listed and gorilla the last, the grouping in the consense outfile will probably look like

.******. 98.0 rather than

*......* 98.0 as you might have expected.

These two are in fact equivalent descriptions of the clustering of the first and last sequence. If you found that any of the branches in your neighbor joining tree didn't appear in the consensus tree (because another branching pattern got a higher bootstrap score), add in the value anyway, but add a star to the label:

(pig:0.052, (sheep:0.017, cow:0.015)"87":0.32)"*34":0.000012

(so the branch joining sheep and cow was in both the consensus tree and the original tree, while the branch joining pig to that group was only in the original neighbor joining tree, and not in the consensus tree). Use Treeview to display and print the tree with bootstrap value labels. For 'special effects' you can use any drawing program to edit your final tree.



(this instructions are adapted from a tutorial written by Ingrid Jakobsen for the course MATH2210) --ThomasHuber 08:29, 27 April 2007 (EST)