Title: | R We There Yet? Visualizing MCMC Convergence in Phylogenetics |
---|---|
Description: | Implements various tests, visualizations, and metrics for diagnosing convergence of MCMC chains in phylogenetics. It implements and automates many of the functions of the AWTY package in the R environment, as well as a host of other functions. Warren, Geneva, and Lanfear (2017), <doi:10.1093/molbev/msw279>. |
Authors: | Dan Warren [aut, cre], Anthony Geneva [aut], Rob Lanfear [aut], Luke Harmon [ctb], April Wright [ctb], Diego Mallo [ctb], Klaus Schliep [ctb], Kelly Luke [ctb], Kendall Michelle [ctb], Smith Martin [ctb] |
Maintainer: | Dan Warren <[email protected]> |
License: | GPL-2 |
Version: | 1.0.2 |
Built: | 2024-11-20 04:55:18 UTC |
Source: | https://github.com/danlwarren/rwty |
This is the main user interface to rwty. It allows users to conduct simple visualizations of MCMC chain performance with very few arguments.
analyze.rwty(chains, burnin = 0, window.size = 20, treespace.points = 100, n.clades = 20, min.freq = 0, fill.color = NA, filename = NA, overwrite = FALSE, facet = TRUE, free_y = FALSE, autocorr.intervals = 100, ess.reps = 20, treedist = "PD", params = NA, max.sampling.interval = NA, ...)
analyze.rwty(chains, burnin = 0, window.size = 20, treespace.points = 100, n.clades = 20, min.freq = 0, fill.color = NA, filename = NA, overwrite = FALSE, facet = TRUE, free_y = FALSE, autocorr.intervals = 100, ess.reps = 20, treedist = "PD", params = NA, max.sampling.interval = NA, ...)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Default value is zero. |
window.size |
The number of trees to include in each windows of sliding window plots |
treespace.points |
The number of trees to plot in the treespace plot. Default is 100 |
n.clades |
The number of clades to include in plots of split frequencies over the course of the MCMC |
min.freq |
The minimum frequency for a node to be used for calculating ASDSF. Default is 0.1 |
fill.color |
The name of a column in your log file that you would like to use as the fill colour of points in the treespace plots |
filename |
Name of an output file (e.g., "output.pdf"). If none is supplied, rwty will not save outputs to file. |
overwrite |
Boolean variable saying whether output file should be overwritten, if it exists. |
facet |
A Boolean expression indicating whether multiple chains should be plotted as facet plots (default TRUE). |
free_y |
TRUE/FALSE to turn free y scales on the facetted plots on or off (default FALSE). Only works if facet = TRUE. |
autocorr.intervals |
The maximum number of intervals to use for autocorrelation plots. |
ess.reps |
The number of replicate analyses to do when calculating the pseudo ESS. |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance. |
params |
A vector of parameters to use when making the parameter correlation plots. Defaults to the first two columns in the log table. |
max.sampling.interval |
The maximum sampling interval to use for generating autocorrelation plots |
... |
Extra arguments to be passed to plotting and analysis functions. |
output The output is a list containing the following plots:
Plots of likelihood, model parameters, and tree topologies as a function of chain length (the first two only when output from MCMC parameters has been loaded along with the tree topologies).
Plot of autocorrelation of tree topolgies at different sampling intervals along a chain
Plot of split frequencies calculated in sliding windows for the most variable clades
Plot of change in split frequencies between sliding windows for all clades
Plot of cumulative split frequencies as the MCMC progresses
Plot of change in cumulative split frequencies as the MCMC progresses
Heatmap and point depictions of chains in treespace.
Plot of the Average Standard Deviation of Split Frequencies (ASDSF) between chains as the MCMC progresses
Plot of pairwise correlations between split frequencies among chains
Plot of chains clustered by their pairwise ASDSF values
## Not run: data(fungus) p <- analyze.rwty(fungus, burnin = 50, window.num = 50) p ## End(Not run)
## Not run: data(fungus) p <- analyze.rwty(fungus, burnin = 50, window.num = 50) p ## End(Not run)
This function is automatically called by many other functions, but can be run manually as well. It performs a number of tests of chain format, labels, lengths, etc.
check.chains(chains)
check.chains(chains)
chains |
A list of rwty.chain objects. |
chains A list of rwty.chain objects
## Not run: data(fungus) check.chains(fungus) ## End(Not run)
## Not run: data(fungus) check.chains(fungus) ## End(Not run)
Uses ape functionality to get the frequencies and names of clades in an MCMC chain or subset thereof.
clade.freq(x, start, end, rooted = FALSE, ...)
clade.freq(x, start, end, rooted = FALSE, ...)
x |
A multiPhylo or rwty.chain object |
start |
The index of the first tree to consider in calcuating frequencies |
end |
The index of the last tree to consider in calculating frequencies |
rooted |
(TRUE/FALSE). Tells RWTY whether your trees are rooted or not. |
... |
Arguments to be passed to ape's prop.part function |
clade.df A data froma containing clade names and frequencies
## Not run: data(fungus) clade.freq(fungus$Fungus.Run1, start=10, end=100) ## End(Not run)
## Not run: data(fungus) clade.freq(fungus$Fungus.Run1, start=10, end=100) ## End(Not run)
This function is automatically called by some of the plot functions.
combine.ptables(chains, burnin)
combine.ptables(chains, burnin)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
ptable A data frame of likelihood values and model parameters for the supplied rwty.chain objects
## Not run: data(fungus) combine.ptables(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) combine.ptables(fungus, burnin=20) ## End(Not run)
This function calculates the cumulative mean split frequencies of clades as an MCMC progresses.
cumulative.freq(chains, burnin = 0, window.size = 20)
cumulative.freq(chains, burnin = 0, window.size = 20)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Defaults to zero. |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
A list of rwty.cumulative objects, one per chain in the input list of chains. Each rwty.cumulative object contains the cumulative mean split frequencies of clades at sp windows, and a translation table that converts clade groupings to factors.
## Not run: data(fungus) cumulative.data <- cumulative.freq(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) cumulative.data <- cumulative.freq(fungus, burnin=20) ## End(Not run)
This function uses an exponential semivariogram model to estimate the asymptotic topological distance, and uses that to estimate the sampling interval at which topological distances have reached some fixed proportion of that value (default 0.95). It expects as input a data table output by rwty's topological.autocorr function
estimate.autocorr.m(dat, ac.cutoff = 0.95)
estimate.autocorr.m(dat, ac.cutoff = 0.95)
dat |
A data frame output from topological.autocorr. |
ac.cutoff |
Default 0.95. The proportion of the asymptotic topological distance to use as a cutoff for determining sampling interval. For example, if ac.cutoff = 0.9, then the minimum sampling interval returned is the one that guarantees a topological distance at least 0.9 times the asymptotic value. |
A data frame consisting of the value matching the ac.cutoff proportion of the asymptotic topological distance for each chain. This sampling interval estimates the interval at which topological distances are no longer autocorrelated. If the value is larger than the largest sampling distance, the table records this as a value of -1
data(fungus) ## Not run: # To get a good estimate we need all sampling intervals autocorr.intervals = as.integer(length(fungus[[1]]$trees)/21) sampling.table <- topological.autocorr(fungus, burnin = 20, autocorr.intervals = autocorr.intervals) estimate.autocorr.m(sampling.table) ## End(Not run)
data(fungus) ## Not run: # To get a good estimate we need all sampling intervals autocorr.intervals = as.integer(length(fungus[[1]]$trees)/21) sampling.table <- topological.autocorr(fungus, burnin = 20, autocorr.intervals = autocorr.intervals) estimate.autocorr.m(sampling.table) ## End(Not run)
This is the output from a MrBayes run of 10,000,000 generations using the analysis settings from the original .nex file. Sampling is one tree per 40,000 generations. Four chains are included, each with its associated log file.
data(fungus)
data(fungus)
A data frame with four chains of 251 phylogenetic trees and associated likelihood and parameter values.
Study reference: Hibbett D., Pine E., Langer E., Langer G., & Donoghue M. 1997. Evolution of gilled mushrooms and puffballs inferred from ribosomal DNA sequences. Proceedings of the National Academy of Sciences of the United States of America, 94(22): 12002-12006.
http://treebase.org/treebase-web/search/study/summary.html?id=271
Finds trees and log files based on format definition, returns rwty.chain objects containing both
load.multi(path = ".", format = "mb", labels = NA, ...)
load.multi(path = ".", format = "mb", labels = NA, ...)
path |
The path to the directory containing tree and log files |
format |
File format, which is used to find tree and log files. Currently accepted values are "mb" for MrBayes, "beast" for BEAST, "*beast" for *BEAST, and "revbayes" for RevBayes. If you would like RWTY to understand additional formats, please contact the authors and send us some sample data. |
labels |
A vector of names to assign to chains as they are read in. |
... |
Further arguments to be passed to load.trees. |
output A list of rwty.chain objects containing the multiPhylos and the tables of values from the log files if available.
#load.multi(path = "~/my trees/", format = "*beast")
#load.multi(path = "~/my trees/", format = "*beast")
Loads trees, looks for a log file of tree likelihoods and parameter values, returns an rwty.chain object containing both
load.trees(file, type = NA, format = "mb", gens.per.tree = NA, trim = 1, logfile = NA, skip = NA)
load.trees(file, type = NA, format = "mb", gens.per.tree = NA, trim = 1, logfile = NA, skip = NA)
file |
A path to a tree file containing an MCMC chain of trees |
type |
An argument that designates the type of tree file. If "nexus",
trees are loaded using ape's |
format |
File format, which is used to find tree and log files. Currently accepted values are "mb" for MrBayes, "beast" for BEAST, "*beast" for *BEAST, and "revbayes" for RevBayes. If you would like RWTY to understand additional formats, please contact the authors and send us some sample data. |
gens.per.tree |
The number of generations separating trees. If not provided, RWTY will attempt to calculate it automatically. |
trim |
Used for thinning the chain. If a number N is provided, RWTY keeps every Nth tree. |
logfile |
A path to a file containing model parameters and likelihoods. If no path is provided but a "format" argument is supplied, RWTY will attempt to find the log file automatically based on the format definition. |
skip |
The number of lines that must be skipped to get to the header of the log file. MrBayes, for instance, prints a comment line at the top of the log file, so MrBayes files should be read in with a skip value of 1. If no "skip" value is provided but a "format" is supplied, RWTY will attempt to read logs using the skip value from the format definition. |
output An rwty.chain object containing the multiPhylo and the table of values from the log file if available.
#load.trees(file="mytrees.t", format = "mb")
#load.trees(file="mytrees.t", format = "mb")
This function takes one or more rwty.chain ojects and returns a plot of CSF within each chain as the MCMC progresses. The solid line with points shows the Average Change in Split Frequencies (ACSF; it is average across the changes in split frequencies from all clades in the analysis) between this window and the previous window The grey ribbon shows the upper and lower 95
makeplot.acsf.cumulative(chains, burnin = 0, window.size = 20, facet = TRUE)
makeplot.acsf.cumulative(chains, burnin = 0, window.size = 20, facet = TRUE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Defaults to zero. |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
facet |
(TRUE/FALSE). TRUE: return a single plot with one facet per chain; FALSE: return a list of individual plots with one plot per chain |
output A plof of the CSF between sliding windows over all chains
acsf.plot A ggplot object, or list of ggplot objects
## Not run: data(fungus) makeplot.acsf.cumulative(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) makeplot.acsf.cumulative(fungus, burnin=20) ## End(Not run)
This function takes one or more rwty.chain ojects and returns a plot of CSF within each chain as the MCMC progresses. The solid line with points shows the Average Change in Split Frequencies (ACSF) between this window and the previous window The grey ribbon shows the upper and lower 95
makeplot.acsf.sliding(chains, burnin = 0, window.size = 20, facet = TRUE)
makeplot.acsf.sliding(chains, burnin = 0, window.size = 20, facet = TRUE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Defaults to zero. |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
facet |
(TRUE/FALSE). TRUE: return a single plot with one facet per chain; FALSE: return a list of individual plots with one plot per chain |
output A plof of the CSF between sliding windows over all chains
acsf.plot A ggplot object, or list of ggplot objects
## Not run: data(fungus) makeplot.acsf.sliding(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) makeplot.acsf.sliding(fungus, burnin=20) ## End(Not run)
Plots all parameter values, including tree topologies (see makeplot.topology) over the length of the MCMC chain
makeplot.all.params(chains, burnin = 0, facet = TRUE, free_y = FALSE, strip = 1)
makeplot.all.params(chains, burnin = 0, facet = TRUE, free_y = FALSE, strip = 1)
chains |
A set of rwty.chain objects |
burnin |
The number of trees to omit as burnin. |
facet |
Boolean denoting whether to make a facet plot. |
free_y |
TRUE/FALSE to turn free y scales on the facetted plots on or off (default FALSE). Only works if facet = TRUE. |
strip |
Number indicating which column to strip off (i.e., strip=1 removes first column, which is necessary for most MCMC outputs in which the first column is just the generation). You can skip multiple columns by passing a vector of columns to skip, e.g., strip=c(1,4,6). |
param.plot Returns a list of ggplot objects.
## Not run: data(fungus) makeplot.all.params(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) makeplot.all.params(fungus, burnin=20) ## End(Not run)
This function takes two or more rwty.chain ojects and returns a plot of ASDSF as the run progresses. The solid line with points shows the Average Standard Deviation of Split Frequences at the current generation The grey ribbon shows the upper and lower 95
makeplot.asdsf(chains, burnin = 0, window.size = 20, min.freq = 0, log.y = TRUE)
makeplot.asdsf(chains, burnin = 0, window.size = 20, min.freq = 0, log.y = TRUE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Defaults to zero. |
window.size |
The number of trees between each point at which the ASDSFs is calculated (note, specified as a number of sampled trees, not a number of generations) |
min.freq |
The minimum frequency for a node to be used for calculating ASDSF. |
log.y |
Controls whether they Y axis is plotted on a log scale or not. Which scale is more useful depends largely on the amount of disagreement between your chains. Attempting to make an asdsf plot with a log Y axis for chains that include standard deviations of zero will result in warning messages. |
output A cumulative plot of ASDSF across all chains
## Not run: data(fungus) p <- makeplot.asdsf(fungus, burnin = 20) p ## End(Not run)
## Not run: data(fungus) p <- makeplot.asdsf(fungus, burnin = 20) p ## End(Not run)
This function takes a list of rwty.chain objects, and makes an autocorrelation plot for each chain. Each plot shows the mean phylogenetic distance at a series of sampling intervals. In really well behaved MCMC analyses, the mean distance will stay constant as the sampling interval increases. If there is autocorrelation, the mean distance will increase as the sampling interval increases, and is expected to level off when the autocorrelation decreases to zero. The function calculates path distances, though other distances could also be employed.
makeplot.autocorr(chains, burnin = 0, max.sampling.interval = NA, autocorr.intervals = 40, squared = FALSE, facet = FALSE, free_y = FALSE, treedist = "PD", use.all.samples = FALSE)
makeplot.autocorr(chains, burnin = 0, max.sampling.interval = NA, autocorr.intervals = 40, squared = FALSE, facet = FALSE, free_y = FALSE, treedist = "PD", use.all.samples = FALSE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. |
max.sampling.interval |
The largest sampling interval for which you want to calculate the mean distance between pairs of trees (default is 10 percent of the length of the chain). |
autocorr.intervals |
The number of sampling intervals to use. These will be spaced evenly between 1 and the max.sampling.interval |
squared |
TRUE/FALSE use squared tree distances (necessary to calculate approximate ESS; default FALSE) |
facet |
TRUE/FALSE to turn facetting of the plot on or off (default FALSE) |
free_y |
TRUE/FALSE to turn free y scales on the facetted plots on or off (default FALSE). Only works if facet = TRUE. |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
use.all.samples |
(TRUE/FALSE). Whether to calculate autocorrelation from all possible pairs of trees in your chain. The default is FALSE, in which case 500 samples are taken at each sampling interval. This is sufficient to get reasonably accurate estimates of the approximate ESS. Setting this to TRUE will give you slightly more accurate ESS estimates, at the cost of potentially much longer execution times. |
A ggplot2 plot object, with one line (facetting off) or facet (facetting on) per rwty.chain object.
## Not run: data(fungus) makeplot.autocorr(fungus, burnin = 20) ## End(Not run)
## Not run: data(fungus) makeplot.autocorr(fungus, burnin = 20) ## End(Not run)
Makes a plot matrix of each parameter against each other (including the topology) in your analysis. The default behaviour is to just plot the first two columns of your parameter file (after removing the column for the generation number) as well as the topological distance. This usually means that you see a pairs plot with the likelihood, the tree length, and the tree toppology. We do this because some parameter files contain so many columns that the plot matrix becomes too busy. To include parameters of your choice, use the 'parameters' argument. In this function, the topological distance is calculate from the first tree in every chain.
makeplot.pairs(chains, burnin = 0, treedist = "PD", params = NA, strip = 1)
makeplot.pairs(chains, burnin = 0, treedist = "PD", params = NA, strip = 1)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to omit as burnin. |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
params |
'NA', 'all', or a vector of column names to include in the plot. 'NA' gives the default behaviour (see above). 'all' plots all columns (watch out!). Choose specific columns by name with a vector. |
strip |
Number indicating which column to strip off (i.e., strip=1 removes first column, which is necessary for most MCMC outputs in which the first column is just the generation). |
pairs.plot Returns a ggplot object.
## Not run: data(salamanders) makeplot.pairs(salamanders[1], burnin=20) # plot all the variables makeplot.pairs(salamanders[1], burnin=20, params = 'all') # plot specific the variables (note: you always get the topological distance) makeplot.pairs(salamanders[1], burnin=20, params = c('pi.A.', 'pi.C.', 'pi.G.', 'pi.T.')) ## End(Not run)
## Not run: data(salamanders) makeplot.pairs(salamanders[1], burnin=20) # plot all the variables makeplot.pairs(salamanders[1], burnin=20, params = 'all') # plot specific the variables (note: you always get the topological distance) makeplot.pairs(salamanders[1], burnin=20, params = c('pi.A.', 'pi.C.', 'pi.G.', 'pi.T.')) ## End(Not run)
Plots parameter values over the length of the MCMC chain
makeplot.param(chains, burnin = 0, parameter = "LnL", facet = TRUE, free_y = FALSE)
makeplot.param(chains, burnin = 0, parameter = "LnL", facet = TRUE, free_y = FALSE)
chains |
A set of rwty.chain objects. |
burnin |
The number of trees to omit as burnin. |
parameter |
The column name of the parameter to plot. |
facet |
Boolean denoting whether to make a facet plot. |
free_y |
TRUE/FALSE to turn free y scales on the facetted plots on or off (default FALSE). Only works if facet = TRUE. |
param.plot Returns a ggplot object.
## Not run: data(fungus) makeplot.param(fungus, burnin=20, parameter="pi.A.") ## End(Not run)
## Not run: data(fungus) makeplot.param(fungus, burnin=20, parameter="pi.A.") ## End(Not run)
This function takes a list of rwty.chain objects, and plots the pseudo ESS of the tree topologies from each chain, after removing burnin. Each caulcation is repeated n times, where in each replicate a random tree from the chain is chosen as a 'focal' tree. The calculation works by calculating the path distance of each tree in the chain from the focal tree, and calculating the ESS of the resulting vector of phylogenetic distances using the effectiveSize function from the coda package. NB this function requires the calculation of many tree distances, so can take some time.
makeplot.pseudo.ess(chains, burnin = 0, n = 20)
makeplot.pseudo.ess(chains, burnin = 0, n = 20)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
n |
The number of replicate analyses to do |
pseudo.ess.plot A ggplot2 plot object, in which each chain is represented by a point which represents the median pseudo ESS from the n replicates, and whiskers representing the upper and lower 95
## Not run: data(fungus) makeplot.pseudo.ess(fungus, burnin = 20, n = 10) ## End(Not run)
## Not run: data(fungus) makeplot.pseudo.ess(fungus, burnin = 20, n = 10) ## End(Not run)
This function takes list of rwty.chain objects, and returns a scatterplot matrix in which each plot shows the split frequencies of all clades that appear in one or both MCMC chains at least once. In the upper diagonal, we show the correlation between the split frequencies (Pearson's R), and the Average Standard Deviation of the split frequencies.
makeplot.splitfreq.matrix(chains, burnin = 0)
makeplot.splitfreq.matrix(chains, burnin = 0)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
output A list of two plots: the first is a matrix of scatterplots, where each point is a clade, and the values are the split frequencies of that clade in the post-burnin trees of each chain. The second plot is a tree of the chains clustered by their ASDSFs.
## Not run: data(salamanders) makeplot.splitfreq.matrix(salamanders[1:4], burnin = 20) ## End(Not run)
## Not run: data(salamanders) makeplot.splitfreq.matrix(salamanders[1:4], burnin = 20) ## End(Not run)
Takes a list of rwty.chain objects. Plots the cumulative split frequencies of clades over the course of the MCMC. Stationarity is indicated by split frequencies levelling out. Only plots the n.clades most variable clades, as measured by the standard deviation of the split frequencies of each clade across all windows. Each line in the plot represents a single clade. The colour of the line represents the standard deviation of the split frequencies of that clade across all sliding windows.
makeplot.splitfreqs.cumulative(chains, burnin = 0, n.clades = 20, window.size = 20, facet = TRUE, rank = "wcsf")
makeplot.splitfreqs.cumulative(chains, burnin = 0, n.clades = 20, window.size = 20, facet = TRUE, rank = "wcsf")
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
n.clades |
The number of clades to plot |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
facet |
(TRUE/FALSE). TRUE: return a single plot with one facet per chain; FALSE: return a list of individual plots with one plot per chain |
rank |
('wcsf', 'sd'). How to rank the clades? By default, we plot the 20 'worst' clades. This parameter sets the definition of 'worst'. The default is to rank the by the weighted change in split frequencies (rank = 'wcsf'). This works by looking at the change in the cumulative split frequency over the course of the MCMC, and ranks the worst chains as those that do not level off (i.e. those that have changes near the end). We do this because in a well-behaved chain, we expect the cumulative split frequencies to level off once the chain has been run for long enough. So, any cumulative split frequencies which are still changing towards the end of your run are likely to indicate problematic clades. Specifically, we multiply the absolute change in split frequencies for each clade by a set of weights that increase linearly towards the end of the chain (the first observation gets a weight of zero, the final observation gets a weight of one). The original AWTY ranked clades by their standard deviations (higher SD = worse), so we include this as an option too. To do this, just set rank = 'sd'. |
splitfreqs.plot Either a single ggplot2 object or a list of ggplot2 objects.
## Not run: data(fungus) makeplot.splitfreqs.cumulative(fungus, burnin = 20, n.clades=25) ## End(Not run)
## Not run: data(fungus) makeplot.splitfreqs.cumulative(fungus, burnin = 20, n.clades=25) ## End(Not run)
Takes a list of rwty.chain objects. Plots the split frequencies of clades over the course of the MCMC, calculated from windows of a specified size. Only plots the n.clades most variable clades, as measured by the standard deviation of the split frequencies of each clade across the MCMC. Each line in the plot represents a single clade. The colour of the line represents the standard deviation of the split frequencies of that clade across the MCMC.
makeplot.splitfreqs.sliding(chains, burnin = 0, n.clades = 20, window.size = 20, facet = TRUE, rank = "ess")
makeplot.splitfreqs.sliding(chains, burnin = 0, n.clades = 20, window.size = 20, facet = TRUE, rank = "ess")
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
n.clades |
The number of clades to plot |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
facet |
(TRUE/FALSE). TRUE: return a single plot with one facet per chain; FALSE: return a list of individual plots with one plot per chain |
rank |
('ess', 'sd'). How to rank the clades? By default, we plot the 20 'worst' clades. This parameter sets the definition of 'worst'. The default is to rank the clades by increasing Effective Sample Size (i.e. the 20 worst clades are those with the lowest ESS), since in a sliding window plot we expect well-sampled splits to have a high value (rank = "ess"). The original AWTY ranked clades by their standard deviations. To do this, just set rank = 'sd'. |
splitfreqs.plot Either a single ggplot2 object or a list of ggplot2 objects.
## Not run: data(fungus) makeplot.splitfreqs.sliding(fungus, burnin = 20, n.clades=25) ## End(Not run)
## Not run: data(fungus) makeplot.splitfreqs.sliding(fungus, burnin = 20, n.clades=25) ## End(Not run)
Plots a trace of topological distances of trees over the length of the MCMC chain. The plot shows the path distance of each tree in each chain from the last tree of the burnin of the first chain. If burnin is set to zero, then distances are calculated from the first tree of the first chain. If required, the behaviour can be changed to plot the path distance of each tree from the last tree of the burnin of each chain, using the independent.chains option. This is not recommended in most cases.
makeplot.topology(chains, burnin = 0, facet = TRUE, free_y = FALSE, independent.chains = FALSE, treedist = "PD", approx.ess = TRUE)
makeplot.topology(chains, burnin = 0, facet = TRUE, free_y = FALSE, independent.chains = FALSE, treedist = "PD", approx.ess = TRUE)
chains |
A set of rwty.chain objects. |
burnin |
The number of trees to omit as burnin. |
facet |
TRUE/FALSE denoting whether to make a facet plot (default TRUE) |
free_y |
TRUE/FALSE to turn free y scales on the facetted plots on or off (default FALSE). Only works if facet = TRUE. |
independent.chains |
TRUE/FALSE if FALSE (the default) then the plots show the distance of each tree from the last tree of the burnin of the first chain. If TRUE, the plots show the distance of each tree from the first tree of the chain in which that tree appears. The TRUE option should only be used in the case that different chains represent analyses of different genes or datasets. |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
approx.ess |
TRUE/FALSE do you want the approximate topological ess to be calculated and displayed for each chain? |
topology.trace.plot Returns a ggplot object.
## Not run: data(fungus) makeplot.topology(fungus, burnin=20) ## End(Not run)
## Not run: data(fungus) makeplot.topology(fungus, burnin=20) ## End(Not run)
This function will take list of rwty.chains objects and produce plots of chains in treespace.
makeplot.treespace(chains, burnin = 0, n.points = 100, fill.color = NA)
makeplot.treespace(chains, burnin = 0, n.points = 100, fill.color = NA)
chains |
A list of one or more rwty.chain objects |
burnin |
The number of samples to remove from the start of the chain as burnin |
n.points |
The number of points on each plot |
fill.color |
The name of any column in your parameter file that you would like to use as a fill colour for the points of the plot. |
A list of two ggplot objects: one plots the points in treespace, the other shows a heatmap of the same points
## Not run: data(fungus) p <- makeplot.treespace(fungus, burnin = 20, fill.color = 'LnL') # Treespace plot for all the fungus data # NB: these data indicate significant problems: the chains are sampling very # different parts of tree space. # # View the points plotted in treespace (these data indicate significant problems) p$treespace.points.plot # View the heatmap of the same data # Note that this data is so pathologically bad that the heatmap is not # very useful. It is more useful on better behaved datasets p$treespace.heatmap # we can also plot different parameters as the fill colour. # e.g. we can plot the first two fungus chains with likelihood as the fill makeplot.treespace(fungus[1:2], burnin = 100, fill.color = 'LnL') # or with tree length as the fill makeplot.treespace(fungus[1:2], burnin = 100, fill.color = 'TL') # you can colour the plot with any parameter in your ptable # to see which parameters you have you can simply do this: names(fungus[[1]]$ptable) ## End(Not run)
## Not run: data(fungus) p <- makeplot.treespace(fungus, burnin = 20, fill.color = 'LnL') # Treespace plot for all the fungus data # NB: these data indicate significant problems: the chains are sampling very # different parts of tree space. # # View the points plotted in treespace (these data indicate significant problems) p$treespace.points.plot # View the heatmap of the same data # Note that this data is so pathologically bad that the heatmap is not # very useful. It is more useful on better behaved datasets p$treespace.heatmap # we can also plot different parameters as the fill colour. # e.g. we can plot the first two fungus chains with likelihood as the fill makeplot.treespace(fungus[1:2], burnin = 100, fill.color = 'LnL') # or with tree length as the fill makeplot.treespace(fungus[1:2], burnin = 100, fill.color = 'TL') # you can colour the plot with any parameter in your ptable # to see which parameters you have you can simply do this: names(fungus[[1]]$ptable) ## End(Not run)
Converts a list of clades (e.g., "1 2 3 4" as a clade) and returns a list of parsed clades, converting numbers to names using a set of trees. Called internally by the slide and cumulative analyses, not user-facing.
parse.clades(clades, treelist)
parse.clades(clades, treelist)
clades |
A list of clades, as in the first column of a cladetable in an rwty.slide or rwty.cumulative object. |
treelist |
A list of trees, used for getting tip names. |
output A list of clades with parsed tip names
This function is automatically called when printing a rwty.chain object
## S3 method for class 'rwty.chain' print(x, ...)
## S3 method for class 'rwty.chain' print(x, ...)
x |
A rwty.chain object |
... |
Other arguments to be passed on to next function |
A summary of the contents of the chain
data(fungus) fungus$Fungus.Run1
data(fungus) fungus$Fungus.Run1
This is the output from a MrBayes run of 25,000,000 generations using the analysis settings from the original .nex files. Sampling is one tree per 100,000 generations. Data is from alignments of three separate sequences, two chains per alignment, each with its associated log file.
data(salamanders)
data(salamanders)
A data frame with six chains (two each from three separate alignments) of 251 phylogenetic trees and associated likelihood and parameter values.
Study reference: Williams JS, Niedzwiecki JH, Weisrock DW (2013) Species tree reconstruction of a poorly resolved clade of salamanders (Ambystomatidae) using multiple nuclear loci. Molecular Phylogenetics and Evolution 68(3): 671-682. http://dx.doi.org/10.1016/j.ympev.2013.04.013
Dryad reference: Williams JS, Niedzwiecki JH, Weisrock DW (2013) Data from: Species tree reconstruction of a poorly resolved clade of salamanders (Ambystomatidae) using multiple nuclear loci. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.2gq14
http://datadryad.org/resource/doi:10.5061/dryad.2gq14
This function takes sliding windows of a specified length over an MCMC chain and calculates the split frequency of clades within that window. It allows users to see whether the chain is visiting different areas of treespace.
slide.freq(chains, burnin = 0, window.size = 20)
slide.freq(chains, burnin = 0, window.size = 20)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin. Defaults to zero. |
window.size |
The number of trees to include in each window (note, specified as a number of sampled trees, not a number of generations) |
A list of rwty.slide objects, one per chain in the input list of chains. Each rwty.slide object contains the frequencies of clades in the sliding windows, and a translation table that converts clade groupings to factors.
## Not run: data(fungus) slide.data <- slide.freq(fungus, burnin=20)\ ## End(Not run)
## Not run: data(fungus) slide.data <- slide.freq(fungus, burnin=20)\ ## End(Not run)
This function takes a list of rwty.chain objects, and calculates the pseudo ESS of the trees from each chain, after removing burnin. The calculation uses the autocorrelation among squared topological distances between trees to calculate an approximate ESS of tree topologies for each chain. NB this function requires the calculation of many many tree distances, so can take some time.
topological.approx.ess(chains, burnin = 0, max.sampling.interval = 100, treedist = "PD", use.all.samples = FALSE)
topological.approx.ess(chains, burnin = 0, max.sampling.interval = 100, treedist = "PD", use.all.samples = FALSE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
max.sampling.interval |
The largest sampling interval you want to use to calculate the ESS. Every sampling interval up to and including this number will be sampled. Higher is better, but also slower. In general, setting this number to 100 (the default) should be fine for most cases. However, if you get an upper bound on the ESS estimate (i.e. ESS<x) rather than a point estimate (i.e. ESS = x) then that indicates a higher max.sampling.interval would be better, because the algorithm could not find the asymptote on the autocorrelation plot with the current max.sampling.interval. |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
use.all.samples |
(TRUE/FALSE). Whether to calculate autocorrelation from all possible pairs of trees in your chain. The default is FALSE, in which case 500 samples are taken at each sampling interval. Setting this to TRUE will give you slightly more accurate ESS estimates, at the cost of potentially much longer execution times. |
A data frame with one row per chain, and columns describing the approximate ESS and the name of the chain.
## Not run: data(fungus) topological.approx.ess(fungus, burnin = 20) ## End(Not run)
## Not run: data(fungus) topological.approx.ess(fungus, burnin = 20) ## End(Not run)
This function takes a list of rwty.chain objects, and calculates the mean phylogenetic distance at a series of roughly even sampling intervals. In really well behaved MCMC analyses, the mean distance will stay constant as the sampling interval increases. If there is autocorrelation, it will increase as the sampling interval increases, and is expected to level off when the autocorrelation decreases to zero. The function calculates path distances, though other distances could also be employed.
topological.autocorr(chains, burnin = 0, max.sampling.interval = NA, autocorr.intervals = 100, squared = FALSE, treedist = "PD", use.all.samples = FALSE)
topological.autocorr(chains, burnin = 0, max.sampling.interval = NA, autocorr.intervals = 100, squared = FALSE, treedist = "PD", use.all.samples = FALSE)
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
max.sampling.interval |
The largest sampling interval for which you want to calculate the mean distance between pairs of trees (default is 10 percent of the length of the list of trees). |
autocorr.intervals |
The number of sampling intervals to use. These will be spaced evenly between 1 and the max.sampling.interval |
squared |
TRUE/FALSE use squared tree distances (necessary to calculate approximate ESS) |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
use.all.samples |
(TRUE/FALSE). Whether to calculate autocorrelation from all possible pairs of trees in your chain. The default is FALSE, in which case 500 samples are taken at each sampling interval. This is sufficient to get reasonably accurate estimates of the approximate ESS. Setting this to TRUE will give you slightly more accurate ESS estimates, at the cost of potentially much longer execution times. |
A data frame with one row per sampling interval, per chain. The first column is the sampling interval. The second column is the mean path distance between pairs of trees from that sampling interval. The third column is the chain ID.
## Not run: data(fungus) topological.autocorr(fungus, burnin = 20) ## End(Not run)
## Not run: data(fungus) topological.autocorr(fungus, burnin = 20) ## End(Not run)
This function takes a list of rwty.chain objects, and calculates the pseudo ESS of the trees from each chain, after removing burnin. Each caulcation is repeated n times, where in each replicate a random tree from the chain is chosen as a 'focal' tree. The calculation works by calculating the path distance of each tree in the chain from the focal tree, and calculating the ESS of the resulting vector of phylogenetic distances using the effectiveSize function from the coda package. NB this function requires the calculation of many many tree distances, so can take some time.
topological.pseudo.ess(chains, burnin = 0, n = 20, treedist = "PD")
topological.pseudo.ess(chains, burnin = 0, n = 20, treedist = "PD")
chains |
A list of rwty.chain objects. |
burnin |
The number of trees to eliminate as burnin |
n |
The number of replicate analyses to do |
treedist |
the type of tree distance metric to use, can be 'PD' for path distance or 'RF' for Robinson Foulds distance |
A data frame with one row per chain, and columns describing the median ESS, the upper and lower 95 replicates performed, and the name of the chain.
## Not run: data(fungus) topological.pseudo.ess(fungus, burnin = 20, n = 20) ## End(Not run)
## Not run: data(fungus) topological.pseudo.ess(fungus, burnin = 20, n = 20) ## End(Not run)
This function takes a list of trees and returns a distance matrix populated with distances between all trees in the list.
tree.dist.matrix(trees, treedist = "rf", ...)
tree.dist.matrix(trees, treedist = "rf", ...)
trees |
an object of class 'multiPhylo'. |
treedist |
acronym of distance method to employ: one of |
dots |
additional parameters sent to distance functions. |
a matrix of distances between each pair of trees
A suite of distance metrics are implemented, offering a trade-off between running time and suitability of metric. Ranked according to their running time with 251 85-tip trees on a low-spec desktop computer, recommended distance metrics are:
- rf
(0.4 seconds): Robinson-Foulds distance (Robinson & Foulds,
1981): although widely used, the RF metric has a series of theoretical
shortcomings that give rise to bias and artefacts, translating to poor
performance in a suite of practical applications. Its low resolution and
rapid saturation make it particularly unsuitable for operations in tree
space. Nevertheless, its speed is hard to match.
- pd
(1 s): The path (= cladistic / nodal / patristic / tip)
distance (Farris 1973) can also be calculated rapidly, but is heavily
influenced by the shape (e.g. balanced / unbalanced) of trees, meaning that
similar-looking trees that nevertheless denote very different sets of
relationships will have shorter distances than may be anticipated.
Consequently, the path metric does a poor job of identifying clusters of
similar trees.
- icrf
(1.4 s): Robinson-Foulds distance, corrected for split size
using information theory (Smith 2020). This measure adjusts the
Robinson-Foulds distance to account for the different significance of
different partitions: partitions that evenly divide taxa contain more
information, and thus should contribute more to a distance score if they
are not shared between trees. This adjustment improves the resolution and
sensitivity of the metric, but does not correct for a number of arguably
more significant biases.
- pid
(4 s); msid
(8 s); cid
(30 s):
Phylogenetic information distance, matching split information
distance and clustering information distance (Smith 2020).
These information-theoretic methods belong to the class of
Generalized Robinson-Foulds distances:
by recognizing similarities between non-identical splits, they overcome
many of the artefacts that affect the RF distance, providing a more
representative measure of tree distances; whereas their information-theoretic
basis affords them a natural unit (the bit), providing a measurable dimension
to tree space.
Whilst the CID performs the best against a suite of theoretical and
practical criteria, the MSID comes a very close second and is somewhat
quicker to calculate. The PID will provide unexpectedly large distances
in a subset of the cases that distort the RF metric, which may result in
undesirable distortions of an accompanying tree space.
Detailed analysis of the behaviour of these and other tree distance methods against a suite of criteria is available in Smith (2020); implementation details are provided in the R package 'TreeDist'.
A further set of methods that underperform methods with similar running time listed above are also implemented for comparative purposes:
- mast
, masti
(30 minutes): size / information content of
the maximum agreement forest, subtracted from its maximum possible value
to create a distance. Specify 'rooted = FALSE' if trees are unrooted.
- jrf
(1 min, k = 2; 25 min, conflict-ok; 4 h, k = 4, no-conflict):
Jaccard Robinson-Foulds metric (Böcker et al. 2013);
specify a value of k
and allowConflict
using ...
.
- ms
(5 s): Matching split distance (Bogdanowicz and Giaro 2012;
Lin et al. 2012.
- nye
(65 s): The generalized RF distance of Nye et al. (2006).
- nni
(65 s): Approximate Nearest Neighbour Interchance (rotation)
distance.
- spr
(0.4 s): Approximate Subtree Prune and Regraft distance.
Böcker S, Canzar S, Klau GW (2013). “The generalized Robinson-Foulds metric.” In Darling A, Stoye J (eds.), Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, vol 8126, 156–169. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-40453-5_13.
Bogdanowicz D, Giaro K (2012). “Matching split distance for unrooted binary phylogenetic trees.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 150–160. doi: 10.1109/TCBB.2011.48.
Farris JS (1973). “On comparing the shapes of taxonomic trees.” Systematic Zoology, 22(1), 50–54. doi: 10.2307/2412378.
Lin Y, Rajan V, Moret BME (2012). “A metric for phylogenetic trees based on matching.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(9), 1014–1022. doi: 10.1109/TCBB.2011.157.
Nye TMW, Liò P, Gilks WR (2006). “A novel algorithm and web-based tool for comparing two alternative phylogenetic trees.” Bioinformatics, 22(1), 117–119. doi: 10.1093/bioinformatics/bti720.
Robinson DF, Foulds LR (1981). “Comparison of phylogenetic trees.” Mathematical Biosciences, 53(1-2), 131–147. doi: 10.1016/0025-5564(81)90043-2.
Smith MR (2020). “Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees.” Bioinformatics, in production. doi: 10.1093/bioinformatics/btaa614.
## Not run: data(fungus) tree.dist.matrix(fungus$Fungus.Run1$trees) ## End(Not run)
## Not run: data(fungus) tree.dist.matrix(fungus$Fungus.Run1$trees) ## End(Not run)
This function constructs a distance matrix from a list of trees and uses multi-dimensional scaling to collapse it to a two- dimensional tree space for plotting.
treespace(chains, n.points = 100, burnin = 0, fill.color = NA)
treespace(chains, n.points = 100, burnin = 0, fill.color = NA)
chains |
A list of 1 or more rwty.chain objects. |
n.points |
The minimum number of points you want in your plot. |
burnin |
The number of trees to eliminate as burnin. Default is zero. |
fill.color |
The name of the column from the log table that that you would like to use to colour the points in the plot. |
Returns a list containing the points and a plot.
## Not run: data(fungus) treespace(fungus, n.points=50, burnin=20, fill.color="LnL") ## End(Not run)
## Not run: data(fungus) treespace(fungus, n.points=50, burnin=20, fill.color="LnL") ## End(Not run)