The ggsmc
package uses ggplot2
to display
the results of importance sampling (IS), sequential Monte Carlo (SMC) or
ensemble-based algorithms. Each algorithm outputs a collection of
points, usually evolved through a sequence of target distributions,
which for IS and SMC are weighted.
To use this package the algorithm output must be in tidy format, where each
dimension of each parameter for each particle lies in a distinct row in
a data frame. The data sir_cwna_model
provides an example
of valid input to the plotting functions.
library(ggsmc)
data(sir_cwna_model)
head(sir_cwna_model)
#> ExternalIndex Target Time NormalisingConstant ISESS TargetParameters
#> 1 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> 2 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> 3 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> 4 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> 5 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> 6 1 1 0.0025475 -3.07181 16.2299 i=0;dt=1;
#> Iteration Particle AncestorIndex LogWeight ParameterName Dimension Value
#> 1 1 1 1 -28.057199 x 1 6.5297
#> 2 1 1 1 -28.057199 x 2 -15.4264
#> 3 1 2 2 -2.453099 x 1 -0.5869
#> 4 1 2 2 -2.453099 x 2 -8.7790
#> 5 1 3 3 -11.568199 x 1 3.6434
#> 6 1 3 3 -11.568199 x 2 -14.1641
This data contains the output of a particle filter (PF) applied to a
target tracking problem. The state \(x=(x_1,x_2)^T\) tracked using the PF is
two-dimensional, consisting of the (one-dimensional) position and
velocity of the target. The value of the first particle for the first
target distribution is given by \((x_1,x_2)^T=(3.0163,4.6682)^T\). In tidy
format, this is stored on two rows of the data frame, where the
Dimension
column gives the index of the state
ParameterName
: e.g. in the row where
ParameterName=="x"
and Dimension==2
, the
Value
column gives the value of \(x_2\).
If you are unfamiliar with it, this way of storing data with its many
repeated values might seem wasteful. Its strength is that this format
can be used consistently across different situations, allowing the use
of general purpose packages for processing and, in our case, plotting
the data. If your data is in the more standard matrix format for Monte
Carlo output (i.e. one parameter vector per row), then you can use the
matrix2tidy
function included in the package to convert to
the required format your algorithm output for each target distribution.
For this function you need to supply your algorithm output
in matrix format, a name for the parameter
that is
represented in the output, an integer index for the target
to which the output corresponds and the log_weights
of each
particle if using an IS or SMC algorithm. If using an algorithm that
iterates over multiple targets, the matrix2tidy
function
should be called for each target, then the output of each call stacked
together using rbind
.
The sir_cwna_model
data contains more columns than are
required to use the plotting functions in this package. The columns
required by all functions are:
Target
, which indexes the target distribution,
taking a different integer values for each target.
Particle
, which indexes the particles, taking a
different integer value for each particle.
ParameterName
, which uses a string to name each
parameter.
Dimension
, which indexes the dimension of the
parameter, taking a different integer value for each dimension.
Value
, which stores the numerical value for the
particle, target, parameter and dimension specified by the other
columns.
For the output of an IS or SMC algorithm, we may additionally include
a LogWeight
column to store the log of the (normalised)
weight of the particle. If this column is not found in the data, each
particle will be assigned an equal weight.
The plot_histogram
and plot_density
functions may be used to plot, respectively, a histogram or density of
the marginal distribution of one dimension of one parameter. If the
LogWeight
column is present in the data a weighted
histogram/density will be used.
For these functions, a parameter
(string) and
dimension
(integer) argument need to be used. If the
target
variable is set, the function will plot the marginal
histogram/density for the specified target, parameter and dimension. If
no target is specified, the points for all points for the specified
parameter and dimension will be used for the plot. The
plot_histogram
function also takes a bins
argument, which may be used to specify the number of bins for the
histogram (if this argument is not specified, the default value in
ggplot2
is used).
For an example, we look at the 20th target of the
sir_cwna_model
data using both a histogram and a
density.
The plot_scatter
function can we used to creates a
scatter plot of the Monte Carlo representation of the joint distribution
between two parameters, or two dimensions of the same parameter. We
specify the parameter and dimension for the x-axis using the arguments
x_parameter
and x_dimension
respectively. The
target
variable plays the same role as for
plot_histogram
and plot_density
. If a
LogWeight
column is present in the data, the size of the
points in the scatter plot will be used to represent the particle
weights.
We illustrate this plot again on the 20th target of the
sir_cwna_model
data.
plot_scatter(sir_cwna_model,
x_parameter = "x",
x_dimension = 1,
y_parameter = "x",
y_dimension = 2,
target = 20,
alpha = 0.5,
max_size = 3)
Note that in this plot we have used two additional arguments,
alpha
and max_size
, to adjust the look of the
plot. In this example we have very few importance points, I found the
default value of alpha = 0.1
(the transparency of the
points from 0 to 1) to be too low, and the default
max_size = 1
(governing the size of the points) to be too
small.
To more clearly understand the output of an SMC or ensemble-based
algorithm it can be useful to visualise the evolution of the particles
over time. The function plot_genealogy
may be used for this
purpose, for one dimension of one parameter. We again use the
sir_cwna_model
data to illustrate this function, showing
the evolution of the PF’s estimate of the target position over time.
plot_genealogy(sir_cwna_model,
parameter = "x",
dimension = 1,
use_initial_points = FALSE,
vertical = FALSE,
alpha_lines = 0,
alpha_points = 0.05,
arrows = FALSE)
The first few arguments are the same as those used for the
plot_density
function. We have used several additional
arguments the alter the plot:
use_initial_points
(default TRUE) governs the
inclusion (or otherwise) of the initial unweighted points drawn from the
proposal used to initialise the algorithm. In this case we chose to omit
these points so that we show only the estimate of the filtering
distribution at each time.
vertical
(default TRUE) controls the orientation of
the figure.
alpha_lines
changes the transparency (from 0,
transparent, to 1, solid) of the lines connecting corresponding points
between successive targets. Here we choose to omit these lines by making
them invisible.
alpha_points
changes the transparency (from 0,
transparent, to 1, solid) of the points (whose size is given by the
LogWeights
column if included in the data.
arrows
(default TRUE) determines if arrows are
included on the lines in the plot (omitted in this plot).
To illustrate an alternative configuration of a genealogy plot, we use output from a different algorithm: an SMC sampler applied to a sequence of targets on parameter \(\theta\) that begins with a Gaussian distribution, and ends with a two-component mixture of Gaussians.
data(mixture_25_particles)
plot_genealogy(mixture_25_particles,
parameter = "θ",
dimension = 1,
alpha_lines = 0.2,
alpha_points = 0.4)
For an SMC algorithm, due to the resampling
step the index of the ancestor of particle \(i\) for each target is not likely to be
\(i\). To produce a plot with lines
that connect each particle with its ancestor, we need an additional
column in the data named AncestorIndex
.
For some algorithms each Monte Carlo point represents a time series,
where the index that thus far has been represented by the column
Dimension
can be thought of as indexing time. To illustrate
this we plot simulations from a stochastic Lotka-Volterra
model produced during a run of an approximate
Bayesian computation (ABC) algorithm. These are contained in the
data lv_output
.
In this plot the weight (given by LogWeight
) of each
time series is represented by the width the line. We make the lines more
visible by choosing alpha = 0.5
compared to the default
0.1, and change the limits of the y-axis by using
ylimits=c(0,1000)
. There is only one target, so the
target
argument does not need to be specified. The
max_line_width
argument (not used here) can be used to
scale the width of all lines in the plot.
All figures are created using ggplot2
, so can be
modified using other functions from this package. We now demonstrate
this by adding the true target position onto a plot of the
sir_cwna_model
data, and changing its style. We also
demonstrate the ability to automatically generate a title for the
figure.
data(cwna_data)
plot_genealogy(sir_cwna_model,
parameter = "x",
dimension = 1,
use_initial_points = FALSE,
vertical = FALSE,
alpha_lines = 0,
alpha_points = 0.05,
arrows = FALSE,
default_title = TRUE) +
ggplot2::geom_line(data=cwna_data,ggplot2::aes(x=Index,y=Position),colour="red",inherit.aes = FALSE) +
ggplot2::theme_minimal() +
ggplot2::theme(legend.position="none")
The histogram, density, scatter and time series plots can all be
animated, to show how the output changes over a sequence of target
distributions. Below we show the code for generating (looping) 10 second
animations (set by the duration
argument) showing how the
sir_cwna_model
data changes over the sequence of 20
targets.
animate_scatter(sir_cwna_model,
x_parameter = "x",
x_dimension = 1,
y_parameter = "x",
y_dimension = 2,
alpha = 0.5,
max_size = 3,
duration = 10)
For time series, there are two possible animations. One,
animate_time_series
, shows how the time series particles
evolve over a sequence of (non-time ordered) targets. The other,
animate_reveal_time_series
, illustrated below on
lv_output
, animates the time series to show how the
population of particles evolves over time.