synthesizer
Package version 0.4.0.
Use citation('synthesizer')
to cite the package.
synthetiser
is an R package for quickly and easily
synthesizing data. It also provides a few basic functions based on pMSE
to measure some utility of the synthesized data.
The package supports numerical, categorical/ordinal, and mixed data,
it synthesizes times series (ts
) objects and also correctly
takes account of missing values and mixed (or zero-inflated)
distributions. A rankcor
parameter lets you gradually shift
between realistic data with high utility and less realistic data with
decreased correlations between original and syntesized data.
The latest CRAN release can be installed as follows.
install.packages("synthesizer")
Next, the package can be loaded. You can use
packageVersion
(from base R) to check which version you
have installed.
> library(synthesizer)
> # check the package version
> packageVersion("synthesizer")
1] ‘0.4.0’ [
We will use the iris
dataset, that is built into R.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Creating a synthetic version of this dataset is easy.
> set.seed(1)
> synth_iris <- synthesize(iris)
To compare the datasets we can make some side-by-side scatterplots.
By default synthesize
will return a dataset of the same
size as the input dataset. However, it is possible to ask for any number
of records.
> more_synth <- synthesize(iris, n=250)
> dim(more_synth)
1] 250 5 [
The pMSE method is a popular way of measuring the quality of a dataset. The idea is to train a model to predict whether a record is synthetic or not. The worse a model can do that, the better a synthetic data instance resembles the original data. The value scales between 0 and 0.25 (if the synthetic and original datasets have the same number of records). Smaller is better.
> pmse(synth=synth_iris, real=iris)
1] 0.007844863 [
The package lets you choose between logistic regression (the default) and a random forest classifier as the predictive model.
> pmse(synth=synth_iris, real=iris, model="rf")
1] 0.0921007 [
Synthetic data can be too realistic, in the sense that it might reveal actual properties of the original entities used to create the synthetic data. One way to mitigate this is to decrease the rank correlation between the original and the synthetic data.
When synthesizing data frames this can be controlled with the
rankcor
parameter. This parameter varies from 0,
representing the lowest utility, to 1, the default and maximum utility.
The rankcor
refers to the maximum rank correlation between
original and synthesized variables. If rankcor
is a single
(unnamed) value, all synthetic variables are rank-decorrelated from the
original data by random permutations until the rank correlation between
synthetic and original data drops below the rankcor
value.
It is also possible to lower the utility of a selection of variables.
Variables for which rankcor
is not specified will default
to perfect rank correlation (rankcor=1
).
> # decorrelate rank matching to 0.5
> s1 <- synthesize(iris, rankcor=0.5)
> # decorrelate only Species
> s2 <- synthesize(iris, rankcor=c("Species"=0.5))
In the left figure, we show the three variables of a synthesized
iris
dataset, where all variables are decorrelated. Both
the geometric clustering and the species are now garbled. In the right
figure we only decorrelate the Species variable. Here, the spatial
clustering is retained while the correlation between color (Species) and
location is lost.
Synthesizing time series is as easy as synthesizing data frames, but there are a few differences.
rankcor
parameter for time series
data.As a demonstration, we create a synthetic version of the
UKDriverDeaths
dataset, including with base R.
> data(UKDriverDeaths)
> synth_udd <- synthesize(UKDriverDeaths)
Below is a plot of the original and synthetic dataset.
Synthetic data is generated in two steps:
These steps ensure a synthetic dataset that closely resembles the
original data. The rank order matching ensures a certain resiliance to
the influence of outliers. If the rankcor
argument has a
value less than the default 1, a third step is performed:
rankcor
value.Except for the case of time series it is possible to sample datasets that are larger or smaller than their originals. This is done by (if necessary) creating multiple synthetic datasets and sample records uniformly without replacement from the combined dataset.