The AuxSurvey R package provides a set of statistical methods for improving survey inference by using discretized auxiliary variables from administrative records. The utility of such auxiliary data can often be diminished due to discretization for confidentiality reasons, but this package offers multiple estimators that handle such discretized auxiliary variables effectively.
This vignette demonstrates the key functionalities of the AuxSurvey package, including:
These methods are designed for use with discretized auxiliary variables in survey data, and we will walk through data generation and estimation examples.
The AuxSurvey package includes a function
simulate()
that generates the datasets used in the paper.
These datasets include a population of 3000 samples and a sample of
about 600 cases, with two outcomes (Y1
as a continuous
variable, Y2
as a binary outcome).
The covariates in the dataset include: - Z1
,
Z2
, Z3
: Binary covariates - X
:
Continuous covariate - Discretized versions of X
:
auX_3
, auX_5
, auX_10
- Propensity
scores: true_pi
(true), logit_true_pi
,
estimated_pi
(estimated using BART)
library(AuxSurvey)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
#> Loading required package: rstanarm
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
#>
#> Attaching package: 'AuxSurvey'
#> The following object is masked from 'package:stats':
#>
#> simulate
# Generate data
data = simulate(N = 3000, discretize = c(3, 5, 10), setting = 1, seed = 123)
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 4, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.249241
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,4,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#>
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 609
#> ***using serial code
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 5, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.152879
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#>
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 609
#> ***using serial code
population = data$population # Full population data (3000 cases)
samples = data$samples # Sample data (600 cases)
ipw = 1 / samples$true_pi # True inverse probability weighting
est_ipw = 1 / samples$estimated_pi # Estimated inverse probability weighting
true_mean = mean(population$Y1) # True value of the estimator
After generating the data, we can use the auxsurvey()
function to apply various estimators. The
auxsurvey()
function supports multiple
estimation methods, including unweighted or weighted sample mean,
raking, poststratification, MRP, GAMP, linear regression, and BART.
To estimate the sample mean for Y1
, we
can use the auxsurvey()
function. For the unweighted sample
mean:
# Unweighted sample mean
sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = NULL, samples = samples, population = NULL, method = "sample_mean", levels = 0.95)
For the inverse probability weighted (IPW) sample mean:
The raking method adjusts the sample to match known
marginal distributions of the auxiliary variables. You can apply raking
for auX_5
:
# Unweighted Raking for auX_5 with interaction with Z1
rake_5_Z1 = auxsurvey("~Y1", auxiliary = "Z2 + Z3 + auX_5 * Z1", weights = NULL, samples = samples, population = population, method = "rake", levels = 0.95)
For IPW raking:
The MRP method models the outcome using both fixed
and random effects. Here is an example of running MRP with
auX_3
as a random effect:
The GAMP method can include smooth functions of the
auxiliary variables. For example, here’s how to use smooth functions of
logit_estimated_pi
and auX_10
:
The Bayesian Linear Regression method treats
categorical variables as dummy variables. Here’s an example for
Y1
:
Finally, the BART method can be applied to estimate
the relationship between the outcome and the covariates. Here’s an
example for estimating Y1
using BART:
# BART for estimation
BART_1 = auxsurvey("Y1 ~ Z1 + Z2 + Z3 + auX_3 + logit_true_pi", auxiliary = NULL, samples = samples, population = population, method = "BART", levels = 0.95)
#> *****Into main of wbart
#> *****Data:
#> data:n,p,np: 609, 6, 0
#> y1,yn: 2.646420, -0.963718
#> x1,x[n*p]: 1.000000, -1.119597
#> *****Number of Trees: 200
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 1000, 1000
#> *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.737716,3.000000,5.078729
#> *****sigma: 5.106138
#> *****w (weights): 1.000000 ... 1.000000
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,6,0
#> *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
#> *****printevery: 100
#> *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
#>
#> MCMC
#> done 0 (out of 2000)
#> done 100 (out of 2000)
#> done 200 (out of 2000)
#> done 300 (out of 2000)
#> done 400 (out of 2000)
#> done 500 (out of 2000)
#> done 600 (out of 2000)
#> done 700 (out of 2000)
#> done 800 (out of 2000)
#> done 900 (out of 2000)
#> done 1000 (out of 2000)
#> done 1100 (out of 2000)
#> done 1200 (out of 2000)
#> done 1300 (out of 2000)
#> done 1400 (out of 2000)
#> done 1500 (out of 2000)
#> done 1600 (out of 2000)
#> done 1700 (out of 2000)
#> done 1800 (out of 2000)
#> done 1900 (out of 2000)
#> time: 10s
#> check counts
#> trcnt,tecnt,temecnt,treedrawscnt: 1000,0,0,1000
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 609
#> ***using serial code
The AuxSurvey package provides a powerful set of tools for survey analysis when working with discretized auxiliary data. By leveraging various Bayesian models and traditional survey methods like raking and poststratification, users can enhance their inference without violating confidentiality.
For further details on the package’s functionality, please refer to the documentation, which provides more examples and explanation of the various estimators.