Introduction for AuxSurvey: a package to improving survey inference using administrative records

Introduction

The AuxSurvey R package provides a set of statistical methods for improving survey inference by using discretized auxiliary variables from administrative records. The utility of such auxiliary data can often be diminished due to discretization for confidentiality reasons, but this package offers multiple estimators that handle such discretized auxiliary variables effectively.

This vignette demonstrates the key functionalities of the AuxSurvey package, including:

These methods are designed for use with discretized auxiliary variables in survey data, and we will walk through data generation and estimation examples.

Generate Simulated Data

The AuxSurvey package includes a function simulate() that generates the datasets used in the paper. These datasets include a population of 3000 samples and a sample of about 600 cases, with two outcomes (Y1 as a continuous variable, Y2 as a binary outcome).

The covariates in the dataset include: - Z1, Z2, Z3: Binary covariates - X: Continuous covariate - Discretized versions of X: auX_3, auX_5, auX_10 - Propensity scores: true_pi (true), logit_true_pi, estimated_pi (estimated using BART)

library(AuxSurvey)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
#> Loading required package: rstanarm
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#>   options(mc.cores = parallel::detectCores())
#> 
#> Attaching package: 'AuxSurvey'
#> The following object is masked from 'package:stats':
#> 
#>     simulate

# Generate data
data = simulate(N = 3000, discretize = c(3, 5, 10), setting = 1, seed = 123)
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 4, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.249241
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,4,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#> 
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 609
#> ***using serial code
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 5, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.152879
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#> 
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 609
#> ***using serial code
population = data$population  # Full population data (3000 cases)
samples = data$samples  # Sample data (600 cases)
ipw = 1 / samples$true_pi  # True inverse probability weighting
est_ipw = 1 / samples$estimated_pi  # Estimated inverse probability weighting
true_mean = mean(population$Y1)  # True value of the estimator

Estimation Methods

After generating the data, we can use the auxsurvey() function to apply various estimators. The auxsurvey() function supports multiple estimation methods, including unweighted or weighted sample mean, raking, poststratification, MRP, GAMP, linear regression, and BART.

Example 1: Sample Mean

To estimate the sample mean for Y1, we can use the auxsurvey() function. For the unweighted sample mean:

# Unweighted sample mean
sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = NULL, samples = samples, population = NULL, method = "sample_mean", levels = 0.95)

For the inverse probability weighted (IPW) sample mean:

# IPW sample mean
IPW_sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = ipw, samples = samples, population = population, method = "sample_mean", levels = 0.95)
#> population parameter is specified, so the finite population correction will be calculated for sample mean.

Example 2: Raking

The raking method adjusts the sample to match known marginal distributions of the auxiliary variables. You can apply raking for auX_5:

# Unweighted Raking for auX_5 with interaction with Z1
rake_5_Z1 = auxsurvey("~Y1", auxiliary = "Z2 + Z3 + auX_5 * Z1", weights = NULL, samples = samples, population = population, method = "rake", levels = 0.95)

For IPW raking:

# IPW Raking for auX_10
rake_10 = auxsurvey("~Y1", auxiliary = "Z1 + Z2 + Z3 + auX_10", weights = ipw, samples = samples, population = population, method = "rake", levels = c(0.95, 0.8))

Example 3: MRP (Multilevel Regression with Poststratification)

The MRP method models the outcome using both fixed and random effects. Here is an example of running MRP with auX_3 as a random effect:

# MRP with auX_3
MRP_1 = auxsurvey("Y1 ~ Z1 + Z2", auxiliary = "Z3 + auX_3", samples = samples, population = population, method = "MRP", levels = 0.95)

Example 4: GAMP (Generalized Additive Model of Response Propensity)

The GAMP method can include smooth functions of the auxiliary variables. For example, here’s how to use smooth functions of logit_estimated_pi and auX_10:

# GAMP with smooth functions
GAMP_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "s(logit_estimated_pi) + s(auX_10)", samples = samples, population = population, method = "GAMP", levels = 0.95)

Example 5: Bayesian Linear Regression

The Bayesian Linear Regression method treats categorical variables as dummy variables. Here’s an example for Y1:

# Linear regression with Bayesian estimation
LR_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "auX_3", samples = samples, population = population, method = "linear", levels = 0.95)

Example 6: BART (Bayesian Additive Regression Trees)

Finally, the BART method can be applied to estimate the relationship between the outcome and the covariates. Here’s an example for estimating Y1 using BART:

# BART for estimation
BART_1 = auxsurvey("Y1 ~ Z1 + Z2 + Z3 + auX_3 + logit_true_pi", auxiliary = NULL, samples = samples, population = population, method = "BART", levels = 0.95)
#> *****Into main of wbart
#> *****Data:
#> data:n,p,np: 609, 6, 0
#> y1,yn: 2.646420, -0.963718
#> x1,x[n*p]: 1.000000, -1.119597
#> *****Number of Trees: 200
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 1000, 1000
#> *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.737716,3.000000,5.078729
#> *****sigma: 5.106138
#> *****w (weights): 1.000000 ... 1.000000
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,6,0
#> *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
#> *****printevery: 100
#> *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
#> 
#> MCMC
#> done 0 (out of 2000)
#> done 100 (out of 2000)
#> done 200 (out of 2000)
#> done 300 (out of 2000)
#> done 400 (out of 2000)
#> done 500 (out of 2000)
#> done 600 (out of 2000)
#> done 700 (out of 2000)
#> done 800 (out of 2000)
#> done 900 (out of 2000)
#> done 1000 (out of 2000)
#> done 1100 (out of 2000)
#> done 1200 (out of 2000)
#> done 1300 (out of 2000)
#> done 1400 (out of 2000)
#> done 1500 (out of 2000)
#> done 1600 (out of 2000)
#> done 1700 (out of 2000)
#> done 1800 (out of 2000)
#> done 1900 (out of 2000)
#> time: 10s
#> check counts
#> trcnt,tecnt,temecnt,treedrawscnt: 1000,0,0,1000
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 609
#> ***using serial code

Conclusion

The AuxSurvey package provides a powerful set of tools for survey analysis when working with discretized auxiliary data. By leveraging various Bayesian models and traditional survey methods like raking and poststratification, users can enhance their inference without violating confidentiality.

For further details on the package’s functionality, please refer to the documentation, which provides more examples and explanation of the various estimators.