Introduction for AuxSurvey: a package to improving survey inference using administrative records

Generate Simulated Data

The AuxSurvey package includes a function simulate() that generates the datasets used in the paper. These datasets include a population of 3000 samples and a sample of about 600 cases, with two outcomes (Y1 as a continuous variable, Y2 as a binary outcome).

The covariates in the dataset include: - Z1, Z2, Z3: Binary covariates - X: Continuous covariate - Discretized versions of X: auX_3, auX_5, auX_10 - Propensity scores: true_pi (true), logit_true_pi, estimated_pi (estimated using BART)

library(AuxSurvey)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
#> Loading required package: rstanarm
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#>   options(mc.cores = parallel::detectCores())
#> 
#> Attaching package: 'AuxSurvey'
#> The following object is masked from 'package:stats':
#> 
#>     simulate

# Generate data
data = simulate(N = 3000, discretize = c(3, 5, 10), setting = 1, seed = 123)
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 4, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.249241
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,4,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#> 
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 4
#> from x,np,p: 4, 609
#> ***using serial code
#> *****Into main of pbart
#> *****Data:
#> data:n,p,np: 3000, 5, 0
#> y1,yn: 1, 0
#> x1,x[n*p]: 1.000000, -1.152879
#> *****Number of Trees: 50
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 100, 100
#> *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
#> *****binaryOffset: -0.830953
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
#> *****nkeeptrain,nkeeptest,nkeeptreedraws: 100,100,100
#> *****printevery: 100
#> *****skiptr,skipte,skiptreedraws: 1,1,1
#> 
#> MCMC
#> done 0 (out of 200)
#> done 100 (out of 200)
#> time: 1s
#> check counts
#> trcnt,tecnt: 100,0
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 100
#> number of trees in bart sum: 50
#> number of x columns: 5
#> from x,np,p: 5, 609
#> ***using serial code
population = data$population  # Full population data (3000 cases)
samples = data$samples  # Sample data (600 cases)
ipw = 1 / samples$true_pi  # True inverse probability weighting
est_ipw = 1 / samples$estimated_pi  # Estimated inverse probability weighting
true_mean = mean(population$Y1)  # True value of the estimator

Estimation Methods

After generating the data, we can use the auxsurvey() function to apply various estimators. The auxsurvey() function supports multiple estimation methods, including unweighted or weighted sample mean, raking, poststratification, MRP, GAMP, linear regression, and BART.

Example 1: Sample Mean

To estimate the sample mean for Y1, we can use the auxsurvey() function. For the unweighted sample mean:

# Unweighted sample mean
sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = NULL, samples = samples, population = NULL, method = "sample_mean", levels = 0.95)

For the inverse probability weighted (IPW) sample mean:

# IPW sample mean
IPW_sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = ipw, samples = samples, population = population, method = "sample_mean", levels = 0.95)
#> population parameter is specified, so the finite population correction will be calculated for sample mean.

Example 2: Raking

The raking method adjusts the sample to match known marginal distributions of the auxiliary variables. You can apply raking for auX_5:

# Unweighted Raking for auX_5 with interaction with Z1
rake_5_Z1 = auxsurvey("~Y1", auxiliary = "Z2 + Z3 + auX_5 * Z1", weights = NULL, samples = samples, population = population, method = "rake", levels = 0.95)

For IPW raking:

# IPW Raking for auX_10
rake_10 = auxsurvey("~Y1", auxiliary = "Z1 + Z2 + Z3 + auX_10", weights = ipw, samples = samples, population = population, method = "rake", levels = c(0.95, 0.8))

Example 3: MRP (Multilevel Regression with Poststratification)

The MRP method models the outcome using both fixed and random effects. Here is an example of running MRP with auX_3 as a random effect:

# MRP with auX_3
MRP_1 = auxsurvey("Y1 ~ Z1 + Z2", auxiliary = "Z3 + auX_3", samples = samples, population = population, method = "MRP", levels = 0.95)

Example 4: GAMP (Generalized Additive Model of Response Propensity)

The GAMP method can include smooth functions of the auxiliary variables. For example, here’s how to use smooth functions of logit_estimated_pi and auX_10:

# GAMP with smooth functions
GAMP_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "s(logit_estimated_pi) + s(auX_10)", samples = samples, population = population, method = "GAMP", levels = 0.95)

Example 5: Bayesian Linear Regression

The Bayesian Linear Regression method treats categorical variables as dummy variables. Here’s an example for Y1:

# Linear regression with Bayesian estimation
LR_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "auX_3", samples = samples, population = population, method = "linear", levels = 0.95)

Example 6: BART (Bayesian Additive Regression Trees)

Finally, the BART method can be applied to estimate the relationship between the outcome and the covariates. Here’s an example for estimating Y1 using BART:

# BART for estimation
BART_1 = auxsurvey("Y1 ~ Z1 + Z2 + Z3 + auX_3 + logit_true_pi", auxiliary = NULL, samples = samples, population = population, method = "BART", levels = 0.95)
#> *****Into main of wbart
#> *****Data:
#> data:n,p,np: 609, 6, 0
#> y1,yn: 2.646420, -0.963718
#> x1,x[n*p]: 1.000000, -1.119597
#> *****Number of Trees: 200
#> *****Number of Cut Points: 1 ... 100
#> *****burn and ndpost: 1000, 1000
#> *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.737716,3.000000,5.078729
#> *****sigma: 5.106138
#> *****w (weights): 1.000000 ... 1.000000
#> *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,6,0
#> *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
#> *****printevery: 100
#> *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
#> 
#> MCMC
#> done 0 (out of 2000)
#> done 100 (out of 2000)
#> done 200 (out of 2000)
#> done 300 (out of 2000)
#> done 400 (out of 2000)
#> done 500 (out of 2000)
#> done 600 (out of 2000)
#> done 700 (out of 2000)
#> done 800 (out of 2000)
#> done 900 (out of 2000)
#> done 1000 (out of 2000)
#> done 1100 (out of 2000)
#> done 1200 (out of 2000)
#> done 1300 (out of 2000)
#> done 1400 (out of 2000)
#> done 1500 (out of 2000)
#> done 1600 (out of 2000)
#> done 1700 (out of 2000)
#> done 1800 (out of 2000)
#> done 1900 (out of 2000)
#> time: 10s
#> check counts
#> trcnt,tecnt,temecnt,treedrawscnt: 1000,0,0,1000
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 3000
#> ***using serial code
#> *****In main of C++ for bart prediction
#> tc (threadcount): 1
#> number of bart draws: 1000
#> number of trees in bart sum: 200
#> number of x columns: 6
#> from x,np,p: 6, 609
#> ***using serial code