This short vignette introduces the capabilities of
smcfcs
to accommodate classical covariate measurement
error. We consider the cases where internal validation data and then
internal replication data are available.
We will simulate a dataset with internal validation data where the true covariate (x) is observed for 10% of the sample, while every subject has an error-prone measurement (w) observed:
set.seed(1234)
n <- 1000
x <- rnorm(n)
w <- x + rnorm(n)
y <- x + rnorm(n)
x[(n * 0.1):n] <- NA
simData <- data.frame(x, w, y)
We have generated code where the error-prone measurement w is equal to the true covariate x plus some independent normally distributed measurement error. Since x is observed for some of the subjects in the case of interval validation data, this is a regular missing data problem. The error-prone measurement w serves as an auxiliary variable for the purposes of imputation of x. In particular, we will impute using `smcfcs’ such that w is not in the substantive model. This encodes the so called non-differential error assumption, that says that conditional on x, the error-prone measurement w provides no independent information about the outcome y. An initial attempt to do this is:
library(smcfcs)
imps <- smcfcs(simData,
smtype = "lm", smformula = "y~x",
method = c("norm", "", ""), m = 5
)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: "
## [1] "Imputation 1"
## [1] "Imputing: x using plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 16 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
We see from the output that smcfcs
has not mentioned
that it is using w anywhere. This is because w is fully observed and is
not involved in the substantive model. To force w to be conditioned on
when imputing x, we must pass an appropriate
predictorMatrix
to smcfcs
:
We have specified that the first variable, x, be imputed using w.
Note that we do not need to tell smcfcs
to impute x using
y, as this will occur automatically by virtue of y being the outcome
variable in the substantive model. We can now impute again, passing
predMat
as the predictorMatrix
:
imps <- smcfcs(simData,
smtype = "lm", smformula = "y~x",
method = c("norm", "", ""), m = 5,
predictorMatrix = predMat
)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: "
## [1] "Imputation 1"
## [1] "Imputing: x using w plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 3 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
Now we can fit the substantive model to each imputed dataset and use
the mitools
package to pool the estimates and standard
errors using Rubin’s rules:
library(mitools)
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ x))
summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ x))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.08850686 0.08171538 -0.1136864 0.2907001 87 %
## x 0.91596526 0.11612856 0.6100080 1.2219226 95 %
We note from the results that the fraction of missing information for the coefficient of x is high. This should not surprise us, given that x was missing for 90% of the sample and the error-prone measurement w is quite a noisy measure of x.
We will now demonstrate how smcfcs
can be used to impute
a covariate x which is not observed for any subjects, but we have for at
least a subset of the sample two or more error-prone replicate
measurements. We first simulate the dataset:
x <- rnorm(n)
w1 <- x + rnorm(n)
w2 <- x + rnorm(n)
w2[(n * 0.1):n] <- NA
y <- x + rnorm(n)
x <- rep(NA, n)
simData <- data.frame(x, w1, w2, y)
Note that now x is missing for every subject. Every subject has an error-prone measurement w1 of x, and 10% of the sample have a replicated measurement w2.
We will now impute x using smcfcs
. To do this we specify
that x be imputed using the latnorm
method. In addition, we
pass a matrix to the errorProneMatrix
argument of
smcfcs
, whose role is to specify, for each latent normal
variable to be imputed, which variables in the data frame are
error-prone measurements. smcfcs
then imputes the missing
values in x, assuming a normal classical error model for the error-prone
replicates.
errMat <- array(0, dim = c(4, 4))
errMat[1, c(2, 3)] <- 1
imps <- smcfcs(simData,
smtype = "lm", smformula = "y~x",
method = c("latnorm", "", "", ""), m = 5,
errorProneMatrix = errMat
)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: "
## [1] "Imputation 1"
## [1] "Imputing: x using w1,w2 plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
Analysing the imputed datasets, we obtain:
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ x))
summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ x))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) -0.09181376 0.03709867 -0.1653430 -0.01828453 21 %
## x 0.86868846 0.15799531 0.4424182 1.29495872 97 %
If we summarise one of the imputed datasets (below), we will see that
smcfcs
has not only imputed the missing values in x, but
also the ‘missing’ values in w2. We hyphenate missing here because
typically a study with replicate error-prone measurements will have
intentionally planned to only take a second error-prone measurement on a
random subset, so the values were never intended to be measured.
## x w1 w2 y
## Min. :-3.03077 Min. :-4.36681 Min. :-4.78473 Min. :-4.43675
## 1st Qu.:-0.82756 1st Qu.:-0.93609 1st Qu.:-1.01688 1st Qu.:-1.02676
## Median :-0.02556 Median :-0.01339 Median :-0.02595 Median :-0.13283
## Mean :-0.02280 Mean :-0.00018 Mean :-0.06838 Mean :-0.08455
## 3rd Qu.: 0.78154 3rd Qu.: 0.94560 3rd Qu.: 0.91115 3rd Qu.: 0.87570
## Max. : 4.12664 Max. : 5.20210 Max. : 4.38671 Max. : 4.25531
One thing to be wary of when imputing covariates measured with error, particularly with replication data, is that convergence may take longer than in the regular missing data setting. To examine this, we re-impute one dataset using 100 iterations, and then plot the estimates against iteration number:
imps <- smcfcs(simData,
smtype = "lm", smformula = "y~x",
method = c("latnorm", "", "", ""), m = 1, numit = 100,
errorProneMatrix = errMat
)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: "
## [1] "Imputation 1"
## [1] "Imputing: x using w1,w2 plus outcome"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 6 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
This plot suggests it would probably be safer to impute using slightly more than 10 iterations per imputation.
smcfcs
can impute multiple covariates measured with
error when internal replication data are available. It allows for a
separate error variance for each such covariate. The following code adds
a second covariate which is itself measured by two error-prone
measurements, but this time with a smaller error variance. It then
defines the errorProneMatrix
, imputes and analyses the
imputed datasets:
x <- rnorm(n)
x1 <- x + rnorm(n)
x2 <- x + rnorm(n)
w2[(n * 0.1):n] <- NA
z <- x + rnorm(n)
z1 <- z + 0.1 * rnorm(n)
z2 <- z + 0.1 * rnorm(n)
y <- x - z + rnorm(n)
x <- rep(NA, n)
z <- rep(NA, n)
simData <- data.frame(x, x1, x2, z, z1, z2, y)
errMat <- array(0, dim = c(7, 7))
errMat[1, c(2, 3)] <- 1
errMat[4, c(5, 6)] <- 1
imps <- smcfcs(simData,
smtype = "lm", smformula = "y~x+z",
method = c("latnorm", "", "", "latnorm", "", "", ""), m = 5,
errorProneMatrix = errMat
)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x,z"
## [1] "Fully obs. substantive model variables: "
## [1] "Imputation 1"
## [1] "Imputing: x using z,x1,x2 plus outcome"
## [1] "Imputing: z using x,z1,z2 plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 38 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
We now analyse the imputed datasets, remembering to add z into the substantive model:
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ x + z))
summary(MIcombine(models))
## Multiple imputation results:
## with(impobj, lm(y ~ x + z))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) -0.0557226 0.03730903 -0.1302476 0.01880238 27 %
## x 0.9561448 0.06151938 0.8252418 1.08704780 56 %
## z -0.9266750 0.04508968 -1.0227239 -0.83062610 57 %
We see that the fraction of missing information is lower for z than for x. This is a consequence of the fact that we generated the error-prone measurements of z to have smaller error variance than for the corresponding error-prone measurements of x.