Mixgb offers a scalable solution for imputing large datasets using XGBoost, subsampling and predictive mean matching. Our method utilizes the capabilities of XGBoost, a highly efficient implementation of gradient boosted trees, to capture interactions and non-linear relations automatically. Moreover, we have integrated subsampling and predictive mean matching to minimize bias and reflect appropriate imputation variability. Our package supports various types of variables and offers flexible settings for subsampling and predictive mean matching. We also include diagnostic tools for evaluating the quality of the imputed values.
mixgb
We first load the mixgb
package and the
nhanes3_newborn
dataset, which contains 16 variables of
various types (integer/numeric/factor/ordinal factor). There are 9
variables with missing values.
library(mixgb)
str(nhanes3_newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#> $ HSHSIZER: int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#> $ HSAGEIR : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#> $ HSSEX : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 2 2 2 1 ...
#> $ DMARACER: Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 2 ...
#> $ DMAETHNR: Factor w/ 3 levels "1","2","3": 3 1 3 3 3 3 3 3 3 3 ...
#> $ DMARETHN: Factor w/ 4 levels "1","2","3","4": 1 3 2 1 1 1 2 1 2 2 ...
#> $ BMPHEAD : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#> ..- attr(*, "label")= chr "Head circumference (cm)"
#> $ BMPRECUM: num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#> ..- attr(*, "label")= chr "Recumbent length (cm)"
#> $ BMPSB1 : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#> ..- attr(*, "label")= chr "First subscapular skinfold (mm)"
#> $ BMPSB2 : num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#> ..- attr(*, "label")= chr "Second subscapular skinfold (mm)"
#> $ BMPTR1 : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#> ..- attr(*, "label")= chr "First triceps skinfold (mm)"
#> $ BMPTR2 : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#> ..- attr(*, "label")= chr "Second triceps skinfold (mm)"
#> $ BMPWT : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#> ..- attr(*, "label")= chr "Weight (kg)"
#> $ DMPPIR : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#> ..- attr(*, "label")= chr "Poverty income ratio"
#> $ HFF1 : Factor w/ 2 levels "1","2": 2 2 1 1 1 2 2 1 2 1 ...
#> $ HYD1 : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(nhanes3_newborn))
#> HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM
#> 0 0 0 0 0 0 124 114
#> BMPSB1 BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1
#> 161 169 124 167 117 192 7 0
To impute this dataset, we can use the default settings. The default
number of imputed datasets is m = 5
. Note that we do not
need to convert our data into dgCMatrix or one-hot coding format. Our
package will automatically convert it for you. Variables should be of
the following types: numeric, integer, factor or ordinal factor.
# use mixgb with default settings
<- mixgb(data = nhanes3_newborn, m = 5) imputed.data
We can also customize imputation settings:
The number of imputed datasets m
The number of imputation iterations maxit
XGBoost hyperparameters and verbose settings.
xgb.params
, nrounds
,
early_stopping_rounds
, print_every_n
and
verbose
.
Subsampling ratio. By default, subsample = 0.7
.
Users can change this value under the xgb.params
argument.
Predictive mean matching settings pmm.type
,
pmm.k
and pmm.link
.
Whether ordinal factors should be converted to integer
(imputation process may be faster)
ordinalAsInteger
Whether or not to use bootstrapping
bootstrap
Initial imputation methods for different types of variables
initial.num
, initial.int
and
initial.fac
.
Whether to save models for imputing newdata
save.models
and save.vars
.
# Use mixgb with chosen settings
<- list(
params max_depth = 3,
gamma = 0,
eta = 0.3,
min_child_weight = 1,
subsample = 0.7,
colsample_bytree = 1,
colsample_bylevel = 1,
colsample_bynode = 1,
nthread = 2,
tree_method = "auto",
gpu_id = 0,
predictor = "auto"
)
<- mixgb(
imputed.data data = nhanes3_newborn, m = 5, maxit = 1,
ordinalAsInteger = FALSE, bootstrap = FALSE,
pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
initial.num = "normal", initial.int = "mode", initial.fac = "mode",
save.models = FALSE, save.vars = NULL,
xgb.params = params, nrounds = 100, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)
Imputation performance can be affected by the hyperparameter
settings. Although tuning a large set of hyperparameters may appear
intimidating, it is often possible to narrowing down the search space
because many hyperparameters are correlated. In our package, the
function mixgb_cv()
can be used to tune the number of
boosting rounds - nrounds
. There is no default
nrounds
value in XGBoost,
so users are
required to specify this value themselves. The default
nrounds
in mixgb()
is 100. However, we
recommend using mixgb_cv()
to find the optimal
nrounds
first.
<- list(max_depth = 3, subsample = 0.7, nthread =2)
params <- mixgb_cv(data = nhanes3_newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results $evaluation.log
cv.results#> iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#> 1: 1 5.3744890 0.014675460 5.3750613 0.07086852
#> 2: 2 3.8332703 0.010838631 3.8372278 0.07087161
#> 3: 3 2.7718411 0.006272944 2.7770853 0.07691548
#> 4: 4 2.0595957 0.007274550 2.0664484 0.07566479
#> 5: 5 1.5868738 0.008584994 1.6054670 0.07819587
#> 6: 6 1.2907648 0.014248482 1.3210073 0.07759826
#> 7: 7 1.1071555 0.015394644 1.1530031 0.07783167
#> 8: 8 1.0000161 0.017887745 1.0566439 0.07895215
#> 9: 9 0.9414638 0.018404997 1.0082380 0.07867945
#> 10: 10 0.9074870 0.018933432 0.9829059 0.08001215
#> 11: 11 0.8876951 0.018986953 0.9682910 0.07752943
#> 12: 12 0.8764532 0.018322576 0.9609722 0.07684140
#> 13: 13 0.8670131 0.018055405 0.9576967 0.07822358
#> 14: 14 0.8604551 0.017868182 0.9551112 0.07878126
#> 15: 15 0.8545978 0.017994667 0.9556937 0.07906311
#> 16: 16 0.8497766 0.017346718 0.9574317 0.07809297
#> 17: 17 0.8456010 0.017452824 0.9579252 0.07793869
#> 18: 18 0.8412693 0.017763551 0.9566392 0.07777018
#> 19: 19 0.8369451 0.017050940 0.9582819 0.07699266
#> 20: 20 0.8329889 0.017898987 0.9579895 0.07783339
#> 21: 21 0.8292042 0.018045147 0.9609148 0.07802547
#> 22: 22 0.8261493 0.018352210 0.9629216 0.07725943
#> 23: 23 0.8218315 0.018426677 0.9660504 0.07674019
#> 24: 24 0.8174190 0.018241518 0.9668980 0.07467530
#> iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
$response
cv.results#> [1] "BMPSB2"
$best.nrounds
cv.results#> [1] 14
By default, mixgb_cv()
will randomly choose an
incomplete variable as the response and build an XGBoost model with
other variables as explanatory variables using the complete cases of the
dataset. Therefore, each run of mixgb_cv()
will likely
return different results. Users can also specify the response and
covariates in the argument response
and
select_features
respectively.
<- mixgb_cv(data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
cv.results response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM","BMPSB1", "BMPSB2","BMPTR1", "BMPTR2", "BMPWT"),xgb.params = params, verbose = FALSE)
$best.nrounds
cv.results#> [1] 18
Let’s just try setting nrounds = cv.results$best.nrounds
in mixgb()
to obtain 5 imputed datasets.
<- mixgb(data = nhanes3_newborn, m = 5, nrounds = cv.results$best.nrounds) imputed.data
The mixgb
package provides the following visual
diagnostics functions:
Single variable: plot_hist()
,
plot_box()
, plot_bar()
;
Two variables: plot_2num()
,
plot_2fac()
, plot_1num1fac()
;
Three variables: plot_2num1fac()
,
plot_1num2fac()
.
Each function will return m+1
panels to compare the
observed data with m
sets of actual imputed values.
For more details, please check the vignette on GitHub Visual diagnostics for multiply imputed values.