In addition to Model based Imputation Methods (see
vignette("modelImp")
) the VIM
package also
presents donor based imputation methods, namely Hot-Deck Imputation,
k-Nearest Neighbour Imputation and fast matching/imputation based on
categorical variable.
This vignette showcases the functions hotdeck()
and
kNN()
, which can both be used to generate imputations for
several variables in a dataset. Moreover, the function
matchImpute()
is presented, which is in contrast a
imputation method based on categorical variables.
The following example demonstrates the functionality of
hodeck()
and kNN()
using a subset of
sleep
. The columns have been selected deliberately to
include some interactions between the missing values.
library(VIM)
library(magrittr)
<- sleep[, c("Dream", "NonD", "BodyWgt", "Span")]
dataset $BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
datasetaggr(dataset)
The plot indicates several missing values in Dream
,
NonD
, and Span.
sapply(dataset, function(x)sum(is.na(x)))
#> Dream NonD BodyWgt Span
#> 12 14 0 4
The call of the functions is straightforward. We will start by just
imputing NonD
based on the other variables. Besides
imputing missing variables for a single variable, these functions also
support imputation of multiple variables. For matchImpute()
suitable donors are searched based on matching of the categorical
variables.
<- hotdeck(dataset, variable = "NonD") # hotdeck imputation
imp_hotdeck <- kNN(dataset, variable = "NonD") # kNN imputation
imp_knn <- matchImpute(dataset, variable = "NonD", match_var = c("BodyWgt","Span")) # match imputation
imp_match aggr(imp_knn, delimiter = "_imp")
aggr(imp_match, delimiter = "_imp")
We can see that kNN()
imputed all missing values for
NonD
in our dataset. The same is true for the values
imputed via hotdeck()
. The specified variables in
matchImpute()
serve as a donor and enable imputation for
NonD
.
As we can see in the next two plots, the origninal data structure of
NonD
and Span
is preserved by
hotdeck()
. kNN()
reveals the typically
procedure of methods, which are based on similar data points weighted by
the distance.
c("NonD", "Span", "NonD_imp")] %>%
imp_hotdeck[, marginplot(delimiter = "_imp")
c("NonD", "Span", "NonD_imp")] %>%
imp_knn[, marginplot(delimiter = "_imp")
matchImpute()
works by sampling values from the suitable
donors and also provides reasonable results.
c("NonD", "Span", "NonD_imp")] %>%
imp_match[, marginplot(delimiter = "_imp")
In order to validate the performance of kNN()
and to
highlight the ability to impute different datatypes the
iris
dataset is used. Firstly, some values are randomly set
to NA
.
data(iris)
<- iris
df colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
<- 50
nbr_missing <- data.frame(row = sample(nrow(iris), size = nbr_missing, replace = TRUE),
y col = sample(ncol(iris), size = nbr_missing, replace = TRUE))
<-y[!duplicated(y), ]
yas.matrix(y)] <- NA
df[
aggr(df)
sapply(df, function(x) sum(is.na(x)))
#> S.Length S.Width P.Length P.Width Species
#> 10 9 8 10 12
We can see that there are missings in all variables and some observations reveal missing values on several points.
<- kNN(df)
imp_knn aggr(imp_knn, delimiter = "imp")
The plot indicates that all missing values have been imputed by
kNN()
. The following table displays the rounded first five
results of the imputation for all variables.