Multiple Imputation by Chained Equations is a robust, informative method of dealing with missing data in datasets. The procedure ‘fills in’ (imputes) missing data in a dataset through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met.
This process is continued until all specified variables have been
imputed. Additional iterations can be run if it appears that the average
imputed values have not converged, although no more than 5 iterations
are usually necessary. The accuracy of the imputations will depend on
the information density in the dataset. A dataset of completely
independent variables with no correlation will not yield accurate
imputations. There are diagnostic plots available in
miceRanger
which allow the user to determine how valid the
imputations may be.
miceRanger
can make use of a procedure called predictive
mean matching (PMM) to select which values are imputed. PMM involves
selecting a datapoint from the original, nonmissing data which has a
predicted value close to the predicted value of the missing sample. The
closest N (meanMatchCandidates
parameter in
miceRanger()
) values are chosen as candidates, from which a
value is chosen at random. Going into more detail from our example
above, we see how this works in practice:
This method is very useful if you have a variable which needs imputing which has any of the following characteristics:
As an example, let’s construct a dataset with some of the above characteristics:
library(data.table)
library(miceRanger)
# random uniform variable
nrws <- 1000
dat <- data.table(Uniform_Variable = runif(nrws))
# slightly bimodal variable correlated with Uniform_Variable
dat$Close_Bimodal_Variable <- sapply(
dat$Uniform_Variable
, function(x) sample(c(rnorm(1,-2),rnorm(1,2)),prob=c(x,1-x),size=1)
) + dat$Uniform_Variable
# very bimodal variable correlated with Uniform_Variable
dat$Far_Bimodal_Variable <- sapply(
dat$Uniform_Variable
, function(x) sample(c(rnorm(1,-3),rnorm(1,3)),prob=c(x,1-x),size=1)
)
# Highly skewed variable correlated with Uniform_Variable
dat$Skewed_Variable <- exp((dat$Uniform_Variable*runif(nrws)*3)) + runif(nrws)*3
# Integer variable correlated with Close_Bimodal_Variable and Uniform_Variable
dat$Integer_Variable <- round(dat$Uniform_Variable + dat$Close_Bimodal_Variable/3 + runif(nrws)*2)
# Ampute the data.
ampDat <- amputeData(dat,0.2)
# Plot the original data
plot(dat)
We can see how our variables are distributed and correlated in the graph above. Now let’s run our imputation process twice, once using mean matching, and once using the model prediction.
mrMeanMatch <- miceRanger(ampDat,valueSelector = "meanMatch",verbose=FALSE)
mrModelOutput <- miceRanger(ampDat,valueSelector = "value",verbose=FALSE)
Let’s look at the effect on the different variables.
The affect of mean matching on our imputations is immediately apparent. If we were only looking at model error, we may be inclined to use the Prediction Value, since it has a higher OOB R-Squared. However, we are left with imputations that do not match our original distribution, and therefore, do not behave like our original data.
We see a similar occurance in the skewed variable - the distribution of the values imputed with the Prediction Value are shifted towards the mean.
The most obvious variable affected by mean matching was our integer
variable - using valueSelector = 'value'
allows
interpolation in the numeric variables. Using mean matching has allowed
us to keep the distribution and distinct values of the original data,
without sacrificing accuracy.
MICE is particularly useful if missing values are associated with the target variable in a way that introduces leakage. For instance, let’s say you wanted to model customer retention at the time of sign up. A certain variable is collected at sign up or 1 month after sign up. The absence of that variable is a data leak, since it tells you that the customer did not retain for 1 month.
Information is often collected at different stages of a ‘funnel’. MICE can be used to make educated guesses about the characteristics of entities at different points in a funnel.
MICE can be used to impute missing values, however it is important to keep in mind that these imputed values are a prediction. Creating multiple datasets with different imputed values allows you to do two types of inference: