miceRanger
performs Multiple Imputation by Chained
Equations (MICE) with random forests. It can impute categorical and
numeric data without much setup, and has an array of diagnostic plots
available. This vignette provides examples and syntax needed to impute a
dataset under different conditions. For a much more detailed walkthrough
of miceRanger, as well as benchmarks, please see the official GitHub page.
In these examples we will be looking at a simple example of multiple imputation. We need to load the packages, and define the data:
require(miceRanger)
set.seed(1)
# Load data
data(iris)
# Ampute the data. iris contains no missing values by default.
ampIris <- amputeData(iris,perc=0.25)
head(ampIris,10)
Running in parallel is usually not necessary. By default,
ranger
will use all available cores, and
data.table
s assignment by reference is already lightning
fast. However, in certain cases, we can still save some time by sending
each dataset imputation to a different R back end. To do this, we need
to set up a core cluster and use parallel = TRUE
. This
causes the dataset to be copied for each back end, which may eat up your
RAM. If the process is memory constrained, this can cause the parallel
implementation to actually take more time than the sequential
implementation.
library(doParallel)
# Set up back ends.
cl <- makeCluster(2)
registerDoParallel(cl)
# Perform mice
parTime <- system.time(
miceObjPar <- miceRanger(
ampIris
, m=6
, parallel = TRUE
, verbose = FALSE
)
)
stopCluster(cl)
registerDoSEQ()
Let’s take a look at the time we saved running in parallel:
perc <- round(1-parTime[[3]]/seqTime[[3]],2)*100
print(paste0("The parallel process ran ",perc,"% faster using 2 R back ends."))
We did not save that much time (if any) by running in parallel.
ranger
already makes full use of our CPU. Running in
parallel will save you time if you are using a high
meanMatchCandidates
, or if you are working with very large
data and using a low num.trees
.
If you plot your data and notice that you need to may need to run more iterations, or you would like more datasets for your analysis, you can use the following functions:
It is possible to customize our imputation procedure by variable. By
passing a named list to vars
, you can specify the
predictors for each variable to impute. You can also select which
variables should be imputed using mean matching, as well as the mean
matching candidates, by passing a named vector to
valueSelector
and meanMatchCandidates
,
respectively:
v <- list(
Sepal.Width = c("Sepal.Length","Petal.Width","Species")
, Sepal.Length = c("Sepal.Width","Petal.Width")
, Species = c("Sepal.Width")
)
pmm <- c(
Sepal.Width = "meanMatch"
, Sepal.Length = "value"
, Species = "meanMatch"
)
mmc <- c(
Sepal.Width = 4
, Species = 10
)
miceObjCustom <- miceRanger(
ampIris
, vars = v
, valueSelector = pmm
, meanMatchCandidates = mmc
, verbose=FALSE
)
Multiple Imputation can take a long time. If you wish to impute a
dataset using the MICE algorithm, but don’t have time to train new
models, it is possible to impute new datasets using a
miceDefs
object. The impute
function uses the
random forests returned by miceRanger
to perform multiple
imputation without updating the random forest at each iteration:
All of the imputation parameters (valueSelector, vars, etc) will be
carried over from the original miceDefs
object. When mean
matching, the candidate values are pulled from the original dataset. For
performance and timing benchmark information, please see the benchmarks
page.
To return the imputed data simply use the completeData
function:
We can see how the imputed data compares to the original data before it was amputed:
It looks like most of our variables were imputed with a high degree
of accuracy. Sepal.Width had a relatively poor Spearman correlation,
however we expected this when we saw the results from
plotModelError()
above.