--- title: "Imputing Missing Data with miceRanger" author: "Sam Wilson" date: "`r Sys.Date()`" output: html_document vignette: > %\VignetteIndexEntry{Filling in Missing Data with miceRanger} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} params: EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true") --- ```{r, SETTINGS-knitr, include=FALSE} stopifnot(require(knitr)) opts_chunk$set( eval = if (isTRUE(exists("params"))) params$EVAL else FALSE ) ``` ## Introduction ```miceRanger``` performs Multiple Imputation by Chained Equations (MICE) with random forests. It can impute categorical and numeric data without much setup, and has an array of diagnostic plots available. This vignette provides examples and syntax needed to impute a dataset under different conditions. For a much more detailed walkthrough of miceRanger, as well as benchmarks, please see the official [GitHub](https://github.com/farrellday/miceRanger) page. ## Using miceRanger In these examples we will be looking at a simple example of multiple imputation. We need to load the packages, and define the data: ```{r amputeData,message=FALSE} require(miceRanger) set.seed(1) # Load data data(iris) # Ampute the data. iris contains no missing values by default. ampIris <- amputeData(iris,perc=0.25) head(ampIris,10) ``` ### Simple example ```{r simpleMice,message=FALSE} # Perform mice, return 6 datasets. seqTime <- system.time( miceObj <- miceRanger( ampIris , m=6 , returnModels = TRUE , verbose=FALSE ) ) ``` ### Running in Parallel Running in parallel is usually not necessary. By default, ```ranger``` will use all available cores, and ```data.table```s assignment by reference is already lightning fast. However, in certain cases, we can still save some time by sending each dataset imputation to a different R back end. To do this, we need to set up a core cluster and use ```parallel = TRUE```. *This causes the dataset to be copied for each back end, which may eat up your RAM. If the process is memory constrained, this can cause the parallel implementation to actually take more time than the sequential implementation.* ```{r parMice,message=FALSE} library(doParallel) # Set up back ends. cl <- makeCluster(2) registerDoParallel(cl) # Perform mice parTime <- system.time( miceObjPar <- miceRanger( ampIris , m=6 , parallel = TRUE , verbose = FALSE ) ) stopCluster(cl) registerDoSEQ() ``` Let's take a look at the time we saved running in parallel: ```{r parFaster} perc <- round(1-parTime[[3]]/seqTime[[3]],2)*100 print(paste0("The parallel process ran ",perc,"% faster using 2 R back ends.")) ``` We did not save that much time (if any) by running in parallel. ```ranger``` already makes full use of our CPU. Running in parallel will save you time if you are using a high ```meanMatchCandidates```, or if you are working with very large data and using a low ```num.trees```. ### Adding More Iterations/Datasets If you plot your data and notice that you need to may need to run more iterations, or you would like more datasets for your analysis, you can use the following functions: ```{r addToMice} miceObj <- addIterations(miceObj,iters=2,verbose=FALSE) miceObj <- addDatasets(miceObj,datasets=1,verbose=FALSE) ``` ### Specifying Predictors, Value Selector, and Mean Matching Candidates by Variable It is possible to customize our imputation procedure by variable. By passing a named list to ```vars```, you can specify the predictors for each variable to impute. You can also select which variables should be imputed using mean matching, as well as the mean matching candidates, by passing a named vector to ```valueSelector``` and ```meanMatchCandidates```, respectively: ```{r customSetup} v <- list( Sepal.Width = c("Sepal.Length","Petal.Width","Species") , Sepal.Length = c("Sepal.Width","Petal.Width") , Species = c("Sepal.Width") ) pmm <- c( Sepal.Width = "meanMatch" , Sepal.Length = "value" , Species = "meanMatch" ) mmc <- c( Sepal.Width = 4 , Species = 10 ) miceObjCustom <- miceRanger( ampIris , vars = v , valueSelector = pmm , meanMatchCandidates = mmc , verbose=FALSE ) ``` ### Imputing New Data with Existing Models Multiple Imputation can take a long time. If you wish to impute a dataset using the MICE algorithm, but don't have time to train new models, it is possible to impute new datasets using a ```miceDefs``` object. The ```impute``` function uses the random forests returned by ```miceRanger``` to perform multiple imputation without updating the random forest at each iteration: ```{r} newDat <- amputeData(iris) newImputed <- impute(newDat,miceObj,verbose=FALSE) ``` All of the imputation parameters (valueSelector, vars, etc) will be carried over from the original ```miceDefs``` object. When mean matching, the candidate values are pulled from the original dataset. For performance and timing benchmark information, please see the [benchmarks](https://github.com/FarrellDay/miceRanger/tree/master/benchmarks) page. ## Using the Imputed Data To return the imputed data simply use the ```completeData``` function: ```{r completeData} dataList <- completeData(miceObj) head(dataList[[1]],10) ``` We can see how the imputed data compares to the original data before it was amputed: ```{r impAccuracy,message=FALSE,echo=FALSE,fig.height = 6,warning=FALSE} require(ggplot2) require(ggpubr) plotVars <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") plotList <- lapply( plotVars , function(x) { missIndx <- is.na(ampIris[,get(x)]) impVsAmp <- data.table( originalData = iris[missIndx,x] , imputedData = dataList[[1]][missIndx,get(x)] , Species = iris[missIndx,]$Species ) return( ggscatter(impVsAmp,x="originalData",y="imputedData",add="reg.line",size = 0) + geom_point(data=impVsAmp,aes(x=originalData,y=imputedData,color=Species)) + stat_cor(label.x = min(impVsAmp$originalData), label.y = max(impVsAmp$imputedData)*0.9+0.1*min(impVsAmp$imputedData)) + xlab(paste0("Original ",x)) + ylab(paste0("Imputed ",x)) ) } ) arranged <- ggarrange( plotlist = plotList , common.legend = TRUE ) annotate_figure( arranged , top=text_grob( "Original Data Compared to Imputed Value" , face = "bold" , size = 14 ) ) ``` It looks like most of our variables were imputed with a high degree of accuracy. Sepal.Width had a relatively poor Spearman correlation, however we expected this when we saw the results from ```plotModelError()``` above.