Title: | Multiple Imputation by Chained Equations with Random Forests |
---|---|
Description: | Multiple Imputation has been shown to be a flexible method to impute missing values by Van Buuren (2007) <doi:10.1177/0962280206074463>. Expanding on this, random forests have been shown to be an accurate model by Stekhoven and Buhlmann <arXiv:1105.0828> to impute missing values in datasets. They have the added benefits of returning out of bag error and variable importance estimates, as well as being simple to run in parallel. |
Authors: | Sam Wilson [aut, cre] |
Maintainer: | Sam Wilson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.5.1 |
Built: | 2025-02-08 03:07:43 UTC |
Source: | https://github.com/farrellday/miceranger |
Add datasets to a current miceDefs object. Adds the same number of iterations as other datasets.
addDatasets(miceObj, datasets = 3, parallel = FALSE, verbose = TRUE, ...)
addDatasets(miceObj, datasets = 3, parallel = FALSE, verbose = TRUE, ...)
miceObj |
A miceDefs object created by |
datasets |
The number of datasets to add. |
parallel |
Should the process run in parallel? This process will take advantage of any cluster
set up when |
verbose |
should progress be printed? |
... |
other parameters passed to |
an updated miceDefs object with additional datasets.
data("sampleMiceDefs") miceObj <- addDatasets( sampleMiceDefs , datasets = 1 , verbose = FALSE , num.threads = 1 , num.trees=5 )
data("sampleMiceDefs") miceObj <- addDatasets( sampleMiceDefs , datasets = 1 , verbose = FALSE , num.threads = 1 , num.trees=5 )
Add iterations to a current miceDefs object. Adds iterations for all datasets.
addIterations(miceObj, iters = 5, parallel = FALSE, verbose = TRUE, ...)
addIterations(miceObj, iters = 5, parallel = FALSE, verbose = TRUE, ...)
miceObj |
A miceDefs object created by |
iters |
The number of iterations to add to each dataset. |
parallel |
Should the process run in parallel? This process will take advantage of any cluster
set up when |
verbose |
should progress be printed? |
... |
other parameters passed to |
an updated miceDefs object with additional iterations.
data("sampleMiceDefs") miceObj <- addIterations( sampleMiceDefs , iters=2 , verbose=FALSE , num.threads = 1 , num.trees=5 )
data("sampleMiceDefs") miceObj <- addIterations( sampleMiceDefs , iters=2 , verbose=FALSE , num.threads = 1 , num.trees=5 )
Randomly amputes data (MCAR).
amputeData(data, perc = 0.1, cols = names(data))
amputeData(data, perc = 0.1, cols = names(data))
data |
The data to be amputed |
perc |
A scalar. The percentage (0-1) to be amputed. |
cols |
The columns to ampute. |
The same dataset with random values in cols
set to NA.
data(iris) head(iris,10) ampIris <- amputeData(iris) head(ampIris,10)
data(iris) head(iris,10) ampIris <- amputeData(iris) head(ampIris,10)
Return imputed datasets from a miceDefs
object.
completeData(miceObj, datasets = 1:miceObj$callParams$m, verbose = TRUE)
completeData(miceObj, datasets = 1:miceObj$callParams$m, verbose = TRUE)
miceObj |
an object of class miceDefs. |
datasets |
a vector of the datasets you want to return. |
verbose |
a warning is thrown if integers are converted to doubles.
To suppress this warning, set to |
A list of imputed datasets.
data("sampleMiceDefs") imputedList <- completeData(sampleMiceDefs)
data("sampleMiceDefs") imputedList <- completeData(sampleMiceDefs)
Returns imputations for the specified datasets and variable.
getVarImps(x, datasets, var)
getVarImps(x, datasets, var)
x |
A |
datasets |
The datasets to return. Can be a number, of a numeric vector. |
var |
The variable to return the imputations for. |
These functions exist solely to get at the imputed data for a specific dataset and variable.
A matrix of imputations for a single variable. Each column represents a different dataset.
data("sampleMiceDefs") getVarImps(sampleMiceDefs,var="Petal.Width")
data("sampleMiceDefs") getVarImps(sampleMiceDefs,var="Petal.Width")
Impute data using the information from an existing miceDefs
object.
impute( data, miceObj, datasets = 1:miceObj$callParams$m, iterations = miceObj$callParams$maxiter, verbose = TRUE )
impute( data, miceObj, datasets = 1:miceObj$callParams$m, iterations = miceObj$callParams$maxiter, verbose = TRUE )
data |
The data to be imputed. Must have all columns used in the imputation of miceDefs. |
miceObj |
A miceDefs object created by |
datasets |
A numeric vector specifying the datasets with which to impute |
iterations |
The number of iterations to run.
By default, the same as the number of iterations currently in |
verbose |
should progress be printed? |
This capability is experimental, but works well in
benchmarking.
The original data and random forests (if returnModels = TRUE) are returned when miceRanger
is called. These models can be recycled to impute a new dataset in the same fashion as miceRanger
,
by imputing each variable over a series of iterations. Each dataset created in miceObj
can be thought of as a different imputation mechanism, with different initialized values
and a different associated random forests. Therefore, it is necessary to choose the datasets
which will be used to impute the data. When mean matching a numeric variable, the candidate
values are drawn from the original data passed to miceRanger
, not the data
passed
to this function.
An object of class impDefs, which contains information about the imputation process.
callParams |
The parameters of the object. |
data |
The original data provided by the user. |
naWhere |
Logical index of missing data, having the same dimensions as |
missingCounts |
The number of missing values for each variable. |
imputedData |
A list of imputed datasets. |
ampDat <- amputeData(iris) miceObj <- miceRanger(ampDat,1,1,returnModels=TRUE,verbose=FALSE) newDat <- amputeData(iris) newImps <- impute(newDat,miceObj)
ampDat <- amputeData(iris) miceObj <- miceRanger(ampDat,1,1,returnModels=TRUE,verbose=FALSE) newDat <- amputeData(iris) newImps <- impute(newDat,miceObj)
Performs multiple imputation by chained random forests. Returns a miceDefs object, which contains information about the imputation process.
miceRanger( data, m = 5, maxiter = 5, vars, valueSelector = c("meanMatch", "value"), meanMatchCandidates = pmax(round(nrow(data) * 0.01), 5), returnModels = FALSE, parallel = FALSE, verbose = TRUE, ... )
miceRanger( data, m = 5, maxiter = 5, vars, valueSelector = c("meanMatch", "value"), meanMatchCandidates = pmax(round(nrow(data) * 0.01), 5), returnModels = FALSE, parallel = FALSE, verbose = TRUE, ... )
data |
A data.frame or data.table to be imputed. |
m |
The number of datasets to produce. |
maxiter |
The number of iterations to run for each dataset. |
vars |
Specifies which and how variables should be imputed. Can be specified in 3 different ways:
|
valueSelector |
How to select the value to be imputed from the model predictions.
Can be "meanMatching", "value", or a named vector containing a mixture of those values.
If a named vector is passed, the names must equal the variables to be imputed specified in |
meanMatchCandidates |
Specifies the number of candidate values which are selected from in the
mean matching algorithm. Can be either specified as an integer or a named integer vector for different
values by variable. If a named integer vector is passed, the names of the vector must contain at a
minimum the names of the numeric variables imputed using |
returnModels |
Logical. Should the final model for each variable be returned? Set to |
parallel |
Should the process run in parallel? Usually not necessary. This process will
take advantage of any cluster set up when |
verbose |
should progress be printed? |
... |
other parameters passed to |
a miceDefs object, containing the following:
callParams |
The parameters of the object. |
data |
The original data provided by the user, cast to a data.table. |
naWhere |
Logical index of missing data, having the same dimensions as |
missingCounts |
The number of missing values for each variable |
rawClasses |
The original classes provided in |
newClasses |
The new classes of the returned data. |
allImps |
The imputations of all variables at each iteration, for each dataset. |
allImport |
The variable importance metrics at each iteration, for each dataset. |
allError |
The OOB model error for all variables at each iteration, for each dataset. |
finalImps |
The final imputations for each dataset. |
finalImport |
The final variable importance metrics for each dataset. |
finalError |
The final model error for each variable in every dataset. |
finalModels |
Only returned if |
imputationTime |
The total time in seconds taken to create the imputations for the specified datasets and iterations. Does not include any setup time. |
It is highly recommended to visit the GitHub README for a thorough walkthrough of miceRanger's capabilities, as well as performance benchmarks.
Several vignettes are also available on miceRanger's listing on the CRAN website.
################# ## Simple Example data(iris) ampIris <- amputeData(iris) miceObj <- miceRanger( ampIris , m = 1 , maxiter = 1 , verbose=FALSE , num.threads = 1 , num.trees=5 ) ################## ## Run in parallel data(iris) ampIris <- amputeData(iris) library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) # Perform mice miceObjPar <- miceRanger( ampIris , m = 2 , maxiter = 2 , parallel = TRUE , verbose = FALSE ) stopCluster(cl) registerDoSEQ() ############################ ## Complex Imputation Schema data(iris) ampIris <- amputeData(iris) # Define variables to impute, as well as their predictors v <- list( Sepal.Width = c("Sepal.Length","Petal.Width","Species") , Sepal.Length = c("Sepal.Width","Petal.Width") , Species = c("Sepal.Width") ) # Specify mean matching for certain variables. vs <- c( Sepal.Width = "meanMatch" , Sepal.Length = "value" , Species = "meanMatch" ) # Different mean matching candidates per variable. mmc <- c( Sepal.Width = 4 , Species = 10 ) miceObjCustom <- miceRanger( ampIris , m = 1 , maxiter = 1 , vars = v , valueSelector = vs , meanMatchCandidates = mmc , verbose=FALSE )
################# ## Simple Example data(iris) ampIris <- amputeData(iris) miceObj <- miceRanger( ampIris , m = 1 , maxiter = 1 , verbose=FALSE , num.threads = 1 , num.trees=5 ) ################## ## Run in parallel data(iris) ampIris <- amputeData(iris) library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) # Perform mice miceObjPar <- miceRanger( ampIris , m = 2 , maxiter = 2 , parallel = TRUE , verbose = FALSE ) stopCluster(cl) registerDoSEQ() ############################ ## Complex Imputation Schema data(iris) ampIris <- amputeData(iris) # Define variables to impute, as well as their predictors v <- list( Sepal.Width = c("Sepal.Length","Petal.Width","Species") , Sepal.Length = c("Sepal.Width","Petal.Width") , Species = c("Sepal.Width") ) # Specify mean matching for certain variables. vs <- c( Sepal.Width = "meanMatch" , Sepal.Length = "value" , Species = "meanMatch" ) # Different mean matching candidates per variable. mmc <- c( Sepal.Width = 4 , Species = 10 ) miceObjCustom <- miceRanger( ampIris , m = 1 , maxiter = 1 , vars = v , valueSelector = vs , meanMatchCandidates = mmc , verbose=FALSE )
Plot the correlation of imputed values between every combination of datasets for each variable.
plotCorrelations( miceObj, vars = names(miceObj$callParams$vars), factCorrMetric = "CramerV", numbCorrMetric = "pearson", ... )
plotCorrelations( miceObj, vars = names(miceObj$callParams$vars), factCorrMetric = "CramerV", numbCorrMetric = "pearson", ... )
miceObj |
an object of class miceDefs, created by the miceRanger function. |
vars |
the variables you want to plot. Default is to plot all variables. Can be a vector of variable names, or one of 'allNumeric' or 'allCategorical' |
factCorrMetric |
The correlation metric for categorical variables. Can be one of:
|
numbCorrMetric |
The correlation metric for numeric variables. Can be one of:
|
... |
Other arguments to pass to ggarrange() |
an object of class ggarrange
.
data("sampleMiceDefs") plotCorrelations(sampleMiceDefs)
data("sampleMiceDefs") plotCorrelations(sampleMiceDefs)
Plots the distribution of the original data beside the imputed data.
plotDistributions( miceObj, vars = names(miceObj$callParams$vars), dotsize = 0.5, ... )
plotDistributions( miceObj, vars = names(miceObj$callParams$vars), dotsize = 0.5, ... )
miceObj |
an object of class miceDefs, created by the miceRanger function. |
vars |
the variables you want to plot. Default is to plot all variables. Can be a vector of variable names, or one of 'allNumeric' or 'allCategorical' |
dotsize |
Passed to |
... |
additional parameters passed to |
an object of class ggarrange
.
data("sampleMiceDefs") plotDistributions(sampleMiceDefs)
data("sampleMiceDefs") plotDistributions(sampleMiceDefs)
Plots the distribution of the difference between datasets of the imputed values. For categorical variables, the distribution of the number of distinct levels imputed for each sample is shown next to the distribution of unique draws from that variable in the nonmissing data, given that the draws were completely random. For numeric variables, the density of the standard deviation (between datasets) of imputations is plotted. The shaded area represents the samples that had a standard deviation lower than the total nonmissing standard deviation for the original data.
plotImputationVariance( miceObj, vars = names(miceObj$callParams$vars), monteCarloSimulations = 10000, ... )
plotImputationVariance( miceObj, vars = names(miceObj$callParams$vars), monteCarloSimulations = 10000, ... )
miceObj |
an object of class miceDefs, created by the miceRanger function. |
vars |
the variables you want to plot. Default is to plot all variables. Can be a vector of variable names, or one of 'allNumeric' or 'allCategorical' |
monteCarloSimulations |
The number of simulations to run to determine the distribution of unique categorical levels drawn if the draws were completely random. |
... |
additional parameters passed to |
an object of class ggarrange
.
data("sampleMiceDefs") plotImputationVariance( sampleMiceDefs , monteCarloSimulations = 100 )
data("sampleMiceDefs") plotImputationVariance( sampleMiceDefs , monteCarloSimulations = 100 )
Plot the Out Of Bag model error for specified variables over all iterations.
plotModelError( miceObj, vars = names(miceObj$callParams$vars), pointSize = 1, ... )
plotModelError( miceObj, vars = names(miceObj$callParams$vars), pointSize = 1, ... )
miceObj |
an object of class miceDefs, created by the miceRanger function. |
vars |
the variables you want to plot. Default is to plot all variables. Can be a vector of variable names, or one of 'allNumeric' or 'allCategorical' |
pointSize |
passed to |
... |
other arguments passed to |
an object of class ggarrange
.
data("sampleMiceDefs") plotModelError(sampleMiceDefs)
data("sampleMiceDefs") plotModelError(sampleMiceDefs)
Plot the evolution of the dispersion and center of each variable. For numeric variables, the center is the mean, and the dispersion is the standard deviation. For categorical variables, the center is the mode, and the dispersion is the entropy of the distribution.
plotVarConvergence(miceObj, vars = names(miceObj$callParams$vars), ...)
plotVarConvergence(miceObj, vars = names(miceObj$callParams$vars), ...)
miceObj |
an object of class |
vars |
the variables you want to plot. Default is to plot all variables. Can be a vector of variable names, or one of 'allNumeric' or 'allCategorical' |
... |
options passed to |
an object of class ggarrange
.
data("sampleMiceDefs") plotVarConvergence(sampleMiceDefs)
data("sampleMiceDefs") plotVarConvergence(sampleMiceDefs)
Plot the variable importance for each imputed variable. The values represent the variables on the top axis importance in imputing the variables on the left axis.
plotVarImportance( miceObj, display = c("Relative", "Absolute"), dataset = 1, ... )
plotVarImportance( miceObj, display = c("Relative", "Absolute"), dataset = 1, ... )
miceObj |
an object of class miceDefs, created by the miceRanger function. |
display |
How do you want to display variable importance?
|
dataset |
The dataset you want to plot the variable importance of. |
... |
Other arguments passed to |
data("sampleMiceDefs") plotVarImportance(sampleMiceDefs)
data("sampleMiceDefs") plotVarImportance(sampleMiceDefs)
miceDefs
objectPrint a miceDefs
object
## S3 method for class 'miceDefs' print(x, ...)
## S3 method for class 'miceDefs' print(x, ...)
x |
Object of class |
... |
required to use S3 method |
NULL
Sample miceDefs object built off of iris dataset. Included so examples don't run for too long.
sampleMiceDefs
sampleMiceDefs
A miceDefs object. See “'?miceRanger“' for details.
set.seed(1991) data(iris) ampIris <- amputeData(iris,cols = c("Petal.Width","Species")) sampleMiceDefs <- miceRanger( ampIris ,m=3 ,maxiter=3 ,vars=c("Petal.Width","Species") )
## Not run: sampleMiceDefs ## End(Not run)
## Not run: sampleMiceDefs ## End(Not run)