Imputing Missing Data with miceRanger
Last compiled on 09/04/2023
miceRanger performs Multiple Imputation by Chained Equations (MICE) with random forests. It can impute categorical and numeric data without much setup, and has an array of diagnostic plots available. A simple example can be found here. The parameters can be found here
library(tidyverse)
library(readxl)
library(openxlsx)
library(ggplot2)
library(miceRanger)
##---Upload data-----------------------------------------------
Data <- as.data.frame(read.xlsx("./Data/3.Suicide_rate_values_female_final.xlsx"))
Following the correlation analysis, the analysis for male data will exclude the variables HEALTH8, HEALTH6, ECON1, and ECON8.
Data$ECON1 <- NULL
Data$ECON5 <- NULL
Data$ECON8 <- NULL
#Remove DMA and KNA, we do not have suicide information
country_indices = which(Data$Country=="DMA"|Data$Country=="KNA")
if (length(country_indices)!=0){
filtered_data2= Data[-country_indices,]
}else(filtered_data2=Data)
#Reshape the file
reshaped_data1 <- filtered_data2%>%pivot_longer(!c(Country,Year), names_to = "Indicator", values_to ="Value")
reshaped_data2 <- reshaped_data1%>%pivot_wider(names_from=Year, values_from = Value)
reshaped_data2 <- reshaped_data2%>%select("Country", "Indicator", as.character(2000:2019))
#Calculate the percentage of available data in each column
percent_available <-rowMeans(!is.na(reshaped_data2[,3:ncol(reshaped_data2)]))
# Plot indicators with missing data
plot_missing_values<-function(Data){
ggplot(Data, aes(x = Year, y = Value, color = Country, group = Country)) +
geom_line() +
geom_point(size=2)+
labs(title = "Indicator Values Over Years",
x = "Year",
y = "Value",
color = "Country") +
theme_minimal() +
facet_wrap(~ Indicator, ncol = 1,scales = "free_y")
}
# Identify and filter with missing data
Indices_missing_values<-which(percent_available<1)
Data_with_missing_values <-reshaped_data2[Indices_missing_values,]
Data_with_missing_values2<-Data_with_missing_values%>%pivot_longer(!c("Country","Indicator"), names_to = "Year", values_to="Value")
plot_missing_values(Data_with_missing_values2)
# Save the plot with a specific size
#ggsave(paste("./ResultsMissingData/missing_data_",Name_data,".pdf",sep=""), plot = p, width = 10, height = 15)
https://github.com/farrellday/miceRanger
#Identify variables with missing data
columns_with_missing<- c(unique(Data_with_missing_values$Indicator))
print(columns_with_missing)
## [1] "AMB" "ECON6" "HEALTH9" "HEALTH5" "SOC9" "HEALTH11" "ECON7"
## [8] "ECON2" "ECON3"
#Countries with missing data
countries_with_missing_data <-unique(Data_with_missing_values$Country)
print(countries_with_missing_data)
## [1] "VEN" "GUY" "ARG" "HND" "HTI" "CHL" "SUR" "ECU" "BRB" "JAM" "COL" "CAN"
## [13] "NIC" "DOM" "ATG" "GRD" "PAN" "LCA" "BHS" "PRY" "GTM" "BOL" "PER" "MEX"
## [25] "BLZ" "CRI" "CUB" "SLV" "TTO" "USA" "VCT" "BRA" "URY"
#filter data for indicator with missing data in a particular country
#filter(Country==countries_with_missing_data[2])%>%
ampDat <- filtered_data2%>%pivot_longer(!c(Year,Country), names_to = "Indicator", values_to = "Value" )%>%pivot_wider(names_from = Indicator, values_from = Value)
# Create the imputation model using miceRanger
mrModelOutput <- miceRanger(ampDat,valueSelector = "value", cols = columns_with_missing, verbose = FALSE, m = 10)
We can take a look at the imputed distributions compared to the original distribution for each variable:
The red line is the density of the original, nonmissing data. The smaller, black lines are the density of the imputed values in each of the datasets. If these don’t match up, it’s not a problem, however it may tell you that your data #was not Missing Completely at Random (MCAR).
plotDistributions(mrModelOutput,vars='allNumeric')
We are probably interested in knowing how our values between datasets converged over the iterations. The plotCorrelations function shows you a boxplot of the correlations between imputed values in every combination of datasets, at each iteration:
plotCorrelations(mrModelOutput,vars='allNumeric')
Sometimes, if the missing data locations are correlated with higher or lower values, we need to run multiple iterations for the process to converge to the true theoretical mean (given the information that exists in the dataset). We can see if the imputed data converged, or if we need to run more iterations:
plotVarConvergence(mrModelOutput,vars='allNumeric')
Random Forests give us a cheap way to determine model error without cross validation. Each model returns the OOB accuracy for classification, and r-squared for regression. We can see how these converged as the iterations progress:
plotModelError(mrModelOutput,vars='allNumeric')
Now let’s plot the variable importance for each imputed variable. The top axis contains the variable that was used to impute the variable on the left axis.
plotVarImportance(mrModelOutput)
We are probably interested in how “certain” we were of our imputations. We can get a feel for the variance experienced for each imputed value between the datasets by using plotImputationVariance() function:
plotImputationVariance(mrModelOutput,ncol=1,widths=c(5,3))
To return the imputed data simply use the completeData function:
dataList <- completeData(mrModelOutput)
Filled_data<-data.frame(dataList[[1]])
Data_with_missing_values2
Analysis created by:
Yury E.
Garcia
Dr. Fabio Sanchez’s Research Team
CIMPA
Universidad de Costa
Rica
Email: epimec.cr@gmail.com