Imputing Missing Data with miceRanger
Last compiled on 09/16/2023
miceRanger performs Multiple Imputation by Chained Equations (MICE) with random forests. It can impute categorical and numeric data without much setup, and has an array of diagnostic plots available. A simple example can be found here. The parameters can be found here
Methodology: MICE Ranger is an extension of the MICE method that uses decision trees (Random Forest) as imputation models. It is particularly effective for imputing missing values in large datasets with mixed data types. Use Case: MICE Ranger is designed for handling missing data in large and complex datasets, such as those commonly encountered in data science and machine learning tasks. It can handle both continuous and categorical data efficiently. Complexity: While it may require more computational resources due to decision tree modeling, it is still relatively easy to use, especially if you are familiar with the MICE framework.
All files are available for download in GitHub
library(tidyverse)
library(readxl)
library(openxlsx)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(imputeTS)
library(miceRanger)
library(downloadthis)
library(stats)
library(VIM)
library(lattice)
library(knitr)
library(DT)
name_file ="_All"
Data1 <- as.data.frame(read.xlsx("./Data/5.Data_with_all_indicators.xlsx"))
Data <-Data1%>%filter(Year%in%2000:2019)
#List of variables
final_variables<-as.data.frame(read.xlsx("./Data/List_of_final_vairables.xlsx",sheet="All"))
We preprocess the data using a filter that selectively keeps data points containing over 80% of the available information, with the additional requirement that the missing data are not consecutive data from the last few years in the series. For that:
we identify the indicators (Rows) that contain between 80% and 100% of the available information.
Among these selected indicators, we pinpoint those where the final year’s data is missing.
Data rows that fail to meet the specified criteria are then removed from the dataset.
Filter data by country.
Finally, we apply the miceRanger algorithm to perform the imputation process.
Note that we have removed data entries solely for countries with a significant amount of missing data. As a result, the final set of independent variables may differ from one country to another
#reshape the file
Original_Data<-Data%>%pivot_wider(names_from = Year,values_from = Values)
#Compute the percentage of data different from NA
per_non_na<-round(rowMeans(!is.na(Original_Data[,3:ncol(Original_Data)])),digits = 2)
#rows with less that 80% of data
data_less_80_per<-which(per_non_na<0.8)
missing_data <-Original_Data[-data_less_80_per,]
# Plot indicators with missing data
plot_missing_values<-function(Data){
ggplot(Data, aes(x = Year, y = Value, color = Country, group = Country)) +
geom_line() +
geom_point(size=2)+
labs(title = "Indicator over time",
x = "Year",
y = "Value",
color = "Country") +
theme_bw() +
facet_wrap(~ Indicator, ncol = 2,scales = "free_y")+
theme(legend.position = "top")
}
# Identify and filter with missing data
Indices_missing_values <-which(per_non_na<1&per_non_na>0.8)
plots_with_missing_values <-Original_Data[Indices_missing_values,]
Data_with_missing_values2<-plots_with_missing_values%>%pivot_longer(!c("Country","Indicator"), names_to = "Year", values_to="Value")
plot_missing_values(Data_with_missing_values2)
We aim to confirm the availability of data for the most recent years. In cases where data is missing for the last few years, we will exclude variables with such missing data. We possess a dataset spanning 20 years, ranging from 2000 to 2019. When missing data aligns with these recent years, the imputation process becomes more uncertain, as we rely more on estimation than direct imputation.
years <- 2000:2019
years <- as.character(years)
missing_data <- missing_data%>%dplyr::select(Country,Indicator,years)
#filter last 4 years
last_four_columns<-missing_data[c("2016","2017","2018","2019")]
#identify the indicator with missing data for more than one year, including the year 2019
aaa <- rowSums(is.na(last_four_columns))
remove_indices <- which(aaa>1&is.na(missing_data[,"2019"]))
print("Total missing years")
## [1] "Total missing years"
aaa[remove_indices]
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 4 3 3 2
#Remove rows
Data_with_more_80_per<-missing_data[-remove_indices,]
MiceRanger_imputation<-function(Data_by_country,columns_with_na){
Data_by_country$Year<-as.numeric(Data_by_country$Year)
# Create the imputation model using miceRanger
mrModelOutput <- miceRanger(Data_by_country[,2:ncol(Data_by_country)],
valueSelector ="value",
cols =columns_with_na,
verbose = FALSE, m=15)
dataList <- completeData(mrModelOutput)
Filled_data<- data.frame(dataList[[1]])
p<-plotDistributions(mrModelOutput,vars='allNumeric')
# Return a list with imputed data and diagnostic figures
result <- list(imputed_data = Filled_data, mice_fig = p)
return(result)
}
per_non_na2 <- rowMeans(!is.na(Data_with_more_80_per[, 3:ncol(Data_with_more_80_per)]))
countries_with_missing_Data<-unique(Data_with_more_80_per$Country[which(per_non_na2<1)])
print(countries_with_missing_Data)
## [1] "ARG" "BOL" "BRA" "CAN" "COL" "CRI" "ECU" "HND" "PAN" "PRY" "USA" "URY"
## [13] "CHL" "CUB"
#filter data without missing information
rows_with_missing_data<-which(per_non_na2<1)
Data_with_all_values<-Data_with_more_80_per[-rows_with_missing_data,]
Data_with_all_values<-Data_with_all_values%>%pivot_longer(!c("Country","Indicator"),names_to = "Year", values_to = "Value")
final_imputed_data<-NULL
for (Coutry_name in countries_with_missing_Data){
#filter data by country
Data_by_country<-Data_with_more_80_per%>%
filter(Country==Coutry_name)%>%
pivot_longer(!c("Country","Indicator"), names_to = "Year",
values_to="Value")%>%
pivot_wider(names_from= Indicator, values_from = Value)
Data_by_country<-as.data.frame(Data_by_country)
#Identify variables with missing data to create a plot
columns_with_na <-colnames(Data_by_country)[colMeans(is.na(Data_by_country))>0]
print("---------------------------------------------------")
print(paste("Country: ", Coutry_name,sep = ""))
for (col in columns_with_na){
print(paste(col," Missing data: ", length(which(is.na(Data_by_country[,col])))))
}
#Aply imputation method
MiceRanger_results <- MiceRanger_imputation(Data_by_country,columns_with_na)
#Access imputed data
MiceRanger_method <- MiceRanger_results$imputed_data
#FIGURES
#Data with missing data
subdata_miceR <- MiceRanger_method[, c("Year",columns_with_na)]
subdata_miceR <- subdata_miceR%>%pivot_longer(!c("Year"),names_to = "Indicator", values_to = "Value")
subdata_miceR$Country<-"Imputation"
subdata_miceR<-subdata_miceR%>%dplyr::select(Country,Year,Indicator,Value)
#Plot data with missing values
Data_plot_missing_data <- Data_by_country[, c("Country", "Year", columns_with_na)]
Data_plot_missing_data <- Data_plot_missing_data%>%pivot_longer(!c("Country","Year"),names_to = "Indicator",values_to = "Value")
#Data with missing and imputed values
Final_data <-rbind(subdata_miceR,Data_plot_missing_data)
#FIGURES
figs <-Plot(Final_data)
fig_dens<-Plot_density(Final_data)
combined_plot <- grid.arrange(figs,fig_dens, ncol = 2)
print(combined_plot)
#Final Imputed Data
final_imputed_file_per_country <-MiceRanger_method[,c("Year",columns_with_na)]
final_imputed_file_per_country$Country <-Data_by_country$Country
final_imputed_file_per_country<-final_imputed_file_per_country%>%pivot_longer(!c("Country", "Year"),names_to = "Indicator", values_to = "Value")
final_imputed_file_per_country <- final_imputed_file_per_country%>%dplyr::select(colnames(Data_with_all_values))
final_imputed_data<-rbind(final_imputed_file_per_country,final_imputed_data)
}
## [1] "---------------------------------------------------"
## [1] "Country: ARG"
## [1] "ECON2 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: BOL"
## [1] "ECON2 Missing data: 2"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: BRA"
## [1] "ECON2 Missing data: 2"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: CAN"
## [1] "ECON2 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: COL"
## [1] "ECON2 Missing data: 2"
## [1] "WHO_COV_1 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: CRI"
## [1] "ECON1 Missing data: 1"
## [1] "WHO_COV_8 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: ECU"
## [1] "ECON1 Missing data: 1"
## [1] "ECON2 Missing data: 2"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: HND"
## [1] "ECON2 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: PAN"
## [1] "ECON1 Missing data: 4"
## [1] "WHO_COV_2 Missing data: 2"
## [1] "WHO_COV_4 Missing data: 2"
## [1] "WHO_COV_9 Missing data: 2"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: PRY"
## [1] "ECON1 Missing data: 1"
## [1] "ECON2 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: USA"
## [1] "ECON8 Missing data: 2"
## [1] "WHO_COV_1 Missing data: 1"
## [1] "WHO_COV_2 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: URY"
## [1] "COV9 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: CHL"
## [1] "WHO_COV_2 Missing data: 2"
## [1] "WHO_COV_4 Missing data: 2"
## [1] "WHO_COV_5 Missing data: 2"
## [1] "WHO_COV_6 Missing data: 2"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: CUB"
## [1] "WHO_COV_4 Missing data: 1"
## [1] "WHO_COV_9 Missing data: 1"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
***
Beginning with a dataset containing complete information, random NA values are intentionally introduced to simulate missing data. Following this, the imputation process is applied as previously described.
Note that the closer the density function of the imputed data is to the original data, the better the imputation.
## [1] "---------------------------------------------------"
## [1] "Country: CAN"
## [1] "HEALTH1 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: USA"
## [1] "COV7 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: BHS"
## [1] "HEALTH3 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: ECU"
## [1] "SOC25 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## [1] "---------------------------------------------------"
## [1] "Country: LCA"
## [1] "HEALTH1 Missing data: 3"
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Analysis created by:
Dr. Fabio
Sanchez’s Research Team
CIMPA
Universidad de Costa Rica
Email: epimec.cr@gmail.com