1.ImputationData_Suicide

Imputing Missing Data with miceRanger

Last compiled on 09/04/2023

Introduction

miceRanger performs Multiple Imputation by Chained Equations (MICE) with random forests. It can impute categorical and numeric data without much setup, and has an array of diagnostic plots available. A simple example can be found here. The parameters can be found here

library(tidyverse)
library(readxl)
library(openxlsx)
library(ggplot2)
library(miceRanger)

##---Upload data-----------------------------------------------
Data  <- as.data.frame(read.xlsx("./Data/3.Suicide_rate_values_female_final.xlsx"))

Data for Female

Following the correlation analysis, the analysis for male data will exclude the variables HEALTH8, HEALTH6, ECON1, and ECON8.

Data$ECON1   <- NULL
Data$ECON5   <- NULL
Data$ECON8   <- NULL

#Remove DMA and KNA, we do not have suicide information
country_indices  = which(Data$Country=="DMA"|Data$Country=="KNA")
if (length(country_indices)!=0){
      filtered_data2= Data[-country_indices,]
}else(filtered_data2=Data)

#Reshape the file
reshaped_data1 <- filtered_data2%>%pivot_longer(!c(Country,Year), names_to = "Indicator", values_to ="Value")
reshaped_data2 <- reshaped_data1%>%pivot_wider(names_from=Year, values_from = Value)
reshaped_data2 <- reshaped_data2%>%select("Country", "Indicator", as.character(2000:2019))

#Calculate the percentage of available data in each column
percent_available <-rowMeans(!is.na(reshaped_data2[,3:ncol(reshaped_data2)]))

# Plot indicators with missing data
plot_missing_values<-function(Data){

ggplot(Data, aes(x = Year, y = Value, color = Country, group = Country)) +
  geom_line() +
  geom_point(size=2)+
  labs(title = "Indicator Values Over Years",
       x     = "Year",
       y     = "Value",
       color = "Country") +
  theme_minimal() +
  facet_wrap(~ Indicator, ncol = 1,scales = "free_y")
  
}

# Identify and filter with missing data
Indices_missing_values<-which(percent_available<1)

Data_with_missing_values  <-reshaped_data2[Indices_missing_values,]

Data_with_missing_values2<-Data_with_missing_values%>%pivot_longer(!c("Country","Indicator"), names_to = "Year", values_to="Value")

plot_missing_values(Data_with_missing_values2)

# Save the plot with a specific size
#ggsave(paste("./ResultsMissingData/missing_data_",Name_data,".pdf",sep=""), plot = p, width = 10, height = 15)

Filling missing data using MICE

https://github.com/farrellday/miceRanger

#Identify variables with missing data
columns_with_missing<- c(unique(Data_with_missing_values$Indicator))
print(columns_with_missing)

## [1] "AMB"      "ECON6"    "HEALTH9"  "HEALTH5"  "SOC9"     "HEALTH11" "ECON7"   
## [8] "ECON2"    "ECON3"

#Countries with missing data
countries_with_missing_data <-unique(Data_with_missing_values$Country)
print(countries_with_missing_data)

##  [1] "VEN" "GUY" "ARG" "HND" "HTI" "CHL" "SUR" "ECU" "BRB" "JAM" "COL" "CAN"
## [13] "NIC" "DOM" "ATG" "GRD" "PAN" "LCA" "BHS" "PRY" "GTM" "BOL" "PER" "MEX"
## [25] "BLZ" "CRI" "CUB" "SLV" "TTO" "USA" "VCT" "BRA" "URY"

#filter data for indicator with missing data in a particular country
#filter(Country==countries_with_missing_data[2])%>%
ampDat <- filtered_data2%>%pivot_longer(!c(Year,Country), names_to = "Indicator", values_to = "Value" )%>%pivot_wider(names_from = Indicator, values_from = Value)

# Create the imputation model using miceRanger
mrModelOutput <- miceRanger(ampDat,valueSelector = "value", cols = columns_with_missing, verbose = FALSE, m = 10)

Diagnostic Plotting

Distribution of Imputed Values

We can take a look at the imputed distributions compared to the original distribution for each variable:

The red line is the density of the original, nonmissing data. The smaller, black lines are the density of the imputed values in each of the datasets. If these don’t match up, it’s not a problem, however it may tell you that your data #was not Missing Completely at Random (MCAR).

plotDistributions(mrModelOutput,vars='allNumeric')

Convergence of Correlation

We are probably interested in knowing how our values between datasets converged over the iterations. The plotCorrelations function shows you a boxplot of the correlations between imputed values in every combination of datasets, at each iteration:

plotCorrelations(mrModelOutput,vars='allNumeric')

Center and Dispersion Convergence

Sometimes, if the missing data locations are correlated with higher or lower values, we need to run multiple iterations for the process to converge to the true theoretical mean (given the information that exists in the dataset). We can see if the imputed data converged, or if we need to run more iterations:

plotVarConvergence(mrModelOutput,vars='allNumeric')

Model OOB Error

Random Forests give us a cheap way to determine model error without cross validation. Each model returns the OOB accuracy for classification, and r-squared for regression. We can see how these converged as the iterations progress:

plotModelError(mrModelOutput,vars='allNumeric')

Variable Importance

Now let’s plot the variable importance for each imputed variable. The top axis contains the variable that was used to impute the variable on the left axis.

plotVarImportance(mrModelOutput)

Imputed Variance Between Datasets

We are probably interested in how “certain” we were of our imputations. We can get a feel for the variance experienced for each imputed value between the datasets by using plotImputationVariance() function:

plotImputationVariance(mrModelOutput,ncol=1,widths=c(5,3))

Using the Imputed Data

To return the imputed data simply use the completeData function:

dataList   <- completeData(mrModelOutput)
Filled_data<-data.frame(dataList[[1]])

Plot filled variables

Data_with_missing_values2

Analysis created by:
Yury E. Garcia
Dr. Fabio Sanchez’s Research Team
CIMPA
Universidad de Costa Rica
Email: epimec.cr@gmail.com