This is a short post on a function to Visualize missing data which I came across on Kaggle Kernel by user AiO.
The function gives an amazing plot representing the missing data in white colour and non missing data in black as output. It’s capability of handling huge amounts of data is quite impressive. I have created a plot for a dataset of 500000 observations within few seconds using it.
library(ggplot2)
plot_Missing <- function(data_in, title = NULL) {
temp_df <- as.data.frame(ifelse(is.na(data_in), 0, 1))
temp_df <- temp_df[, order(colSums(temp_df))]
data_temp <-
expand.grid(list(x = 1:nrow(temp_df), y = colnames(temp_df)))
data_temp$m <- as.vector(as.matrix(temp_df))
data_temp <-
data.frame(
x = unlist(data_temp$x),
y = unlist(data_temp$y),
m = unlist(data_temp$m)
)
ggplot(data_temp) + geom_tile(aes(x = x, y = y, fill = factor(m))) +
scale_fill_manual(values = c("white", "black"), name = "Missing\n(0=Yes, 1=No)") + theme_light() + ylab("") + xlab("") +
ggtitle(title)
}I’ve used the sleep dataset from VIM package for the demo.
load("sleep.rda")
misscols = sapply(test.df, function(x) sum(is.na(x)))
misscols = names(misscols[misscols > 0])
plot_Missing(test.df[,misscols])It is indeed a great visualization of missing data. One great use case I found from it is the identification of commonness in missing data across different variables.
Observing the above plot we can derive few interesting insights.
From these insights, we can say these there might be some sort of correlation between variables “Sleep” and “NonD”, “Dream” and “NonD” due to which data is missing for common set of observations.