1 Introduction

This is a short post on a function to Visualize missing data which I came across on Kaggle Kernel by user AiO.

The function gives an amazing plot representing the missing data in white colour and non missing data in black as output. It’s capability of handling huge amounts of data is quite impressive. I have created a plot for a dataset of 500000 observations within few seconds using it.

2 Function Code

library(ggplot2)
plot_Missing <- function(data_in, title = NULL) {
  temp_df <- as.data.frame(ifelse(is.na(data_in), 0, 1))
  temp_df <- temp_df[, order(colSums(temp_df))]
  data_temp <-
    expand.grid(list(x = 1:nrow(temp_df), y = colnames(temp_df)))
  data_temp$m <- as.vector(as.matrix(temp_df))
  data_temp <-
    data.frame(
      x = unlist(data_temp$x),
      y = unlist(data_temp$y),
      m = unlist(data_temp$m)
    )
  ggplot(data_temp) + geom_tile(aes(x = x, y = y, fill = factor(m))) +
    scale_fill_manual(values = c("white", "black"), name = "Missing\n(0=Yes, 1=No)") + theme_light() + ylab("") + xlab("") +
    ggtitle(title)
}

2.1 Demo

I’ve used the sleep dataset from VIM package for the demo.

load("sleep.rda")

misscols = sapply(test.df, function(x) sum(is.na(x)))
misscols = names(misscols[misscols > 0])

plot_Missing(test.df[,misscols])

2.2 Interpretation

It is indeed a great visualization of missing data. One great use case I found from it is the identification of commonness in missing data across different variables.

Observing the above plot we can derive few interesting insights.

  • Observations for which data is missing for the variable “Sleep”, same set of observations has missing data for the variable “NonD”.
  • Observations for which data is missing for the variable “Dream”, same set of observations has missing data for the variable “NonD”.

From these insights, we can say these there might be some sort of correlation between variables “Sleep” and “NonD”, “Dream” and “NonD” due to which data is missing for common set of observations.

3 Credits

Kaggle User: AiO.
Source code can be found in his kernel
You can find some great Exploratory Data Analysis Visualizations in it.