Something Missing?

A common task in data analysis is cleaning up the data set. Some practitioners estimate this takes up 80% of any analytical project. So it helps to get a view of the problem.

Nicholas Tierney created a clever way to look at it. He came up with a “missing data map” to help the analyst see what he/she might have overlooked.

And here’s the code!

In this example, a random number of NAs are generated to illustrate missing data.

The first graphic is colorful but the formatting is a bit funky.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(wakefield)
## 
## Attaching package: 'wakefield'
## 
## The following object is masked from 'package:dplyr':
## 
##     id
df <- r_data_frame(n = 100, id, race, age, sex, hour, iq, height, died, Scoring = rnorm, Smoker = valid) %>% r_na(prob = 0.33)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(df)
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used

It is possible to jazz things up and create a prettier picture, even if it is monochromatic. User friendliness is important. Note the layout is more intuitive, as well, with the ID in the left-most column.

library(reshape2)
library(ggplot2)
ggplot_missing <- function(x){
        x %>% is.na %>% melt %>% ggplot(data = ., aes(x = Var2, y = Var1)) +
                geom_raster(aes(fill = value)) +
                scale_fill_grey (name = '', labels = c('Present', 'Missing')) +
                theme_minimal() +
                theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
                labs(x = 'Variables on Dataset', y = 'Rows / Observations')
}
ggplot_missing(df)

This code was adapted from “ggplot your missing data” by Nicholas Tierney.