A common task in data analysis is cleaning up the data set. Some practitioners estimate this takes up 80% of any analytical project. So it helps to get a view of the problem.
Nicholas Tierney created a clever way to look at it. He came up with a “missing data map” to help the analyst see what he/she might have overlooked.
In this example, a random number of NAs are generated to illustrate missing data.
The first graphic is colorful but the formatting is a bit funky.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(wakefield)
##
## Attaching package: 'wakefield'
##
## The following object is masked from 'package:dplyr':
##
## id
df <- r_data_frame(n = 100, id, race, age, sex, hour, iq, height, died, Scoring = rnorm, Smoker = valid) %>% r_na(prob = 0.33)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(df)
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
It is possible to jazz things up and create a prettier picture, even if it is monochromatic. User friendliness is important. Note the layout is more intuitive, as well, with the ID in the left-most column.
library(reshape2)
library(ggplot2)
ggplot_missing <- function(x){
x %>% is.na %>% melt %>% ggplot(data = ., aes(x = Var2, y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey (name = '', labels = c('Present', 'Missing')) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
labs(x = 'Variables on Dataset', y = 'Rows / Observations')
}
ggplot_missing(df)