Updated on 11/6/17: Fixed bug preventing some from using heatmap function. Note: A bug appeared during the R Hangouts for Beginners October meeting on 10/3/17 which prevented many from being able to use the heatmap function. The cause of the bug was that a new version of the tidyverse (and perhaps plotly) packages were necessary (at least 1.1.1 for tidyverse). To resolve this, the package installation in the next step is no longer conditional. This simple change should fix the bug and the heatmap function should work for everyone if all the code below is run in the order presented.
# Install packages
install.packages("tidyverse", repos='http://cran.us.r-project.org')
install.packages("plotly", repos='http://cran.us.r-project.org')
# Load necessary packages
library(tidyverse)
library(plotly)
# Download data files from GitHub repository
download.file("https://github.com/haroldgil/SyS-tools/raw/master/data/syn_miss_fields.csv", "syn_miss_fields.csv")
syn_miss_fields = read_csv("syn_miss_fields.csv")
download.file("https://github.com/haroldgil/SyS-tools/raw/master/data/syn_miss_source.csv", "syn_miss_source.csv")
syn_miss_source = read_csv("syn_miss_source.csv")
dq_complete_plot <- function(data, by_var, title, miss = NULL, margin = NULL){
data_miss <- data %>% as_tibble() %>%
mutate_at(.vars=vars(-matches(by_var)), .funs=funs(replace(., . %in% miss, NA) %>% is.na)) %>%
group_by_(by_var) %>% summarize_all(.funs=funs(((1-mean(.))*100) %>% round(2)))
data_miss_mat <- data_miss %>% select_(paste("-", by_var, sep="")) %>% as.matrix() %>% t()
colnames(data_miss_mat) <- select(data_miss, by_var) %>% pull
rownames(data_miss_mat) <- setdiff(names(data_miss), by_var)
out_dq_plot <- plot_ly(
x = colnames(data_miss_mat),
y = rownames(data_miss_mat),
z = data_miss_mat,
colors = colorRamp(c("red", "yellow", "green")),
type = "heatmap"
) %>% colorbar(limits = c(0, 100)) %>%
layout(title = title, xaxis = list(title = NULL), yaxis = list(title = NULL), autosize = T, margin = margin)
return(out_dq_plot)
}
The function options for dq_complete_plot() are:
data: the dataset.by_var: the variable to stratify by.title: title for your plot.miss [Optional]: an atomic vector listing what values to classify as missing (besides NA which will count as missing by default). Example: miss = c(" “,”blank“,”missing“,”X“).margins [Optional]: a list with margin (l:left, r:right, b:bottom, t:top) and padding sizes specified in pixels. Example: margin = list(l = 200, r = 50, b = 100, t = 25, pad = 4).Now let’s use it.
dq_complete_plot(data = syn_miss_fields, by_var = "Source", title = "Percent Completeness of Fields by Source")
Notice that the Initial.Temperature and City fields are missing about 20-25% of their values for all data sources. It’s a little harder to tell, but the Age field is also missing about 10% of its values across data sources. You can see the numerical percent completeness values by hovering over the corresponding square of the heatmap.
Let’s look at a different dataset. Let’s also specify the miss and margin options to see them at work.
dq_complete_plot(data = syn_miss_source, by_var = "Source", title = "Percent Completeness of Fields by Source", miss = c(" ", "X", "blank", "sorry, not sorry"), margin = list(l=150, r=50, b=100, t=25, pad=4))
In this dataset, you can see that something is up with data sources B and F. Initial.Temperature is also missing a bit of data like before. Note that by increasing the left margin size, we can now see the entire field names. Also note that if the dataset contained entries such as “X” or “blank”, those entries would count as missing values.