The content of this blog is based on examples/ notes/ experiments related to the material presented in the “Building Data Visualization Tools” module of the “Mastering Software Development in R” Specialization (Coursera) created by Johns Hopkins University [1].
Packages used for running the examples…
# If necessary to install a package run
# install.packages("packageName")
# Load packages
library(ggplot2)
library(dplyr) # supporting data manipulation
library(gridExtra) # adding extra features for plotting
library(ggthemes) # extra themes (based on ggplot2)
Data used for the examples…
# install.packages("dlnm")
library(dlnm)
# Data used for this example in chicagoNMMAPS
# Daily Mortality Weather and Pollution Data for Chicago (dataset)
# ?chicagoNMMAPS #for more info about the data
data("chicagoNMMAPS") #?chicagoNMMAPS
chic <- chicagoNMMAPS
# selecting only data for July 1995
chic_july <- chic %>%
filter(month == 7 & year == 1995)
# install.packages("faraway")
library(faraway)
data("worldcup")
# Data on players from the 2010 WOrld Cup
# ?worldcup #for more info about the data
The main objective of a visualization is to tell a story with data, to tell an interesting and engaging story. Plots/ graphs can help to visualize what the data have to say.
Data is a representation of real life, data is the manifestation of specific behaviors/ events. There are stories and meanings behind the data. Sometimes those stories are simple and straighforward, other times they are complex and difficult to understand.
Data can be used to tell amazing stories, plots/graphics are one of the means that can be used to tell these stories or part of them. See examples below
When exploring the data, there are two main things to look for: patterns and relationships. These are the things that good graphics/ plots tries to capture. And remember to question always what you see, ask “Does it make sense?”. Data checking and verification is one of the most important task when looking for stories in the data.
Six simple guidelines that can be used to create good graphics/ plots (based on the works of Edward Tufte, Howard Wainer, Stephen Few, Nathan Yau) are:
Try to increase, as much as possible, the data to ink ratio in your graph, the ratio of “ink” providing information to all ink used in the figure. Increasing the data to ink ratio makes it easier for users to see the message in the data, see example below.
base_plot <- ggplot(data = chic_july, mapping = aes(x = date, y = death)) +
scale_y_continuous(limits = c(0, 500))
# Lower Data Density
plot_1 <- base_plot +
geom_area(fill = "black") + ggtitle("Lower Data Density")
# Higher Data Density
plot_2 <- base_plot +
geom_line() + ggtitle("Higher Data Density")
grid.arrange(plot_1, plot_2, ncol = 2)
Themes can be used to manipulate the data density in a graphic/ plot, selecting the correct theme can help increasing the data-to-ink ratio, see examples below.
plot_1 <- base_plot +
geom_point() + ggtitle("Default Theme")
plot_2 <- base_plot +
geom_point() + ggtitle("theme_bw") + theme_bw()
plot_3 <- base_plot +
geom_point() + ggtitle("theme_few") + theme_few()
plot_4 <- base_plot +
geom_point() + ggtitle("theme_tufte") + theme_tufte()
plot_5 <- base_plot +
geom_point() + ggtitle("theme_538") + theme_fivethirtyeight()
plot_6 <- base_plot +
geom_point() + ggtitle("theme_solarized") + theme_solarized()
grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, plot_6, ncol = 2)
The default behavior of ggplot2 is to use the column names as labels for the x- and y-axis. This behavior is acceptable when performing EDA, but it is not adequate for graphics/ plots used within reports, presentations, papers. Labels should be clear and meaningful.
Strategies that can be used to make labels clearer and meaningful:
xlab, ylab functions to customize your labels (alternatively e.g. scale_x_continuous).Data is easier to interpret when you add references.
base_plot <- ggplot(data = chic_july, mapping = aes(x = date, y = death))
# Lower Data Density
plot_1 <- base_plot +
geom_line() + theme_bw() + ggtitle("No Reference")
# Higher Data Density
plot_2 <- base_plot +
geom_hline(yintercept = 120, color = "gray") +
geom_hline(yintercept = 90, color = "gray") +
geom_line() + theme_bw() +
ggtitle("Reference")
grid.arrange(plot_1, plot_2, ncol = 2)
Strategies to add references to a graphic/ plot:
geom_smoothfunction.geom_hline and geom_vline,to add horizontal or vertical linesgeom_abline, to add a line with an intercept and slopegeom_polygon, to add a filled polygongeom_path, to add an unfilled polygon.alpha to add transparency to the reference elements.Considering adding elements to highlight specific aspects of the data.
base_plot <- ggplot(data = chic_july, mapping = aes(x = date, y = death))
# Let's make the hypothesis that a heat wave was present in theg the period 14.07 - 17.07
# No Highlight of this interesting aspect
plot_1 <- base_plot +
geom_line() + theme_bw() + ggtitle("No Highlight")
# With Highlight of this interesting aspect
plot_2 <- base_plot +
geom_segment(aes(x = as.Date("1995-07-14"), xend = as.Date("1995-07-17"),
y = max(chic_july$death) + 10, yend = max(chic_july$death) + 10), color = "red", size = 3) +
geom_line() + theme_bw() +
ggtitle("With Highlight of Heat Wave period")
grid.arrange(plot_1, plot_2, ncol = 2)
Geoms like geom_segment, geom_line, geom_text are quite useful for highliting interesting aspects in the graph.
Small multiples are graphs that use many small plots to show different subsets of the data. All plots use the same x- and y- ranges making it easier to compare across plots.
facet_grid and facet_wrap functions can be used to create in a simple way small multiples for the data (see facets section in [3]). Often, when using faceting, it is necessary to rename or re-oder the factor levels of categorical features in order to make the graphs easier to read and interpret.
Adding order to plots can help highlight interesting finding/ aspects in the data. When working with categorical features (factor) often the default ordering (e.g. alphabetical order) is not interesting and it needs to be changes to something more meaningful, see example below.
worldcup_data <- worldcup %>%
group_by(Team) %>%
summarise(mean_time = mean(Time))
# Default ordering of the Team (categorical)
plot_1 <- ggplot(data = worldcup_data, mapping = aes(x = mean_time, y = Team)) +
geom_point() +
theme_bw() +
xlab("Mean per player (minutes") +
ylab("") +
ggtitle("Alphabetical Order")
# With a more meaningful order - by mean_time
plot_2 <- worldcup_data %>%
arrange(mean_time) %>%
#reorganize the level in Team (factor) before plotting
mutate(Team = factor(Team, levels = Team)) %>%
ggplot(mapping = aes(x = mean_time, y = Team)) +
geom_point() +
theme_bw() +
xlab("Mean per player (minutes") +
ylab("") +
ggtitle("Meaningful Order - by mean_time")
grid.arrange(plot_1, plot_2, ncol = 2)
[1] “Mastering Software Development in R” by Roger D. Peng, Sean Cross and Brooke Anderson, 2017
[2] “Building Data Visualization Tools (Part 1): basic plotting with R and ggplot2” by Pier Lorenzo Paracchini
[3] “Building Data Visualization Tools (Part 2): ‘ggplot2’, essential concepts” by Pier Lorenzo Paracchini