Base R Visualization. For an introduction to data visualization in R, see Visualization in Base R.
Introduction to ggplot2: Part I. See Visualization with ggplot2: Part I, which introduces:
ggplot2 visualizationIntroduction to ggplot2: Part II. See Visualization with ggplot2: Part II, which introduces:
This is the final introductory session on R package ggplot2. We’ll introduce:
In this session, we’ll use local public construction project records from:
You can find these datasets and documentation in this GitHub repository. These are the same data gathered by Legal Services of Central New York (LSCNY) and scraped, analyzed, and visualized for Legal Services’ and Urban Jobs Task Force’s collaborative publication: “Building Equity in the Construction Trades: A Racial Equity Impact Statement”.
Run the following code in R to install and load packages required for this session.
readr is used for importing and exporting datadplyr is built around a grammar of data manipulation, inspired by SQLscales allows one to easily format variables, e.g. percents and dollar amountsggthemes, introduced last session, expands the ggplot2 library of preset themesif(!require("readr")){install.packages("readr")}
if(!require("dplyr")){install.packages("dplyr")}
if(!require("scales")){install.packages("scales")}
if(!require("ggplot2")){install.packages("ggplot2")}
if(!require("ggthemes")){install.packages("ggthemes")}
library(readr)
library(dplyr)
library(scales)
library(ggplot2)
library(ggthemes)
Run the following code to read in (i.e. import) our practice data:
hc contains all scraped records for Hancock renovationslv contains all scraped records for Lakeview renovationsurl <- paste0("https://raw.githubusercontent.com/jamisoncrawford/",
"wealth/master/Tidy%20Data/hancock_lakeview_tidy.csv") # URL containing data
hc <- read_csv(file = url,
col_types = "ccDcccddddccliii") %>% # Abbreviations specifying data classes
filter(project == "Hancock") # Filters data for "Hancock" only
lv <- read_csv(file = url,
col_types = "ccDcccddddccliii") %>% # Abbreviations specifying data classes
filter(project == "Lakeview") # Filters data for "Lakeview" only
rm(url) # Removes object `url`
Let’s observe Hancock’s worker race versus individual gross incomes per working period.
We’re using the three essential layers for any plot, plus one additional layer for clarity:
ggplot() calls the data required for the data layeraes() specifies which variables for the aesthetics layergeom_jitter() specifies a jittered scatter plot for the geometries layertheme_light() isn’t essential, but it modifies a preset themeggplot(hc, # Call data layer
aes(x = race,
y = gross)) + # Map variables in aesthetics layer
geom_jitter(width = 0.2,
alpha = 0.3,
color = "tomato") + # Specify geometry layer, with some attributes
theme_light() # Unessential: A preset theme for clarity
In general, the Coordinates layer follows a few rules:
coord_, type this in-console to view functions via autocompletecoord_cartesian() is the most common and modifies a plot’s Cartesian planeNo, we’re not politely cursing our coordinates.
Instead, we swap the x- and y-axes, which is a best practice when using categorical variables. Why?
We can use function ccord_flip() to achieve this. Note that x = and y= arguments are reversed.
ggplot(hc,
aes(x = race,
y = gross)) +
geom_jitter(width = 0.2,
alpha = 0.3,
color = "tomato") +
coord_flip() + # Swap x- and y-axes with `coord_flip()`
theme_light()
We can “zoom in” on a part of our plot with function coord_cartesian():
gross and net incomenet includes income from other projectsgeom_smooth()
coord_cartesian()ggplot(hc,
aes(x = net, # Note the substitution of `race` for `net`
y = gross)) +
geom_point() + # We don't need jitter and use `geom_pont()`
geom_smooth() + # By default, this plots a Loess curve and standard error
theme_light()
Now, we’ll use coord_cartesian() to specify where we’d like to “zoom in”.
c()xlim = and y-axis range with ylim =ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
coord_cartesian(xlim = c(0, 1000), # We may specify limits for one or both axes
ylim = c(0, 2000)) + # The range of limits in `c()` can differ
theme_light()
While coord_cartesian() allows one to “zoom in”, limit functions will “filter” unseen data.
xlim() with argument x = controls the x-axis limitsylim() with argument y = controls the y-axis limitslims() accepts both x = and y = argumentsc()ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
lims(x = c(0, 1000),
y = c(0, 2000)) + # Alternatively: xlim(c(0, 1000)) + ylim(c(0, 2000))
theme_light()
Questions. Why has the Loess curve changed? Why has the standard error changed?
Modifying the scales of the x- and y-axes requires you to follow a few rules, as well:
scale_; explore with autocompletescale_x_ or scale_y_scale_x_discrete()scale_y_continuous()labels =ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Unique Net Payments") + # Specify continuous x-axis
scale_y_continuous(name = "Unique Gross Payments") + # Specify continuous y-axis
theme_light()
Labels. Above, we’ve simply modifed the axes labels with argument name =.
We can use R package scales to easily format the data labels, which may:
label = "comma")labels = dollar) or percents (labels = percent)library(scales)
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Unique Net Payments",
labels = dollar) + # `scales` makes formatting easy!
scale_y_continuous(name = "Unique Gross Payments",
labels = dollar) +
theme_light()
Breaks. We can also format breaks along each axis, i.e. labels every $500, or every $1K.
breaks =c()library(scales)
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Unique Net Payments",
labels = dollar,
breaks = c(0, 500, 1000, 1500, 2000, 2500)) + # Custom breaks
scale_y_continuous(name = "Unique Gross Payments",
labels = dollar,
breaks = c(0, 500, 1000, 1500, 2000, 2500, 3000, 3500)) +
theme_light()
Repeating Breaks. We don’t need to type out each “break” value by using package scales:
pretty_breaks() from scales allows easy, incremental break formattingn = specifies the numebr of equal breakslibrary(scales)
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Unique Net Payments",
labels = dollar,
breaks = pretty_breaks(n = 6)) + # Pretty breaks
scale_y_continuous(name = "Unique Gross Payments",
labels = dollar,
breaks = pretty_breaks(n = 8)) +
theme_light()
Custom Labels. We don’t have to use package scales, though it helps. Instead:
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Unique Net Payments",
labels = c("$ 0", "0.5", "1.0", "1.5", "2.0", "2.5 K"),
breaks = c(0, 500, 1000, 1500, 2000, 2500)) +
scale_y_continuous(name = "Unique Gross Payments",
labels = c("$ 0", "0.5", "1.0", "1.5", "2.0", "2.5", "3 K"),
breaks = c(0, 500, 1000, 1500, 2000, 2500, 3000)) +
theme_light()
Small multiples allow us to compare different groups of data using the same scales.
facet_; peruse faceting with autocompletefacet_grid() and facet_wrap() are most common
facet_grid() aligns small multiples horizontally along one shared y-axisfacet_wrap() structures small multiples tabularly, in rows and columnsclass ~ race
.Let’s try facet_grid() first with formula ~ race, faceting on race alone:
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth(method = "lm") + # Linear model, not Loess
scale_x_continuous(name = "Unique Net Payments",
labels = c("$0", "1", "2 K", "")) +
scale_y_continuous(name = "Unique Gross Payments",
labels = dollar) +
facet_grid( ~ race) + # Use a single y-axis scale
theme_light()
Now let’s give facet_wrap() a try with the same formula:
ggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth(method = "lm") + # Linear model, not Loess
scale_x_continuous(name = "Unique Net Payments",
labels = c("$0", "1", "2 K", "")) +
scale_y_continuous(name = "Unique Gross Payments",
labels = dollar) +
facet_wrap( ~ race) + # Use a tabular format
theme_light()
There are several arguments for faceting functions, though usually you only need a formula.
Labeling is also fairly straightforward and usually contain lab in the function name:
labs() is a catch-all for multiple labelsscale_x_continuous()ggtitle() modifies the plot titlexlab() modifies the x-axis titleylab() modifies the y-axis titleMost conveniently, you can modify all labels in the labs() function using arguments:
title = for plot titlesubtitle = for plot subtitlex = for x-axis titley = for y-axis titlecaption = for listing sourcesfill =, color =, or other aesthetics for legend titlesggplot(hc,
aes(x = net,
y = gross)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = dollar) +
facet_wrap( ~ race) +
labs(title = "Gross vs. Net payments by ethnicity", # Labels in fell swoop
subtitle = "2018 Hancock Airport Renovations",
x = "Unique Net Payments",
y = "Unique Gross Payments",
caption = "Source: Syracuse Regional Airport Authority") +
theme_light()
Overlay plots by adding new geom_*() and aes() functions and arguments appropriately.
geom_jitter() plot with gross income and racegeom_*() callaes()ggplot(hc,
aes(x = race,
y = gross)) +
geom_boxplot(width = 0.55,
color = "grey55") +
geom_jitter(width = 0.2,
alpha = 0.3,
color = "tomato") +
coord_flip() +
scale_y_continuous(labels = dollar) +
labs(title = "Gross payments by ethnicity",
subtitle = "2018 Hancock Airport Renovations",
x = "Ethnicity",
y = "Unique Gross Payments",
caption = "Source: Syracuse Regional Airport Authority") +
theme_light()
Instead of dataset hc, for Hancock data, replicate the above plots with Lakeview data, lv. Recommendations are provided below, but feel free to experiment. Make sure to:
lv in your data layer, ggplot()names(lv)Suggested plots:
geom_jitter() plot with race by grossgeom_point() plot with geom_smooth() using gross by netgeom_bar() plot with argument stat = "identity" in the geometry layer
fill = race in the aes() function callThanks for sticking it out with me throughout this introductory series to the R language.
With grit and determination, which you’ve already shown, these fundamentals will be a robust foundation, without pretense to architecture, on which to build your future data science skills. We’ve only scratched the surface - tooth and nail, of course - but such is the nature of any aspiring hacker. Keep it up.
In the words of Benjamin Franklin:
“Industry need not wish”.