library(tidyverse)
library(lubridate)
library(scales)
library(plotly)
library(usmap)
rm(list=ls())Data Visualization Exercises
In this exercise, we will reproduce the graphics from the class slides. The purpose is to illustrate different types of plots and demonstrate various features of ggplot. As a reference, don’t forget to check out the ggplot cheat sheet.
In this exercise, we will work directly with R code. However, I encourage you to check out the esquisse package. It is a “point and click” interface that will allow you to adjust ggplot graphics quickly and provide you with the code.
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
Pie and bar chart examples
The first visualizations represent pie and bar charts using country-level GDP data. The first step is to create a data set (or tibble) with GDP information:
gdp <- tibble(
value = c(0.1501, 0.0328, 0.0326, 0.166, 0.0254, 0.2409, 0.0601, 0.032, 0.0457),
country = c('China', 'India', 'UK', 'Rest of World', 'Brazil', 'US', 'Japan', 'France', 'Germany'))When manually creating a dataset, we could alternatively use data.table or data.frame in lieu of tibble(). Tibbles are native to the tidyverse, so we are going with that!
Now let’s create a pie chart using the base R plotting function:
pie(gdp$value, labels = gdp$country, main = '2017 GDP for Largest Global Economies')Let’s try out a bar chart depicting the same information:
ggplot(gdp, aes(x = country, y = value)) +
geom_bar(stat = 'identity', fill = 'dodgerblue', color = 'black') +
labs(title = '2017 GDP for Largest Global Economies', x = 'Country', y = 'Percentage')We may want the graphic above sorted in a particular order. Note that country is currently a ‘character’ variable. Really, though, we should consider country a factor variable and we can assign an order (the default is alphabetical. Let’s impose an order then recreate each of these plots.
gdp <- gdp %>%
arrange(desc(value)) %>%
mutate(country = fct_inorder(country))
pie(gdp$value, labels = gdp$country, main = '2017 GDP for Largest Global Economies')ggplot(gdp, aes(x = country, y = value)) +
geom_bar(stat = 'identity', fill = 'dodgerblue', color = 'black') +
labs(title = '2017 GDP for Largest Global Economies', x = 'Country', y = 'Percentage')Histogram examples
For the next set of examples, we are going to borrow from the diamonds, mpg, texas housing, and state population datasets from the tidyverse and usmap packages. Information about these datasets can be found at: diamonds, mpg, txhousing, and statepop. Let’s load each data set into the environment.
diamonds <- diamonds %>%
filter(carat < 1.5)
mpg <- mpg
txhousing <- txhousing
statepop <- statepopCreate a histogram using the diamonds data:
ggplot(diamonds, aes(x = price)) +
geom_histogram(fill = 'dodgerblue', color = 'black', bins = 30) +
labs(title = 'Diamond Prices', x = 'Price', y = 'Count') +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)Create a histogram using the mpg data:
ggplot(mpg, aes(x = cty)) +
geom_histogram(bins = 10, fill = 'dodgerblue', color = 'black') +
labs(title = 'City MPG', x = 'City MPG', y = 'Count')Create a histogram using the txhousing data:
txhousing %>%
group_by(city, year) %>%
summarize(sales_price = median(median)) %>%
filter(year == 2014) %>%
ggplot(aes(x = sales_price)) +
geom_histogram(fill = 'dodgerblue', color = 'black', bins = 8) +
labs(title = 'Median House Prices by City in Texas (2014)', x = 'Price', y = 'Count') +
scale_x_continuous(labels = comma)Box plot examples
Next, let’s create a boxplot using the mpg dataset:
ggplot(mpg, aes(x = class, y = cty)) +
geom_boxplot(fill = 'dodgerblue', color = 'black') +
labs(title = 'City MPG by Class', x = 'Class Type', y = 'City MPG')Line chart examples
Let’s create some line charts using Alabama and Tennessee’s season records since 1996. First let’s create the data sets.
seasons <- tibble(
ua_record = c(0.857,0.846,0.867,1,0.846,0.933,0.929,0.933,0.933,0.857,0.846,0.929,0.923,0.769,1,0.857,0.538,0.462,0.833,0.5,0.308,0.769,0.583,0.273,0.769,0.583,0.364,0.769),
ut_record = c(0.692,0.846,0.538,0.300,0.615,0.417,0.333,0.692,0.692,0.538,0.417,0.417,0.417,0.462,0.538,0.417,0.714,0.692,0.455,0.769,0.769,0.615,0.846,0.667,0.75,1,0.846,0.833),
year = seq(2023,1996,-1))
seasons_long <- seasons %>%
pivot_longer(cols = !year, names_to = 'school_record', values_to = 'value')Notice that we reshaped the initial data set. Think through why we need to do that to create the chart below.
Hint: think through the principles of ‘tidy’ data. What do the columns and rows need to represent?
Let’s plot season records over time:
ggplot(seasons_long, aes(x = year, y = value, color = school_record, shape = school_record)) +
geom_line(linewidth = 1.0) +
geom_point(size = 3.0) +
scale_color_manual(values=c('#9E1B32', '#FF8200')) +
scale_x_continuous(limits = c(1996,2023), breaks = seq(1996,2023,4)) +
labs(title = 'Football Season Records for Alabama and Tennessee', x ='Year', y = 'Record', color = 'Legend', shape = 'Legend')How about plotly:
plot_ly(seasons, x = ~year, y = ~ua_record, type = 'scatter', mode = 'lines', name = 'UA Record', line = list(color = '#9E1B32')) %>%
add_trace(y = ~ut_record, name = 'UT Record', line = list(color = '#FF8200')) %>%
layout(title = 'Football Season Records for Alabama and Tennessee', xaxis = list(title = 'Year'), yaxis = list (title = 'Record'))Scatter plot examples
Let’s recreate a scatter plot using the diamonds dataset. As before, we are going to visualize the relations between three variables (price, carat, and clarity).
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point() +
labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price', color = 'Clarity') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)Notice that the variation in prices gets larger as the diamonds get larger (something we talked about in class earlier). In this case, a log transformation of the axes will adjust for this pattern. We can transform one (or more of the axes) by changing the scale of the axes:
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point() +
labs(title = 'Scatter Plot of Price versus Carat (Log Scales)', x = 'Log Carat', y = 'Log Price', color = 'Clarity') +
scale_y_log10(labels = comma) +
scale_x_log10(labels = comma)Scatterplots can suffer from overplotting. Here let’s tinker with two common tools to alleviate overplotting (note jitter is unlikely to help much in severe overplotting cases).
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(position = 'jitter', alpha = 0.1) +
labs(title = 'Scatter Plot of Price versus Carat (Jitter and Alpha)', x = 'Carat', y = 'Price', color = 'Clarity') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)Another way to handle overplotting is to plot a random sample of the data. To do so, we will just “wrap” the diamonds data set in a sampling function. When sampling, you can use set.seed() to make your randomized sample reproducible.
set.seed(42)
diamonds %>%
slice_sample(n = 1000) %>%
ggplot(aes(x = carat, y = price, color = clarity)) +
geom_point() +
labs(title = 'Scatter Plot of Price versus Carat (Sample)', x = 'Carat', y = 'Price', color = 'Clarity') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)Create the same plot using plotly:
set.seed(42)
diamonds %>%
slice_sample(n = 1000) %>%
plot_ly(x = ~carat, y = ~price, color = ~clarity, type = 'scatter', mode = 'markers') %>%
layout(title = 'Scatter Plot of Price versus Carat', xaxis = list(title = 'Year'), yaxis = list (title = 'Record'))Axis scaling example
The scale of the x or y axis can dramatically change the appearance of, and inferences from, a graphic. Always consider the “scale” of data visualizations. Below, I will illustrate how Alabama’s year over year seasonal performance since 2010 can look very different based on how the y-axis was scaled.
“Zoomed” in:
seasons_long %>%
filter(school_record == 'ua_record' & year > 2010) %>%
ggplot(aes(x = year, y = value, color = school_record, shape = school_record)) +
geom_line(linewidth = 1.0) +
geom_point(size = 3.0) +
scale_color_manual(values=c('#9E1B32')) +
scale_x_continuous(limits = c(2011,2023), breaks = seq(2011,2023,2)) +
labs(title = 'Football Season Records for Alabama Since 2010', x ='Year', y = 'Record',color = 'Legend', shape = 'Legend')“Zoomed” out:
seasons_long %>%
filter(school_record == 'ua_record' & year > 2010) %>%
ggplot(aes(x = year, y = value, color = school_record, shape = school_record)) +
geom_line(linewidth = 1.0) +
geom_point(size = 3.0) +
scale_y_continuous(limits = c(0,1.1), breaks = seq(0,1,0.2)) +
scale_x_continuous(limits = c(2011,2023), breaks = seq(2011,2023,2)) +
scale_color_manual(values=c('#9E1B32')) +
labs(title = 'Football Season Records for Alabama Since 2010', x ='Year', y = 'Record', color = 'Legend', shape = 'Legend')Faceting example
In ggplot, faceting allows us to create separate plots by “groups”. Notice that we did this earlier in the Covid-19 exercise. Here, let’s add facet_wrap() to the football season records plot to create separate plots for each team:
ggplot(seasons_long, aes(x = year, y = value, color = school_record, shape = school_record)) +
geom_line(linewidth = 1.0) +
geom_point(size = 3.0) +
scale_color_manual(values=c('#9E1B32', '#FF8200')) +
labs(title = 'Football Season Records for Alabama and Tennessee', x ='Year', y = 'Record', color = 'Legend', shape = 'Legend') +
facet_wrap(~school_record, nrow = 2)Note that you can also facet across two group variables using facet_grid(). Don’t forget to reference the ggplot cheat sheet for many more features.