The ggplot2 package for R offers a relatively easy way to make high-quality graphics. This tutorial will show you how to make a basic scatterplot, histogram, bar chart, and box plot. Then it will revisit each type of chart to explain some advanced features you can use with each. Finally, it will illustrate “faceting,” which involves producing two or more charts of the same type for different variables in a way that makes comparing the variables easier.
We’ll need some data to work with. This code will retrieve some
county-level 2022 American Community Survey for Tennessee, store the
file on your computer, and open the file in your R workspace as a data
frame called mydata
. There’s also some code that creates a
few categorical variables that will prove useful for showing some of
ggplot2’s advanced features.
# Read the data from the web
FetchedData <-
read.csv("https://drkblake.com/wp-content/uploads/2023/11/DataWrangling.csv")
# Save the data on your computer
write.csv(FetchedData, "DataWrangling.csv", row.names = FALSE)
# remove the data from the environment
rm (FetchedData)
# Installing required packages
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse)
# Read the data
mydata <- read.csv("DataWrangling.csv")
# Create a continuous "Density" variable measuring
# households per square mile, then a two-level and
# a three-level categorical version
mydata <- mydata %>%
mutate(Density = Households / Land_area) %>%
mutate(Density_2 = cut_number(Density, n = 2)) %>% mutate(Density_3 = cut_number(Density, n = 3))
mydata <- mydata %>%
mutate(
Density_2 = case_when(
Density_2 == "[7.35,28.6]" ~ "Low density",
Density_2 == "(28.6,583]" ~ "High density",
.default = "Error"
)
)
mydata <- mydata %>%
mutate(
Density_3 = case_when(
Density_3 == "[7.35,21]" ~ "Low density",
Density_3 == "(21,40.4]" ~ "Intermediate density",
Density_3 == "(40.4,583]" ~ "High density",
.default = "Error"
)
)
# Re-save the data on your computer
write.csv(mydata, "DataWrangling.csv", row.names = FALSE)
Here is a list of the variables in the data frame, with a quick description of each. All data are county-level variables for Tennessee from the U.S. Census Bureau’s 2021 five-year American Community Survey.
County: The name of the county. “Anderson,” “Bedford,” “Benton,” etc. In all, there are 95 counties in the data frame.
Region: The Tennessee region in which the county is located. There are three regions: “West,” “Middle,” and “East.”
Med_HH_Income: Each county’s median household income.
Households: The number of households in each county.
Pct_BB: The percentage of households in each county that have broadband internet access.
Pct_College: The percentage of residents in each county who have a four-year college degree or higher (like a master’s degree, a law degree, a Ph.D., or a medical degree).
Land_area: Each county’s land area, in square miles. Land area is area in the county that is not covered by a river, lake, or other body of water.
Density: Households / Land_area.
Density_2: Density, divided into two roughly equal groups.
Density_3: Density, divided into three roughly equal groups.
Let’s start with two plots often used to graph just one, single variable.
Use a histogram to graph a single continuous variable, like Pct_BB, the percentage of households with broadband internet access in each Tennessee county.
The code uses the ggplot()
function, which has three
fundamental parts: the one that indicates the dataset being used (here,
mydata
); the “aesthetic” part, which indicates which
variable to place on the x axis (here, aes(x=Pct_BB)
), and
the “geometry” part, which indicates the type of graphic you want (here,
geom_histogram()
). Note the +
symbol that
connects the “aesthetic” and “geometry” portions of the code.
# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
geom_histogram()
Use a bar chart to graph a single categorical variable, like Region, which shows how many counties are in each grand division of Tennessee.
Again, the code uses the ggplot()
function, and the code
has the same three basic parts that it did in the histogram code: the
data frame specification, the “aesthetic” part, and the “geometry” part.
But this time, the x-axis variable is Region
, because
that’s the variable we’re graphing, and geom_bar()
tells R
we want a bar chart.
# Basic bar chart
ggplot(mydata, aes(x=Region))+
geom_bar()
Here are two plots used to graph two variables at the same time - often because you are looking for evidence of a relationship between the two.
Use a scatterplot to graph the relationship between two continuous
variables, like the relationship between Med_HH_Income
and
Pct_BB
. Again, the code has three basic parts - the data
frame’s name, the “aesthetic” part, and the “geometry” part. But, as
before, the “aesthetic” and “geometry” parts have to change a little,
both to tell R which variables to use and to tell R what kind of graph
to arrange them in.
You’ll also have to decide which variable depends upon the other -
that is, which is the dependent variable, and which is the independent
variable. Here, Pct_BB
seems more likely to depend on
Med_HH_Income
than the other way around, so
Pct_BB
is the dependent variable, and
Med_HH_Income
is the independent variable. Conventionally,
the dependent variable will go on the y axis, and the independent
variable will go on the x axis.
Having decided all of that, you can write the code. The dataset is
still mydata
, but x = Med_HH_Income
tells R to
put Med_HH_Income
on the x axis, and
y = Pct_BB
tells R to put Pct_BB
on the Y
axis. which variable to show in the x axis, Med_HH_Income
,
and which variable to show on the y axis, Pct_BB
. Finally,
the “geometry” portion changes to geom_point
, to tell R
that you want a scatterplot this time.
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()
Looks like Pct_BB
tends to rise as
Med_HH_Income
rises. In other words, the two appear
positively related, although there is an obvious outlier (Williamson
County), which has both high household income and high broadband
access.
Adding a regression line to your scatterplot requires using a +
symbol to tack a geom_smooth
function onto the
geom_point
function:
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()+
geom_smooth(method = "lm",
se = FALSE)
Use a stacked bar chart to graph the relationship between two
categorical variables, like the relationship between
Density_2
and Region
. In this code’s
“aesthetic” portion, the independent variable goes on the x axis, so
that there is a bar for each category on the independent variable, while
the dependent variable gets used to determine how each bar will be
filled. Let’s go with the idea that Density_2
depends upon
Region
.
ggplot(mydata, aes(x = Region, fill = Density_2)) +
geom_bar()
Looks like high-density counties (represented by aqua) are more common in East Tennessee than in Middle Tennessee, and especially more common than in West Tennessee.
Use faceted histograms when you want to compare the distribution of a
continuous variable across two or more levels of a categorical variable.
For example, suppose you want to compare how Pct_BB
, or
broadband access, is distributed among low- and high-density counties,
as measured by Density_2
.
The “faceted” approach to displaying such data involves making one
histogram of Pct_BB
for low-density counties, a second
histogram of Pct_BB
for high-density counties, then
stacking the two hisograms so that you can easily compare both on the
same horizontal scale.
Telling R to do so involves using a +
symbol to add the
facet_wrap
function to the “geometry” portion of the code
for an ordinary histogram. The name of the categorical variable you want
to facet by appears after the facet_wrap
function’s
~
symbol. The ncol=1
argument tells R to stack
the facets vertically in a single column. Here’s the code:
### Faceting
ggplot(mydata, aes(x = Pct_BB))+
geom_histogram()+
facet_wrap(~Density_2,
ncol = 1)
You can facet other types of charts, too. For example, here’s what
you get if you facet bar charts of one categorical variable,
Density_2
, by another categorical variable,
Region
:
ggplot(mydata, aes(x = Density_2)) +
geom_bar()+
facet_wrap(~Region,
ncol = 1)
The results indicate that high-density counties are especially common in East Tennesee, and especially rare in West Tennessee, while Middle Tennessee has slightly fewer high-density counties than low-density counties.
Scatterplots can be faceted as well, including those that also show a regression line. For example, here is what the relationship between Pct_BB and Med_HH_Income looks like when faceted by Density_2:
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()+
geom_smooth(method = "lm",
se = FALSE)+
facet_wrap(~Density_2,
ncol = 1)
## `geom_smooth()` using formula = 'y ~ x'
The ggplot2 plackage offers all sorts of ways to make your graphics pretty. For now, let’s look at two: Adding axis labels and a title, and adding color.
By default, ggplot2 uses variable names as axis labels and omits titles.
To add custom axis labels, use the +
symbol to add the
labs()
function after the “geometry” portion of the code,
then, inside the function’s parentheses, specify the axis and the text
you want to label it with. Here,
labs(x = "Pct. HH with broadband")
adds the label “Pct. HH
with broadband” to the x axis of a histogram. Including
y = "Number of counties"
after a comma and within the same
parentheses replaces the generic “count” label on the y axis.
# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
geom_histogram()+
labs(x = "Pct. HH with broadband",
y = "Number of counties")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To add a title, just include - again, after a comma -
title = "Broadband access among TN counties"
. Like
this:
# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
geom_histogram()+
labs(x = "Pct. HH with broadband",
y = "Number of counties",
title = "Broadband access among TN counties")
The same approach works for a bar chart:
# Basic bar chart
ggplot(mydata, aes(x=Region))+
geom_bar()+
labs(x = "Tennessee grand division",
y = "Number of counties",
title = "TN counties, by division")
It also works for a scatterplot:
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()+
labs(x = "Median HH income",
y = "Pct. HH with broadband",
title = "Broadband access, by income")+
geom_smooth(method = "lm",
se = FALSE)
For consistency, I put the labs()
function immediately
after the geom_point()
function. But it works fine if you
instead put if after the geom_smooth()
function that
produces the regression line:
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()+
geom_smooth(method = "lm",
se = FALSE)+
labs(x = "Median HH income",
y = "Pct. HH with broadband",
title = "Broadband access, by income")
It works with faceted graphics, too.
ggplot(mydata, aes(x = Pct_BB))+
geom_histogram()+
labs(x = "Pct. HH with broadband",
y = "Number of counties",
title = "Broadband access, by density")+
facet_wrap(~Density_2,
ncol = 1)
ggplot(mydata, aes(x = Density_2))+
geom_bar()+
labs(x = "Households per sq. mile",
y = "Number of counties",
title = "Density, by division")+
facet_wrap(~Region,
ncol = 1)
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point()+
labs(x = "Median HH income",
y = "Pct. HH with broadband",
title = "Broadband access, by income and density")+
geom_smooth(method = "lm",
se = FALSE)+
facet_wrap(~Density_2,
ncol = 1)
For a stacked bar chart, add fill = “” to the labs() function to control the legend’s title:
ggplot(mydata, aes(x = Region, fill = Density_2)) +
geom_bar()+
labs(x = "Grand division",
y = "Number of counties",
fill = "HH / sq. mile")
As you can see above, ggplot2 sometimes adds color by default. You
can add it on purpose, though - and control which colors get used. Here,
adding color = "darkblue"
inside the geom_histogram
function’s parentheses changes the outlines of the histogram bars to
dark blue. Adding fill = "blue"
, after a comma, changes the
area inside the bars to a medium blue. For example, try color = #190482,
fill = #7752FE
# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
geom_histogram(color = "darkblue",
fill = "blue")+
labs(x = "Pct. HH with broadband",
y = "Number of counties",
title = "Broadband access among TN counties")
Alternatively, you specify colors using hex codes, which you can get from online palette guides.
# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
geom_histogram(color = "#001B79",
fill = "#1640D6")+
labs(x = "Pct. HH with broadband",
y = "Number of counties",
title = "Broadband access among TN counties")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The same approach will add color to a bar chart:
# Basic bar chart
ggplot(mydata, aes(x=Region))+
geom_bar(color = "darkblue",
fill = "blue")+
labs(x = "Tennessee grand division",
y = "Number of counties",
title = "TN counties, by division")
… and to a scatterplot, although, with a scatterplot, you also might want to change the regression line’s default color to something else. Here, I changed it to dark gray.
ggplot(mydata, aes(x = Med_HH_Income,
y = Pct_BB))+
geom_point(color = "darkblue",
fill = "blue")+
labs(x = "Median HH income",
y = "Pct. HH with broadband",
title = "Broadband access, by income")+
geom_smooth(method = "lm",
se = FALSE,
color = "darkgray")
Coloring a stacked bar chart with something other than the default
colors requires adding the scale_fill_manual()
function to
the code, after a +
symbol. The color codes you want to use
are listed within the function’s parentheses, using the format
values=c('color1', 'color2', 'color3')
, and so on, with one
color code for each area you want to color:
ggplot(mydata, aes(x = Region, fill = Density_2)) +
geom_bar()+
labs(x = "Grand division",
y = "Number of counties",
fill = "HH / sq. mile")+
scale_fill_manual(values=c('lightblue', 'darkblue'))