Intro

The ggplot2 package for R offers a relatively easy way to make high-quality graphics. This tutorial will show you how to make a basic scatterplot, histogram, bar chart, and box plot. Then it will revisit each type of chart to explain some advanced features you can use with each. Finally, it will illustrate “faceting,” which involves producing two or more charts of the same type for different variables in a way that makes comparing the variables easier.

Getting some data

We’ll need some data to work with. This code will retrieve some county-level 2022 American Community Survey for Tennessee, store the file on your computer, and open the file in your R workspace as a data frame called mydata. There’s also some code that creates a few categorical variables that will prove useful for showing some of ggplot2’s advanced features.

# Read the data from the web
FetchedData <-
  read.csv("https://drkblake.com/wp-content/uploads/2023/11/DataWrangling.csv")
# Save the data on your computer
write.csv(FetchedData, "DataWrangling.csv", row.names = FALSE)
# remove the data from the environment
rm (FetchedData)

# Installing required packages
if (!require("tidyverse"))
  install.packages("tidyverse")
library(tidyverse)

# Read the data
mydata <- read.csv("DataWrangling.csv")

# Create a continuous "Density" variable measuring
# households per square mile, then a two-level and
# a three-level categorical version
mydata <- mydata %>%
  mutate(Density = Households / Land_area) %>%
  mutate(Density_2 = cut_number(Density, n = 2)) %>%   mutate(Density_3 = cut_number(Density, n = 3))
mydata <- mydata %>%
  mutate(
    Density_2 = case_when(
      Density_2 == "[7.35,28.6]" ~ "Low density",
      Density_2 == "(28.6,583]" ~ "High density",
      .default = "Error"
    )
  )
mydata <- mydata %>%
  mutate(
    Density_3 = case_when(
      Density_3 == "[7.35,21]" ~ "Low density",
      Density_3 == "(21,40.4]" ~ "Intermediate density",
      Density_3 == "(40.4,583]" ~ "High density",
      .default = "Error"
    )
  )

# Re-save the data on your computer
write.csv(mydata, "DataWrangling.csv", row.names = FALSE)

Codebook for the data frame

Here is a list of the variables in the data frame, with a quick description of each. All data are county-level variables for Tennessee from the U.S. Census Bureau’s 2021 five-year American Community Survey.

County: The name of the county. “Anderson,” “Bedford,” “Benton,” etc. In all, there are 95 counties in the data frame.

Region: The Tennessee region in which the county is located. There are three regions: “West,” “Middle,” and “East.”

Med_HH_Income: Each county’s median household income.

Households: The number of households in each county.

Pct_BB: The percentage of households in each county that have broadband internet access.

Pct_College: The percentage of residents in each county who have a four-year college degree or higher (like a master’s degree, a law degree, a Ph.D., or a medical degree).

Land_area: Each county’s land area, in square miles. Land area is area in the county that is not covered by a river, lake, or other body of water.

Density: Households / Land_area.

Density_2: Density, divided into two roughly equal groups.

Density_3: Density, divided into three roughly equal groups.

Single-variable plots

Let’s start with two plots often used to graph just one, single variable.

A basic histogram

Use a histogram to graph a single continuous variable, like Pct_BB, the percentage of households with broadband internet access in each Tennessee county.

The code uses the ggplot() function, which has three fundamental parts: the one that indicates the dataset being used (here, mydata); the “aesthetic” part, which indicates which variable to place on the x axis (here, aes(x=Pct_BB)), and the “geometry” part, which indicates the type of graphic you want (here, geom_histogram()). Note the + symbol that connects the “aesthetic” and “geometry” portions of the code.

# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
  geom_histogram()

A basic bar chart

Use a bar chart to graph a single categorical variable, like Region, which shows how many counties are in each grand division of Tennessee.

Again, the code uses the ggplot() function, and the code has the same three basic parts that it did in the histogram code: the data frame specification, the “aesthetic” part, and the “geometry” part. But this time, the x-axis variable is Region, because that’s the variable we’re graphing, and geom_bar() tells R we want a bar chart.

# Basic bar chart
ggplot(mydata, aes(x=Region))+
  geom_bar()

Two-variable plots

Here are two plots used to graph two variables at the same time - often because you are looking for evidence of a relationship between the two.

A basic scatterplot

Use a scatterplot to graph the relationship between two continuous variables, like the relationship between Med_HH_Income and Pct_BB. Again, the code has three basic parts - the data frame’s name, the “aesthetic” part, and the “geometry” part. But, as before, the “aesthetic” and “geometry” parts have to change a little, both to tell R which variables to use and to tell R what kind of graph to arrange them in.

You’ll also have to decide which variable depends upon the other - that is, which is the dependent variable, and which is the independent variable. Here, Pct_BB seems more likely to depend on Med_HH_Income than the other way around, so Pct_BB is the dependent variable, and Med_HH_Income is the independent variable. Conventionally, the dependent variable will go on the y axis, and the independent variable will go on the x axis.

Having decided all of that, you can write the code. The dataset is still mydata, but x = Med_HH_Income tells R to put Med_HH_Income on the x axis, and y = Pct_BB tells R to put Pct_BB on the Y axis. which variable to show in the x axis, Med_HH_Income, and which variable to show on the y axis, Pct_BB. Finally, the “geometry” portion changes to geom_point, to tell R that you want a scatterplot this time.

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()

Looks like Pct_BB tends to rise as Med_HH_Income rises. In other words, the two appear positively related, although there is an obvious outlier (Williamson County), which has both high household income and high broadband access.

A scatterplot plus a regression line

Adding a regression line to your scatterplot requires using a + symbol to tack a geom_smooth function onto the geom_point function:

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()+
  geom_smooth(method = "lm",
              se = FALSE)

A stacked bar chart

Use a stacked bar chart to graph the relationship between two categorical variables, like the relationship between Density_2 and Region. In this code’s “aesthetic” portion, the independent variable goes on the x axis, so that there is a bar for each category on the independent variable, while the dependent variable gets used to determine how each bar will be filled. Let’s go with the idea that Density_2 depends upon Region.

ggplot(mydata, aes(x = Region, fill = Density_2)) +
  geom_bar()

Looks like high-density counties (represented by aqua) are more common in East Tennessee than in Middle Tennessee, and especially more common than in West Tennessee.

Faceted histograms

Use faceted histograms when you want to compare the distribution of a continuous variable across two or more levels of a categorical variable. For example, suppose you want to compare how Pct_BB, or broadband access, is distributed among low- and high-density counties, as measured by Density_2.

The “faceted” approach to displaying such data involves making one histogram of Pct_BB for low-density counties, a second histogram of Pct_BB for high-density counties, then stacking the two hisograms so that you can easily compare both on the same horizontal scale.

Telling R to do so involves using a + symbol to add the facet_wrap function to the “geometry” portion of the code for an ordinary histogram. The name of the categorical variable you want to facet by appears after the facet_wrap function’s ~ symbol. The ncol=1 argument tells R to stack the facets vertically in a single column. Here’s the code:

### Faceting
ggplot(mydata, aes(x = Pct_BB))+
  geom_histogram()+
  facet_wrap(~Density_2,
             ncol = 1)

Faceteting other kinds of charts

You can facet other types of charts, too. For example, here’s what you get if you facet bar charts of one categorical variable, Density_2, by another categorical variable, Region:

ggplot(mydata, aes(x = Density_2)) +
  geom_bar()+
  facet_wrap(~Region,
             ncol = 1)

The results indicate that high-density counties are especially common in East Tennesee, and especially rare in West Tennessee, while Middle Tennessee has slightly fewer high-density counties than low-density counties.

Scatterplots can be faceted as well, including those that also show a regression line. For example, here is what the relationship between Pct_BB and Med_HH_Income looks like when faceted by Density_2:

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()+
  geom_smooth(method = "lm",
              se = FALSE)+
  facet_wrap(~Density_2,
             ncol = 1)

## `geom_smooth()` using formula = 'y ~ x'

Customizations

The ggplot2 plackage offers all sorts of ways to make your graphics pretty. For now, let’s look at two: Adding axis labels and a title, and adding color.

Adding axis labels and titles

By default, ggplot2 uses variable names as axis labels and omits titles.

To add custom axis labels, use the + symbol to add the labs() function after the “geometry” portion of the code, then, inside the function’s parentheses, specify the axis and the text you want to label it with. Here, labs(x = "Pct. HH with broadband") adds the label “Pct. HH with broadband” to the x axis of a histogram. Including y = "Number of counties" after a comma and within the same parentheses replaces the generic “count” label on the y axis.

# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
  geom_histogram()+
  labs(x = "Pct. HH with broadband",
       y = "Number of counties")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To add a title, just include - again, after a comma - title = "Broadband access among TN counties". Like this:

# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
  geom_histogram()+
  labs(x = "Pct. HH with broadband",
       y = "Number of counties",
       title = "Broadband access among TN counties")

The same approach works for a bar chart:

# Basic bar chart
ggplot(mydata, aes(x=Region))+
  geom_bar()+
  labs(x = "Tennessee grand division",
       y = "Number of counties",
       title = "TN counties, by division")

It also works for a scatterplot:

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()+
  labs(x = "Median HH income",
       y = "Pct. HH with broadband",
       title = "Broadband access, by income")+
  geom_smooth(method = "lm",
              se = FALSE)

For consistency, I put the labs() function immediately after the geom_point() function. But it works fine if you instead put if after the geom_smooth() function that produces the regression line:

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()+
  geom_smooth(method = "lm",
              se = FALSE)+
  labs(x = "Median HH income",
       y = "Pct. HH with broadband",
       title = "Broadband access, by income")

It works with faceted graphics, too.

ggplot(mydata, aes(x = Pct_BB))+
  geom_histogram()+
  labs(x = "Pct. HH with broadband",
       y = "Number of counties",
       title = "Broadband access, by density")+
  facet_wrap(~Density_2,
             ncol = 1)

ggplot(mydata, aes(x = Density_2))+
  geom_bar()+
  labs(x = "Households per sq. mile",
       y = "Number of counties",
       title = "Density, by division")+
  facet_wrap(~Region,
             ncol = 1)

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point()+
  labs(x = "Median HH income",
       y = "Pct. HH with broadband",
       title = "Broadband access, by income and density")+
  geom_smooth(method = "lm",
              se = FALSE)+
  facet_wrap(~Density_2,
             ncol = 1)

For a stacked bar chart, add fill = “” to the labs() function to control the legend’s title:

ggplot(mydata, aes(x = Region, fill = Density_2)) +
  geom_bar()+
  labs(x = "Grand division",
       y = "Number of counties",
       fill = "HH / sq. mile")

Adding color

As you can see above, ggplot2 sometimes adds color by default. You can add it on purpose, though - and control which colors get used. Here, adding color = "darkblue" inside the geom_histogram function’s parentheses changes the outlines of the histogram bars to dark blue. Adding fill = "blue", after a comma, changes the area inside the bars to a medium blue. For example, try color = #190482, fill = #7752FE

# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
  geom_histogram(color = "darkblue",
                 fill = "blue")+
  labs(x = "Pct. HH with broadband",
       y = "Number of counties",
       title = "Broadband access among TN counties")

Alternatively, you specify colors using hex codes, which you can get from online palette guides.

# Basic histogram
ggplot(mydata, aes(x=Pct_BB))+
  geom_histogram(color = "#001B79",
                 fill = "#1640D6")+
  labs(x = "Pct. HH with broadband",
       y = "Number of counties",
       title = "Broadband access among TN counties")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The same approach will add color to a bar chart:

# Basic bar chart
ggplot(mydata, aes(x=Region))+
  geom_bar(color = "darkblue",
                 fill = "blue")+
  labs(x = "Tennessee grand division",
       y = "Number of counties",
       title = "TN counties, by division")

… and to a scatterplot, although, with a scatterplot, you also might want to change the regression line’s default color to something else. Here, I changed it to dark gray.

ggplot(mydata, aes(x = Med_HH_Income,
                   y = Pct_BB))+
  geom_point(color = "darkblue",
                 fill = "blue")+
  labs(x = "Median HH income",
       y = "Pct. HH with broadband",
       title = "Broadband access, by income")+
  geom_smooth(method = "lm",
              se = FALSE,
              color = "darkgray")

Coloring a stacked bar chart with something other than the default colors requires adding the scale_fill_manual() function to the code, after a + symbol. The color codes you want to use are listed within the function’s parentheses, using the format values=c('color1', 'color2', 'color3'), and so on, with one color code for each area you want to color:

ggplot(mydata, aes(x = Region, fill = Density_2)) +
  geom_bar()+
  labs(x = "Grand division",
       y = "Number of counties",
       fill = "HH / sq. mile")+
  scale_fill_manual(values=c('lightblue', 'darkblue'))

Basic data visualization in R

Dr. Ken Blake

2024-03-07