Exercises on ggplot2

Harold Nelson

2/5/2022

Setup

Load the tidyverse and the dataset “county_clean.Rdata”. Import the file “state_region.csv” into the dataframe state_region.

Solution

library(tidyverse)
load("county_clean.Rdata")
state_region <- read_csv("state_region.csv")

Data

Do a glimpse of county_clean and state_region.

Solution

glimpse(county_clean)
## Rows: 3,135
## Columns: 14
## $ name              <fct> Autauga County, Baldwin County, Barbour County, Bibb…
## $ state             <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama…
## $ pop2000           <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017           <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change        <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty           <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ homeownership     <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4…
## $ multi_unit        <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ metro             <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no…
## $ median_edu        <fct> some_college, some_college, hs_diploma, hs_diploma, …
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income  <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…
glimpse(state_region)
## Rows: 51
## Columns: 4
## $ State        <chr> "Alaska", "Alabama", "Arkansas", "Arizona", "California",…
## $ `State Code` <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL…
## $ Region       <chr> "West", "South", "South", "West", "West", "West", "Northe…
## $ Division     <chr> "Pacific", "East South Central", "West South Central", "M…

Simple Scatterplot

Do a simple scatterplot of per_capita_income on the y-axis against homeownership on the x-axis.

Solution

county_clean %>% 
  ggplot(aes(x = per_capita_income, y = homeownership)) +
  geom_point()

Adjust

Use the alpha and size parameters of geom_point() to clean up the overplotting. Add a smoother.

Solution

county_clean %>% 
  ggplot(aes(x = per_capita_income, y = homeownership)) +
  geom_point(size = .2,alpha=.2) +
  geom_smooth(color = "red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Examine Categorical / Quantitative Relationship

The variable median_edu is categorical and the variable per_capita_income is quantitative. Put the education variable on the x-axis and the income variable on the y-axis. Use geom_point().

Solution

county_clean %>% 
  ggplot(aes(x = median_edu, y = per_capita_income)) +
  geom_point()

Jitter

Repeat the previous exercise with geom_jitter(). Play with the size parameter to get a value you like.

Solution

county_clean %>% 
  ggplot(aes(x = median_edu, y = per_capita_income)) +
  geom_jitter(size = .5)

Add Labels

Make the graph look more professional.

Solution

county_clean %>% 
  ggplot(aes(x = median_edu, y = per_capita_income)) +
  geom_jitter(size = .5) +
  labs(x = "Median Educational Level",
       y = "Per Capita Income",
       title = "Per Capita Income by Education Level",
       subtitle = "US Counties 2017")

Add Region Data

Use left_join to join the region dataframe to the bulk of the data. Glimpse the result.

Solution

county_clean = county_clean %>% 
  left_join(state_region,by = c("state" = "State"))
glimpse(county_clean)
## Rows: 3,135
## Columns: 17
## $ name              <fct> Autauga County, Baldwin County, Barbour County, Bibb…
## $ state             <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama…
## $ pop2000           <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017           <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change        <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty           <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ homeownership     <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4…
## $ multi_unit        <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ metro             <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no…
## $ median_edu        <fct> some_college, some_college, hs_diploma, hs_diploma, …
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income  <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…
## $ `State Code`      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
## $ Region            <chr> "South", "South", "South", "South", "South", "South"…
## $ Division          <chr> "East South Central", "East South Central", "East So…

Two Categorical Variables

Use geom_bar and fill to describe the relationship between Regian and median_edu. Make Region a factor first. In the call to geom_bar() set color = “white”.

Solution

county_clean = county_clean %>% 
  mutate(Region = factor(Region))

county_clean %>% 
  ggplot(aes(x = Region, fill = median_edu)) +
  geom_bar(color = "white")

Position

The default value of the parameter position is “stack”. Try setting position = “dodge” and “dodge2”.

Solution

county_clean %>% 
  ggplot(aes(x = Region, fill = median_edu)) +
  geom_bar(color = "white",position = "dodge") +
  ggtitle("position = 'dodge'")

county_clean %>% 
  ggplot(aes(x = Region, fill = median_edu)) +
  geom_bar(color = "white",position = "dodge2") +
  ggtitle("position = 'dodge2'")