Harold Nelson
2/7/2022
Load the tidyverse and the dataset “county_clean.Rdata”. Import the file “state_region.csv” into the dataframe state_region. Use left_join to add the region data to county_clean. Glimpse county_clean to make sure you’re OK
library(tidyverse)
load("county_clean.Rdata")
state_region <- read_csv("state_region.csv")
county_clean = county_clean %>%
left_join(state_region, by = c("state" = "State"))
glimpse(county_clean)
## Rows: 3,135
## Columns: 17
## $ name <fct> Autauga County, Baldwin County, Barbour County, Bibb…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama…
## $ pop2000 <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017 <int> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4…
## $ multi_unit <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ metro <fct> yes, yes, no, yes, yes, no, no, yes, no, no, yes, no…
## $ median_edu <fct> some_college, some_college, hs_diploma, hs_diploma, …
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income <int> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…
## $ `State Code` <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
## $ Region <chr> "South", "South", "South", "South", "South", "South"…
## $ Division <chr> "East South Central", "East South Central", "East So…
An alternative for Two Categorical Variables
Examine the relationship between median educational level and region. Put region on the horizontal axis and median education on the vertical axis. Use geom_jitter(). Try a few different values of the size parameter in geom_jitter().
Another Alternative
Repeat the last exercise but use geom_count() instead of geom_jitter().
There is a style of plotting popularized by William Cleveland. We’ll start with something that doesn’t really work and improve it in steps.
The Simple Version
Get a barplot of the number of counties in each state using geom_bar().
The graph is totally unreadable because the x-axis labels are on top of each other. A solution is to add coord_flip() as a layer. Do that.
Use dplyr to create a dataframe state_counties with state name and the count of counties.
state_counties = county_clean %>%
group_by(state) %>%
summarize(count = n()) %>%
ungroup()
head(state_counties)
## # A tibble: 6 × 2
## state count
## <chr> <int>
## 1 Alabama 67
## 2 Alaska 25
## 3 Arizona 15
## 4 Arkansas 75
## 5 California 58
## 6 Colorado 63
Use state_counties as the data argument of ggplot. In the aes, map x to count and y to state. Use geom_col().
Instead of state in the previous exercise, use reorder(state,count).
Replace geom_col() in the previous exercise with geom_point().
Add a geom_col() to the previous exercise. Set width = .1.
state_counties %>%
ggplot(aes(x = count, y = reorder(state,count))) +
geom_point() +
geom_col(width = .1)
Get a histogram of pop2017.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph is unreadable because of the extreme right skew.
Use a logarithmic scale. Add scale_x_log10() as a layer.