Harold Nelson
10/2/2021
Get the socviz package and tidyverse. Build the counties dataframe as in the previous set of exercises.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s reduce the number of variables in counties and add a continuous variable density.
Convert counties to counties_long by collapsing pop, density, and land_area into a variable measure.
counties_long = counties %>%
pivot_longer(cols = pop:density, names_to = "measure", values_to = "value")
head(counties_long)
## # A tibble: 6 x 6
## fips name census_region state measure value
## <dbl> <chr> <fct> <fct> <chr> <dbl>
## 1 1001 Autauga County South AL pop 55395
## 2 1001 Autauga County South AL land_area 594.
## 3 1001 Autauga County South AL density 93.2
## 4 1003 Baldwin County South AL pop 200111
## 5 1003 Baldwin County South AL land_area 1590.
## 6 1003 Baldwin County South AL density 126.
Use facet_grid() to show how the regions compare for these measures.
counties_long %>%
ggplot(aes(x = value)) +
geom_density() +
geom_rug() +
scale_x_log10() +
facet_grid(census_region ~ measure, scales = "free_x")
That is a bit busy and possibly hard to read. Use filter to compare just the West and the Northeast for land area.
counties_long %>%
filter(census_region %in% c("West","Northeast") &
measure == "land_area") %>%
ggplot(aes(x = value)) +
geom_density() +
geom_rug() +
scale_x_log10() +
facet_grid(census_region ~ measure)
Let’s use summarize to compare the land areas of counties in the regions. Capture min, median, and max values.
counties_long %>%
filter(measure == "land_area") %>%
group_by(census_region) %>%
summarize(min = min(value),
median = median(value),
max = max(value)) %>%
ungroup() %>%
arrange(median) %>%
ggplot(aes(y = census_region)) +
geom_point(aes(x = min),color = "blue") +
geom_point(aes(x = median),color = "black") +
geom_point(aes(x = max),color = "red") +
ggtitle("Comparison of Land Areas")
Use a log scale on the x-axis
counties_long %>%
filter(measure == "land_area") %>%
group_by(census_region) %>%
summarize(min = min(value),
median = median(value),
max = max(value)) %>%
ungroup() %>%
arrange(median) %>%
ggplot(aes(y = census_region)) +
geom_point(aes(x = min),color = "blue") +
geom_point(aes(x = median),color = "black") +
geom_point(aes(x = max),color = "red") +
ggtitle("Comparison of Land Areas") +
scale_x_log10()
The following code will create a dataframe containing the uninsured rate by county.
URL <- "https://healthinequality.org/dl/health_ineq_online_table_12.csv"
characteristicsData <- read.csv(URL)
uninsured_rate = characteristicsData %>%
select(cty,puninsured2010,county_name,statename)
str(uninsured_rate)
## 'data.frame': 3138 obs. of 4 variables:
## $ cty : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ puninsured2010: num 13.6 19.1 18.5 17.7 19.3 ...
## $ county_name : chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ statename : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
Now I want to see the distribution of county uninsured rates by census_region.
Trim the counties dataframe down to just the identifying variables.
## fips census_region state name
## 1 1001 South AL Autauga County
## 2 1003 South AL Baldwin County
## 3 1005 South AL Barbour County
## 4 1007 South AL Bibb County
## 5 1009 South AL Blount County
## 6 1011 South AL Bullock County
## fips census_region state name puninsured2010 county_name statename
## 1 1001 South AL Autauga County 13.60128 Autauga Alabama
## 2 1003 South AL Baldwin County 19.08533 Baldwin Alabama
## 3 1005 South AL Barbour County 18.51381 Barbour Alabama
## 4 1007 South AL Bibb County 17.71883 Bibb Alabama
## 5 1009 South AL Blount County 19.28041 Blount Alabama
## 6 1011 South AL Bullock County 16.24321 Bullock Alabama
Use facet_wrap() to see the distribution of the uninsured rate by region.