knitr::opts_chunk$set(echo = F,
fig.align = "center")
## Load the tidyverse package
pacman::p_load(tidyverse)
## Change the default theme to theme_bw()
theme_set(theme_bw())
## Read in the "us counties.csv" data set and save it as counties
counties <-
read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/us%20counties.csv") |>
# separating the county_state column into two separate columns
separate(
col = county_state,
sep = ",",
into = c("county", "state")
) |>
filter(state != "DC")
While we can use summarize()
to calculate summary stats
for the entire data, it is often used to calculate summaries for groups
in the data. How do we tell it to calculate summary stats for multiple
groups in the data?
By using the by = {column}
argument.
As an example, run the code below as is, then remove the # in front
of the .by
argument
## economic counties_n pop_med pov_over_20 nohs_mp_cor
## 1 0 1236 42783.5 0.19902913 0.7795650
## 2 5 333 25914.0 0.04819277 0.6113926
## 3 3 501 29345.0 0.14970060 0.6273403
## 4 4 406 34776.5 0.33251232 0.7087486
## 5 2 221 14814.0 0.21719457 0.5964220
## 6 1 444 5783.0 0.13963964 0.5482886
Unlike using summarize()
by itself, our result has more
than 1 row! In fact, we have 1 row per group of the column given to
.by
!
The number of columns also increased. We went from 1 column per
summary (4 columns) to 1 column per summary + 1 column for the variable
given to .by
for a total of 5 columns!
Use summarize()
with .by =
to calculate
the total population for each state.
Pipe the results into ggplot()
, map
x = state
and y = population
to create a bar
chart with the proportion on the y-axis. You’ll need to use
geom_col()
instead of geom_bar()
to create the
bar chart since you’ll be specifying the y-axis as well as the
x-axis
You can specify multiple columns that can form the groups with
.by =
and it will form a group for each combination of
columns listed.
For instance, if column1 has 3 groups and column2
has 5 groups, .by = c(column1, column2)
will form 15 unique
groups (3x5), assuming each group in column1 occurs with each
group in column2
For counties with ruc of 1, 6, or 7, calculate the same summaries, but for each ruc and economic code combination!
## # A tibble: 18 × 6
## economic ruc counties_n pop_med pov_over_20 nohs_mp_cor
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 3 6 158 24914 0.222 0.679
## 2 0 1 290 197084 0.0345 0.747
## 3 0 6 210 23134 0.310 0.791
## 4 3 7 66 21727 0.258 0.401
## 5 0 7 142 16454. 0.324 0.786
## 6 4 6 74 23218. 0.432 0.732
## 7 5 7 61 19637 0.0492 0.687
## 8 4 7 49 13737 0.469 0.687
## 9 2 7 64 16281 0.219 0.618
## 10 5 6 53 29298 0.0943 0.584
## 11 1 7 51 9615 0.196 0.631
## 12 1 6 53 11196 0.132 0.593
## 13 5 1 39 49563 0.0513 0.614
## 14 4 1 39 127446 0.128 0.783
## 15 3 1 43 68992 0 0.747
## 16 2 1 9 55820 0 0.783
## 17 2 6 45 14925 0.289 0.517
## 18 1 1 11 14960 0 0.803
Now that ruc has 3 groups and economic has 6 groups, our data set has 3x6 = 18 rows!
Use summarize()
and mutate()
to create and
save a data set that has:
The economic code
The rural urban continuum code (ruc)
The number of counties for each economic and ruc combination
income_avg = The average median home income per economic code and ruc combination
The standard deviation of median home income per economic code and ruc combination
lower = The average home income minus 2 standard deviation divided by the square root of the number of counties
upper = The average home income plus 2 standard deviation divided by the square root of the number of counties
Italics indicate what you should name that column. You can name the other columns whatever you wish
## # A tibble: 54 × 7
## economic ruc counties income_avg income_sd lower upper
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 2 213 57342. 11486. 55768. 58916.
## 2 5 3 30 57096. 10614. 53221. 60972.
## 3 3 6 158 48346. 8307. 47025. 49668.
## 4 0 1 290 71477. 18784. 69270. 73683.
## 5 0 6 210 45987. 9786. 44636. 47337.
## 6 4 3 75 52201. 9937. 49906. 54496.
## 7 3 9 29 48722. 13525. 43699. 53745.
## 8 3 7 66 48459. 9190. 46197. 50721.
## 9 0 8 53 43495. 8898. 41050. 45939.
## 10 0 4 86 48565. 8135. 46811. 50320.
## # ℹ 44 more rows
Use the resulting data set to create a cleveland
plot using ggplot()
and 2 different geoms:
geom_point()
x
, y
, and
color
geom_segment()
x
, xend
, y
,
yend
The graph can be found in brightspace!
## Warning: Removed 2 rows containing missing values (`geom_segment()`).
filter(.by = )
While it is commonly used with summarize()
, that isn’t
the only dplyr verb with a .by =
argument.
For example, let’s say I want to only keep the counties that have the largest median home_income for their state:
## # A tibble: 50 × 3
## county state home_income
## <chr> <chr> <int>
## 1 Shelby County AL 75663
## 2 North Slope Borough AK 89276
## 3 Greenlee County AZ 65818
## 4 Benton County AR 68978
## 5 Santa Clara County CA 125933
## 6 Douglas County CO 120670
## 7 Fairfield County CT 91183
## 8 New Castle County DE 69562
## 9 St. Johns County FL 81925
## 10 Forsyth County GA 105921
## # ℹ 40 more rows
slice(.by = )
We can also use .by =
with slice()
to pick
the rows of a data set by group as well.
Use these two functions to pick the first 3 counties (rows) listed for each state
## # A tibble: 150 × 14
## county_id county state metro ruc economic population no_hs high_school
## <int> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 1 Autauga Co… AL Yes 2 0 55086 0.113 0.326
## 2 2 Baldwin Co… AL Yes 3 5 203727 0.097 0.276
## 3 3 Barbour Co… AL No 6 3 26054 0.27 0.357
## 4 68 Aleutians … AK No 9 3 3345 0.136 0.402
## 5 69 Aleutians … AK No 9 3 5706 0.112 0.405
## 6 70 Anchorage … AK Yes 2 4 295230 0.062 0.238
## 7 97 Apache Cou… AZ No 6 4 71497 0.204 0.331
## 8 98 Cochise Co… AZ Yes 3 4 127014 0.128 0.235
## 9 99 Coconino C… AZ Yes 3 4 139147 0.104 0.214
## 10 112 Arkansas C… AR No 6 3 18261 0.17 0.383
## # ℹ 140 more rows
## # ℹ 5 more variables: some_college <dbl>, bachelor <dbl>, poverty <dbl>,
## # minor_poverty <dbl>, home_income <int>
You can also use by =
in slice_min()
or
slice_max()
to pick the smallest or largest n rows of a
certain column for each group. Note: no .
before
by
using slice()
Using two of the above functions to find the 2 counties lowest median home_income degree percentage per state
## # A tibble: 100 × 3
## county state home_income
## <chr> <chr> <int>
## 1 Wilcox County AL 25385
## 2 Perry County AL 26814
## 3 Kusilvak Census Area AK 32728
## 4 Yukon-Koyukuk Census Area AK 38912
## 5 Apache County AZ 33652
## 6 La Paz County AZ 37350
## 7 Lee County AR 28080
## 8 Phillips County AR 29945
## 9 Trinity County CA 40680
## 10 Lake County CA 45086
## # ℹ 90 more rows
by
and mutate()
While it isn’t used often, mutate()
also has a
.by =
argument. It is used together if we want to calculate
a group summary and fill the entire column with the value per group.
For instance, the code below will will calculate the average median home income per state, then subtract median home income from the average of the state to see how that counties compares to other counties in the same state:
## # A tibble: 3,141 × 5
## county state home_income MHI_avg MHI_diff
## <chr> <chr> <int> <dbl> <dbl>
## 1 Autauga County AL 59338 43480. 15858.
## 2 Baldwin County AL 57588 43480. 14108.
## 3 Barbour County AL 34382 43480. -9098.
## 4 Bibb County AL 46064 43480. 2584.
## 5 Blount County AL 50412 43480. 6932.
## 6 Bullock County AL 29267 43480. -14213.
## 7 Butler County AL 37365 43480. -6115.
## 8 Calhoun County AL 45400 43480. 1920.
## 9 Chambers County AL 39917 43480. -3563.
## 10 Cherokee County AL 42132 43480. -1348.
## # ℹ 3,131 more rows