dplyr basics - by = argument

Using dplyr

Set Up Your Project and Load Libraries

knitr::opts_chunk$set(echo = F,
                      fig.align = "center") 

## Load the tidyverse package
pacman::p_load(tidyverse)


## Change the default theme to theme_bw()
theme_set(theme_bw())

## Read in the "us counties.csv" data set and save it as counties
counties <- 
  read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/us%20counties.csv") |> 
  # separating the county_state column into two separate columns
  separate(
    col = county_state,
    sep = ",",
    into = c("county", "state")
  ) |> 
  filter(state != "DC")

.by = and summarize()

While we can use summarize() to calculate summary stats for the entire data, it is often used to calculate summaries for groups in the data. How do we tell it to calculate summary stats for multiple groups in the data?

By using the by = {column} argument.

As an example, run the code below as is, then remove the # in front of the .by argument

##   economic counties_n pop_med pov_over_20 nohs_mp_cor
## 1        0       1236 42783.5  0.19902913   0.7795650
## 2        5        333 25914.0  0.04819277   0.6113926
## 3        3        501 29345.0  0.14970060   0.6273403
## 4        4        406 34776.5  0.33251232   0.7087486
## 5        2        221 14814.0  0.21719457   0.5964220
## 6        1        444  5783.0  0.13963964   0.5482886

Unlike using summarize() by itself, our result has more than 1 row! In fact, we have 1 row per group of the column given to .by!

The number of columns also increased. We went from 1 column per summary (4 columns) to 1 column per summary + 1 column for the variable given to .by for a total of 5 columns!

Basic bar chart of state population

Use summarize() with .by = to calculate the total population for each state.
Pipe the results into ggplot(), map x = state and y = population to create a bar chart with the proportion on the y-axis. You’ll need to use geom_col() instead of geom_bar() to create the bar chart since you’ll be specifying the y-axis as well as the x-axis

Multiple columns in .by =

You can specify multiple columns that can form the groups with .by = and it will form a group for each combination of columns listed.

For instance, if column1 has 3 groups and column2 has 5 groups, .by = c(column1, column2) will form 15 unique groups (3x5), assuming each group in column1 occurs with each group in column2

For counties with ruc of 1, 6, or 7, calculate the same summaries, but for each ruc and economic code combination!

## # A tibble: 18 × 6
##    economic   ruc counties_n pop_med pov_over_20 nohs_mp_cor
##       <int> <int>      <int>   <dbl>       <dbl>       <dbl>
##  1        3     6        158  24914       0.222        0.679
##  2        0     1        290 197084       0.0345       0.747
##  3        0     6        210  23134       0.310        0.791
##  4        3     7         66  21727       0.258        0.401
##  5        0     7        142  16454.      0.324        0.786
##  6        4     6         74  23218.      0.432        0.732
##  7        5     7         61  19637       0.0492       0.687
##  8        4     7         49  13737       0.469        0.687
##  9        2     7         64  16281       0.219        0.618
## 10        5     6         53  29298       0.0943       0.584
## 11        1     7         51   9615       0.196        0.631
## 12        1     6         53  11196       0.132        0.593
## 13        5     1         39  49563       0.0513       0.614
## 14        4     1         39 127446       0.128        0.783
## 15        3     1         43  68992       0            0.747
## 16        2     1          9  55820       0            0.783
## 17        2     6         45  14925       0.289        0.517
## 18        1     1         11  14960       0            0.803

Now that ruc has 3 groups and economic has 6 groups, our data set has 3x6 = 18 rows!

Cleveland Plot

Use summarize() and mutate() to create and save a data set that has:

The economic code
The rural urban continuum code (ruc)
The number of counties for each economic and ruc combination
income_avg = The average median home income per economic code and ruc combination
The standard deviation of median home income per economic code and ruc combination
lower = The average home income minus 2 standard deviation divided by the square root of the number of counties
upper = The average home income plus 2 standard deviation divided by the square root of the number of counties

Italics indicate what you should name that column. You can name the other columns whatever you wish

## # A tibble: 54 × 7
##    economic   ruc counties income_avg income_sd  lower  upper
##       <int> <int>    <int>      <dbl>     <dbl>  <dbl>  <dbl>
##  1        0     2      213     57342.    11486. 55768. 58916.
##  2        5     3       30     57096.    10614. 53221. 60972.
##  3        3     6      158     48346.     8307. 47025. 49668.
##  4        0     1      290     71477.    18784. 69270. 73683.
##  5        0     6      210     45987.     9786. 44636. 47337.
##  6        4     3       75     52201.     9937. 49906. 54496.
##  7        3     9       29     48722.    13525. 43699. 53745.
##  8        3     7       66     48459.     9190. 46197. 50721.
##  9        0     8       53     43495.     8898. 41050. 45939.
## 10        0     4       86     48565.     8135. 46811. 50320.
## # ℹ 44 more rows

Use the resulting data set to create a cleveland plot using ggplot() and 2 different geoms:

geom_point()

aesthetics: x, y, and color

geom_segment()

aesthetics: x, xend, y, yend

The graph can be found in brightspace!

## Warning: Removed 2 rows containing missing values (`geom_segment()`).

.by = uses with other dplyr verbs

`filter(.by = )`

While it is commonly used with summarize(), that isn’t the only dplyr verb with a .by = argument.

For example, let’s say I want to only keep the counties that have the largest median home_income for their state:

## # A tibble: 50 × 3
##    county              state home_income
##    <chr>               <chr>       <int>
##  1 Shelby County       AL          75663
##  2 North Slope Borough AK          89276
##  3 Greenlee County     AZ          65818
##  4 Benton County       AR          68978
##  5 Santa Clara County  CA         125933
##  6 Douglas County      CO         120670
##  7 Fairfield County    CT          91183
##  8 New Castle County   DE          69562
##  9 St. Johns County    FL          81925
## 10 Forsyth County      GA         105921
## # ℹ 40 more rows

`slice(.by = )`

We can also use .by = with slice() to pick the rows of a data set by group as well.

Use these two functions to pick the first 3 counties (rows) listed for each state

## # A tibble: 150 × 14
##    county_id county      state metro   ruc economic population no_hs high_school
##        <int> <chr>       <chr> <chr> <int>    <int>      <int> <dbl>       <dbl>
##  1         1 Autauga Co… AL    Yes       2        0      55086 0.113       0.326
##  2         2 Baldwin Co… AL    Yes       3        5     203727 0.097       0.276
##  3         3 Barbour Co… AL    No        6        3      26054 0.27        0.357
##  4        68 Aleutians … AK    No        9        3       3345 0.136       0.402
##  5        69 Aleutians … AK    No        9        3       5706 0.112       0.405
##  6        70 Anchorage … AK    Yes       2        4     295230 0.062       0.238
##  7        97 Apache Cou… AZ    No        6        4      71497 0.204       0.331
##  8        98 Cochise Co… AZ    Yes       3        4     127014 0.128       0.235
##  9        99 Coconino C… AZ    Yes       3        4     139147 0.104       0.214
## 10       112 Arkansas C… AR    No        6        3      18261 0.17        0.383
## # ℹ 140 more rows
## # ℹ 5 more variables: some_college <dbl>, bachelor <dbl>, poverty <dbl>,
## #   minor_poverty <dbl>, home_income <int>

by with slice_min/slice_max

You can also use by = in slice_min() or slice_max() to pick the smallest or largest n rows of a certain column for each group. Note: no . before by using slice()

Using two of the above functions to find the 2 counties lowest median home_income degree percentage per state

## # A tibble: 100 × 3
##    county                    state home_income
##    <chr>                     <chr>       <int>
##  1 Wilcox County             AL          25385
##  2 Perry County              AL          26814
##  3 Kusilvak Census Area      AK          32728
##  4 Yukon-Koyukuk Census Area AK          38912
##  5 Apache County             AZ          33652
##  6 La Paz County             AZ          37350
##  7 Lee County                AR          28080
##  8 Phillips County           AR          29945
##  9 Trinity County            CA          40680
## 10 Lake County               CA          45086
## # ℹ 90 more rows

`by` and `mutate()`

While it isn’t used often, mutate() also has a .by = argument. It is used together if we want to calculate a group summary and fill the entire column with the value per group.

For instance, the code below will will calculate the average median home income per state, then subtract median home income from the average of the state to see how that counties compares to other counties in the same state:

## # A tibble: 3,141 × 5
##    county          state home_income MHI_avg MHI_diff
##    <chr>           <chr>       <int>   <dbl>    <dbl>
##  1 Autauga County  AL          59338  43480.   15858.
##  2 Baldwin County  AL          57588  43480.   14108.
##  3 Barbour County  AL          34382  43480.   -9098.
##  4 Bibb County     AL          46064  43480.    2584.
##  5 Blount County   AL          50412  43480.    6932.
##  6 Bullock County  AL          29267  43480.  -14213.
##  7 Butler County   AL          37365  43480.   -6115.
##  8 Calhoun County  AL          45400  43480.    1920.
##  9 Chambers County AL          39917  43480.   -3563.
## 10 Cherokee County AL          42132  43480.   -1348.
## # ℹ 3,131 more rows

dplyr basics - by = argument

your name

STAT 2870

Using dplyr

Set Up Your Project and Load Libraries

.by = and summarize()

Basic bar chart of state population

Multiple columns in .by =

Cleveland Plot

.by = uses with other dplyr verbs

`filter(.by = )`

`slice(.by = )`

by with slice_min/slice_max

`by` and `mutate()`

dplyr basics - by = argument

your name

STAT 2870

Using dplyr

Set Up Your Project and Load Libraries

.by = and summarize()

Basic bar chart of state population

Multiple columns in .by =

Cleveland Plot

.by = uses with other dplyr verbs

filter(.by = )

slice(.by = )

by with slice_min/slice_max

by and mutate()

`filter(.by = )`

`slice(.by = )`

`by` and `mutate()`