October 15, 2020

Before we get started

  • The student shared experience
  • Slowing down the class
  • Any issues running R at this point?
  • R Studio vs. R; .Rmd files vs. .R files

What we learned in the previous class

  • Getting up and running in RStudio
  • Core R concepts
    • Data types
    • Relationship between data and functions
  • How to import data
  • Regularly used data summary functions
  • Assignment and pipe operators
  • Troubleshooting basics

What we’ll learn today

  • Selecting variables
  • Creating variables
  • Updating values
  • Subsetting tibbles/data frames
  • Aggregation

Let’s begin by reading the Covid-19 and Seattle crime datasets

library(tidyverse)

covid <- read_csv('https://rb.gy/lzlylj')
crime <- read_csv('https://rb.gy/5zuayh') 

Select, create, and rename variables

Select, create, and rename variables

  • select() keeps the named variables from the dataset
  • You can also remove variables from the dataset with -
  • Let’s keep it simple for now and not nest functions within select()
    • Exception: everything()
  • Use '``' for multi-word variables and variable names that begin with numbers
covid %>% 
  select(location, deaths, confirmed_infections, est_infections, date) %>% 
  head(3)
## # A tibble: 3 x 5
##   location deaths confirmed_infections est_infections date      
##   <chr>     <dbl>                <dbl>          <dbl> <date>    
## 1 Global      103                 4966             NA 2020-02-04
## 2 Global      111                 4749             NA 2020-02-05
## 3 Global      110                 3919             NA 2020-02-06

Select, create, and rename variables

  • select() keeps the named variables from the dataset
  • You can also remove variables from the dataset with -
  • Let’s keep it simple for now and not nest functions within select()
    • Exception: everything()
  • Use '``' for multi-word variables and variable names that begin with numbers
covid %>% 
  select(location, deaths, confirmed_infections, est_infections, date) %>% 
  select(-est_infections) %>%
  head(3)
## # A tibble: 3 x 4
##   location deaths confirmed_infections date      
##   <chr>     <dbl>                <dbl> <date>    
## 1 Global      103                 4966 2020-02-04
## 2 Global      111                 4749 2020-02-05
## 3 Global      110                 3919 2020-02-06

Select, create, and rename variables

  • select() keeps the named variables from the dataset
  • You can also remove variables from the dataset with -
  • Let’s keep it simple for now and not nest functions within select()
    • Exception: everything()
  • Use '``' for multi-word variables and variable names that begin with numbers
covid %>% 
  select(location, deaths, confirmed_infections, est_infections, date) %>% 
  select(location, date, everything())  %>% 
  head(3)
## # A tibble: 3 x 5
##   location date       deaths confirmed_infections est_infections
##   <chr>    <date>      <dbl>                <dbl>          <dbl>
## 1 Global   2020-02-04    103                 4966             NA
## 2 Global   2020-02-05    111                 4749             NA
## 3 Global   2020-02-06    110                 3919             NA

Select, create, and rename variables

  • mutate() creates variables
  • Create multiple variables in one go
  • Use functions you already learned within mutate()
    • min(), max(), mean()
    • as.character(), as.factor(), as.numeric()
  • Use ifelse() within mutate()
covid %>% 
  mutate(
    cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    ) 

Select, create, and rename variables

  • mutate() creates variables
  • Create multiple variables in one go
  • Use functions you already learned within mutate()
    • min(), max(), mean()
    • as.character(), as.factor(), as.numeric()
  • Use ifelse() within mutate()
x <- c(1, 2, 3)
y <- c(4, 5, 1)

ifelse(x > y, 'yes', 'no')

Select, create, and rename variables

  • mutate() creates variables
  • Create multiple variables in one go
  • Use functions you already learned within mutate()
    • min(), max(), mean()
    • as.character(), as.factor(), as.numeric()
  • Use ifelse() within mutate()
x <- c(1, 2, 3)
y <- c(4, 5, 1)

ifelse(x > y, 'yes', 'no')
## [1] "no"  "no"  "yes"

Select, create, and rename variables

  • mutate() creates variables
  • Create multiple variables in one go
  • Use functions you already learned within mutate()
    • min(), max(), mean()
    • as.character(), as.factor(), as.numeric()
  • Use ifelse() within mutate()
covid %>% 
  mutate(
    cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    , global_infections = ifelse(location %in% 'Global', confirmed_infections, NA) 
    , global_mobility = ifelse(location %in% 'Global', mobility_composite, NA) 
    ) 

Select, create, and rename variables

covid %>% 
  mutate(
    cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    , global_infections = ifelse(location %in% 'Global', confirmed_infections, NA) 
    , global_mobility = ifelse(location %in% 'Global', mobility_composite, NA) 
    ) %>%
  select(location, date, cases_deaths_ratio, tests_over_mean, global_infections, global_mobility) %>%
  head(5) 
## # A tibble: 5 x 6
##   location date       cases_deaths_ra… tests_over_mean global_infectio…
##   <chr>    <date>                <dbl>           <dbl>            <dbl>
## 1 Global   2020-02-04             0.02            1.54            4966 
## 2 Global   2020-02-05             0.02            1.68            4749 
## 3 Global   2020-02-06             0.03            1.82            3919 
## 4 Global   2020-02-07             0.03            1.97            4495.
## 5 Global   2020-02-08             0.04            2.11            3403.
## # … with 1 more variable: global_mobility <dbl>

Select, create, and rename variables

Exercise - 5 minutes
- With the crime dataset, create a variable called larceny
- In larceny, flag with a 1 all records that have a ‘LARCENY-THEFT’ value in offense_parent_group. All other records should have a 0
- Set neighborhood to NA when the new variable larceny is equal to 0
- Only report the following varilables in the output: neighborhood, offense_parent_group, larceny

crime %>% 
  mutate(
    larceny = ifelse(offense_parent_group %in% 'LARCENY-THEFT', 1, 0)
    , neighborhood = ifelse(larceny %in% 1, neighborhood, NA)
    ) %>%
  select(offense_parent_group, larceny, neighborhood) %>%
  tail(8)

Select, create, and rename variables

Exercise - 5 minutes
- With the crime dataset, create a variable called larceny
- In larceny, flag with a 1 all records that have a ‘LARCENY-THEFT’ value in offense_parent_group. All other records should have a 0
- Set neighborhood to NA when the new variable larceny is equal to 0
- Only report the following varilables in the output: neighborhood, offense_parent_group, larceny

crime %>% 
  mutate(
    larceny = ifelse(offense_parent_group %in% 'LARCENY-THEFT', 1, 0)
    , neighborhood = ifelse(larceny %in% 1, neighborhood, NA)
    ) %>%
  select(offense_parent_group, larceny, neighborhood) %>%
  tail(8)
## # A tibble: 8 x 3
##   offense_parent_group larceny neighborhood     
##   <chr>                  <dbl> <chr>            
## 1 LARCENY-THEFT              1 CAPITOL HILL     
## 2 ASSAULT OFFENSES           0 <NA>             
## 3 ASSAULT OFFENSES           0 <NA>             
## 4 LARCENY-THEFT              1 ROOSEVELT/RAVENNA
## 5 ASSAULT OFFENSES           0 <NA>             
## 6 MOTOR VEHICLE THEFT        0 <NA>             
## 7 LARCENY-THEFT              1 MOUNT BAKER      
## 8 ASSAULT OFFENSES           0 <NA>

Select, create, and rename variables

  • transmute() = mutate() + select()
covid %>% 
  mutate(
    cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    , global_infections = ifelse(location %in% 'Global', confirmed_infections, NA) 
    , global_mobility = ifelse(location %in% 'Global', mobility_composite, NA) 
    ) %>%
  select(cases_deaths_ratio, tests_over_mean, global_infections, global_mobility)
## # A tibble: 103,366 x 4
##    cases_deaths_ratio tests_over_mean global_infections global_mobility
##                 <dbl>           <dbl>             <dbl>           <dbl>
##  1               0.02            1.54             4966             NA  
##  2               0.02            1.68             4749             NA  
##  3               0.03            1.82             3919             NA  
##  4               0.03            1.97             4495.            NA  
##  5               0.04            2.11             3403.           -23.7
##  6               0.04            2.25             3954            -23.9
##  7               0.05            2.39             3294.           -23.9
##  8               0.06            2.54             2591.           -23.4
##  9               0.01            2.68              134            -23.5
## 10               0.09            2.83             1728            -23.2
## # … with 103,356 more rows

Select, create, and rename variables

  • transmute() = mutate() + select()
covid %>% 
  transmute(
    cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    , global_infections = ifelse(location %in% 'Global', confirmed_infections, NA) 
    , global_mobility = ifelse(location %in% 'Global', mobility_composite, NA) 
    )
## # A tibble: 103,366 x 4
##    cases_deaths_ratio tests_over_mean global_infections global_mobility
##                 <dbl>           <dbl>             <dbl>           <dbl>
##  1               0.02            1.54             4966             NA  
##  2               0.02            1.68             4749             NA  
##  3               0.03            1.82             3919             NA  
##  4               0.03            1.97             4495.            NA  
##  5               0.04            2.11             3403.           -23.7
##  6               0.04            2.25             3954            -23.9
##  7               0.05            2.39             3294.           -23.9
##  8               0.06            2.54             2591.           -23.4
##  9               0.01            2.68              134            -23.5
## 10               0.09            2.83             1728            -23.2
## # … with 103,356 more rows

Select, create, and rename variables

  • Keep unmanipulated variables too if you would like
covid %>% 
  transmute(
    location
    , date
    , cases_deaths_ratio = (deaths / confirmed_infections) %>% round(2)
    , tests_over_mean = (total_tests / mean(total_tests, na.rm = TRUE)) %>% round(2)
    , global_infections = ifelse(location %in% 'Global', confirmed_infections, NA) 
    , global_mobility = ifelse(location %in% 'Global', mobility_composite, NA) 
    )
## # A tibble: 103,366 x 6
##    location date       cases_deaths_ra… tests_over_mean global_infectio…
##    <chr>    <date>                <dbl>           <dbl>            <dbl>
##  1 Global   2020-02-04             0.02            1.54            4966 
##  2 Global   2020-02-05             0.02            1.68            4749 
##  3 Global   2020-02-06             0.03            1.82            3919 
##  4 Global   2020-02-07             0.03            1.97            4495.
##  5 Global   2020-02-08             0.04            2.11            3403.
##  6 Global   2020-02-09             0.04            2.25            3954 
##  7 Global   2020-02-10             0.05            2.39            3294.
##  8 Global   2020-02-11             0.06            2.54            2591.
##  9 Global   2020-02-12             0.01            2.68             134 
## 10 Global   2020-02-13             0.09            2.83            1728 
## # … with 103,356 more rows, and 1 more variable: global_mobility <dbl>

Select, create, and rename variables

  • rename() assigns a new name to an existing variable
  • Renaming is possible with transmute() and mutate()as well
    • rename(): renames selected variables, keeps all other variables
    • transmute(): renames selected variables, drops all other variables
    • mutate(): creates new variables (with new names) and keeps originals

Exercise - 3 minutes
- Rename confirmed_infections as cases, location as geography, and date as ymd with one of the methods

Select, create, and rename variables

  • rename() assigns a new name to an existing variable
  • Renaming is possible with transmute() and mutate()as well
    • rename(): renames selected variables, keeps all other variables
    • transmute(): renames selected variables, drops all other variables
    • mutate(): creates new variables (with new names) and keeps originals
covid %>% 
  rename(
    geography = location
    , ymd = date
    , cases = confirmed_infections
    ) %>%
  head(1)
## # A tibble: 1 x 14
##   geography ymd        deaths mobility_data_t… mobility_compos… total_tests_dat…
##   <chr>     <date>      <dbl> <chr>                       <dbl> <chr>           
## 1 Global    2020-02-04    103 <NA>                           NA observed        
## # … with 8 more variables: total_tests <dbl>, cases <dbl>,
## #   est_infections <dbl>, population <dbl>, deaths_p100k <dbl>,
## #   confirmed_infections_p100k <dbl>, est_infections_p100k <dbl>, month <dbl>

Select, create, and rename variables

  • rename() assigns a new name to an existing variable
  • Renaming is possible with transmute() and mutate()as well
    • rename(): renames selected variables, keeps all other variables
    • transmute(): renames selected variables, drops all other variables
    • mutate(): creates new variables (with new names) and keeps originals
covid %>% 
  transmute(
    geography = location
    , ymd = date
    , cases = confirmed_infections
    ) %>%
  head(1)
## # A tibble: 1 x 3
##   geography ymd        cases
##   <chr>     <date>     <dbl>
## 1 Global    2020-02-04  4966

Select, create, and rename variables

  • rename() assigns a new name to an existing variable
  • Renaming is possible with transmute() and mutate()as well
    • rename(): renames selected variables, keeps all other variables
    • transmute(): renames selected variables, drops all other variables
    • mutate(): creates new variables (with new names) and keeps originals
covid %>% 
  mutate(
    geography = location
    , ymd = date
    , cases = confirmed_infections
    ) %>%
  select(location, geography, ymd, date, cases, confirmed_infections) %>%
  head(1)
## # A tibble: 1 x 6
##   location geography ymd        date       cases confirmed_infections
##   <chr>    <chr>     <date>     <date>     <dbl>                <dbl>
## 1 Global   Global    2020-02-04 2020-02-04  4966                 4966

Exercise - 5 minutes

  • With crime, create a data frame that meets the following conditions

    • The data frame includes three variables: report_number, crime_against_category, offense_location
    • Make no changes to report_number
    • In crime_against_category, change NA values to UNKNOWN
    • Rename neighborhood as offense_location
crime %>% 
  transmute(
    report_number
    , crime_against_category = ifelse(crime_against_category %in% NA
                                , 'UNKNOWN', crime_against_category)
    , offense_location = neighborhood
    ) 

Exercise - 5 minutes

crime %>% 
  transmute(
    report_number
    , crime_against_category = ifelse(crime_against_category %in% NA
                                , 'UNKNOWN', crime_against_category)
    , offense_location = neighborhood
    ) %>%
  head()
## # A tibble: 6 x 3
##   report_number crime_against_category offense_location        
##   <chr>         <chr>                  <chr>                   
## 1 2015-323587   PROPERTY               ROOSEVELT/RAVENNA       
## 2 2015-267992   PROPERTY               FIRST HILL              
## 3 2013-900207   PROPERTY               <NA>                    
## 4 2008-334626   PROPERTY               ROOSEVELT/RAVENNA       
## 5 2014-363634   PROPERTY               CENTRAL AREA/SQUIRE PARK
## 6 2010-019552   UNKNOWN                GREENWOOD

Filter data frames

Filter data frames

  • Use filter() to subset your data
  • With filter(), you tell R to exclude/include data that meet certain conditions
  • Operators for conditions
    • !, %in%, >, >=, <, <=, ==
covid %>% 
  filter(est_infections_p100k > 60) %>% 
  select(location, date, est_infections_p100k) %>% 
  head(2)
## # A tibble: 2 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Global   2020-12-15                 61.2
## 2 Global   2020-12-16                 62.4

Filter data frames

  • Use filter() to subset your data
  • With filter(), you tell R to exclude/include data that meet certain conditions
  • Operators for conditions
    • !, %in%, >, >=, <, <=, ==
covid %>% 
  filter(est_infections_p100k <= 60) %>% 
  select(location, date, est_infections_p100k) %>% 
  head(2)
## # A tibble: 2 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Global   2020-08-13                 21.7
## 2 Global   2020-08-14                 21.6

Filter data frames

  • Use filter() to subset your data
  • With filter(), you tell R to exclude/include data that meet certain conditions
  • Operators for conditions
    • !, %in%, >, >=, <, <=, ==
covid %>% 
  filter(location %in% 'Peru') %>% 
  select(location, date, est_infections_p100k) %>% 
  tail(2)
## # A tibble: 2 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Peru     2020-12-31                 2.67
## 2 Peru     2021-01-01                 2.63

Filter data frames

  • Use filter() to subset your data
  • With filter(), you tell R to exclude/include data that meet certain conditions
  • Operators for conditions
    • !, %in%, >, >=, <, <=, ==
covid %>% 
  filter(! location %in% 'Peru') %>% 
  select(location, date, est_infections_p100k) %>% 
  tail(2)
## # A tibble: 2 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Zimbabwe 2020-12-31                0.581
## 2 Zimbabwe 2021-01-01                0.579

Filter data frames

  • You can stipulate multiple conditions with filter()
    • Use & for ‘and’ expressions
    • Use | for ‘or’ expressions
    • If there are many conditions, wrap conditions in () or use more than one filter()
    • If you want to filter on multiple vaues, use c()
covid %>% 
  filter(est_infections_p100k > 60 & location %in% 'Malta') %>% 
  select(location, est_infections_p100k, date) %>% 
  head(2)
## # A tibble: 2 x 3
##   location est_infections_p100k date      
##   <chr>                   <dbl> <date>    
## 1 Malta                    85.6 2020-09-29
## 2 Malta                    92.3 2020-09-30

Filter data frames

  • You can stipulate multiple conditions with filter()
    • Use & for ‘and’ expressions
    • Use | for ‘or’ expressions
    • If there are many conditions, wrap conditions in () or use more than one filter()
    • If you want to filter on multiple vaues, use c()
covid %>% 
  filter(est_infections_p100k > 434 | location %in% 'Malta') %>% 
  select(location, est_infections_p100k, date) %>% 
  head(2)
## # A tibble: 2 x 3
##   location    est_infections_p100k date      
##   <chr>                      <dbl> <date>    
## 1 Maharashtra                 435. 2020-11-30
## 2 Malta                        NA  2020-02-04

Filter data frames

  • You can stipulate multiple conditions with filter()
    • Use & for ‘and’ expressions
    • Use | for ‘or’ expressions
    • If there are many conditions, wrap conditions in () or use more than one filter()
    • If you want to filter on multiple vaues, use c()
covid %>% 
  filter(
    (est_infections_p100k > 60 & location %in% 'Malta') | 
    (est_infections_p100k <= 2.5 & location %in% 'Italy')
    ) %>% 
  select(location, est_infections_p100k, date) %>% 
  head(5)
## # A tibble: 5 x 3
##   location est_infections_p100k date      
##   <chr>                   <dbl> <date>    
## 1 Italy                    2.38 2020-09-29
## 2 Italy                    2.43 2020-09-30
## 3 Italy                    2.46 2020-10-01
## 4 Italy                    2.50 2020-10-02
## 5 Malta                   85.6  2020-09-29

Filter data frames

  • You can stipulate multiple conditions with filter()
    • Use & for ‘and’ expressions
    • Use | for ‘or’ expressions
    • If there are many conditions, wrap conditions in () or use more than one filter()
    • If you want to filter on multiple vaues, use c()
covid %>% 
  filter(location %in% c('Peru', 'Colombia') & est_infections_p100k > 50) %>% 
  select(location, date, est_infections_p100k) %>% 
  head(5)
## # A tibble: 5 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Colombia 2020-09-29                 101.
## 2 Colombia 2020-09-30                 101.
## 3 Colombia 2020-10-01                 101.
## 4 Colombia 2020-10-02                 100.
## 5 Colombia 2020-10-03                 100.

Filter data frames

  • Other functions for subsetting your data
    • head()
    • tail()
    • sample_n()
covid %>% sample_n(3) %>% select(location, date, est_infections_p100k)
## # A tibble: 3 x 3
##   location date       est_infections_p100k
##   <chr>    <date>                    <dbl>
## 1 Romania  2020-10-29                 42.4
## 2 Oklahoma 2020-05-29                 NA  
## 3 Nagaland 2020-12-22                179.
covid %>% sample_n(3) %>% select(location, date, est_infections_p100k)
## # A tibble: 3 x 3
##   location     date       est_infections_p100k
##   <chr>        <date>                    <dbl>
## 1 Rhode Island 2020-06-29                NA   
## 2 Belgium      2020-02-25                NA   
## 3 Bulgaria     2020-11-13                 9.71

Exercise - 10 minutes

  • Produce a dataset that meets these conditions
    • Contains total_tests_data_type values not equal to NA
    • Contains location values equal to 'China' or 'Germany'
    • Contains no dates prior to July 4, 2020
    • Contains confirmed_infections greater than 130
    • Contains a variable called high_cases. This variable should flag with a 1 observations that have confirmed_infections greater than 400
    • Contains 5 variables in total in the following order: location, high_cases, date, confirmed_infections, total_tests_data_type
  • Lastly, determine how many records meet the above filtering conditions

Hint

covid %>% 
  filter(date >= '2020-07-04')

Exercise - 10 minutes

covid %>% 
  filter(
    ! total_tests_data_type %in% NA & 
    location %in% c('China', 'Germany') & 
    confirmed_infections > 130 & 
    date >= '2020-07-04'
    ) %>%  
  transmute(
    location
    , high_cases = ifelse(confirmed_infections > 400, 1, NA)
    , date
    , confirmed_infections
    , total_tests_data_type
  ) 
## # A tibble: 58 x 5
##    location high_cases date       confirmed_infections total_tests_data_type
##    <chr>         <dbl> <date>                    <dbl> <chr>                
##  1 China            NA 2020-07-22                 136  observed             
##  2 China            NA 2020-07-25                 137  observed             
##  3 China            NA 2020-07-27                 149  observed             
##  4 China            NA 2020-07-30                 152  observed             
##  5 China            NA 2020-08-01                 133. observed             
##  6 Germany           1 2020-07-04                 422. observed             
##  7 Germany          NA 2020-07-05                 239. observed             
##  8 Germany          NA 2020-07-06                 219. observed             
##  9 Germany          NA 2020-07-07                 390. observed             
## 10 Germany          NA 2020-07-08                 397. observed             
## # … with 48 more rows

Exercise - 10 minutes

covid %>% 
  filter(
    ! total_tests_data_type %in% NA & 
    location %in% c('China', 'Germany') & 
    confirmed_infections > 130 & 
    date >= '2020-07-04'
    ) %>%
  nrow()
## [1] 58

Aggregate data with group_by() and summarise()

What you can do with group_by() and summarise()

  • Count, sum, average, and identify max and min values
  • Example questions you could answer
    • In covid, which month reports the most total_tests?
    • In covid, which location reports the largest number of estimated infections in September, excluding 'Global' values?
    • In crime, of all 'ASSAULT OFFENSES' values in offense_parent_group which offense value has the fewest incidences?
    • In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

How you use group_by() and summarise() to aggregate data

  • In group_by(), list the variables by which you want to aggregate data
  • In summarise(), create a variable and define the variable with an aggregation function
  • Use %>% to ‘link’ group_by() and summarise()
  • Aggregation functions
    • n(), n_distinct(), sum(), mean(), max(), min(), etc.
  • Perform multiple aggregations in a single summarise() function

Example
In covid, which month reports the most total_tests?

covid %>% 
  group_by(month) %>% 
  summarise(total_tests = sum(total_tests, na.rm = TRUE))

How you use group_by() and summarise() to aggregate data

Example
In covid, which month reports the most total_tests?

covid %>% 
  group_by(month) %>% 
  summarise(total_tests = sum(total_tests, na.rm = TRUE)) %>%
  arrange(desc(total_tests))
## # A tibble: 12 x 2
##    month total_tests
##    <dbl>       <dbl>
##  1     8    1719495.
##  2     7    1689056.
##  3     5    1210343.
##  4     6    1182845.
##  5     9    1152910.
##  6     4     958562.
##  7     3     312733.
##  8     2      44480.
##  9     1          0 
## 10    10          0 
## 11    11          0 
## 12    12          0

How you use group_by() and summarise() to aggregate data

Example
In covid, which location reports the largest number of estimated infections in September, excluding 'Global' values?

covid %>% 
  group_by(month, location) %>% 
  summarise(est_infections = sum(est_infections, na.rm = TRUE)) %>%
  filter(month %in% 9 & ! location %in% 'Global') %>%
  arrange(desc(est_infections)) %>% 
  head()
## # A tibble: 6 x 3
## # Groups:   month [1]
##   month location                    est_infections
##   <dbl> <chr>                                <dbl>
## 1     9 Latin America and Caribbean       7757636.
## 2     9 Europe and Central Asia           3550230.
## 3     9 South Asia                        2481264.
## 4     9 India                             2330349.
## 5     9 East Asia and Pacific             1596947.
## 6     9 Maharashtra                        812265.

How you use group_by() and summarise() to aggregate data

How can you rewrite the code below and get the same output?

In covid, which location reports the largest number of estimated infections in September, excluding 'Global' values?

covid %>% 
  group_by(month, location) %>% 
  summarise(est_infections = sum(est_infections, na.rm = TRUE)) %>%
  filter(month %in% 9 & ! location %in% 'Global') %>%
  arrange(desc(est_infections)) %>% 
  head()

How you use group_by() and summarise() to aggregate data

How can you rewrite the code below and get the same output?

In covid, which location reports the largest number of estimated infections in September, excluding 'Global' values?

covid %>% 
  filter(month %in% 9 & ! location %in% 'Global') %>%
  group_by(location) %>% 
  summarise(est_infections = sum(est_infections, na.rm = TRUE)) %>%
  arrange(desc(est_infections)) %>% 
  head()

How you use group_by() and summarise() to aggregate data

Exercise #1 - 5 minutes
In crime, of all 'ASSAULT OFFENSES' values in offense_parent_group which offense value has the fewest incidences? (Exclude NAs from the output you produce.)

Exercise #2 - 7 minutes
In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Exercise #1 - 5 minutes

In crime, of all 'ASSAULT OFFENSES' values in offense_parent_group which offense value has the fewest incidences? (Exclude NAs from the output you produce.)

crime %>% 
  group_by(offense_parent_group, offense) %>% 

Exercise #1 - 5 minutes

In crime, of all 'ASSAULT OFFENSES' values in offense_parent_group which offense value has the fewest incidences? (Exclude NAs from the output you produce.)

crime %>% 
  group_by(offense_parent_group, offense) %>% 
  summarise(incidence = n())

Exercise #1 - 5 minutes

In crime, of all 'ASSAULT OFFENSES' values in offense_parent_group which offense value has the fewest incidences? (Exclude NAs from the output you produce.)

crime %>% 
  group_by(offense_parent_group, offense) %>% 
  summarise(incidence = n()) %>%
  filter(offense_parent_group %in% 'ASSAULT OFFENSES' & ! offense %in% NA)
## # A tibble: 3 x 3
## # Groups:   offense_parent_group [1]
##   offense_parent_group offense            incidence
##   <chr>                <chr>                  <int>
## 1 ASSAULT OFFENSES     Aggravated Assault      3754
## 2 ASSAULT OFFENSES     Intimidation            3963
## 3 ASSAULT OFFENSES     Simple Assault          9527

Exercise #2 - 7 minutes

In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Hint You need to calculate the difference between the two year variables.

Exercise #2 - 7 minutes

In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Hint You need to calculate the difference between the two year variables.

crime %>% 
  mutate(dif = reported_year - occurred_year)

Exercise #2 - 7 minutes

In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Hint You need to calculate the difference between the two year variables.

crime %>% 
  mutate(dif = reported_year - occurred_year) %>% 
  filter(dif %in% c(1, 2)) 

Exercise #2 - 7 minutes

In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Hint You need to calculate the difference between the two year variables.

crime %>% 
  mutate(dif = reported_year - occurred_year) %>% 
  filter(dif %in% c(1, 2)) %>% 
  group_by(dif) 

Exercise #2 - 7 minutes

In crime, how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?

Hint You need to calculate the difference between the two year variables.

crime %>% 
  mutate(dif = reported_year - occurred_year) %>% 
  filter(dif %in% c(1, 2)) %>% 
  group_by(dif) %>% 
  summarise(n = n())
## # A tibble: 2 x 2
##     dif     n
##   <dbl> <int>
## 1     1  1808
## 2     2   181

Exercise - 15 minutes

  • From the crime dataset, report the mean difference in reported_date and occurred_date values by neighborhood.
    • The variable that tells us the difference between reported_date and occurred_date should be called date_dif
    • In your output of mean values by neighborhood only report rows where the number of neighborhood values in the dataset is greater than 1000 (n)
    • You’ll use transmute and filter as well as group_by and summarise to solve this problem
    • The dataframe should have three variables: neighborhood, date_dif, n
    • Lastly, export the data to a .csv file
 [...] %>%  
  write_csv('tidyverse_exercise_output.csv')  
  • Questions
    • What is the date_dif value for ‘HIGH POINTneighborhood?
    • What is the number of observations used to calculate the mean difference between reported_date and occurred_date for ‘BALLARD NORTHneighborhood?

Exercise - 15 minutes

  • From the crime dataset, report the mean difference in reported_date and occurred_date values by neighborhood.
    • The variable that tells us the difference between reported_date and occurred_date should be called date_dif
    • In your output of mean values by neighborhood only report rows where the number of neighborhood values in the dataset is greater than 1000 (n)
    • You’ll use transmute and filter as well as group_by and summarise to solve this problem
    • The dataframe should have three variables: neighborhood, date_dif, n
    • Lastly, export the data to a .csv file
crime %>% 
  transmute(
    neighborhood
    , date_dif = reported_date - occurred_date
  ) %>% 

Exercise - 15 minutes

  • From the crime dataset, report the mean difference in reported_date and occurred_date values by neighborhood.
    • The variable that tells us the difference between reported_date and occurred_date should be called date_dif
    • In your output of mean values by neighborhood only report rows where the number of neighborhood values in the dataset is greater than 1000 (n)
    • You’ll use transmute and filter as well as group_by and summarise to solve this problem
    • The dataframe should have three variables: neighborhood, date_dif, n
    • Lastly, export the data to a .csv file
crime %>% 
  transmute(
    neighborhood
    , date_dif = reported_date - occurred_date
  ) %>% 
  group_by(neighborhood) %>% 
  summarise(
    date_dif = mean(date_dif, na.rm = TRUE)
    , n = n()
  ) %>% 

Exercise - 15 minutes

  • From the crime dataset, report the mean difference in reported_year and occurred_year values by neighborhood.
    • The variable that tells us the difference between reported_date and occurred_date should be called date_dif
    • In your output of mean values by neighborhood only report rows where the number of neighborhood values in the dataset is greater than 1000 (n)
    • You’ll use transmute and filter as well as group_by and summarise to solve this problem
    • The dataframe should have three variables: neighborhood, date_dif, n
    • Lastly, export the data to a .csv file
crime %>% 
  transmute(
    neighborhood
    , date_dif = reported_date - occurred_date
  ) %>% 
  group_by(neighborhood) %>% 
  summarise(
    date_dif = mean(date_dif, na.rm = TRUE)
    , n = n()
  ) %>% 
  filter(n > 1000) %>% 
  write_csv('tidyverse_exercise_output.csv')

Exercise - 15 minutes

  • Questions
    • What is the mean difference in years between reported_date and occurred_date for ‘HIGH POINTneighborhood?
    • What is the number of observations used to calculate the mean difference between reported_date and occurred_date for ‘BALLARD NORTHneighborhood?

Exercise - 15 minutes

  • Questions
    • What is the mean difference in years between reported_date and occurred_date for ‘HIGH POINTneighborhood?
    • What is the number of observations used to calculate the mean difference between reported_date and occurred_date for ‘BALLARD NORTHneighborhood?
crime %>% 
  transmute(
    neighborhood
    , date_dif = reported_date - occurred_date
  ) %>% 
  group_by(neighborhood) %>% 
  summarise(
    date_dif = mean(date_dif, na.rm = TRUE)
    , n = n()
  ) %>% 
  filter(neighborhood %in% c('HIGH POINT', 'BALLARD NORTH')) 
## # A tibble: 2 x 3
##   neighborhood  date_dif          n
##   <chr>         <drtn>        <int>
## 1 BALLARD NORTH 13.35251 days  2746
## 2 HIGH POINT    16.88184 days  1151

Exercise - 10 minutes

  • From the covid dataset, determine the location-month combination with the highest number of estimated infections per capita. Exclude dates prior to July and 'Global' values in location. To do this:
    • Create a new variable called infections_per_capita that reports est_infections over population
    • Group by and aggregate the relevant variables
    • Filter your data to only include the observation with the largest number of infections_per_capita
covid %>% 
  filter(
    month >= 7 & 
    ! location %in% 'Global'
  ) 

Exercise - 10 minutes

  • From the covid dataset, determine the location-month combination with the highest number of estimated infections per capita. Exclude dates prior to July and 'Global' values in location. To do this:
    • Create a new variable called infections_per_capita that reports est_infections over population
    • Group by and aggregate the relevant variables
    • Filter your data to only include the observation with the largest number of infections_per_capita
covid %>% 
  filter(
    month >= 7 & 
    ! location %in% 'Global'
  ) %>% 
  mutate(infections_per_capita = est_infections / population) 

Exercise - 10 minutes

  • From the covid dataset, determine the location-month combination with the highest number of estimated infections per capita. Exclude dates prior to July and 'Global' values in location. To do this:
    • Create a new variable called infections_per_capita that reports est_infections over population
    • Group by and aggregate the relevant variables
    • Filter your data to only include the observation with the largest number of infections_per_capita
covid %>% 
  filter(
    month >= 7 & 
    ! location %in% 'Global'
  ) %>% 
  mutate(infections_per_capita = est_infections / population) %>% 
  group_by(location, month) %>% 
  summarise(infections_per_capita = sum(infections_per_capita, na.rm = TRUE)) 

Exercise - 10 minutes

covid %>% 
  filter(
    month >= 7 & 
    ! location %in% 'Global'
  ) %>% 
  mutate(infections_per_capita = est_infections / population) %>% 
  group_by(location, month) %>% 
  summarise(infections_per_capita = sum(infections_per_capita, na.rm = TRUE)) %>% 
  ungroup %>%
  filter(infections_per_capita %in% max(infections_per_capita, na.rm = TRUE))
## # A tibble: 1 x 3
##   location month infections_per_capita
##   <chr>    <dbl>                 <dbl>
## 1 Manipur     12                 0.167