[PRISM] dplyr:: summarize() and group

dplyr’s five verbs (cont.)

summarize()

Last week, we covered filter and select. Today let’s cover the summarize( ) and group_by( ). Summarize( ) does someting qualitatively different from the other dplyr verbs we’ve learned so far, in the sense that it generates something new. Often, this will be some sort of a statistic that you derive from the data you have. Therefore, it is intuitive that you will want to assign the result of a summarize( ) as a new variable. We will start with our play data.

# Our data on political advertisements
dt1 <- readRDS("/Users/christyoh/Dropbox/PRISM_2024/Antonio/example_dt.RDS")

Let’s say that we want to see the total of the amount spent on advertising. This is a statistic created by summing up all of the respective spending for an advertisement. Just cementing the intuition here - you start with many datapopints, and with summarize you are condensing the information into a single summary.

# What is the total amount spent? 

dt1 %>% 
  summarize(
    tot_spend = est_spending %>% 
      sum
  )

## # A tibble: 1 × 1
##   tot_spend
##       <dbl>
## 1 968244840

# How many unique affiliations were there?

dt1 %>% 
  summarize(
    affiliation %>% 
      unique %>% 
      length
  )

## # A tibble: 1 × 1
##   `affiliation %>% unique %>% length`
##                                 <int>
## 1                                   3

summarize() and group_by() make a great pair

A logical next step as a political scientist seems to be to do what we can with summarize, but for specific groups in data. For example, you may wonder, how much was spent by each affiliation? Group_by does conducts the statistical task that you define with summarize() but for each group that you define by a variable in group_by(). In other words, you will get a statistic for each group

# We have DEMOCRAT, OTHER, REPUBLICAN
dt1$affiliation %>% table

## .
##   DEMOCRAT      OTHER REPUBLICAN 
##     751596        243     478918

# You can use group_by() to look 
dt1 %>% 
  group_by(affiliation) %>% 
  summarize(
    tot_spend = est_spending %>% 
      sum
  )

## # A tibble: 3 × 2
##   affiliation tot_spend
##   <chr>           <dbl>
## 1 DEMOCRAT    558857940
## 2 OTHER           87570
## 3 REPUBLICAN  409299330

# You can derive more than one statistics inside summarize
dt1 %>% 
  group_by(affiliation) %>% 
  summarize(
    tot_spend = est_spending %>% 
      sum,
    ads = n() 
  )

## # A tibble: 3 × 3
##   affiliation tot_spend    ads
##   <chr>           <dbl>  <int>
## 1 DEMOCRAT    558857940 751596
## 2 OTHER           87570    243
## 3 REPUBLICAN  409299330 478918

Exercise 1

Building on the last code, add an extra statistic that shows the average money spent per ad, by affiliation

# Exercise 1

You can define groups by more than one variables in group_by(). For example, we can see which time of the day that each political party ran their ads.

dt1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tote_spend = est_spending %>% 
      sum, 
    ads = n()
  ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
  )

## `summarise()` has grouped output by 'affiliation'. You can override using the
## `.groups` argument.

## # A tibble: 16 × 4
## # Groups:   affiliation [2]
##    affiliation daypart       tote_spend    ads
##    <chr>       <chr>              <dbl>  <int>
##  1 DEMOCRAT    DAYTIME         90442270 197746
##  2 DEMOCRAT    EARLY FRINGE    74519960 104256
##  3 DEMOCRAT    EARLY MORNING   98470600 173379
##  4 DEMOCRAT    EARLY NEWS      45078880  47891
##  5 DEMOCRAT    LATE FRINGE     52588070  87493
##  6 DEMOCRAT    LATE NEWS       40284700  42283
##  7 DEMOCRAT    PRIME ACCESS    55331660  57216
##  8 DEMOCRAT    PRIME TIME     102141800  41332
##  9 REPUBLICAN  DAYTIME         51512260  95777
## 10 REPUBLICAN  EARLY FRINGE    55410960  64729
## 11 REPUBLICAN  EARLY MORNING   66368550 117236
## 12 REPUBLICAN  EARLY NEWS      30599170  32693
## 13 REPUBLICAN  LATE FRINGE     34513880  51875
## 14 REPUBLICAN  LATE NEWS       34978000  32920
## 15 REPUBLICAN  PRIME ACCESS    41314930  42547
## 16 REPUBLICAN  PRIME TIME      94601580  41141

So, how do we envision this being useful for us? We can use these summaries to build towards a data structure that is more useful for our purposes.

dt1 %>% 
  group_by(affiliation, daypart) %>% 
  summarize(
    tot_spend = est_spending %>% 
      sum(),
    ads = n()
  ) %>% 
  filter(
    affiliation %>% 
      equals("OTHER") %>% 
      not
  ) %>% 
  mutate(
    perc_ads = ads %>% 
      divide_by(
        ads %>% 
          sum
      ) %>% 
      multiply_by(100),
    perc_spend = tot_spend %>% 
      divide_by(
        tot_spend %>% 
          sum
      ) %>% 
      multiply_by(100)
  )

## `summarise()` has grouped output by 'affiliation'. You can override using the
## `.groups` argument.

## # A tibble: 16 × 6
## # Groups:   affiliation [2]
##    affiliation daypart       tot_spend    ads perc_ads perc_spend
##    <chr>       <chr>             <dbl>  <int>    <dbl>      <dbl>
##  1 DEMOCRAT    DAYTIME        90442270 197746    26.3       16.2 
##  2 DEMOCRAT    EARLY FRINGE   74519960 104256    13.9       13.3 
##  3 DEMOCRAT    EARLY MORNING  98470600 173379    23.1       17.6 
##  4 DEMOCRAT    EARLY NEWS     45078880  47891     6.37       8.07
##  5 DEMOCRAT    LATE FRINGE    52588070  87493    11.6        9.41
##  6 DEMOCRAT    LATE NEWS      40284700  42283     5.63       7.21
##  7 DEMOCRAT    PRIME ACCESS   55331660  57216     7.61       9.90
##  8 DEMOCRAT    PRIME TIME    102141800  41332     5.50      18.3 
##  9 REPUBLICAN  DAYTIME        51512260  95777    20.0       12.6 
## 10 REPUBLICAN  EARLY FRINGE   55410960  64729    13.5       13.5 
## 11 REPUBLICAN  EARLY MORNING  66368550 117236    24.5       16.2 
## 12 REPUBLICAN  EARLY NEWS     30599170  32693     6.83       7.48
## 13 REPUBLICAN  LATE FRINGE    34513880  51875    10.8        8.43
## 14 REPUBLICAN  LATE NEWS      34978000  32920     6.87       8.55
## 15 REPUBLICAN  PRIME ACCESS   41314930  42547     8.88      10.1 
## 16 REPUBLICAN  PRIME TIME     94601580  41141     8.59      23.1

A few things to note here. Notice that when we run code after defining group_by() with two variables, the last grouping is immediately peeled off. Why this design choice was made is clear, when we follow how above code gives us useful information. Once we saw that parties ran ads differently in different dayparts, we are now interested in deriving other stuff by party. In other words, we were then interested to see by party if the percentage of ads or percentage of spending differed for each daypart.

Another thing is that from the point you ran summarize(), you are losing some of your data. You are trading off some of the granularity in data to gain condensed understanding about it.

Finally, we realize that the magrittr approach is super powerful in allowing us to follow how we are progressing with the analysis of the data.

HW: Penguin exercise

# Download the penguins dataset
library(palmerpenguins)
dt2 <- palmerpenguins::penguins
# Check the data structure
dt2 %>% 
  str

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

# What types of species are there?
dt2$species %>% table

## .
##    Adelie Chinstrap    Gentoo 
##       152        68       124

# Which islands were they observed?
dt2$island %>% table

## .
##    Biscoe     Dream Torgersen 
##       168       124        52

HW 1

What is the average body mass, average flipper length, average bill length for penguins, by species? (tip: you will have to omit NAs in the dataset first)

HW 2

What was the percentage of female penguins, by island?

[PRISM] dplyr:: summarize() and group_by()

Christy Oh

2025-03-04