# Setting up the environment
library(tidyverse)
library(magrittr)
Last week, we covered filter and select. Today let’s cover the summarize( ) and group_by( ). Summarize( ) does someting qualitatively different from the other dplyr verbs we’ve learned so far, in the sense that it generates something new. Often, this will be some sort of a statistic that you derive from the data you have. Therefore, it is intuitive that you will want to assign the result of a summarize( ) as a new variable. We will start with our play data.
# Our data on political advertisements
dt1 <- readRDS("/Users/christyoh/Dropbox/PRISM_2024/Antonio/example_dt.RDS")
Let’s say that we want to see the total of the amount spent on advertising. This is a statistic created by summing up all of the respective spending for an advertisement. Just cementing the intuition here - you start with many datapopints, and with summarize you are condensing the information into a single summary.
# What is the total amount spent?
dt1 %>%
summarize(
tot_spend = est_spending %>%
sum
)
## # A tibble: 1 × 1
## tot_spend
## <dbl>
## 1 968244840
# How many unique affiliations were there?
dt1 %>%
summarize(
affiliation %>%
unique %>%
length
)
## # A tibble: 1 × 1
## `affiliation %>% unique %>% length`
## <int>
## 1 3
A logical next step as a political scientist seems to be to do what we can with summarize, but for specific groups in data. For example, you may wonder, how much was spent by each affiliation? Group_by does conducts the statistical task that you define with summarize() but for each group that you define by a variable in group_by(). In other words, you will get a statistic for each group
# We have DEMOCRAT, OTHER, REPUBLICAN
dt1$affiliation %>% table
## .
## DEMOCRAT OTHER REPUBLICAN
## 751596 243 478918
# You can use group_by() to look
dt1 %>%
group_by(affiliation) %>%
summarize(
tot_spend = est_spending %>%
sum
)
## # A tibble: 3 × 2
## affiliation tot_spend
## <chr> <dbl>
## 1 DEMOCRAT 558857940
## 2 OTHER 87570
## 3 REPUBLICAN 409299330
# You can derive more than one statistics inside summarize
dt1 %>%
group_by(affiliation) %>%
summarize(
tot_spend = est_spending %>%
sum,
ads = n()
)
## # A tibble: 3 × 3
## affiliation tot_spend ads
## <chr> <dbl> <int>
## 1 DEMOCRAT 558857940 751596
## 2 OTHER 87570 243
## 3 REPUBLICAN 409299330 478918
Building on the last code, add an extra statistic that shows the average money spent per ad, by affiliation
# Exercise 1
You can define groups by more than one variables in group_by(). For example, we can see which time of the day that each political party ran their ads.
dt1 %>%
group_by(affiliation, daypart) %>%
summarize(
tote_spend = est_spending %>%
sum,
ads = n()
) %>%
filter(
affiliation %>%
equals("OTHER") %>%
not
)
## `summarise()` has grouped output by 'affiliation'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 4
## # Groups: affiliation [2]
## affiliation daypart tote_spend ads
## <chr> <chr> <dbl> <int>
## 1 DEMOCRAT DAYTIME 90442270 197746
## 2 DEMOCRAT EARLY FRINGE 74519960 104256
## 3 DEMOCRAT EARLY MORNING 98470600 173379
## 4 DEMOCRAT EARLY NEWS 45078880 47891
## 5 DEMOCRAT LATE FRINGE 52588070 87493
## 6 DEMOCRAT LATE NEWS 40284700 42283
## 7 DEMOCRAT PRIME ACCESS 55331660 57216
## 8 DEMOCRAT PRIME TIME 102141800 41332
## 9 REPUBLICAN DAYTIME 51512260 95777
## 10 REPUBLICAN EARLY FRINGE 55410960 64729
## 11 REPUBLICAN EARLY MORNING 66368550 117236
## 12 REPUBLICAN EARLY NEWS 30599170 32693
## 13 REPUBLICAN LATE FRINGE 34513880 51875
## 14 REPUBLICAN LATE NEWS 34978000 32920
## 15 REPUBLICAN PRIME ACCESS 41314930 42547
## 16 REPUBLICAN PRIME TIME 94601580 41141
So, how do we envision this being useful for us? We can use these summaries to build towards a data structure that is more useful for our purposes.
dt1 %>%
group_by(affiliation, daypart) %>%
summarize(
tot_spend = est_spending %>%
sum(),
ads = n()
) %>%
filter(
affiliation %>%
equals("OTHER") %>%
not
) %>%
mutate(
perc_ads = ads %>%
divide_by(
ads %>%
sum
) %>%
multiply_by(100),
perc_spend = tot_spend %>%
divide_by(
tot_spend %>%
sum
) %>%
multiply_by(100)
)
## `summarise()` has grouped output by 'affiliation'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 6
## # Groups: affiliation [2]
## affiliation daypart tot_spend ads perc_ads perc_spend
## <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 DEMOCRAT DAYTIME 90442270 197746 26.3 16.2
## 2 DEMOCRAT EARLY FRINGE 74519960 104256 13.9 13.3
## 3 DEMOCRAT EARLY MORNING 98470600 173379 23.1 17.6
## 4 DEMOCRAT EARLY NEWS 45078880 47891 6.37 8.07
## 5 DEMOCRAT LATE FRINGE 52588070 87493 11.6 9.41
## 6 DEMOCRAT LATE NEWS 40284700 42283 5.63 7.21
## 7 DEMOCRAT PRIME ACCESS 55331660 57216 7.61 9.90
## 8 DEMOCRAT PRIME TIME 102141800 41332 5.50 18.3
## 9 REPUBLICAN DAYTIME 51512260 95777 20.0 12.6
## 10 REPUBLICAN EARLY FRINGE 55410960 64729 13.5 13.5
## 11 REPUBLICAN EARLY MORNING 66368550 117236 24.5 16.2
## 12 REPUBLICAN EARLY NEWS 30599170 32693 6.83 7.48
## 13 REPUBLICAN LATE FRINGE 34513880 51875 10.8 8.43
## 14 REPUBLICAN LATE NEWS 34978000 32920 6.87 8.55
## 15 REPUBLICAN PRIME ACCESS 41314930 42547 8.88 10.1
## 16 REPUBLICAN PRIME TIME 94601580 41141 8.59 23.1
A few things to note here. Notice that when we run code after defining group_by() with two variables, the last grouping is immediately peeled off. Why this design choice was made is clear, when we follow how above code gives us useful information. Once we saw that parties ran ads differently in different dayparts, we are now interested in deriving other stuff by party. In other words, we were then interested to see by party if the percentage of ads or percentage of spending differed for each daypart.
Another thing is that from the point you ran summarize(), you are losing some of your data. You are trading off some of the granularity in data to gain condensed understanding about it.
Finally, we realize that the magrittr approach is super powerful in allowing us to follow how we are progressing with the analysis of the data.
# Download the penguins dataset
library(palmerpenguins)
dt2 <- palmerpenguins::penguins
# Check the data structure
dt2 %>%
str
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# What types of species are there?
dt2$species %>% table
## .
## Adelie Chinstrap Gentoo
## 152 68 124
# Which islands were they observed?
dt2$island %>% table
## .
## Biscoe Dream Torgersen
## 168 124 52
What is the average body mass, average flipper length, average bill length for penguins, by species? (tip: you will have to omit NAs in the dataset first)
What was the percentage of female penguins, by island?