The principle advantage to the dplyr
approach is
cognitive–that a huge, potentially limitless set of challenges and
statistical tasks…
….are actually special cases of a more general problem.
Let me emphasize this: the reason you make an effort with
dplyr
is that you will eventually come to learn that each
seemingly unique analytics challenge, which begins is actually, a
special case of a more general statistical process.
You used to be verklempt! “gosh, how to proceed with this new challenge, given all the rich complexity inherent herein?!”
Armed with this approach, you instead see patterns and similarities.
Becoming adept at this pattern recognition is the goal of the
tidyverse
approach.
Just to ruin the suspense, our verbs are
filter
(to return a subset of
rows)select
(to return a subset of
columns)arrange
(to sort a table)mutate
(to add a new column)summarize
(to generate a new table,
usually with statistical result)and our one adverb
group_by
(to repeat some analysis
separately by each group)Ok let’s get started. filtering
to begin:
dplyr::filter
We filter
to select cases. We have a
data.frame
with n rows, and we’re actually only interested
in k rows which meet some other condition.
A rich and illustrative graphical example
We need some data—this is the CMAG1 table with all the television ads run in 2020 presidential election.
library(tidyverse)
library(magrittr)
d1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/cmag_2020_codelab.rds" %>%
url %>%
gzcon %>%
readRDS
We can do the all the following
d1 %>%
filter(
sponsor %>%
equals("TRUMP, DONALD")
)
d1 %>%
filter(
sponsor %>%
is_in(
c("BIDEN, JOE",
"AMERICA FIRST ACTION")
)
)
d1 %>%
filter(
est_spending %>%
is_greater_than(10000) &
air_date %>%
is_weakly_greater_than(
"2020-10-01" %>%
as.Date
)
)
Stringing together logical conditions is a hugely powerful technique
which you’ll use often. Imagine, for instance, you want to check the
spots run by President Trump in October or November, or the ads
run by then-Vice President Biden in June or July. You might use a nested
sets of intersection operators &
, or union operators
|
, and ()
to communicate the operators’
logical precedence, as follows:
d1 %>%
filter(
(
sponsor %>%
equals("TRUMP, DONALD") &
month %>%
is_in(
c("Oct",
"Nov")
)
) | (
sponsor %>%
equals("BIDEN, JOE") &
month %>%
is_in(
c("Jun",
"Jul")
)
)
)
Take a moment to work this chunk by removing the nesting or the logical operators to really understand what’s going on here.
dplyr::select
If filter
uses logical operations on rows,
select
is the command to do logical operations on columns
(or variables).
We get to select columns any number of ways. Really, the flexibility here is almost a pedagogical drawback.
d1 %>%
select(1:3)
d1 %>%
select(
ad_type, market, time
)
d1 %>%
select(
air_date:market
)
We can use a -
to make an selection exclusion
d1 %>%
select(-c(1:3))
d1 %>%
select(
-ad_type
)
d1 %>%
select(
-c(air_date:market)
)
We get a bunch of utility helpers
d1 %>%
select(starts_with("a"))
d1 %>%
select(ends_with("e"))
d1 %>%
select(contains("_"))
d1 %>%
select(
where(is.character), where(is.ordered)
)
dplyr::arrange
arrange
sorts tables by variables.
This is useful when you need to do something like
…actually come to think of it, arranging tables is pretty specific and mid. But it’s a public university, and the taxpayers
Anyway, here’s how it works
d1 %>%
arrange(
desc(est_spending)
)
d1 %>%
arrange(
air_date
)
dplyr::mutate
mutate
takes an existing table and adds a new column,
usually from a computation. It becomes super super super useful, when
used in conjunction with group_by
Let’s do an example–an indicator grouping weekday or weekend ads
d1 %>%
mutate(
day_type = wday %>%
str_detect("Sat|Sun") %>%
ifelse(
"weekend", "weekday"
)
)
or grouping together dayparts
d1 %>%
mutate(
part_type = daypart %>%
case_match(
"EARLY MORNING" ~ "morning",
c("DAYTIME",
"EARLY FRINGE",
"LATE FRINGE") ~ "daytime",
c("EARLY NEWS",
"PRIME ACCESS",
"PRIME TIME",
"LATE NEWS") ~ "evening"
)
)
or figuring out how many days before the election an ad was broadcast
library(lubridate)
d1 %>%
mutate(
days_before = "2020-11-03" %>%
as.Date %>%
subtract(
air_date
) %>%
as.numeric
)
dplyr::summarize
summarize
takes an existing table and generates
a new table of results. You will normally want to assign this
resuling table to a new variable
How much was spent on TV advertising in the 2020 presidential election?
d1 %>%
summarize(
tote_spend = est_spending %>% sum
)
Or how many different advertisers where there?
d1 %>%
summarize(
uniq_spons = sponsor %>% unique %>% length
)
Now, perhaps it’s already occurred to you–
summarize
is not too
useful, a single stat returned from the table?and I would agree! We’re statisticians, we like counting things, and most of all, we like counting things by group. Which brings us to…
dplyr::group_by
This adverb really makes the dplyr
approach powerful and
compact. group_by
repeats a statistical task for some
arbitrary number of groups (defined by a variable) and then either
returns a single value for each group (when used with
summarize
) or the underlying table (when used with
mutate
.)
Let’s do some examples. How many ads were run by each political team, and how much was spent?
d1 %>%
group_by(affiliation) %>%
summarize(
tote_spend = est_spending %>%
sum,
ads = n()
)
## # A tibble: 3 × 3
## affiliation tote_spend ads
## <chr> <dbl> <int>
## 1 DEMOCRAT 558857940 751596
## 2 OTHER 87570 243
## 3 REPUBLICAN 409299330 478918
Did the political parties run ads during the same dayparts?
d1 %>%
group_by(affiliation, daypart) %>%
summarize(
tote_spend = est_spending %>%
sum,
ads = n()
) %>%
filter(
affiliation %>%
equals("OTHER") %>%
not
)
## # A tibble: 16 × 4
## # Groups: affiliation [2]
## affiliation daypart tote_spend ads
## <chr> <chr> <dbl> <int>
## 1 DEMOCRAT DAYTIME 90442270 197746
## 2 DEMOCRAT EARLY FRINGE 74519960 104256
## 3 DEMOCRAT EARLY MORNING 98470600 173379
## 4 DEMOCRAT EARLY NEWS 45078880 47891
## 5 DEMOCRAT LATE FRINGE 52588070 87493
## 6 DEMOCRAT LATE NEWS 40284700 42283
## 7 DEMOCRAT PRIME ACCESS 55331660 57216
## 8 DEMOCRAT PRIME TIME 102141800 41332
## 9 REPUBLICAN DAYTIME 51512260 95777
## 10 REPUBLICAN EARLY FRINGE 55410960 64729
## 11 REPUBLICAN EARLY MORNING 66368550 117236
## 12 REPUBLICAN EARLY NEWS 30599170 32693
## 13 REPUBLICAN LATE FRINGE 34513880 51875
## 14 REPUBLICAN LATE NEWS 34978000 32920
## 15 REPUBLICAN PRIME ACCESS 41314930 42547
## 16 REPUBLICAN PRIME TIME 94601580 41141
Notice how we passed two variables to group_by
, but
there’s only 1 group specified when we printed the tibble. The last
group specified–d1$daypart
has been peeled
off by the summarize
call, so that we can do
additional analyses on these summary tables.
Pause here.
This is a very specific design choice that causes a ton of confusion online. In making this design, Hadley understood a very specific analytic loop:
So, that the parties run different numbers of ads is not so interesting–instead, given that there were different number of ads, what was the proportion of GOP and Democratic ads run on a specific daypart?
d1 %>%
group_by(affiliation, daypart) %>%
summarize(
tote_spend = est_spending %>%
sum,
ads = n()
) %>%
filter(
affiliation %>%
equals("OTHER") %>%
not
) %>%
mutate(
perc_ads = ads %>%
divide_by(
ads %>% sum
) %>%
multiply_by(100),
perc_spend = tote_spend %>%
divide_by(
tote_spend %>% sum
) %>%
multiply_by(100)
)
## # A tibble: 16 × 6
## # Groups: affiliation [2]
## affiliation daypart tote_spend ads perc_ads perc_spend
## <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 DEMOCRAT DAYTIME 90442270 197746 26.3 16.2
## 2 DEMOCRAT EARLY FRINGE 74519960 104256 13.9 13.3
## 3 DEMOCRAT EARLY MORNING 98470600 173379 23.1 17.6
## 4 DEMOCRAT EARLY NEWS 45078880 47891 6.37 8.07
## 5 DEMOCRAT LATE FRINGE 52588070 87493 11.6 9.41
## 6 DEMOCRAT LATE NEWS 40284700 42283 5.63 7.21
## 7 DEMOCRAT PRIME ACCESS 55331660 57216 7.61 9.90
## 8 DEMOCRAT PRIME TIME 102141800 41332 5.50 18.3
## 9 REPUBLICAN DAYTIME 51512260 95777 20.0 12.6
## 10 REPUBLICAN EARLY FRINGE 55410960 64729 13.5 13.5
## 11 REPUBLICAN EARLY MORNING 66368550 117236 24.5 16.2
## 12 REPUBLICAN EARLY NEWS 30599170 32693 6.83 7.48
## 13 REPUBLICAN LATE FRINGE 34513880 51875 10.8 8.43
## 14 REPUBLICAN LATE NEWS 34978000 32920 6.87 8.55
## 15 REPUBLICAN PRIME ACCESS 41314930 42547 8.88 10.1
## 16 REPUBLICAN PRIME TIME 94601580 41141 8.59 23.1
For these reasons, pay a ton of attention to the ordering of
the variables passed to group_by
Ok–a few more little pieces of useful dplyr
stuff:
slice()
returns a number of rowstop_n()
is like slice
, but allows you to
define an additional variable on which you’d like to make a specific
selectionrename
lets your rename variablestally
reports group counts, normally after
youcount
reports group counts, but lets you define the
group inside the functionlet’s download the Billboard 200
album sales data to do a couple of exercises and try out the
dplyr
approach.
d2 <- "https://github.com/thomasjwood/code_lab/raw/main/data/billboard_200_68-23.rds" %>%
url %>%
gzcon %>%
readRDS
No artist better encapsulates Gen-Z’s mid tastes than T-Swift. How many times has Taylor had 5 or more separate albums on a single Billboard 200 chart?
By decade, report the top 5 artists who’ve spent the most weeks with at least 1 top 10 album. Please exclude soundtracks and albums by ‘various artists.’
By decade, report the top 5 artists who’ve had the most separate albums which spent at least 1 week in the top 10. Please again exclude soundtracks and albums by ‘various artists.’
By decade, report the top 5 albums who spent the most weeks in the top 100. Bonus By decade, report the mean number of weeks the start and end of each album’s period in the top 100.
CMAG has a pretty wide use among academic papers–it can be ordered here http://mediaproject.wesleyan.edu/dataaccess/↩︎