Oh it’s nice to be back. I had a very memorable experience at Melbourne’s Center for Behavior Change last semester. Terms of academic employment are very distinct in different countries, and this was illuminating.
We’ve covered a lot of ground to date, and at least so far as my ambition, we’ve covered the basic toolkit for going from a malformed dataset to publishable figures, models, and tables for an academic paper. The kinds of topics which I think remain outstanding are:
Maps
More elaborate tables, of the kind that epidemiologists traditionally refer to as Table 1.
SQL
Competing data frame implementations (for instance, polars, DuckDB, collapse, data.table, or datafusion) which might go on to displace dplyr as a data manipulation tool
but most importantly, I’d like to gauge the interest of the room!
Initially I had thought to be sharing the convening responsibilities with students, but we ended up in an arrangement of me leading these sessions. Should we do any fundamental reform to our arrangements?
We’re going to look again at a dataset we’ve used multiple times–US box office data
library(tidyverse)
library(magrittr)
library(rvest)
t0 <- "https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz" %>%
read_csv
I’m interested in how we’d model some recent flops
Just eyeballing it, it appears pretty close to an exponential decay functional form.
\[ C(t) = a \cdot (1 - e^{-b \cdot t}) \]
where a
is the asymptotic maximum, and b
is
the decay rate.
We’ll use the elegant little minpack.lm
package
for a simple way to model a nonlinear least squares equation.
We need to do some manipulation to get to the form for the modelling
t1 <- t0 %>%
filter(
title %>%
is_in(
c("The Marvels",
"Morbius",
"X-Men: Dark Phoenix",
"Madame Web",
"Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn",
"Blue Beetle",
"Shazam! Fury of the Gods",
"Joker: Folie u00e0 Deux",
"The Flash")
)
) %>%
mutate(
dow = date %>%
wday(label = T)
) %>%
arrange(
title, date
) %>%
group_by(
title
) %>%
mutate(
dir = 1:n()
) %>%
group_by(
title, dow
) %>%
mutate(
dow_n = 1:n()
)
Then we need to something for the films that don’t get released on a Friday (making their box office accumulation harder to compare)
t2 <- t1 %>%
filter(
dow == "Fri" &
dow_n == 1
) %>%
ungroup %>%
select(
date, title
) %>%
unname %>%
pmap(
\(i, j)
# i <- "2022-04-01" %>% as.Date
# j <- "Morbius"
t1 %>%
ungroup %>%
filter(
title == j &
date %>%
is_weakly_greater_than(i)
)
) %>%
list_rbind %>%
arrange(
title, date
) %>%
group_by(
title
) %>%
mutate(
cr = revenue %>% cumsum
)
If we wanted to apply a separate exponential decay model to each film separately, we’d do this
library(minpack.lm)
t3 <- t2 %>%
na.omit %>%
group_by(
title
) %>%
nest
t3$mods <- t3$data %>%
map(
\(i){
nlsLM(
cr ~ a * (1 - exp(-b * dir)),
start = list(a = max(i$cr), b = 0.1),
data = i
)
}
)
t3$data <- t3$mods %>%
map2(
t3$data,
\(i, j)
j %>%
mutate(
fitted = i %>%
predict(newdata = j)
)
)
p1 <- t3$data %>%
map2(
t3$title,
\(i, j)
i %>%
mutate(
title = j
)
) %>%
list_rbind %>%
pivot_longer(
names_to = "type",
values_to = "val",
cols = cr:fitted
) %>%
mutate(
title = title %>%
fct_reorder(
val, .fun = "max", .desc = T
)
) %>%
ggplot(
aes(
dir, val, color = type
)
) +
geom_line() +
facet_wrap(~title) +
scale_y_continuous(
labels = scales::label_dollar()
) +
labs(
x = "",
y = ""
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
panel.background = element_rect(fill = "grey97",
color = "grey97"),
strip.background = element_rect(fill = "grey97",
color = "grey97")
)
It seemed to work pretty well!
But, it’s not very meaningful to show that we can fit a model to a decay process after the process has been observed.
So now my task: how well do these models do when we fit the model to only a limited amount of days box office? What’s the process of accuracy improvement over time?
Accordingly: