How lovely to be back!

Oh it’s nice to be back. I had a very memorable experience at Melbourne’s Center for Behavior Change last semester. Terms of academic employment are very distinct in different countries, and this was illuminating.

What topics should we cover this semester?

We’ve covered a lot of ground to date, and at least so far as my ambition, we’ve covered the basic toolkit for going from a malformed dataset to publishable figures, models, and tables for an academic paper. The kinds of topics which I think remain outstanding are:

  • Maps

  • More elaborate tables, of the kind that epidemiologists traditionally refer to as Table 1.

  • SQL

  • Competing data frame implementations (for instance, polars, DuckDB, collapse, data.table, or datafusion) which might go on to displace dplyr as a data manipulation tool

    but most importantly, I’d like to gauge the interest of the room!

Should we talk about format?

Initially I had thought to be sharing the convening responsibilities with students, but we ended up in an arrangement of me leading these sessions. Should we do any fundamental reform to our arrangements?

A quick data exercise

We’re going to look again at a dataset we’ve used multiple times–US box office data

library(tidyverse)
library(magrittr)
library(rvest)

t0 <- "https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz" %>% 
  read_csv

I’m interested in how we’d model some recent flops

Just eyeballing it, it appears pretty close to an exponential decay functional form.

\[ C(t) = a \cdot (1 - e^{-b \cdot t}) \]

where a is the asymptotic maximum, and b is the decay rate.

We’ll use the elegant little minpack.lm package for a simple way to model a nonlinear least squares equation.

We need to do some manipulation to get to the form for the modelling

t1 <- t0 %>% 
  filter(
    title %>% 
      is_in(
        c("The Marvels",
          "Morbius",
          "X-Men: Dark Phoenix",
          "Madame Web",
          "Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn",
          "Blue Beetle",
          "Shazam! Fury of the Gods",
          "Joker: Folie u00e0 Deux",
          "The Flash")
      )
  ) %>% 
  mutate(
    dow = date %>% 
      wday(label = T)
  ) %>% 
  arrange(
    title, date
  ) %>% 
  group_by(
    title
  ) %>% 
  mutate(
    dir = 1:n()
  ) %>% 
  group_by(
    title, dow
  ) %>% 
  mutate(
    dow_n = 1:n()
  )

Then we need to something for the films that don’t get released on a Friday (making their box office accumulation harder to compare)

t2 <- t1 %>% 
  filter(
    dow == "Fri" &
      dow_n == 1
  ) %>% 
  ungroup %>% 
  select(
    date, title
  ) %>% 
  unname %>% 
  pmap(
    \(i, j)
    
    # i <- "2022-04-01" %>% as.Date
    # j  <-  "Morbius"
    
    t1 %>% 
      ungroup %>% 
      filter(
        title == j &
          date %>% 
          is_weakly_greater_than(i)
      )
  ) %>% 
  list_rbind %>% 
  arrange(
    title, date
  ) %>% 
  group_by(
    title
  ) %>% 
  mutate(
    cr = revenue %>% cumsum
  )

If we wanted to apply a separate exponential decay model to each film separately, we’d do this

library(minpack.lm)
t3 <- t2 %>% 
  na.omit %>% 
  group_by(
    title
  ) %>% 
  nest

t3$mods <- t3$data %>% 
  map(
    \(i){
     
      nlsLM(
        cr ~ a * (1 - exp(-b * dir)),
        start = list(a = max(i$cr), b = 0.1),
        data = i
      )
      
    }
  )

t3$data <- t3$mods %>% 
  map2(
    t3$data,
    \(i, j)
    j %>% 
      mutate(
        fitted = i %>% 
          predict(newdata = j)
      )
  )

p1 <- t3$data %>% 
  map2(
    t3$title,
    \(i, j)
    i %>% 
      mutate(
        title = j
      )
  ) %>% 
  list_rbind %>% 
  pivot_longer(
    names_to = "type",
    values_to = "val",
    cols = cr:fitted
  ) %>% 
  mutate(
    title = title %>% 
      fct_reorder(
        val, .fun = "max", .desc = T
      )
  ) %>% 
  ggplot(
    aes(
      dir, val, color = type
    )
  ) +
  geom_line() +
  facet_wrap(~title) +
  scale_y_continuous(
    labels = scales::label_dollar()
  ) +
  labs(
    x = "",
    y = ""
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "grey97",
                                    color = "grey97"),
    strip.background = element_rect(fill = "grey97",
                                    color = "grey97")
    )

It seemed to work pretty well!

But, it’s not very meaningful to show that we can fit a model to a decay process after the process has been observed.

So now my task: how well do these models do when we fit the model to only a limited amount of days box office? What’s the process of accuracy improvement over time?

Accordingly:

  1. Fit a separate model for each movie for some set of periods (either measured in days, or weeks)
  2. Visualize the difference in model performance as function of period