Our modeling goal is to estimate how the characteristics of Super Bowl commercials have changed over time. There aren’t a lot of observations in this dataset. And thí í an approach that can be used for robust estimates in such situations. Let’s start by reading in the data
skimr::skim(youtube)
| Name | youtube |
| Number of rows | 247 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| logical | 7 |
| numeric | 7 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| brand | 0 | 1.00 | 3 | 9 | 0 | 10 | 0 |
| superbowl_ads_dot_com_url | 0 | 1.00 | 34 | 120 | 0 | 244 | 0 |
| youtube_url | 11 | 0.96 | 43 | 43 | 0 | 233 | 0 |
| id | 11 | 0.96 | 11 | 11 | 0 | 233 | 0 |
| kind | 16 | 0.94 | 13 | 13 | 0 | 1 | 0 |
| etag | 16 | 0.94 | 27 | 27 | 0 | 228 | 0 |
| title | 16 | 0.94 | 6 | 99 | 0 | 228 | 0 |
| description | 50 | 0.80 | 3 | 3527 | 0 | 194 | 0 |
| thumbnail | 129 | 0.48 | 48 | 48 | 0 | 118 | 0 |
| channel_title | 16 | 0.94 | 3 | 37 | 0 | 185 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| funny | 0 | 1 | 0.69 | TRU: 171, FAL: 76 |
| show_product_quickly | 0 | 1 | 0.68 | TRU: 169, FAL: 78 |
| patriotic | 0 | 1 | 0.17 | FAL: 206, TRU: 41 |
| celebrity | 0 | 1 | 0.29 | FAL: 176, TRU: 71 |
| danger | 0 | 1 | 0.30 | FAL: 172, TRU: 75 |
| animals | 0 | 1 | 0.37 | FAL: 155, TRU: 92 |
| use_sex | 0 | 1 | 0.27 | FAL: 181, TRU: 66 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2010.19 | 5.86 | 2000 | 2005 | 2010 | 2015.00 | 2020 | ▇▇▇▇▆ |
| view_count | 16 | 0.94 | 1407556.46 | 11971111.01 | 10 | 6431 | 41379 | 170015.50 | 176373378 | ▇▁▁▁▁ |
| like_count | 22 | 0.91 | 4146.03 | 23920.40 | 0 | 19 | 130 | 527.00 | 275362 | ▇▁▁▁▁ |
| dislike_count | 22 | 0.91 | 833.54 | 6948.52 | 0 | 1 | 7 | 24.00 | 92990 | ▇▁▁▁▁ |
| favorite_count | 16 | 0.94 | 0.00 | 0.00 | 0 | 0 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
| comment_count | 25 | 0.90 | 188.64 | 986.46 | 0 | 1 | 10 | 50.75 | 9190 | ▇▁▁▁▁ |
| category_id | 16 | 0.94 | 19.32 | 8.00 | 1 | 17 | 23 | 24.00 | 29 | ▃▁▂▆▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| published_at | 16 | 0.94 | 2006-02-06 10:02:36 | 2021-01-27 13:11:29 | 2013-01-31 09:13:55 | 227 |
Let’s make one exploratory plot to see how the characteristics of the commericals change over time
theme_set(theme_light())
youtube %>%
select(year, funny:use_sex) %>%
pivot_longer(funny:use_sex) %>%
group_by(year, name) %>%
summarise(prop = mean(value)) %>%
ungroup() %>%
ggplot(aes(year, prop, color = name)) +
geom_line(size = 1.2, show.legend = FALSE) +
facet_wrap(vars(name)) +
scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = '% of commercials')
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Although those relationships don’t look perfectly linear, we can use a linear model to estimate if and how much these characteristics are changing with time
simple_mod <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex,
data = youtube)
summary(simple_mod)
##
## Call:
## lm(formula = year ~ funny + show_product_quickly + patriotic +
## celebrity + danger + animals + use_sex, data = youtube)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5254 -4.1023 0.1456 3.9662 10.1727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2011.0838 0.9312 2159.748 < 2e-16 ***
## funnyTRUE -2.8979 0.8593 -3.372 0.00087 ***
## show_product_quicklyTRUE 0.7706 0.7443 1.035 0.30160
## patrioticTRUE 2.0455 1.0140 2.017 0.04480 *
## celebrityTRUE 2.4416 0.7767 3.144 0.00188 **
## dangerTRUE 0.4814 0.7846 0.614 0.54007
## animalsTRUE 0.1082 0.7330 0.148 0.88274
## use_sexTRUE -2.4041 0.8175 -2.941 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.391 on 239 degrees of freedom
## Multiple R-squared: 0.178, Adjusted R-squared: 0.1539
## F-statistic: 7.393 on 7 and 239 DF, p-value: 4.824e-08
We get statistical properties from this linear model, but we can use bootstrap resampling to get an estimate of the variance in our quantities. In tidymodels, the rsample package has functions to create resamples such as bootstraps
library(rsample)
## Warning: package 'rsample' was built under R version 4.0.3
bootstraps(youtube, times = 1e3)
## # Bootstrap sampling
## # A tibble: 1,000 x 2
## splits id
## <list> <chr>
## 1 <rsplit [247/91]> Bootstrap0001
## 2 <rsplit [247/85]> Bootstrap0002
## 3 <rsplit [247/89]> Bootstrap0003
## 4 <rsplit [247/96]> Bootstrap0004
## 5 <rsplit [247/92]> Bootstrap0005
## 6 <rsplit [247/88]> Bootstrap0006
## 7 <rsplit [247/93]> Bootstrap0007
## 8 <rsplit [247/86]> Bootstrap0008
## 9 <rsplit [247/91]> Bootstrap0009
## 10 <rsplit [247/92]> Bootstrap0010
## # ... with 990 more rows
This has allowed you to carry out flexible bootstrapping or permutations steps. However, that’s a lot of steps to get to confidence intervals, especially if you are using a really simple model! In a recent release rsample, we added a new function reg_intervals() that finds confidence intervals for models like lm() and glm() (as well as models from the survival package)
set.seed(123)
df <- youtube %>% select(year, funny:use_sex)
df_bootstrap <- bootstraps(df, time = 1e3)
lm_function <- function(df) {
model <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex,
data = df )
}
boot_models <- df_bootstrap %>%
mutate(model = map(splits, lm_function),
coef = map(model, tidy))
boot_models %>% unnest(coef)
## # A tibble: 8,000 x 8
## splits id model term estimate std.error statistic p.value
## <list> <chr> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <rsplit [2~ Bootstra~ <lm> (Intercept) 2012. 0.900 2236. 0.
## 2 <rsplit [2~ Bootstra~ <lm> funnyTRUE -3.69 0.821 -4.49 1.09e-5
## 3 <rsplit [2~ Bootstra~ <lm> show_produc~ 1.40 0.731 1.91 5.68e-2
## 4 <rsplit [2~ Bootstra~ <lm> patrioticTR~ 0.922 0.936 0.985 3.26e-1
## 5 <rsplit [2~ Bootstra~ <lm> celebrityTR~ 3.32 0.792 4.20 3.83e-5
## 6 <rsplit [2~ Bootstra~ <lm> dangerTRUE -1.05 0.722 -1.45 1.47e-1
## 7 <rsplit [2~ Bootstra~ <lm> animalsTRUE 0.457 0.694 0.658 5.11e-1
## 8 <rsplit [2~ Bootstra~ <lm> use_sexTRUE -2.20 0.788 -2.80 5.60e-3
## 9 <rsplit [2~ Bootstra~ <lm> (Intercept) 2013. 0.879 2291. 0.
## 10 <rsplit [2~ Bootstra~ <lm> funnyTRUE -3.30 0.796 -4.15 4.71e-5
## # ... with 7,990 more rows
percentiles <- int_pctl(boot_models, coef)
youtube_intervals <- percentiles[c(2:8),]
All done!
We can use visualization to explore these results. If we had not set keep_reps = TRUE, we would only have the intervals themselves and could a plot such as this one
youtube_intervals %>%
mutate(
term = str_remove(term, "TRUE"),
term = fct_reorder(term, .estimate)
) %>%
ggplot(aes(.estimate, term)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray80') +
geom_errorbarh(aes(xmin = .lower, xmax = .upper),
size = 1.5, alpha = 0.5, color = 'midnightblue') +
geom_point(size = 3, alpha = 0.5, color = 'midnightblue') +
labs(
x = 'Increase in year for each commercial characteristic',
y = NULL
)
Since we did keep the individual replicates, we can look at the distributions.
boot_models %>% unnest(coef) %>%
filter(term != '(Intercept)') %>%
mutate(
term = str_remove(term, "TRUE"),
term = fct_reorder(term, estimate)
) %>%
ggplot(aes(estimate, fill = term)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray50')+
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(vars(term))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.