Explore the data

Our modeling goal is to estimate how the characteristics of Super Bowl commercials have changed over time. There aren’t a lot of observations in this dataset. And thí í an approach that can be used for robust estimates in such situations. Let’s start by reading in the data

skimr::skim(youtube)
Data summary
Name youtube
Number of rows 247
Number of columns 25
_______________________
Column type frequency:
character 10
logical 7
numeric 7
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
brand 0 1.00 3 9 0 10 0
superbowl_ads_dot_com_url 0 1.00 34 120 0 244 0
youtube_url 11 0.96 43 43 0 233 0
id 11 0.96 11 11 0 233 0
kind 16 0.94 13 13 0 1 0
etag 16 0.94 27 27 0 228 0
title 16 0.94 6 99 0 228 0
description 50 0.80 3 3527 0 194 0
thumbnail 129 0.48 48 48 0 118 0
channel_title 16 0.94 3 37 0 185 0

Variable type: logical

skim_variable n_missing complete_rate mean count
funny 0 1 0.69 TRU: 171, FAL: 76
show_product_quickly 0 1 0.68 TRU: 169, FAL: 78
patriotic 0 1 0.17 FAL: 206, TRU: 41
celebrity 0 1 0.29 FAL: 176, TRU: 71
danger 0 1 0.30 FAL: 172, TRU: 75
animals 0 1 0.37 FAL: 155, TRU: 92
use_sex 0 1 0.27 FAL: 181, TRU: 66

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2010.19 5.86 2000 2005 2010 2015.00 2020 ▇▇▇▇▆
view_count 16 0.94 1407556.46 11971111.01 10 6431 41379 170015.50 176373378 ▇▁▁▁▁
like_count 22 0.91 4146.03 23920.40 0 19 130 527.00 275362 ▇▁▁▁▁
dislike_count 22 0.91 833.54 6948.52 0 1 7 24.00 92990 ▇▁▁▁▁
favorite_count 16 0.94 0.00 0.00 0 0 0 0.00 0 ▁▁▇▁▁
comment_count 25 0.90 188.64 986.46 0 1 10 50.75 9190 ▇▁▁▁▁
category_id 16 0.94 19.32 8.00 1 17 23 24.00 29 ▃▁▂▆▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
published_at 16 0.94 2006-02-06 10:02:36 2021-01-27 13:11:29 2013-01-31 09:13:55 227

Let’s make one exploratory plot to see how the characteristics of the commericals change over time

theme_set(theme_light())

youtube %>%
               select(year, funny:use_sex) %>%
               pivot_longer(funny:use_sex) %>% 
               group_by(year, name) %>% 
               summarise(prop = mean(value)) %>%
               ungroup() %>% 
               ggplot(aes(year, prop, color = name)) + 
               geom_line(size = 1.2, show.legend = FALSE) +
               facet_wrap(vars(name)) +
               scale_y_continuous(labels = scales::percent) + 
               labs(x = NULL, y = '% of commercials')
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Fit a simple model

Although those relationships don’t look perfectly linear, we can use a linear model to estimate if and how much these characteristics are changing with time

simple_mod <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex, 
                 data = youtube)

summary(simple_mod)
## 
## Call:
## lm(formula = year ~ funny + show_product_quickly + patriotic + 
##     celebrity + danger + animals + use_sex, data = youtube)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5254  -4.1023   0.1456   3.9662  10.1727 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2011.0838     0.9312 2159.748  < 2e-16 ***
## funnyTRUE                  -2.8979     0.8593   -3.372  0.00087 ***
## show_product_quicklyTRUE    0.7706     0.7443    1.035  0.30160    
## patrioticTRUE               2.0455     1.0140    2.017  0.04480 *  
## celebrityTRUE               2.4416     0.7767    3.144  0.00188 ** 
## dangerTRUE                  0.4814     0.7846    0.614  0.54007    
## animalsTRUE                 0.1082     0.7330    0.148  0.88274    
## use_sexTRUE                -2.4041     0.8175   -2.941  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.391 on 239 degrees of freedom
## Multiple R-squared:  0.178,  Adjusted R-squared:  0.1539 
## F-statistic: 7.393 on 7 and 239 DF,  p-value: 4.824e-08

We get statistical properties from this linear model, but we can use bootstrap resampling to get an estimate of the variance in our quantities. In tidymodels, the rsample package has functions to create resamples such as bootstraps

library(rsample)
## Warning: package 'rsample' was built under R version 4.0.3
bootstraps(youtube, times = 1e3)
## # Bootstrap sampling 
## # A tibble: 1,000 x 2
##    splits            id           
##    <list>            <chr>        
##  1 <rsplit [247/91]> Bootstrap0001
##  2 <rsplit [247/85]> Bootstrap0002
##  3 <rsplit [247/89]> Bootstrap0003
##  4 <rsplit [247/96]> Bootstrap0004
##  5 <rsplit [247/92]> Bootstrap0005
##  6 <rsplit [247/88]> Bootstrap0006
##  7 <rsplit [247/93]> Bootstrap0007
##  8 <rsplit [247/86]> Bootstrap0008
##  9 <rsplit [247/91]> Bootstrap0009
## 10 <rsplit [247/92]> Bootstrap0010
## # ... with 990 more rows

This has allowed you to carry out flexible bootstrapping or permutations steps. However, that’s a lot of steps to get to confidence intervals, especially if you are using a really simple model! In a recent release rsample, we added a new function reg_intervals() that finds confidence intervals for models like lm() and glm() (as well as models from the survival package)

set.seed(123)

df <- youtube %>% select(year, funny:use_sex)

df_bootstrap <- bootstraps(df, time = 1e3)

lm_function <- function(df) {
               
model <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex,
            data = df )

}

boot_models <- df_bootstrap %>% 
               mutate(model  = map(splits, lm_function),
                      coef = map(model, tidy))

boot_models %>% unnest(coef)
## # A tibble: 8,000 x 8
##    splits      id        model term         estimate std.error statistic p.value
##    <list>      <chr>     <lis> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
##  1 <rsplit [2~ Bootstra~ <lm>  (Intercept)  2012.        0.900  2236.    0.     
##  2 <rsplit [2~ Bootstra~ <lm>  funnyTRUE      -3.69      0.821    -4.49  1.09e-5
##  3 <rsplit [2~ Bootstra~ <lm>  show_produc~    1.40      0.731     1.91  5.68e-2
##  4 <rsplit [2~ Bootstra~ <lm>  patrioticTR~    0.922     0.936     0.985 3.26e-1
##  5 <rsplit [2~ Bootstra~ <lm>  celebrityTR~    3.32      0.792     4.20  3.83e-5
##  6 <rsplit [2~ Bootstra~ <lm>  dangerTRUE     -1.05      0.722    -1.45  1.47e-1
##  7 <rsplit [2~ Bootstra~ <lm>  animalsTRUE     0.457     0.694     0.658 5.11e-1
##  8 <rsplit [2~ Bootstra~ <lm>  use_sexTRUE    -2.20      0.788    -2.80  5.60e-3
##  9 <rsplit [2~ Bootstra~ <lm>  (Intercept)  2013.        0.879  2291.    0.     
## 10 <rsplit [2~ Bootstra~ <lm>  funnyTRUE      -3.30      0.796    -4.15  4.71e-5
## # ... with 7,990 more rows
percentiles <- int_pctl(boot_models, coef)

youtube_intervals <- percentiles[c(2:8),]

All done!

Explore bootstrap results

We can use visualization to explore these results. If we had not set keep_reps = TRUE, we would only have the intervals themselves and could a plot such as this one

youtube_intervals %>% 
               mutate(
                              term = str_remove(term, "TRUE"),
                              term = fct_reorder(term, .estimate)
               ) %>% 
ggplot(aes(.estimate, term)) + 
               geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray80') +
               geom_errorbarh(aes(xmin = .lower, xmax = .upper),
                              size = 1.5, alpha = 0.5, color = 'midnightblue') +
               geom_point(size = 3, alpha = 0.5, color = 'midnightblue') +
               labs(
                              x = 'Increase in year for each commercial characteristic',
                              y = NULL
               )

Since we did keep the individual replicates, we can look at the distributions.

boot_models %>% unnest(coef) %>%
               filter(term != '(Intercept)') %>%
               mutate(
                              term = str_remove(term, "TRUE"),
                              term = fct_reorder(term, estimate)
               ) %>%
               ggplot(aes(estimate, fill = term)) +
               geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray50')+
               geom_histogram(alpha = 0.8, show.legend = FALSE) +
               facet_wrap(vars(term))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.