Explore the data

Our modeling goal is to estimate how the characteristics of Super Bowl commercials have changed over time. There aren’t a lot of observations in this dataset. And thí í an approach that can be used for robust estimates in such situations. Let’s start by reading in the data

skimr::skim(youtube)

Data summary
Name	youtube
Number of rows	247
Number of columns	25
_______________________
Column type frequency:
character	10
logical	7
numeric	7
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
brand	0	1.00	3	9	10
superbowl_ads_dot_com_url	0	1.00	34	120	244
youtube_url	11	0.96	43	43	233
id	11	0.96	11	11	233
kind	16	0.94	13	13	1
etag	16	0.94	27	27	228
title	16	0.94	6	99	228
description	50	0.80	3	3527	194
thumbnail	129	0.48	48	48	118
channel_title	16	0.94	3	37	185

Variable type: logical

skim_variable	complete_rate	mean	count
funny	1	0.69	TRU: 171, FAL: 76
show_product_quickly	1	0.68	TRU: 169, FAL: 78
patriotic	1	0.17	FAL: 206, TRU: 41
celebrity	1	0.29	FAL: 176, TRU: 71
danger	1	0.30	FAL: 172, TRU: 75
animals	1	0.37	FAL: 155, TRU: 92
use_sex	1	0.27	FAL: 181, TRU: 66

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2010.19	5.86	2000	2005	2010	2015.00	2020	▇▇▇▇▆
view_count	16	0.94	1407556.46	11971111.01	10	6431	41379	170015.50	176373378	▇▁▁▁▁
like_count	22	0.91	4146.03	23920.40	0	19	130	527.00	275362	▇▁▁▁▁
dislike_count	22	0.91	833.54	6948.52	0	1	7	24.00	92990	▇▁▁▁▁
favorite_count	16	0.94	0.00	0.00	0	0	0	0.00	0	▁▁▇▁▁
comment_count	25	0.90	188.64	986.46	0	1	10	50.75	9190	▇▁▁▁▁
category_id	16	0.94	19.32	8.00	1	17	23	24.00	29	▃▁▂▆▇

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
published_at	16	0.94	2006-02-06 10:02:36	2021-01-27 13:11:29	2013-01-31 09:13:55	227

Let’s make one exploratory plot to see how the characteristics of the commericals change over time

theme_set(theme_light())

youtube %>%
               select(year, funny:use_sex) %>%
               pivot_longer(funny:use_sex) %>% 
               group_by(year, name) %>% 
               summarise(prop = mean(value)) %>%
               ungroup() %>% 
               ggplot(aes(year, prop, color = name)) + 
               geom_line(size = 1.2, show.legend = FALSE) +
               facet_wrap(vars(name)) +
               scale_y_continuous(labels = scales::percent) + 
               labs(x = NULL, y = '% of commercials')

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Fit a simple model

Although those relationships don’t look perfectly linear, we can use a linear model to estimate if and how much these characteristics are changing with time

simple_mod <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex, 
                 data = youtube)

summary(simple_mod)

## 
## Call:
## lm(formula = year ~ funny + show_product_quickly + patriotic + 
##     celebrity + danger + animals + use_sex, data = youtube)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5254  -4.1023   0.1456   3.9662  10.1727 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2011.0838     0.9312 2159.748  < 2e-16 ***
## funnyTRUE                  -2.8979     0.8593   -3.372  0.00087 ***
## show_product_quicklyTRUE    0.7706     0.7443    1.035  0.30160    
## patrioticTRUE               2.0455     1.0140    2.017  0.04480 *  
## celebrityTRUE               2.4416     0.7767    3.144  0.00188 ** 
## dangerTRUE                  0.4814     0.7846    0.614  0.54007    
## animalsTRUE                 0.1082     0.7330    0.148  0.88274    
## use_sexTRUE                -2.4041     0.8175   -2.941  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.391 on 239 degrees of freedom
## Multiple R-squared:  0.178,  Adjusted R-squared:  0.1539 
## F-statistic: 7.393 on 7 and 239 DF,  p-value: 4.824e-08

We get statistical properties from this linear model, but we can use bootstrap resampling to get an estimate of the variance in our quantities. In tidymodels, the rsample package has functions to create resamples such as bootstraps

library(rsample)

## Warning: package 'rsample' was built under R version 4.0.3

bootstraps(youtube, times = 1e3)

## # Bootstrap sampling 
## # A tibble: 1,000 x 2
##    splits            id           
##    <list>            <chr>        
##  1 <rsplit [247/91]> Bootstrap0001
##  2 <rsplit [247/85]> Bootstrap0002
##  3 <rsplit [247/89]> Bootstrap0003
##  4 <rsplit [247/96]> Bootstrap0004
##  5 <rsplit [247/92]> Bootstrap0005
##  6 <rsplit [247/88]> Bootstrap0006
##  7 <rsplit [247/93]> Bootstrap0007
##  8 <rsplit [247/86]> Bootstrap0008
##  9 <rsplit [247/91]> Bootstrap0009
## 10 <rsplit [247/92]> Bootstrap0010
## # ... with 990 more rows

This has allowed you to carry out flexible bootstrapping or permutations steps. However, that’s a lot of steps to get to confidence intervals, especially if you are using a really simple model! In a recent release rsample, we added a new function reg_intervals() that finds confidence intervals for models like lm() and glm() (as well as models from the survival package)

set.seed(123)

df <- youtube %>% select(year, funny:use_sex)

df_bootstrap <- bootstraps(df, time = 1e3)

lm_function <- function(df) {
               
model <- lm(year ~ funny + show_product_quickly + patriotic + celebrity + danger + animals + use_sex,
            data = df )

}

boot_models <- df_bootstrap %>% 
               mutate(model  = map(splits, lm_function),
                      coef = map(model, tidy))

boot_models %>% unnest(coef)

## # A tibble: 8,000 x 8
##    splits      id        model term         estimate std.error statistic p.value
##    <list>      <chr>     <lis> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
##  1 <rsplit [2~ Bootstra~ <lm>  (Intercept)  2012.        0.900  2236.    0.     
##  2 <rsplit [2~ Bootstra~ <lm>  funnyTRUE      -3.69      0.821    -4.49  1.09e-5
##  3 <rsplit [2~ Bootstra~ <lm>  show_produc~    1.40      0.731     1.91  5.68e-2
##  4 <rsplit [2~ Bootstra~ <lm>  patrioticTR~    0.922     0.936     0.985 3.26e-1
##  5 <rsplit [2~ Bootstra~ <lm>  celebrityTR~    3.32      0.792     4.20  3.83e-5
##  6 <rsplit [2~ Bootstra~ <lm>  dangerTRUE     -1.05      0.722    -1.45  1.47e-1
##  7 <rsplit [2~ Bootstra~ <lm>  animalsTRUE     0.457     0.694     0.658 5.11e-1
##  8 <rsplit [2~ Bootstra~ <lm>  use_sexTRUE    -2.20      0.788    -2.80  5.60e-3
##  9 <rsplit [2~ Bootstra~ <lm>  (Intercept)  2013.        0.879  2291.    0.     
## 10 <rsplit [2~ Bootstra~ <lm>  funnyTRUE      -3.30      0.796    -4.15  4.71e-5
## # ... with 7,990 more rows

percentiles <- int_pctl(boot_models, coef)

youtube_intervals <- percentiles[c(2:8),]

All done!

Explore bootstrap results

We can use visualization to explore these results. If we had not set keep_reps = TRUE, we would only have the intervals themselves and could a plot such as this one

youtube_intervals %>% 
               mutate(
                              term = str_remove(term, "TRUE"),
                              term = fct_reorder(term, .estimate)
               ) %>% 
ggplot(aes(.estimate, term)) + 
               geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray80') +
               geom_errorbarh(aes(xmin = .lower, xmax = .upper),
                              size = 1.5, alpha = 0.5, color = 'midnightblue') +
               geom_point(size = 3, alpha = 0.5, color = 'midnightblue') +
               labs(
                              x = 'Increase in year for each commercial characteristic',
                              y = NULL
               )

Since we did keep the individual replicates, we can look at the distributions.

boot_models %>% unnest(coef) %>%
               filter(term != '(Intercept)') %>%
               mutate(
                              term = str_remove(term, "TRUE"),
                              term = fct_reorder(term, estimate)
               ) %>%
               ggplot(aes(estimate, fill = term)) +
               geom_vline(xintercept = 0, size = 1.5, lty = 2, color = 'gray50')+
               geom_histogram(alpha = 0.8, show.legend = FALSE) +
               facet_wrap(vars(term))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bootstrap confidence intervals for Super Bowl commercials

Nguyen_LSCM

3/14/2021

Explore the data

Fit a simple model

Explore bootstrap results