Notes Mar 12

Harold Nelson

3/11/2021

This is a collection of questions which can be answered using the superbowl commercials dataset.

Setup

Load the data and get the libraries we need.

youtube <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   year = col_double(),
##   funny = col_logical(),
##   show_product_quickly = col_logical(),
##   patriotic = col_logical(),
##   celebrity = col_logical(),
##   danger = col_logical(),
##   animals = col_logical(),
##   use_sex = col_logical(),
##   view_count = col_double(),
##   like_count = col_double(),
##   dislike_count = col_double(),
##   favorite_count = col_double(),
##   comment_count = col_double(),
##   published_at = col_datetime(format = ""),
##   category_id = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

glimpse(youtube)

## Rows: 247
## Columns: 25
## $ year                      <dbl> 2018, 2020, 2006, 2018, 2003, 2020, 2020, 2…
## $ brand                     <chr> "Toyota", "Bud Light", "Bud Light", "Hynuda…
## $ superbowl_ads_dot_com_url <chr> "https://superbowl-ads.com/good-odds-toyota…
## $ youtube_url               <chr> "https://www.youtube.com/watch?v=zeBZvwYQ-h…
## $ funny                     <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,…
## $ show_product_quickly      <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE…
## $ patriotic                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ celebrity                 <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRU…
## $ danger                    <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE…
## $ animals                   <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
## $ use_sex                   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA…
## $ id                        <chr> "zeBZvwYQ-hA", "nbbp0VW7z8w", "yk0MQD5YgV8"…
## $ kind                      <chr> "youtube#video", "youtube#video", "youtube#…
## $ etag                      <chr> "rn-ggKNly38Cl0C3CNjNnUH9xUw", "1roDoK-SYqS…
## $ view_count                <dbl> 173929, 47752, 142310, 198, 13741, 23636, 3…
## $ like_count                <dbl> 1233, 485, 129, 2, 20, 115, 1470, 78, 342, …
## $ dislike_count             <dbl> 38, 14, 15, 0, 3, 11, 384, 6, 7, 0, 14, 0, …
## $ favorite_count            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ comment_count             <dbl> NA, 14, 9, 0, 2, 13, 227, 6, 30, 0, 8, 1, 1…
## $ published_at              <dttm> 2018-02-03 11:29:14, 2020-01-31 21:04:13, …
## $ title                     <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ description               <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ thumbnail                 <chr> "https://i.ytimg.com/vi/zeBZvwYQ-hA/sddefau…
## $ channel_title             <chr> "Funny Commercials", "VCU Brandcenter", "Jo…
## $ category_id               <dbl> 1, 27, 17, 22, 24, 1, 24, 2, 24, 24, 24, 24…

Success

For our purposes, we will consider popularity as the criterion of success.

There are a few possible ways to measure popularity from this data.

The variable view_count is obvious but may be influenced by the length of time the ad has been available on youtube.

Is view_count correlated with year?

Answer

youtube %>% 
  ggplot(aes(year,view_count)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 16 rows containing non-finite values (stat_smooth).

## Warning: Removed 16 rows containing missing values (geom_point).

The one outlying observation makes the graph unreadable. To solve this problem use a log scale on the y-axis.

Answer

youtube %>% 
  ggplot(aes(year,view_count)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  scale_y_log10()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 16 rows containing non-finite values (stat_smooth).

## Warning: Removed 16 rows containing missing values (geom_point).

There does not seem to be a bias in favor of older ads.

Another possibility is the ratio of likes to the sum of likes and dislikes. Create this variable and see if it is correlatd with view count

Answer

youtube = youtube %>% 
  mutate (fav = like_count/(like_count + dislike_count))

youtube %>% ggplot(aes(fav,view_count)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  scale_y_log10()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 31 rows containing non-finite values (stat_smooth).

## Warning: Removed 31 rows containing missing values (geom_point).

There is a large cluster of ads with perfect fav scores and very low view counts. Lets try to run this but limit the data to require a minimal number of likes and dislikes.

Answer

youtube %>% 
  filter( like_count + dislike_count > 100) %>% 
  ggplot(aes(fav,view_count)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  scale_y_log10()

## `geom_smooth()` using formula 'y ~ x'

There is a clear negative relationship, so the two measures don’t agree. Let’s proceed with view count as our selected measure.

The obvious question is which characteristics lead to a higher count. We can look at each of the following in turn using side-by-side boxplots.

funny
patriotic
celebrity
danger
animals
use_sex

youtube %>% 
  ggplot(aes(x=funny,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for funny")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=patriotic,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for patriotic")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=celebrity,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for celebrity")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=animals,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for animals")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=danger,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for danger")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=use_sex,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for use_sex")

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

The only one I see that looks promising is danger. We can put all of these logical variables into a standrd multivariate linear model, where there values will be treated as 0/1.

Answer

char_model = lm(log10(view_count) ~  funny  + patriotic  + celebrity  + danger + animals + use_sex,data = youtube  )
summary(char_model)

## 
## Call:
## lm(formula = log10(view_count) ~ funny + patriotic + celebrity + 
##     danger + animals + use_sex, data = youtube)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4011 -0.6361  0.0435  0.7035  3.7254 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.29191    0.18734  22.910   <2e-16 ***
## funnyTRUE      0.22908    0.19667   1.165    0.245    
## patrioticTRUE  0.22872    0.23119   0.989    0.324    
## celebrityTRUE  0.03781    0.17828   0.212    0.832    
## dangerTRUE     0.27211    0.18109   1.503    0.134    
## animalsTRUE   -0.11957    0.16923  -0.707    0.481    
## use_sexTRUE   -0.19142    0.19074  -1.004    0.317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.208 on 224 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.02414,    Adjusted R-squared:  -0.002002 
## F-statistic: 0.9234 on 6 and 224 DF,  p-value: 0.4788

The results are weak. Normally we look for a t value of 2. Danger is the strongest variable, but doesn’t meet the usual standart. Animals and sex appear to be a turnoff, but insignificant.

Consolidated Type

Combine all of the individual logical variables into a single string.

Answer

youtube = youtube %>% 
  mutate (contype = "",
          contype = ifelse(funny,paste(contype,"funny"),contype),
          contype = ifelse(patriotic,paste(contype,"patriotic"),contype),
          contype = ifelse(celebrity,paste(contype,"celebrity"),contype),
          contype = ifelse(danger,paste(contype,"danger"),contype),
          contype = ifelse(animals,paste(contype,"animals"),contype),
          contype = ifelse(use_sex,paste(contype,"use_sex"),contype),
          contype = ifelse(contype=="","None",contype))

youtube %>% 
  select(contype) %>% 
  head(10)

## # A tibble: 10 x 1
##    contype                          
##    <chr>                            
##  1 "None"                           
##  2 " funny celebrity danger"        
##  3 " funny danger animals"          
##  4 "None"                           
##  5 " funny danger animals use_sex"  
##  6 " funny celebrity danger animals"
##  7 " funny celebrity animals"       
##  8 " celebrity"                     
##  9 " funny celebrity animals"       
## 10 " patriotic celebrity danger"

Most Common contype?

Use group_by() and summarize() to get a count of contype values and display the top 10 in descending order.

Answer

youtube %>% 
  group_by(contype) %>% 
  summarize(count = n() ) %>% 
  ungroup() %>% 
  arrange(desc(count))

## # A tibble: 39 x 2
##    contype                    count
##    <chr>                      <int>
##  1 " funny"                      32
##  2 " funny danger"               19
##  3 " funny danger animals"       16
##  4 "None"                        15
##  5 " celebrity"                  14
##  6 " funny animals"              14
##  7 " funny use_sex"              12
##  8 " patriotic animals"          12
##  9 " funny celebrity"            10
## 10 " funny celebrity use_sex"    10
## # … with 29 more rows

Brands

Create a barplot of ads by brand. Which brands have the most ads?

Answer

youtube %>% 
  ggplot(aes(x = brand)) +
  geom_bar()