Notes Mar 12

Harold Nelson

3/11/2021

This is a collection of questions which can be answered using the superbowl commercials dataset.

Setup

Load the data and get the libraries we need.

youtube <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   year = col_double(),
##   funny = col_logical(),
##   show_product_quickly = col_logical(),
##   patriotic = col_logical(),
##   celebrity = col_logical(),
##   danger = col_logical(),
##   animals = col_logical(),
##   use_sex = col_logical(),
##   view_count = col_double(),
##   like_count = col_double(),
##   dislike_count = col_double(),
##   favorite_count = col_double(),
##   comment_count = col_double(),
##   published_at = col_datetime(format = ""),
##   category_id = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
glimpse(youtube)
## Rows: 247
## Columns: 25
## $ year                      <dbl> 2018, 2020, 2006, 2018, 2003, 2020, 2020, 2…
## $ brand                     <chr> "Toyota", "Bud Light", "Bud Light", "Hynuda…
## $ superbowl_ads_dot_com_url <chr> "https://superbowl-ads.com/good-odds-toyota…
## $ youtube_url               <chr> "https://www.youtube.com/watch?v=zeBZvwYQ-h…
## $ funny                     <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,…
## $ show_product_quickly      <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE…
## $ patriotic                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ celebrity                 <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRU…
## $ danger                    <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE…
## $ animals                   <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
## $ use_sex                   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA…
## $ id                        <chr> "zeBZvwYQ-hA", "nbbp0VW7z8w", "yk0MQD5YgV8"…
## $ kind                      <chr> "youtube#video", "youtube#video", "youtube#…
## $ etag                      <chr> "rn-ggKNly38Cl0C3CNjNnUH9xUw", "1roDoK-SYqS…
## $ view_count                <dbl> 173929, 47752, 142310, 198, 13741, 23636, 3…
## $ like_count                <dbl> 1233, 485, 129, 2, 20, 115, 1470, 78, 342, …
## $ dislike_count             <dbl> 38, 14, 15, 0, 3, 11, 384, 6, 7, 0, 14, 0, …
## $ favorite_count            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ comment_count             <dbl> NA, 14, 9, 0, 2, 13, 227, 6, 30, 0, 8, 1, 1…
## $ published_at              <dttm> 2018-02-03 11:29:14, 2020-01-31 21:04:13, …
## $ title                     <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ description               <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ thumbnail                 <chr> "https://i.ytimg.com/vi/zeBZvwYQ-hA/sddefau…
## $ channel_title             <chr> "Funny Commercials", "VCU Brandcenter", "Jo…
## $ category_id               <dbl> 1, 27, 17, 22, 24, 1, 24, 2, 24, 24, 24, 24…

Success

For our purposes, we will consider popularity as the criterion of success.

There are a few possible ways to measure popularity from this data.

  1. The variable view_count is obvious but may be influenced by the length of time the ad has been available on youtube.

Is view_count correlated with year?

Answer

youtube %>% 
  ggplot(aes(year,view_count)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

The one outlying observation makes the graph unreadable. To solve this problem use a log scale on the y-axis.

Answer

youtube %>% 
  ggplot(aes(year,view_count)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

There does not seem to be a bias in favor of older ads.

Another possibility is the ratio of likes to the sum of likes and dislikes. Create this variable and see if it is correlatd with view count

Answer

youtube = youtube %>% 
  mutate (fav = like_count/(like_count + dislike_count))

youtube %>% ggplot(aes(fav,view_count)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 31 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).

There is a large cluster of ads with perfect fav scores and very low view counts. Lets try to run this but limit the data to require a minimal number of likes and dislikes.

Answer

youtube %>% 
  filter( like_count + dislike_count > 100) %>% 
  ggplot(aes(fav,view_count)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'

There is a clear negative relationship, so the two measures don’t agree. Let’s proceed with view count as our selected measure.

The obvious question is which characteristics lead to a higher count. We can look at each of the following in turn using side-by-side boxplots.

youtube %>% 
  ggplot(aes(x=funny,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for funny")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=patriotic,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for patriotic")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=celebrity,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for celebrity")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

youtube %>% 
  ggplot(aes(x=animals,y=view_count)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("Boxplot for animals")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).