Harold Nelson
3/11/2021
This is a collection of questions which can be answered using the superbowl commercials dataset.
Load the data and get the libraries we need.
youtube <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## year = col_double(),
## funny = col_logical(),
## show_product_quickly = col_logical(),
## patriotic = col_logical(),
## celebrity = col_logical(),
## danger = col_logical(),
## animals = col_logical(),
## use_sex = col_logical(),
## view_count = col_double(),
## like_count = col_double(),
## dislike_count = col_double(),
## favorite_count = col_double(),
## comment_count = col_double(),
## published_at = col_datetime(format = ""),
## category_id = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Rows: 247
## Columns: 25
## $ year <dbl> 2018, 2020, 2006, 2018, 2003, 2020, 2020, 2…
## $ brand <chr> "Toyota", "Bud Light", "Bud Light", "Hynuda…
## $ superbowl_ads_dot_com_url <chr> "https://superbowl-ads.com/good-odds-toyota…
## $ youtube_url <chr> "https://www.youtube.com/watch?v=zeBZvwYQ-h…
## $ funny <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,…
## $ show_product_quickly <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE…
## $ patriotic <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ celebrity <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRU…
## $ danger <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE…
## $ animals <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
## $ use_sex <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA…
## $ id <chr> "zeBZvwYQ-hA", "nbbp0VW7z8w", "yk0MQD5YgV8"…
## $ kind <chr> "youtube#video", "youtube#video", "youtube#…
## $ etag <chr> "rn-ggKNly38Cl0C3CNjNnUH9xUw", "1roDoK-SYqS…
## $ view_count <dbl> 173929, 47752, 142310, 198, 13741, 23636, 3…
## $ like_count <dbl> 1233, 485, 129, 2, 20, 115, 1470, 78, 342, …
## $ dislike_count <dbl> 38, 14, 15, 0, 3, 11, 384, 6, 7, 0, 14, 0, …
## $ favorite_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ comment_count <dbl> NA, 14, 9, 0, 2, 13, 227, 6, 30, 0, 8, 1, 1…
## $ published_at <dttm> 2018-02-03 11:29:14, 2020-01-31 21:04:13, …
## $ title <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ description <chr> "Toyota Super Bowl Commercial 2018 Good Odd…
## $ thumbnail <chr> "https://i.ytimg.com/vi/zeBZvwYQ-hA/sddefau…
## $ channel_title <chr> "Funny Commercials", "VCU Brandcenter", "Jo…
## $ category_id <dbl> 1, 27, 17, 22, 24, 1, 24, 2, 24, 24, 24, 24…
For our purposes, we will consider popularity as the criterion of success.
There are a few possible ways to measure popularity from this data.
Is view_count correlated with year?
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
The one outlying observation makes the graph unreadable. To solve this problem use a log scale on the y-axis.
youtube %>%
ggplot(aes(year,view_count)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
There does not seem to be a bias in favor of older ads.
Another possibility is the ratio of likes to the sum of likes and dislikes. Create this variable and see if it is correlatd with view count
youtube = youtube %>%
mutate (fav = like_count/(like_count + dislike_count))
youtube %>% ggplot(aes(fav,view_count)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 31 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).
There is a large cluster of ads with perfect fav scores and very low view counts. Lets try to run this but limit the data to require a minimal number of likes and dislikes.
youtube %>%
filter( like_count + dislike_count > 100) %>%
ggplot(aes(fav,view_count)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'
There is a clear negative relationship, so the two measures don’t agree. Let’s proceed with view count as our selected measure.
The obvious question is which characteristics lead to a higher count. We can look at each of the following in turn using side-by-side boxplots.
youtube %>%
ggplot(aes(x=funny,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for funny")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
youtube %>%
ggplot(aes(x=patriotic,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for patriotic")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
youtube %>%
ggplot(aes(x=celebrity,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for celebrity")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
youtube %>%
ggplot(aes(x=animals,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for animals")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
youtube %>%
ggplot(aes(x=danger,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for danger")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
youtube %>%
ggplot(aes(x=use_sex,y=view_count)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Boxplot for use_sex")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
The only one I see that looks promising is danger. We can put all of these logical variables into a standrd multivariate linear model, where there values will be treated as 0/1.
char_model = lm(log10(view_count) ~ funny + patriotic + celebrity + danger + animals + use_sex,data = youtube )
summary(char_model)
##
## Call:
## lm(formula = log10(view_count) ~ funny + patriotic + celebrity +
## danger + animals + use_sex, data = youtube)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4011 -0.6361 0.0435 0.7035 3.7254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.29191 0.18734 22.910 <2e-16 ***
## funnyTRUE 0.22908 0.19667 1.165 0.245
## patrioticTRUE 0.22872 0.23119 0.989 0.324
## celebrityTRUE 0.03781 0.17828 0.212 0.832
## dangerTRUE 0.27211 0.18109 1.503 0.134
## animalsTRUE -0.11957 0.16923 -0.707 0.481
## use_sexTRUE -0.19142 0.19074 -1.004 0.317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.208 on 224 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.02414, Adjusted R-squared: -0.002002
## F-statistic: 0.9234 on 6 and 224 DF, p-value: 0.4788
The results are weak. Normally we look for a t value of 2. Danger is the strongest variable, but doesn’t meet the usual standart. Animals and sex appear to be a turnoff, but insignificant.
Combine all of the individual logical variables into a single string.
youtube = youtube %>%
mutate (contype = "",
contype = ifelse(funny,paste(contype,"funny"),contype),
contype = ifelse(patriotic,paste(contype,"patriotic"),contype),
contype = ifelse(celebrity,paste(contype,"celebrity"),contype),
contype = ifelse(danger,paste(contype,"danger"),contype),
contype = ifelse(animals,paste(contype,"animals"),contype),
contype = ifelse(use_sex,paste(contype,"use_sex"),contype),
contype = ifelse(contype=="","None",contype))
youtube %>%
select(contype) %>%
head(10)
## # A tibble: 10 x 1
## contype
## <chr>
## 1 "None"
## 2 " funny celebrity danger"
## 3 " funny danger animals"
## 4 "None"
## 5 " funny danger animals use_sex"
## 6 " funny celebrity danger animals"
## 7 " funny celebrity animals"
## 8 " celebrity"
## 9 " funny celebrity animals"
## 10 " patriotic celebrity danger"
Use group_by() and summarize() to get a count of contype values and display the top 10 in descending order.
## # A tibble: 39 x 2
## contype count
## <chr> <int>
## 1 " funny" 32
## 2 " funny danger" 19
## 3 " funny danger animals" 16
## 4 "None" 15
## 5 " celebrity" 14
## 6 " funny animals" 14
## 7 " funny use_sex" 12
## 8 " patriotic animals" 12
## 9 " funny celebrity" 10
## 10 " funny celebrity use_sex" 10
## # … with 29 more rows
Create a barplot of ads by brand. Which brands have the most ads?