library(tidyverse)
library(scales)
## Warning: package 'scales' was built under R version 4.0.4
library(broom)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.4
## load the data set
youtube <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   year = col_double(),
##   funny = col_logical(),
##   show_product_quickly = col_logical(),
##   patriotic = col_logical(),
##   celebrity = col_logical(),
##   danger = col_logical(),
##   animals = col_logical(),
##   use_sex = col_logical(),
##   view_count = col_double(),
##   like_count = col_double(),
##   dislike_count = col_double(),
##   favorite_count = col_double(),
##   comment_count = col_double(),
##   published_at = col_datetime(format = ""),
##   category_id = col_double()
## )
## i Use `spec()` for the full column specifications.

Drop the test columns.

youtube <- youtube %>% select(-c(superbowl_ads_dot_com_url, youtube_url, id:etag, description:category_id))

Plot for varies variable

Bud Light and Budweiser are the companies showed most number of ads during the Super Bowl.

youtube %>% 
  count(brand, sort = T) %>% 
  mutate(brand = fct_reorder(brand, n),
         count_percentage = prop.table(n)) %>% 
  ggplot(aes(x = n, y = brand, fill = brand)) +
  geom_col() +
  geom_text(aes(label = percent(count_percentage, 0.1)),
            size = 3,
            hjust = - 0.3) +
  xlim(c(0,70)) +
  labs(title = "Total number of ads per company",
       x = "Count",
       y = "") +
  theme(legend.position = "none")

If we see the numbers over time, Bud Light is actually showing less ads per year, while the car manufacturers are showing more, such as Hynudai, Kia and Toyota.

youtube %>% 
  ggplot(aes(x = year, fill = brand)) +
  geom_bar() +
  facet_wrap(~ brand) +
  theme(legend.position = "none") +
  labs(title = "Number of ads by year",
       x = "")

From the dataset, we see the view_count has some large values, so we can show it on the log scale, which makes it more symmetric.

youtube %>% 
  ggplot(aes(view_count)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_log10(labels = scales::comma) +
  labs(title = "Distribution of view counts",
       x = "View Count")
## Warning: Removed 16 rows containing non-finite values (stat_bin).

There are multiple variables with _count in the name, we can group them and do a facet plot.

youtube %>% 
  pivot_longer(contains("_count"), names_to = "metric", values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_log10(labels = scales::comma) +
  labs(title = "Ads view metric", 
         x = "# of views") +
  facet_wrap(~ metric)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 429 rows containing non-finite values (stat_bin).

## drop favorite_count
youtube <- youtube %>% select(-favorite_count)

If we see the distribution for the view_count for each brand, NFL and Doritos has has higher number of views on average for their ads. Car company, Hynudai and Kia as the lowest number. On the other hand we notice Budweiser has lower median number of views per ads, even though they have more ads showing.

youtube %>% 
  filter(!is.na(view_count)) %>% 
  mutate(brand = fct_reorder(brand, view_count)) %>% 
  ggplot(aes(x = view_count, y = brand)) +
  geom_boxplot() +
  scale_x_log10(labels = comma) +
  labs(title = "View number distribution by company",
       x = "View Count")

We do similar plot for the year.

youtube %>% 
  ggplot(aes(x = year, y = view_count, group = year)) +
  geom_boxplot() +
  scale_y_log10(labels = comma) +
  labs(title = "View number distribution by year",
       x = "",
       y = "View Count")
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

To compensate the outlier, we will just plot the median value for the number of views. 2007 and 2008 has a popular showing of the ads, with their median value substantially higher than the other years. Although 2018 has the highest value, it is skewed due to small number of ads shown (small dot size).

youtube %>% 
  filter(!is.na(view_count)) %>% 
  group_by(year) %>% 
  summarise(n = n(),
            median_view = median(view_count)) %>% 
  ggplot(aes(x = year, y = median_view)) +
  geom_point(aes(size = n)) +
  geom_line() +
  scale_y_continuous(labels = comma) +
  labs(y = "Median # of views") +
  theme(legend.position = "none")
## `summarise()` ungrouping output (override with `.groups` argument)

There are varies indicator variables about the type of the ads. We can see the median is close between each type, Danger and Patriotic seems to increase the number of views by a small amount.

## group into a single variable
youtube_logic <- youtube %>% 
  pivot_longer(funny:use_sex, names_to = "type", values_to = "value") %>% 
  mutate(type = str_to_title(str_replace_all(type, "_", " ")))

youtube_logic %>% 
  ggplot(aes(type, view_count, fill = value)) +
  geom_boxplot() +
  scale_y_log10(labels = comma) +
  labs(title = "View count difference by type",
       x = "",
       y = "View Count")
## Warning: Removed 112 rows containing non-finite values (stat_boxplot).

A zoomed in view of the previous plot.

youtube_logic %>% 
  filter(!is.na(view_count)) %>% 
  group_by(type , value) %>% 
  summarise(count = n(),
            median_view = median(view_count)) %>% 
  ggplot(aes(x = type, y = median_view, fill = value)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(title = "Median value of view count by type",
       x = "",
       y = "Median View Count")
## `summarise()` regrouping output by 'type' (override with `.groups` argument)

We can see the change of style for the ads shown. Celebrity is gaining strongly, showing 100% in the latest year. Patriotic gained until 2017, but it has declined since then. Funny become much less popular, while ads type Use Sex has disappeared.

youtube_logic %>% 
  group_by(type, year) %>% 
  summarise(percentage = mean(value),
            n = n()) %>% 
  ggplot(aes(x = year, y = percentage, color = type)) +
  geom_line() +
  scale_y_continuous(labels = percent) +
  labs(title = "Popularity of ads type by year",
       x = "",
       y = "Number of ads included type") +
  facet_wrap(~ type) +
  theme(legend.position = "none")
## `summarise()` regrouping output by 'type' (override with `.groups` argument)

Running a logistic model for each type, from the p-value, we can see it concur with the above observations. There is no significant change for the type of Show Product Quickly, Animals and Danger.

youtube_logic %>% 
  group_by(type) %>% 
  summarise(model = list(glm(value ~ year,
                             family = "binomial"))) %>% 
  mutate(tidied = map(model, tidy)) %>% 
  unnest(tidied) %>% 
  filter(term != "(Intercept)") %>% 
  arrange(desc(estimate))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 7
##   type                 model  term   estimate std.error statistic    p.value
##   <chr>                <list> <chr>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Patriotic            <glm>  year   0.119       0.0331    3.61   0.000309  
## 2 Celebrity            <glm>  year   0.0643      0.0249    2.58   0.00992   
## 3 Show Product Quickly <glm>  year   0.00923     0.0234    0.394  0.694     
## 4 Animals              <glm>  year   0.000756    0.0225    0.0336 0.973     
## 5 Danger               <glm>  year  -0.0248      0.0238   -1.05   0.296     
## 6 Use Sex              <glm>  year  -0.0971      0.0264   -3.68   0.000232  
## 7 Funny                <glm>  year  -0.129       0.0269   -4.80   0.00000158
## variable for future use
youtube_brand_type <- youtube_logic %>% 
  group_by(brand, type) %>% 
  summarise(percentage = mean(value)) %>% 
  ungroup()
## `summarise()` regrouping output by 'brand' (override with `.groups` argument)

Separate into each ads type, we can see how different company aim for different feel of their brand.

youtube_brand_type %>% 
  mutate(brand = reorder_within(brand, percentage, type)) %>% 
  ggplot(aes(x = percentage, y = brand)) +
  geom_col() +
    scale_x_continuous(labels = percent) +
    scale_y_reordered() +
  facet_wrap(~ type, scales = "free_y") +
  labs(x = "Usage",
       y = "",
       title = "Brand behaviour in different type")

We can also group by companies, which is easier to compare between similar brand, such as Coco Cola vs Pepsi, Bud Light vs Budweiser, car companies etc.

youtube_brand_type %>% 
  mutate(type = reorder_within(type, percentage, brand)) %>% 
  ggplot(aes(x = percentage, y = type)) +
  geom_col() +
    scale_x_continuous(labels = percent) +
    scale_y_reordered() +
  facet_wrap(~ brand, scales = "free_y") +
  labs(x = "Usage",
       title = "Brand characteristic")

Heatmap (can be dropped)

youtube_brand_type %>% 
  ggplot(aes(x = type, y = brand, fill = percentage)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", midpoint = 0.5)

After filter some ads without like/dislike count, we see Bud Light ads (2019) is the only ads has dislikes than likes, though it has a small view count (36683). Other notable ads are Coca-Cola (2012), Doritos (2002, 2012), Budweiser (2017), these are the ads has more than one million views and a dislike ratio of greater than 20%, so they have more split audience opinion.

youtube %>% 
  filter(like_count + dislike_count > 100) %>% 
  mutate(dislike_ratio = dislike_count/(like_count + dislike_count)) %>% 
  ggplot(aes(x = view_count, y = dislike_ratio)) +
  geom_point(aes(colour = brand)) +
  geom_text(aes(label = paste0(brand, " ", year)),
            check_overlap = TRUE,
            hjust = 1.1,
            hjust = 1.1,
            size = 3) +
  scale_x_log10(labels = comma) +
  scale_y_continuous(labels = percent) +
  labs(title = "Like/Dislike trend with view count",
       x = "View Count",
       y = "Dislike Ratio")
## Warning: Duplicated aesthetics after name standardisation: hjust

We see for the more popular ads (more than one million views), generally they are well liked. Every Super Bowl ads has celebrity in them not, probably also explains the decline in use sexy theme in them (also change of social norm).