Project Background

Practically, I wanted to set up the basics of a set for a predictor that links the year, the content, and the 3rd party views on youtube. I wanted to practice selenium scraping in R, and this seemed a perfect example as this already has links This project made me realize how much more limited RSelenium is than the supported languages are.

This Stack Requires:

  1. Docker
  2. Jupyter
  3. Selenium Container
  4. RSelenium

Article Overview

  In terms of Article, FiveThirtyEight analyzed and categorized Superbowl commercials. Superbowl ads are remarkably expensive to run, and a lot of thought goes into targeting specific audiences for a commercial. I wanted to look more into the data and determine if one could predict the view count on Youtube after the fact to see if there was any long term impact.Practically, I wanted to set up the basics of a set for a predictor that links the year, the content, and the 3rd party views on YouTube.
  I wanted to practice selenium scraping in R, and this seemed a perfect example as this already has links. Through scraping, I realized that these were not all OEM hosted, as well as the view count seemed artificially low or high, so I decided to not attempt to make a predictor due to the low data count and sheer variance in the source of the view count.

Sample Data

This is just the head of the raw data from FiveThirtyEight.

url.data <- "https://raw.githubusercontent.com/fivethirtyeight/superbowl-ads/main/superbowl-ads.csv"
raw <- read.csv(url(url.data), header = TRUE,)
knitr::kable(head(raw), "simple")
year brand superbowl_ads_dot_com_url youtube_url funny show_product_quickly patriotic celebrity danger animals use_sex
2018 Toyota https://superbowl-ads.com/good-odds-toyota/ https://www.youtube.com/watch?v=zeBZvwYQ-hA False False False False False False False
2020 Bud Light https://superbowl-ads.com/2020-bud-light-seltzer-inside-posts-brain/ https://www.youtube.com/watch?v=nbbp0VW7z8w True True False True True False False
2006 Bud Light https://superbowl-ads.com/2006-bud-light-bear-attack/ https://www.youtube.com/watch?v=yk0MQD5YgV8 True False False False True True False
2018 Hynudai https://superbowl-ads.com/hope-detector-nfl-super-bowl-lii-hyundai/ https://www.youtube.com/watch?v=lNPccrGk77A False True False False False False False
2003 Bud Light https://superbowl-ads.com/2003-bud-light-hermit-crab/ https://www.youtube.com/watch?v=ovQYgnXHooY True True False False True True True
2020 Toyota https://superbowl-ads.com/2020-toyota-go-places-with-cobie-smulders/ https://www.youtube.com/watch?v=f34Ji70u3nk True True False True True True False

Here is where we would augment the data with the Selenium Instance

for (row_number in 1:length){
        remDr$navigate(as.character(raw$youtube[row_number]))
        Sys.sleep(5.0) #Could not find a way to wait for complete page load, so I threw in a wait to ensure load
        webElem <- tryCatch({remDr$findElement(using = "css", "[class='view-count style-scope ytd-video-view-count-renderer']")},
                            error = function(e){ print("Could Not Find Video")})
        view_number_text <- ""
        if(!is.null(webElem)){
            view_number_text <- tryCatch({webElem$getElementText()},
                                error = function(e){ print("Could Not get View Count Value")})
        } else {
            print("I could not find target")
        }
        view_number_text = gsub("views", "", view_number_text)
        view_number_text = gsub(",", "", view_number_text)
        if (Dev_Checks){
            print(raw[row_number,]['youtube_url'])
            print(view_number_text)
        }
        raw$Views[row_number] <- view_number_text
    }

Here is where we proceed to remove the Superbowl ads dot com url as well as the YouTube url because they are not needed for the predictor.

url.data <- "https://raw.githubusercontent.com/Amantux/Data607_Assignment1/main/Superbowl_adds_count.csv"
raw <- read.csv(url(url.data), header = TRUE,) 
c_dat = subset(raw, select = -c(superbowl_ads_dot_com_url,youtube_url))
knitr::kable(head(c_dat), "simple")
year brand funny show_product_quickly patriotic celebrity danger animals use_sex Views
2018 Toyota False False False False False False False 185328
2020 Bud Light True True False True True False False 78717
2006 Bud Light True False False False True True False 142558
2018 Hynudai False True False False False False False 240
2003 Bud Light True True False False True True True 13860
2020 Toyota True True False True True True False 28043

Let’s Graph the View Count vs Year to see if there are any interesting trends

See how the clusters change when you start limiting the number of views

  As you can see there is a high amount of range in the data, with a large portion of advertisements having under 50,000 views. I believe this can be attributed to the variability in the source of the view count.

Findings and Recommendations

  Practically I wanted to develop the data backing a predictor, and I think I could probably do a better job of augmenting This mainly relies on scraping YouTube data, where not all cases actually even have a presence. It may be better to look at the viewership numbers from the Superbowl itself. In terms of the origin data set, I would love to see a breakdown on time or % of ad as I think that would be interesting!
  In addition, this data set pull straight from FiveThirtyEight’s Data set, so as time passes, if Fresh_Data and Fresh_Computes are both set to TRUE, with a proper container set up, they will update this whole project with new data.

Reference

RSelenium-https://cran.r-project.org/web/packages/RSelenium/index.html Source Data Article-https://projects.fivethirtyeight.com/super-bowl-ads/ Source Data Set-https://raw.githubusercontent.com/fivethirtyeight/superbowl-ads/main/superbowl-ads.csv rmarkdown-https://rmarkdown.rstudio.com