Practically, I wanted to set up the basics of a set for a predictor that links the year, the content, and the 3rd party views on youtube. I wanted to practice selenium scraping in R, and this seemed a perfect example as this already has links This project made me realize how much more limited RSelenium is than the supported languages are.
This Stack Requires:
This is just the head of the raw data from FiveThirtyEight.
url.data <- "https://raw.githubusercontent.com/fivethirtyeight/superbowl-ads/main/superbowl-ads.csv"
raw <- read.csv(url(url.data), header = TRUE,)
knitr::kable(head(raw), "simple")
Here is where we would augment the data with the Selenium Instance
for (row_number in 1:length){
remDr$navigate(as.character(raw$youtube[row_number]))
Sys.sleep(5.0) #Could not find a way to wait for complete page load, so I threw in a wait to ensure load
webElem <- tryCatch({remDr$findElement(using = "css", "[class='view-count style-scope ytd-video-view-count-renderer']")},
error = function(e){ print("Could Not Find Video")})
view_number_text <- ""
if(!is.null(webElem)){
view_number_text <- tryCatch({webElem$getElementText()},
error = function(e){ print("Could Not get View Count Value")})
} else {
print("I could not find target")
}
view_number_text = gsub("views", "", view_number_text)
view_number_text = gsub(",", "", view_number_text)
if (Dev_Checks){
print(raw[row_number,]['youtube_url'])
print(view_number_text)
}
raw$Views[row_number] <- view_number_text
}
Here is where we proceed to remove the Superbowl ads dot com url as well as the YouTube url because they are not needed for the predictor.
url.data <- "https://raw.githubusercontent.com/Amantux/Data607_Assignment1/main/Superbowl_adds_count.csv"
raw <- read.csv(url(url.data), header = TRUE,)
c_dat = subset(raw, select = -c(superbowl_ads_dot_com_url,youtube_url))
knitr::kable(head(c_dat), "simple")
| year | brand | funny | show_product_quickly | patriotic | celebrity | danger | animals | use_sex | Views |
|---|---|---|---|---|---|---|---|---|---|
| 2018 | Toyota | False | False | False | False | False | False | False | 185328 |
| 2020 | Bud Light | True | True | False | True | True | False | False | 78717 |
| 2006 | Bud Light | True | False | False | False | True | True | False | 142558 |
| 2018 | Hynudai | False | True | False | False | False | False | False | 240 |
| 2003 | Bud Light | True | True | False | False | True | True | True | 13860 |
| 2020 | Toyota | True | True | False | True | True | True | False | 28043 |
RSelenium-https://cran.r-project.org/web/packages/RSelenium/index.html Source Data Article-https://projects.fivethirtyeight.com/super-bowl-ads/ Source Data Set-https://raw.githubusercontent.com/fivethirtyeight/superbowl-ads/main/superbowl-ads.csv rmarkdown-https://rmarkdown.rstudio.com