The high-level goal of this project was to determine if certain factors influence TV viewership for major championships on the PGA Tour. The idea is that prospective advertisers can utilize this information to select which events are most likely to have high viewership based on the factors found to be influential. Specifically, this project wanted to examine:
Major findings include:
The British Open is underrepresented in this data set as the only available TV viewership info was for a U.S. audience. The significant time zone difference hinders viewership numbers
Here is another graph depicting the annual TV viewership numbers of each major and the trend line throughout the years
The British Open:
Year | Birdie % | Bogey % |
---|---|---|
2006 | 18% | 18% |
2009 | 15% | 22% |
2018 | 13% | 20% |
The Masters:
The PGA Championship:
The U.S. Open:
Conclusion on Hole Volatility:
Are there certain tournaments that continually produce a close finish?
Here is the margin of victory data for each major dating back to 2005
In the graph below, viewership is shown on top of the margin of victory data. To more easily identify trends, points have been coloured for when the margin of victory is 2 or less:
The most viewed U.S. Open in the dataset is from 2008 when Rocco Mediate and Tiger Woods went to a playoff to determine the winner. Tiger ended up besting Rocco while doing so on a broken leg. This is a mixture of a narrative driven reason for high viewership as well as a low margin of victory
Now points have been coloured to identify if the tournament was won by a ‘first-time major winner’:
The overall correlation between viewership and a first-time winner is 0.07
Conclusion on Margin of Victory:
In this section, I will outline a portion of the R code that was written for web scraping and some of the abnormal error-catching code that was required.
for
loop runs through each tournament in the list. It gathers player information, their rank, their associated ESPN website URL and stores it for later.Not all players have ESPN profiles, however, and thus an error-catcher was required to be made. Here it is below:
store_reference <- numeric()
for (i in 1:num_made_cut) {
error_catcher <- try(
{
webpage <- read_html(url_player_info_final[i])
}
, silent = T)
if (class(error_catcher) == "try-error") {
store_reference <- c(store_reference, i)
}
}
# This is the actual piece of code that removes them from the data set once they have been found to have an error.
# It also resets the num_made_cut function to reflect the number of players we will be evaluating in the next for loop
if (length(store_reference) > 0) {
num_made_cut <- num_made_cut - length(store_reference)
for (j in length(store_reference):1) {
rank_data <- rank_data[-store_reference[j]]
player_name_data <- player_name_data[-store_reference[j]]
url_player_info_final <- url_player_info_final[-store_reference[j]]
index_num_espn <- index_num_espn[-store_reference[j]]
}
}
for
loop assesses each player’s URL to see if it would return an error (i.e. does the URL exist?)for
loop, I want to remove this player from the data set (I assumed these types of situations would be few and far between and thus would not make much of an impact on overall analysis). The store_reference
vector has stored the positions of the players who need to be removed and then removes them from all the lists that have been created thus far in the code.Now the main data gathering on hole-by-hole scores can be done without these players in the data set.
Here is the nested for
loop for acquiring player scores for each major.
for (i in 1:num_made_cut) {
for (j in 1:4) {
round <- j
css_selector <- paste(css_first_input, round, css_second_input, index_num_espn[i], css_third_input, sep= "")
webpage <- read_html(url_player_info_final[i])
round_data_html <- html_nodes(webpage, css_selector)
round_data <- html_text(round_data_html)
round_data <- as.numeric(round_data)
round_data <- round_data[1:19]
round_data <- round_data[-10]
if (round == 1) {
r1_data <- round_data
if (i == 1) {
css_par_input <- ' .colhead .textcenter+ .textcenter'
css_selector <- paste(css_first_input, round, css_second_input, index_num_espn[i], css_par_input, sep = "")
par_order_html <- html_nodes(webpage, css_selector)
par_order <- html_text(par_order_html)
par_order <- par_order[1:20]
par_order <- par_order[c(-10, -11)]
}
}
else if (round == 2) {
r2_data <- round_data
}
else if (round == 3) {
r3_data <- round_data
}
else if (round == 4) {
r4_data <- round_data
player_hole_by_hole_data <- c(r1_data, r2_data, r3_data, r4_data)
full_hole_by_hole_data <- rbind(full_hole_by_hole_data, player_hole_by_hole_data)
}
}
}
for
loop runs through each player who has made the cut and then runs through each round, as the location to acquire the hole data on ESPN is split into the different roundscss_selector
variable tells the code where to scrap the data from on the site. It is dynamic for each player and each round.round_data
variable and then transferred to the r1_data
if it is round 1, r2_data
if it is round 2, etc…player_hole_by_hole_data
variable and added to the main data frame full_hole_by_hole_data
. This process is done for every player in the tournament, and then every tournament in the list.The other portion of this nested for
loop is the par_order
variable. For one player in each tournament, I scrap the course information, meaning the yardage, par and hole info.
The last step is to clean the data into the Tidy format. I have used the gather function for this. The original data frame had the holes as columns and each row was a unique golfer. I used the gather function (shown below) to make each row a unique golfer and hole combination.
cleaned_hole_by_hole_df <- gather(full_hole_by_hole_data, "Hole", "Score", 3:74)
## 'data.frame': 310176 obs. of 12 variables:
## $ Hole : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Rank : num 1 2 2 2 5 5 5 5 9 9 ...
## $ Player : Factor w/ 684 levels "Aaron Wise","Adam Scott",..: 57 17 64 9 26 63 59 20 28 46 ...
## $ round : num 1 1 1 1 1 1 1 1 1 1 ...
## $ hole_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Par : num 4 4 4 4 4 4 4 4 4 4 ...
## $ Yardage : num 445 445 445 445 445 445 445 445 445 445 ...
## $ Score : num 4 4 3 4 5 4 4 4 4 4 ...
## $ rel_to_par: num 0 0 -1 0 1 0 0 0 0 0 ...
## $ Tournament: Factor w/ 4 levels "The Masters",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Course : Factor w/ 39 levels "Augusta National Golf Club - Augusta, GA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : num 2019 2019 2019 2019 2019 ...
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
If you are interested in learning more about the work that was done, please contact Eric Mercer at emercer54@gmail.com.