Data Analytics for PGA Viewership

Factors Influencing Major Championship Viewership

Executive Summary

The high-level goal of this project was to determine if certain factors influence TV viewership for major championships on the PGA Tour. The idea is that prospective advertisers can utilize this information to select which events are most likely to have high viewership based on the factors found to be influential. Specifically, this project wanted to examine:

What impact does high hole volatility (more birdies/bogeys, less pars) have on viewership?
Does a lower margin of victory increase viewership?

Major findings include:

With respect to hole volatility:
- High volatility (more birdies and bogeys) does not correlate positively with high viewership numbers, except for the U.S. Open
- The Masters is the only major that has more birdies than bogeys. The Masters has the highest viewership year-over-year
- The U.S. Open (second highest viewership) has the most bogeys of any major championship
- Narrative driven stories have a noticeable impact on viewership, specifically when examining outliers
With respect to margin of victory:
- Major championships concluding with a margin of victory of 2 strokes or less, have a higher than average viewership
- The Masters is unaffected by margin of victory fluctuations

Data

Player and course data was web scraped from ESPN
Tournaments scraped were from 2005 to 2019
TV viewership info was gathered from Nielsen, this data is for the final round of the tournament and encompasses U.S. audiences only
In-depth review of R-code and process at the bottom

The Business Case

Viewership:

Below is a boxplot graph of the TV Viewership numbers from a U.S. audience for each major since 2002

The Masters attracts, on average, nearly double the viewers of the other majors
The U.S. Open has reached The Masters viewership levels on two occasions, in 2002 and 2008
- No other major has reached The Masters viewership figures in this time span
The British Open is underrepresented in this data set as the only available TV viewership info was for a U.S. audience. The significant time zone difference hinders viewership numbers
Here is another graph depicting the annual TV viewership numbers of each major and the trend line throughout the years

There are multiple outliers (2008 US Open, 2008 PGA Championship, etc…) that we will explore in-depth below

Hole Volatility:

For the purposes of this project, hole volatility has been defined as a non-par result, primarily looking at the number of birdies and bogeys
The below graph depicts the hole outcomes for the back nine of the final round during each major:
- The back nine is isolated because this is when most people will tune in (the final 2.5 - 4 hours of the competition)
- Additionally, it helps control for the ‘finish’ of the golf tournament

The Masters is the only major that has more birdies than bogeys
The Masters has substantially more birdies than any other major. The U.S. Open, known for its difficulty, has the least amount of birdies
- In comparison, 21.5% of the holes at The Masters result in a birdie on the back nine on Sunday, while the U.S. Open has only 14.1%
- This may not seem like a significant difference, but it means that viewers will see 53% more birdies while watching The Masters than watching the U.S. Open
The British Open and the PGA Championship have a par-outcome over 60% of the time. These are the least viewed major championships by U.S. audiences
Viewers will tune into majors prior to the tournament’s completion, meaning that hole outcomes for that year do not draw viewers in real time. There is either a lag factor or a reliability/predictive factor at work. If a viewer knows a certain tournament has produced consistent results in the past, they will tune in with the knowledge that those results are likely to occur that year as well. Throughout the analysis, consistency in hole outcomes will be examined

Hole Volatility Overlaid with Viewership:

Here is an examination of each major on an annual basis to determine if hole volatility impacts viewership
To summarize the plot below,
- the bars signify the hole outcome (inclusive to the final round), with each colour representing either birdie, par or bogey. This corresponds with the y-axis on the left. The numbers within the bars show the percentage occurrence of that hole outcome
- The black points within the bars represent the number of viewers (in millions) who tuned in that year, which corresponds with the y-axis on the right. The line is the trend for viewership throughout the years

The British Open:

The British Open has a tight standard deviation of pars (SD = 0.02), ranging between 59% and 66% of holes resulting in a par
Nonetheless, it is worth assessing the outliers in greater detail. In 2010, The British had the highest proportion of pars. The tournament also had the lowest final round viewership in the data set at 3 million
This does not conclusively state that a high par rate is predictive for low viewership. In 2017, The British Open also had 66% of holes conclude in a par while this tournament had one of the highest number of viewers, at 4.9 million
When viewing the three most watched British Opens (2006, 2009 and 2018), hole volatility does not reveal any trends. Par rates are 59%, 60% and 64% respectively.
- Birdie/bogey rates seems normal as well:
Year Birdie % Bogey %

2006 18% 18%

2009 15% 22%

2018 13% 20%
- A common theme for highly watched British Opens does appear to be that there are more bogeys than birdies, but not at a statistically relevant rate.
- The high viewership numbers for these tournaments could be attributed to non-quantifiable information. In 2006, Tiger Woods was playing in his first major since his father passed away, he won the tournament
- In 2009, 8-time major winner Tom Watson, then 59 years old, was leading the tournament going into the final round. He ended up losing in a playoff
- In 2018, Tiger Woods had a share of the lead in the final round of a major for the first time in 5 years
- These are all narrative driven reasons for increased viewership for the British Open

Year	Birdie %	Bogey %
2006	18%	18%
2009	15%	22%
2018	13%	20%

The Masters:

The Masters is exceptionally consistent year-over-year. Every measurable statistic within this graph has little variation. The standard deviation for pars is 0.02, for birdies 0.03 and for bogeys 0.03.
Not much will be gleaned from this data, especially since viewership is extremely consistent as well, never dropping below 10 million.
In 9 of 15 years, The Masters has more birdies than bogeys and in 2 of the other years it is a tie. This is extremely abnormal for a major.

The PGA Championship:

There is frequent variation in the PGA Championship data. The range of par outcome percentage fluctuates from 50% to 71%, while viewership ranges from 4.0M to 10.1M
There appears to be little correlation between high viewership and high volatility. In 2017, when the par rate was 50%, less than 5 million viewers tuned in. This implies viewership is not dependent on hole volatility. In fact, viewership seems slightly positively correlated with a high par rate, cor = 0.37)
Between 2012-2015 and 2017-2018, the PGA Championship had high birdie rates (>20%). During these years, there is an equal mixture of low, average and high viewership rates. This provides the conclusion that a high birdie rate does not drive viewership in any direction.
Again, there are narrative driven reasons for high viewership
- In 2006, (10.1M viewers) Tiger Woods won the tournament
- In 2009, (10.1M viewers) Y.E. Yang beat Tiger Woods on the final day, marking the only time in history that Tiger lost a major entering the final round as the leader
- In 2014, (8.2M viewers) Rory McIlroy won the PGA – he seems to be the only other golfer in the data set who has a similar effect as Tiger on viewership figures. Meaning that viewers will tune in to watch McIlroy if he has a chance to win a major championship

The U.S. Open:

The U.S. Open is the only major that has any correlation between high volatility and high viewership. The correlation between viewership and pars is -0.28. The negative correlation means there are more viewers if there are less pars
The major outlier is the 2010 U.S. Open hosted at Pebble Beach. The final day produced a par rate of 40%. In addition, the bogey rate was a whopping 30%. This tournament has above average viewership. Golf fans seem to enjoy watching players struggle during the U.S. Open, but this trend is not upheld during any other major

Conclusion on Hole Volatility:

There is no noticeable correlation between the volatility of a course and the viewership of the tournament, except for the U.S. Open
Despite The Masters having the most birdies and the most viewers, the other tournaments did not reflect a similar relationship
A main driver in viewership among tournaments might be attributed to narrative related reasons
A potential advertiser does not need to concern themselves with a tournament’s hole volatility when deciding which major to advertise in. Unless they are electing to advertise during the U.S. Open, in which case they should identify a course that is likely to produce a significant number of bogeys

Margin of Victory:

Does a closer finish lead to more viewership?
Are there certain tournaments that continually produce a close finish?
Here is the margin of victory data for each major dating back to 2005

Most majors are decided by 2 strokes or less but there is a significant amount that fall outside this condition
There are many years where the margin of victory is equal to zero. This simply means the tournament was decided in a playoff
Margin of Victory less than or equal to 2 strokes:
- British Open: 7/15
- The Masters: 10/15
- PGA Championship: 11/15
- U.S. Open: 10/15

Margin of Victory Overlaid with Viewership:

In the graph below, viewership is shown on top of the margin of victory data. To more easily identify trends, points have been coloured for when the margin of victory is 2 or less:
The U.S. Open shows a very strong correlation between viewership and margin of victory (cor = -0.54)
To a lesser degree, The British Open echoes this relationship (cor = -0.25)
This supports the theory that if the tournament is close coming down the stretch, viewership will increase.
A close finish might be obvious that it increases viewership, but it is interesting to note that some tournaments (The Masters) appears to be resilient to this. For prospective advertisers who may want to capitalize on a large market, The Masters might be a good place to look as it appears to have an inelastic audience
The most viewed U.S. Open in the dataset is from 2008 when Rocco Mediate and Tiger Woods went to a playoff to determine the winner. Tiger ended up besting Rocco while doing so on a broken leg. This is a mixture of a narrative driven reason for high viewership as well as a low margin of victory

Margin of Victory Overlaid with First-Time Major Winner:

Now points have been coloured to identify if the tournament was won by a ‘first-time major winner’:
There is no correlation between being a first-time winner and viewership. It appears that this is not a compelling factor in interest level
The overall correlation between viewership and a first-time winner is 0.07

Conclusion on Margin of Victory:

As the tournament has a lower margin of victory, the viewership numbers will increase
The Masters shows a resilience to this trend, however, seemingly having an inelastic audience. It is possible that this is because The Masters has the lowest variance of margin of victory (SD = 1.4)
Viewership is not impacted based on if the winner is a first-time major champion
For a potential advertiser, it is difficult to predict with certainty that a major will have a close finish (small margin of victory) as most tournaments as hosted at a different course year-to-year. The Masters is the exception, being played at Augusta National every year, and the statistics show it has a routinely close finish. An advertiser might be prudent and hedge their bets by selecting The Masters.

Conclusion

The project set out to find the factors that can influence viewership in major tournaments. These factors could then be predicted by potential advertisers or tournament organizers to try and cultivate a tournament that has high viewership
Hole volatility (more birdies/bogeys) led to no impact on viewership, except for the U.S. Open. Viewers prioritized an increase in bogeys with the U.S. Open
Margin of victory does factor into viewership levels. The lower the margin of victory, specifically 2 strokes or less, results in higher viewership
- Statistically, The Masters is unaffected by this trend and traditionally, the tournament has a lower margin of victory than other majors
First-time major winners do not impact viewership positively or negatively
Narratives are a noticeable driver of viewership. From an advertiser’s perspective, it would be worthy to find which golfers drive viewership. Through this project, it has been identified that Tiger Woods and Rory McIlroy are main drivers. Additionally, abnormal performances (Tom Watson, Rocco Mediate, Y.E. Yang) have shown to be compelling to audiences

Data Gathering & R Code

In this section, I will outline a portion of the R code that was written for web scraping and some of the abnormal error-catching code that was required.
First, I had to scrape the required tournaments from ESPN’s site. Once the list of tournaments was gathered and cleaned to include only majors from 2005 to 2019, we could continue with the main data scraping
I wanted to gather scores for every hole that every golfer played, restricted to those golfers who made the cut. With this information, I could put it into a Tidy format and analyze it in conjunction with viewership data.
The beginning portion of the main for loop runs through each tournament in the list. It gathers player information, their rank, their associated ESPN website URL and stores it for later.
Not all players have ESPN profiles, however, and thus an error-catcher was required to be made. Here it is below:

  store_reference <- numeric()
  
  for (i in 1:num_made_cut) {
    error_catcher <- try(
      {
        webpage <- read_html(url_player_info_final[i])
      }
      , silent = T)
    if (class(error_catcher) == "try-error") {
      store_reference <- c(store_reference, i)
    }
  }
  
  # This is the actual piece of code that removes them from the data set once they have been found to have an error. 
  # It also resets the num_made_cut function to reflect the number of players we will be evaluating in the next for loop
  if (length(store_reference) > 0) {
    num_made_cut <- num_made_cut - length(store_reference)
    for (j in length(store_reference):1) {
      rank_data <- rank_data[-store_reference[j]]
      player_name_data <- player_name_data[-store_reference[j]]
      url_player_info_final <- url_player_info_final[-store_reference[j]]
      index_num_espn <- index_num_espn[-store_reference[j]]
    }
  }

The first for loop assesses each player’s URL to see if it would return an error (i.e. does the URL exist?)
If there is an error, it stores their location in the list of all players
In the second for loop, I want to remove this player from the data set (I assumed these types of situations would be few and far between and thus would not make much of an impact on overall analysis). The store_reference vector has stored the positions of the players who need to be removed and then removes them from all the lists that have been created thus far in the code.
Now the main data gathering on hole-by-hole scores can be done without these players in the data set.
Here is the nested for loop for acquiring player scores for each major.

  for (i in 1:num_made_cut) {
    for (j in 1:4) {
      round <- j
      css_selector <- paste(css_first_input, round, css_second_input, index_num_espn[i], css_third_input, sep= "")
      webpage <- read_html(url_player_info_final[i])
      round_data_html <- html_nodes(webpage, css_selector)
      round_data <- html_text(round_data_html)
      round_data <- as.numeric(round_data)
      round_data <- round_data[1:19]
      round_data <- round_data[-10]
      if (round == 1) {
        r1_data <- round_data
        if (i == 1) {
          css_par_input <-  ' .colhead .textcenter+ .textcenter'
          css_selector <- paste(css_first_input, round, css_second_input, index_num_espn[i], css_par_input, sep = "")
          par_order_html <- html_nodes(webpage, css_selector)
          par_order <- html_text(par_order_html)
          par_order <- par_order[1:20]
          par_order <- par_order[c(-10, -11)]
        }
      }
      else if (round == 2) {
        r2_data <- round_data
      }
      else if (round == 3) {
        r3_data <- round_data
      }
      else if (round == 4) {
        r4_data <- round_data
        player_hole_by_hole_data <- c(r1_data, r2_data, r3_data, r4_data)
        full_hole_by_hole_data <- rbind(full_hole_by_hole_data, player_hole_by_hole_data)
      }
    }
  }

The nested for loop runs through each player who has made the cut and then runs through each round, as the location to acquire the hole data on ESPN is split into the different rounds
The css_selector variable tells the code where to scrap the data from on the site. It is dynamic for each player and each round.
The data is initially stored in the round_data variable and then transferred to the r1_data if it is round 1, r2_data if it is round 2, etc…
Once all rounds have been scraped for the player, the data is combined into the player_hole_by_hole_data variable and added to the main data frame full_hole_by_hole_data. This process is done for every player in the tournament, and then every tournament in the list.
The other portion of this nested for loop is the par_order variable. For one player in each tournament, I scrap the course information, meaning the yardage, par and hole info.
The last step is to clean the data into the Tidy format. I have used the gather function for this. The original data frame had the holes as columns and each row was a unique golfer. I used the gather function (shown below) to make each row a unique golfer and hole combination.

cleaned_hole_by_hole_df <- gather(full_hole_by_hole_data, "Hole", "Score", 3:74)

This allowed the data to be manipulated in a significantly easier way. The output is exemplified in the ability to calculate annual hole volatility and margin of victory information.
The structure of the data frame can be seen here:

## 'data.frame':    310176 obs. of  12 variables:
##  $ Hole      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Rank      : num  1 2 2 2 5 5 5 5 9 9 ...
##  $ Player    : Factor w/ 684 levels "Aaron Wise","Adam Scott",..: 57 17 64 9 26 63 59 20 28 46 ...
##  $ round     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hole_num  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Par       : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ Yardage   : num  445 445 445 445 445 445 445 445 445 445 ...
##  $ Score     : num  4 4 3 4 5 4 4 4 4 4 ...
##  $ rel_to_par: num  0 0 -1 0 1 0 0 0 0 0 ...
##  $ Tournament: Factor w/ 4 levels "The Masters",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Course    : Factor w/ 39 levels "Augusta National Golf Club - Augusta, GA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year      : num  2019 2019 2019 2019 2019 ...

Lastly, the packages that were used in this code were as follows:

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

If you are interested in learning more about the work that was done, please contact Eric Mercer at emercer54@gmail.com.