Business Problem:

What metric is the best predictor of success on the modern PGA Tour? If I were a PGA Tour professional, what area of my game should I focus on in order to make the most money possible, while shooting the lowest scores possible?

Metrics:

The success metrics are defined as:

Earnings Per Event ($USD), a measure of how much money a player made over the timeline of the analysis per tournament, calculated by dividing the sum of their earnings by the number of events they played (the higher, the better). Missed cuts and withdrawals, both medical and non-medical, automatically translate to $0 of earnings for the event.

Scoring Average, a measure of how many strokes it takes a player to complete a round, calculated by the total number of strokes a player makes divided by the total number of rounds they have played (the lower, the better).

The input metrics are:

Driving Accuracy (%), a measure of how accurate a player’s tee shots are, calculated by dividing the total number of times a player’s tee shot finished in the fairway by the total number of tee shots they hit (the higher, the better);

Driving Distance (yards), a measure of how far a player’s average tee shot travels (the higher, the better);

Greens in Regulation (%), a measure of how accurate a player’s approach shots are, calculated by dividing the total number of times a player’s approach shot finishes on the putting green by the total number of approach shots they hit (the higher, the better);

Putts Per Round, a measure of how many strokes a player takes to put the ball in the hole once reaching the putting green, calculated by dividing the total number of putts a player had by the total number of holes the player played (the lower, the better);

Scrambling (%), a measure of how often a player gets “up and down” (when a player puts the ball in the hole in two or fewer strokes after their approach shot missed the putting green), calculated by dividing the total instances of successfully getting up and down by the total number of times a player missed the putting green on their approach shot (the higher, the better);

Data Source:

The Official PGA Tour website keeps raw data of all of the above metrics, from 1980 to the present. However, I am only interested in the best predictor of success on the “modern” PGA Tour; due to two reasons, tournaments conducted prior to 2010 are not considered modern enough to be consistent with those conducted in the 12 years since. Firstly, recent advances in technology have allowed drivers to go further, irons to be more accurate, and putters to be more consistent, which has effectively made the game easier, and therefore the stats would heavily favor the post-2010 data. Secondly, the sizes of the purses from 1980 to 2010 dramatically increased over the years due to more sponsors and more fan attendance; while the rate of increase of modern-day purses since 2010 has been negligible in comparison, offering more consistency. As a result, the data for this analysis includes only the years 2010-2021.
Note also that the 2021-2022 season is only halfway complete, so data from 2022 will not be used.

Processing the Data:


Step 1: Obtained the appropriate data from the official PGA Tour website, and used Excel to clean the data so that it is uniform and properly formatted for the next step.
* Found and deleted duplicates across all tables;
* Used business sense to find and replace null values with the correct values (i.e. in the “victories” field of the “money_made” table, if no record was found for that field, it meant that the player did not win any tournaments that year);
* Appropriately renamed field headers across all tables, to make them consistent and more descriptive (i.e. renaming the “gir” field to “greens_in_regulation” to make it easier to understand);
* Created two matching keys across all data tables for the future joining of as many tables as necessary (the “player” and “year” fields);
* Assigned corresponding year values to each record across all tables;
* Exported each metric as an individual data table in respective .csv files.

Step 2: Imported the data from the respective .csv files into Google Big Query and used the sandbox SQL platform to aggregate, filter, and sort the data for the next step.
* Created a new project titled “lw-capstone” inside of the SQL workspace;
* Created seven data tables by importing the seven .csv files created earlier;
* Inside a nested query, joined all seven tables together using LEFT JOIN’s on a joint primary key created by “player” and “year” – the reason for joining is to combine the data from all tables into one table, the reason for using two keys jointly is to make sure all records match player names as well as the corresponding years they played, and the reason for using a LEFT JOIN rather than an INNER JOIN is that not all players were on the PGA Tour for all 12 years, but I still want to return their stats for the years in which they were active;
* Within the same nested query, returned and renamed only the fields that I need;
* In the outer query, aggregated the data by created and returning summarized values for all of the targeted metrics, grouped by player, filtered by players who have averaged more 10 or more events per year over the course of the 12 years (to ensure large enough of a sample size), and sorted the data from most earnings per event to least earnings per event;
* Exported the data into a .csv file called “pga_aggregate_stats.csv” for further analysis.

Here’s the SQL query in its full form:


You can also access the code on GitHub HERE

SQL Query

Analyzing the Data:


Step 1: Launched R-Studio, installed and loaded the “tidyverse”, “readr”, “ggplot2”, and “dplyr” packages used for data analysis, imported the data from the previously generated .csv file using the read.csv function, and saved the data as a dataframe called “pga_tour_stats”:

setwd("~/Desktop/Learning/Google Data Analytics Certificate/Capstone Project")
pga_stats <- read.csv("pga_aggregate_stats.csv")
as_tibble(pga_stats)
## # A tibble: 502 × 12
##    player            tot_events_played tot_earnings years_active events_per_year
##    <chr>                         <int>        <int>        <int>           <int>
##  1 Rory McIlroy                    182     54764656           11              17
##  2 Jon Rahm                        100     28771788            5              20
##  3 Dustin Johnson                  239     67826516           12              20
##  4 Collin Morikawa                  53     14065666            3              18
##  5 Justin Thomas                   167     43915588            7              24
##  6 Brooks Koepka                   139     35612106            7              20
##  7 Tiger Woods                     114     27989167           10              11
##  8 Jordan Spieth                   207     47750892            9              23
##  9 Patrick Cantlay                  99     21979822            7              14
## 10 Bryson DeChambeau               117     25522685            5              23
## # … with 492 more rows, and 7 more variables: earnings_per_event <dbl>,
## #   driving_accuracy_percent <dbl>, avg_driving_distance <dbl>,
## #   gir_percent <dbl>, putts_per_round <dbl>, scrambling_percent <dbl>,
## #   scoring_average <dbl>


Step 2: Calculated the correlation coefficients of all five input metrics (Driving Accuracy, Driving Distance, Greens in Regulation, Putts Per Round, and Scrambling) against the success metrics (Earnings Per Event and Scoring Average).

The official definition of the correlation coefficient is: “The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there was an error in the correlation measurement.” -Investopedia

Correlation Coefficients of Earnings Per Event vs. the Input Metrics:

as_tibble(earnings_correlations)
## # A tibble: 1 × 5
##   driving_accuracy_earnings driving_distance_earn… gir_earnings putting_earnings
##                       <dbl>                  <dbl>        <dbl>            <dbl>
## 1                   0.00955                  0.361        0.299           -0.295
## # … with 1 more variable: scrambling_earnings <dbl>

As expected, all of the correlation coefficients were positive except for Putting, because in each of the other cases, a higher number is seen as better, but Putting is the only metric where less is better. It is also nice to see that the coefficients were all between -1.0 and +1.0, which means that there were no errors in calculation.

Based on these results, Driving Distance (0.3612) and Scrambling (0.3344) are the metrics most closely correlated with earning more money per event, while Greens in Regulation (0.2990) and Putting (-0.2949) still had an impact but not as much, and Driving Accuracy (0.0096) did not make much of a difference at all. Moving onto the other success metric, Scoring Average, we get the following results:

Correlation Coefficients of Scoring Average vs. the Input Metrics:

as_tibble(scoring_correlations)
## # A tibble: 1 × 5
##   driving_accurac… driving_distanc… gir_scoring putting_scoring scrambling_scor…
##              <dbl>            <dbl>       <dbl>           <dbl>            <dbl>
## 1           -0.248           -0.279      -0.621           0.284           -0.639

Just like above, at first glance, the results were as expected; all of the correlation coefficients were negative except for Putting, because the success metric of Scoring Average is now defined as less-is-better. In addition, the coefficients were again all between -1.0 and +1.0, meaning there were no errors in calculation.

Based on these results, Scrambling (-0.6388) and Greens in Regulation (-0.6215) are the metrics most closely correlated with shooting lower scores, while Putting (0.2842) and Driving Distance (-0.2786) still had an impact but not as much. Once again, Driving Accuracy (-0.2479) made the least difference, although its effect on Scoring Average was much more significant than on Earnings Per Event.


Step 3: Calculated two correlation coefficients between metrics themselves (Driving Accuracy vs. Driving Distance, and Earnings Per Event vs. Scoring Average):

as_tibble(other_correlations)
## # A tibble: 1 × 2
##   driving_correlations earnings_scoring
##                  <dbl>            <dbl>
## 1               -0.580           -0.651


Both correlation coefficients showed statistically significant relationships. In the case of Driving Accuracy vs. Driving Distance (-0.5803), there was a noticeable negative correlation, which makes sense because as a player hits the ball further, they are naturally less accurate. Likewise, in the case of Earnings Per Event vs. Scoring Average (-0.6509), there was also a noticeable negative correlation, which also makes sense, because as a player scores lower in tournaments, they will naturally finish higher in the events and make more money. But with both, the correlation was not close enough to -1.0 that they could be considered strictly correlated with each other, because there are other variables that impact them (namely, the other input metrics).

Data Visualizations:

Visually, the relationship between these variables can be plotted against each other like this:

Correlation Coefficients of Earnings Per Event vs. the Input Metrics:







Correlation Coefficients of Scoring Average vs. the Input Metrics:







Correlation Coefficients Between Metrics Themselves:



Conclusions and Recommendations

Before tying these numbers back to the business problem at hand (what is the best predictor of success on the PGA Tour?), let’s first recap the findings of the analysis:

* Driving Distance (cor: 0.3612) and Scrambling (cor: 0.3344) are the metrics most closely correlated with earning more money per event, while Greens in Regulation (cor: 0.2990) and Putting (cor: -0.2949) still had an impact but not as much, and Driving Accuracy (cor: 0.0096) does not make much of a difference at all;
* Scrambling (cor: -0.6388) and Greens in Regulation (cor: -0.6215) are the metrics most closely correlated with shooting lower scores, while Putting (0.2842) and Driving Distance (cor: -0.2786) still had an impact but not as much. Once again, Driving Accuracy (cor: -0.2479) made the least difference;
* As Driving Distance increases, Driving Accuracy decreases (cor: -0.5803), and as Scoring Average decreases, Earnings Per Event increases (cor: -0.6509).

Based on the above, I gave each of the input metrics a score of 1-5 for each category, with 5 being the best:
* Earnings Per Event: (5) Driving Distance, (4) Scrambling, (3) Greens in Regulation, (2) Putting, (1) Driving Accuracy;
* Scoring Average: (5) Scrambling, (4) Greens in Regulation, (3) Putting, (2) Driving Distance, (1) Driving Accuracy;

It can be assumed that the average PGA Tour player cares far more about money than about his average score, because at the end of the day, they are competing for their livelihoods. However, we should not disregard the significant correlation between lower average scores and higher amounts of money earned, which means that Scoring Average contributes to overall success in other ways, and should still be factored into the end decision. Therefore, I will weight these scores in a 75%-25% split, in favor of Earnings Per Event.

Taking into account the above weighting system, and combining it with the assigned input metric scores, I calculated the total scores for each input metric:

earnings_coeff <- 0.75
scoring_coeff <- 0.25

input_scores <- data.frame(input_metrics = c("driving_distance", "scrambling", "greens in reg",
                             "putting", "driving_accuracy"),
           earnings_scores = c(5, 4, 3, 2, 1),
           scoring_scores = c(2, 5, 4, 3, 1))

input_scores <- input_scores %>% 
  mutate(combined_score = (earnings_coeff * earnings_scores) +
           (scoring_coeff * scoring_scores))

arrange(input_scores, desc(combined_score))
##      input_metrics earnings_scores scoring_scores combined_score
## 1 driving_distance               5              2           4.25
## 2       scrambling               4              5           4.25
## 3    greens in reg               3              4           3.25
## 4          putting               2              3           2.25
## 5 driving_accuracy               1              1           1.00


Once the input metrics have all been assigned scores, two of the metrics have distanced themselves from the others and are tied for most important: Driving Distance and Scrambling. Business intuition confirms that this indeed makes sense; the “modern” form of golf is widely regarded as favoring “bomb-and-gouge” players, or players that drive the ball a long ways (the “bomb”), and because distance is inversely correlated with accuracy, these same players will ultimately need to play more of their approach shots from the rough (the “gouge”), thus placing a high emphasis on their short games and their ability to scramble!

In conclusion, while golf is a multi-faceted sport with many variables at play, at the highest levels of golf on the modern PGA Tour, in order to have greater chance of success, a player should focus on increasing their Driving Distance, as well as their Scrambling Percentage, for the highest return on the investment of their training time.

Future Considerations


In the future, I would love to expand on this current project, and have the following modifications / improvements in mind:

* I would conduct a separate analysis on PGA metrics from the years 1980 to 2010, to see how the game has evolved over time;
* I would dive deeper into Driving Distance and Scrambling to see if there are particular aspects of these metrics that matter more than others (i.e. “number of drives per season over 300 yards,” or “bunker scrambling percentage” vs. “rough scrambling percentage”);
* I would look at other success metrics, such as the number of tournament victories a player has, although I hypothesize that the results would be similar, since the players with making the most money and having the lowest scoring averages are probably also the ones winning the most;
* I would take a handful of the most successful golfers in the modern era (i.e. Tiger Woods, Phil Mickelson, Rory McIlroy, Dustin Johnson, etc.) and dive deeper into what made them so successful;
* I would be interested in seeing how these results are similar or different on other major worldwide professional golf tours such as the DP World Tour (formerly the European Tour); the PGA Tour competes primarily in the United States, and maybe the courses that players play on elsewhere would not reward the “bomb-and-gouge” playing style in the same way;
* I would be interested in diving deeper into any outliers found by this study (i.e. players who are low on Driving Distance, but have still made a lot of money per event and shoot low scores) to find out how and why;
* I would be interested in seeing which input metrics correlate the most strongly with Scoring Average for the average, 15-handicap golfer.