Round One / First Day of Tournament Play: Area of Interest
Background
After Working On Another Project, I Became Really Interested In Looking at Professional Golfers’ Statistics. I Analyzed ESPN’s Statistics Before, But For This Project I Took Some Of The Statistics From The PGA Tour’s Website. In This Project, I Want To Assess Different Aspects Of The Game (Long-Game vs. Short-Game) And Their Impact / Importance To The Players’ Standings / Results.
Dataset
The Dataset Includes The Statistics I Scraped From The PGA Tour’s Website. I Gathered The Statistics From The Tables Highlighted On Their Home Page And Combined Them Into One Dataset. These Tables Included Statistics On The Players’ Driving Accuracy, Driving Distance, FedExCup Points, Greens In Regulation, Scoring Averages, Scoring Totals, Scrambling, Tee-to-Green Performance, And Top Finishes.
Data Dictionary
| Column Name | Explanation |
|---|---|
| PLAYER NAME | Player Name |
| ROUNDS | Number of Rounds Played |
| DA RANK THIS WEEK | Driving Accuracy Rank This Week |
| DA RANK LAST WEEK | Driving Accuracy Rank Last Week |
| DRIVING ACC | Driving Accuracy |
| FAIRWAYS HIT | Number of Fairways Hit |
| POSSIBLE FAIRWAYS | Number of Possible Fairways |
| DD RANK THIS WEEK | Driving Distance Rank This Week |
| DD RANK LAST WEEK | Driving Distance Rank Last Week |
| AVG DRIVING DISTANCE | Average Driving Distance |
| TOTAL DRIVING YARDS | Total Driving Yards |
| TOTAL DRIVES | Total Number of Drives |
| FC RANK THIS WEEK | FedExCup Rank This Week |
| FC RANK LAST WEEK | FedEdCup Rank Last Week |
| EVENTS | Number of Events Played For FedExCup Points |
| FEDEXCUP POINTS | Number of FedExCup Points |
| WINS | Number of Wins |
| POINTS BEHIND LEAD | Number of Points Behind Leader |
| GR RANK THIS WEEK | Greens in Regulation Rank This Week |
| GR RANK LAST WEEK | Greens in Regulation Rank Last Week |
| GREENS HIT IN REGULATION (%) | Percent of Time Player Hit the Green in Regulation |
| # GREENS HIT | Number of Greens Hit |
| # OF HOLES | Number of Holes Played |
| RELATIVE/PAR | When Green Hit in Regulation |
| P RANK THIS WEEK | Putting Rank This Week |
| P RANK LAST WEEK | Putting Rank Last Week |
| AVG PUTTS | Average Putts |
| TOTAL STROKES GAINED: PUTTING | Total Strokes Gained from Putting* |
| P ROUNDS MEASURED | Putting Rounds Measured |
| SA RANK THIS WEEK | Scoring Average Rank This Week |
| SA RANK LAST WEEK | Scoring Average Rank Last Week |
| SCORING AVG | Weighted Scoring Average** |
| TOTAL STROKES | Total Strokes |
| TOTAL ADJUSTMENT | Adjustment for Scoring Average*** |
| SA ROUNDS MEASURED | Number of Rounds Measured for Scoring Average |
| ST RANK THIS WEEK | Scoring Total Rank This Week |
| ST RANK LAST WEEK | Scoring Total Rank Last Week |
| ST AVG | Average Number of Strokes Better/Worse Than Field**** |
| TOTAL SG:T | Average Round Score |
| TOTAL SG:T2G | Average Number of Strokes to Green |
| TOTAL SG:P | Average Number of Putts |
| ST ROUNDS MEASURED | Number of Rounds Measured for Scoring Total |
| S RANK THIS WEEK | Scrambling Rank This Week |
| S RANK LAST WEEK | Scrambling Rank Last Week |
| SCRAMBLING (%) | Scrambling***** |
| S PAR OR BETTER | Number of Times Made Par or Better |
| S MISSED GIR | Number of Times Missed a Green in Regulation |
| TG RANK THIS WEEK | Tee-to-Green Rank This Week |
| TG RANK LAST WEEK | Tee-to-Green Rank Last Week |
| TG AVG | Average Number of Strokes Better/Worse Than Field****** |
| SG:OTT | Strokes Gained: Off The Tee |
| SG:APR | Strokes Gained: Approach |
| SG:ARG | Strokes Gained: Around The Green |
| TG ROUNDS MEASURED | Number of Rounds Measured for Tee-to-Green |
| T10 RANK THIS WEEK | Top 10 Rank This Week |
| T10 RANK LAST WEEK | Top 10 Rank Last Week |
| TOP10 | Number of Top 10 Finishes |
| 1ST PLACE FINISHES | Number of First Place Finishes |
| 2ND PLACE FINISHES | Number of Second Place Finishes |
| 3RD PLACE FINISHES | Number of Third Place Finishes |
*The number of putts a player takes from a specific distance is measured against a statistical baseline to determine the player’s strokes gained or lost on a hole. The sum of the values for all holes played in a round minus the field average strokes gained/lost for the round is the player’s Strokes gained/lost for that round. The sum of strokes gained for each round are divided by total rounds played. The Strokes Gained - concept is a by-product of the PGA TOUR’s ShotLink Intelligence Program, which encourages academics to perform research against ShotLink statistical data.
**The weighted scoring average which takes the stroke average of the field into account. It is computed by adding a player’s total strokes to an adjustment and dividing by the total rounds played.
***The adjustment is computed by determining the stroke average of the field for each round played. This average is subtracted from par to create an adjustment for each round. A player accumulates these adjustments for each round played.
****The per round average of the number of Strokes the player was better or worse than the field average on the same course & event.
*****The percent of time a player misses the green in regulation, but still makes par or better.
******The per round average of the number of Strokes the player was better or worse than the field average on the same course & event minus the Players Strokes Gained putting value.
^ All Theses Explanations Are From The PGA Tour’s Website.
Table of Summary Statistics
These Are The Main Statistics For The Dataset, Arranged Alphabetically. I Added The Players’ FedExCup Standings, So You Can Compare the Players’ Statistics And Also See Where They Stand In Terms of FedExCup Points. Obviously, There Is A Correlation Between Their FedExCup Points And Their Top 10 Finishes, Since The Points Are Awarded For Top Finishes. As For The Other Statistics, There Is Not A Strong Correlation Between Any Of Those And The Players’ FedExCup Points. It Will Be Interesting To See If Any Of The Variables Are Predictive To The FedExCup Points. The Players With The Most Accurate Drives Are Not High In FedExCup Standings. Bryson, Who Has The Most FedExCup Points, Does Have The Farthest Drives, On Average. Cameron Champ, The Player With The Second Farthest Drives, On Average, Is 122nd In FedExCup Points. Furthermore, Of The Top Twenty Players In Driving Distance, Only Two Of Them Are Ranked In The Top Ten in FedExCup Standings. Similarly, Of The Twenty Players Who Hit Greens In Regulation Most Frequently, Only One Of Them Is Ranked In The Top Ten In FedExCup Points: Stewart Cink (5), Who Leads The Group In This Metric. Russell Henley Is 40th In Terms Of FedExCup Points, But Has The Lowest Scoring Average. That Is Quite Interesting! The Two Players With The Most FedExCup Points, Bryson DeChambeau and Justin Thomas, Were Not Far Behind Russell, Though, In Scoring Averages. Besides Their Top 10 Finishes, This Also Seems To Be Somewhat Correlated To FedExCup Points, Which Makes Lots Of Sense. Lastly, The Players With The Best Scrambling Percentages Are Not Highly Ranked In FedExCup Points. Interestingly, I Would Think This Could Be Because The Top Players Do Not Miss The Green In Regulation That Often, So Each Time They Do Not Score Par Or Better Has A More Drastic Impact On Their Scrambling Percentage. However, As Stated Before, Most Of The Top Players In FedExCup Points Are Not Ranked High In The Amount Of Times They Hit The Green In Regulation.
Analyzing These Summary Statistics Created Some Questions And I Am Excited To Dive Deeper Into My Analysis :)
Round Two: Descriptive Analysis
Visualization One
I Wanted To Go A Step Further Than The Summary Statistics And Look At How The Players’ Ranks Compare Across The Different Categories.
Bryson DeChambeau, The Player With The Most FedExCup Points, Ranks Quite High Among Many Of The Other Categories. Similarly, Many Of The Top Ranked Players In FedExCup Points Have Strong Skill Sets In An Area, Which Shows In A Top Ranking In That Category. Although, He And Many Of The Other Highly Ranked Players In FedExCup Points Have Low Rankings In Driving Accuracy. That Is Very Interesting And I Wonder Why That Is. It Is Also Interesting That The Top Players In FedExCup Points Are Not Necessarily The Best Players In Terms Of Scoring Averages and Scoring Totals. On The Other End, Many Of The Players In Low FedExCup Standing, Rank Low Across Many Of The Other Categories. J.J. Spaun, Second To Last In FedExCup Points, Has Low Rankings Across The Board, But Is Tied For 17th In Driving Accuracy. That Is Kind Of The Opposite Of Bryson DeChambeau! Just Like The Highest Ranked Players In FedExCup Points, The Players On The Other End Have A Strength, Even If It Is Not As Strong / They Are Not Ranked As High In That Specific Category. I Suppose Though, If Someone Was The Best At Everything They Would Be A Superstar And Would Win Everything.
Visualization Two
Which Is More Impactful On A Players’ Game: Putting Strokes or Strokes From Tee-to-Green?
I Wanted To Assess If Putting Or The Shots From The Tee To The Green Made A Larger Impact On A Player’s Game. Thus, I Grouped Players By Their FedExCup Points, Since That Is Based On Players’ Finishes. Ideally, Players Want To Be In The Top Right Corner, Gaining Strokes From Putting And From The Tee To The Green. Those Who Are Gaining The Most Strokes From Putting And From The Tee To The Green Are Those In The Top Two Quartiles / Half (Blue And Green). The Purple Dot, The Leader In FedExCup Points, Is Also In The Mix, Which Makes Sense. There Are A Large Number Of Dots Huddled Around The (0,0) Area, Which Indicates The Run-of-the-Mill Or Average Golfers, Who Are Not Exceptionally Better Or Worse Than Others In The Field. To The Far Left, Though, There Are Two Green Dots. These Players Are In The Top 75%, However, They Are The Worse Players In This Dataset From The Tee-to-Green. Meanwhile, The Majority Of The Players Who Lose Twenty-Five Or More Strokes On Putting Are In The Bottom Half Of Players. Thus, I Would Say That Putting Is Important Because Players Who Are Good From The Tee To The Green Are Not Good Overall; Whereas, The Players Who Are Not Good From The Tee To The Green, But Are Good At Putting Are In Better Standing Overall.
Visualization Three
Hypothesis: Players With Higher Scrambling Percentages, Gain More Strokes In Putting.
The Scrambling Percentage Indicates How Often A Player Misses The Green In Regulation, But Still Scores Par Or Better. That Being Said, I Think Those Who Gain The Most Strokes From Putting, Have Higher Scrambling Percentages. Based On These Boxplots, Though, There Does Not Seem To Be A Large Difference Between Those Who Have High Strokes Gained (Quartiles Three And Four) Versus Those Who Have Low Strokes Gained (Quartiles One And Two). However, The Fourth Quartile (The Second Whisker) Of The Fourth Quartile Covers A Larger And Higher Range Of Scrambling Percentages Than The Other Quartiles. Overall, My Hypothesis Was Wrong And I Wonder How The Frequency Of Players Having To Scramble Affects These Results. Those Who Have To Scramble Probably Have Lower Driving Accuracy And / Or Are Not As Good / Precise Around The Green. Putting Is Another Skill Set Of The Game, And While I Thought That If The Players Could Miss The Green In Regulation And Still Score A Par Or Better, That They Would Have To Be Good Putters, There Are Other Factors I Did Not Consider.
Visualization Four
Follow-Up Hypothesis: Those With Lower Scrambling Percentages, Have Lower Driving Accuracy.
So Maybe I Was On To Something With My Second Hypothesis. The Players With Lower Driving Accuracy And Scrambling Percentages Are Those With the Higher Scoring Averages (Blue And Purple). On the Other Hand, Those With Higher Driving Accuracy And Scrambling Percentages Are Those With Lower Scoring Averages (Red And Gold), On Average. While This Hypothesis Is Not 100% Accurate And Does Not Encompass All Players, It Does Look Like There Is A Correlation For The Players To Both Extremes (Low Driving Accuracy And Scrambling Percentages, And High Driving Accuracy And Scrambling Percentages). Adding Their Scoring Average Took The Graph To Another Level, I Think, Because We Could See That Those Who Have Low Driving Accuracy And A Low Scrambling Percentage, Score Higher, On Average, Than Those Who Have High Driving Accuracy And A High Scrambling Percentage. Furthermore, Players On Both Ends Of The Spectrum In Scoring Averages Are Spread Amongst The Board In Driving Accuracy, But There Is A More Clear Divide Within The Scrambling Percentages; Those With Lower Scoring Averages Are Better At Scrambling Than Those With High Scoring Averages.
Visualization Five
Does Playing More Rounds of Golf Correlate to More FedExCup Points?
While The Bottom Quarter Of The Graph Looks To Possibly Indicate A Positive Correlation, Playing More Tournaments Does Not Mean You Are Going to Get More Points. Beginning As Soon As About 32 Rounds, There Is A Spike Where Players Who Have Not Played A Lot Of Rounds Comparatively, Have A Lot of FedExCup Points. Players Who Have Played 60+ Rounds Do Not Have More FedExCup Points Than Players Who Have Played Less Than 40 Rounds.
Visualization Six
Is There A Correlation Between FedExCup Points And Strokes Gained (Off The Tee, Approach, And Around The Green)? Does Gaining Strokes In One Area Have A More Significant Impact On FedExCup Points Than The Others?
The Majority Of The Strokes Lost Are On The Left Side Of The Graph: The Players With Low FedExCup Points. This Checks Out. Interestingly, There Do Not Seem To Be As Many Strokes Gained Around The Green Than The Other Two Areas. Moreover, The Highest Strokes Gained Around The Green Are Actually Towards The Left Side Of The Graph. Those Players Must Not Be As Good Off The Tee And / Or In Their Approach As The Other Players. It Does Look Like The Players’ Strokes Gained From Their Approach And Off The Tee Follow A Similar Pattern.
Round Three: Secondary Data Source
I Want To Assess The Sentiments Of Tweets Regarding The Top Three Players In FedExCup Points.
| PLAYER NAME | FEDEXCUP POINTS |
|---|---|
| Bryson DeChambeau | 1577 |
| Justin Thomas | 1552 |
| Cameron Smith | 1381 |
Bryson DeChambeau, Justin Thomas, And Cameron Smith Are The Leaders In FedExCup Points Currently. I Collected The Tweets Regarding Each Of Those Players. There Were 83 Tweets About Bryson, 496 About Cameron, And 500 About Justin. However, When I Attached Sentiments To Tweets, There Ended Up Being 62, 479, And 309 Sentiments, Respectively. These Were All Combined Into One Dataset For Analysis Purposes.
How Do The Average Sentiments Compare Across The Three Players?
| player | avg anticipation | avg anger | avg disgust | avg fear | avg joy | avg negative | avg positive | avg sadness | avg surprise | avg trust | avg positivity |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bryson | 0.6451613 | 0.3709677 | 0.2096774 | 0.4838710 | 0.6129032 | 0.5806452 | 1.306452 | 0.2419355 | 0.2419355 | 0.8225806 | 0.7258065 |
| Cameron | 0.4363257 | 0.2526096 | 0.1628392 | 0.1983299 | 0.3674322 | 0.4572025 | 1.106472 | 0.2609603 | 0.1649269 | 1.5929019 | 0.6492693 |
| Justin | 0.8252427 | 0.2750809 | 0.1715210 | 0.3203883 | 0.5016181 | 0.5598706 | 1.245955 | 0.2847896 | 0.3236246 | 0.6860841 | 0.6860841 |
Keep In Mind These Are Averages From Different Sized Datasets, But They Are Quite Telling. Cameron Has The Lowest Averages In All The Categories, Besides Sadness, Where He Is In The Middle, And Trust, Where He Actually Has The Highest Average. Justin, Is In The Middle For The Majority Of The Categories, And His Averages In Anger And Disgust Are Close To Cameron’s Averages. He Is The Highest In Anticipation, Sadness, And Surprise, Which Is An Amusing Combination. Lastly, Bryson Is All Over The Board. His Averages For Anger, Disgust, Fear, Joy, Negative, Positive, And Positivity Are The Highest, Which Again, Is An Amusing Combination.
In Addition To Looking At The Averages, I Wanted To Look At The Maximum Values That Tweets About Each Player Received In The Different Sentiment Categories.
Right Off The Bat, The Highest Value / Column Is Cameron’s Negative Value. I Wonder What That Person Had To Say About Him (Yikes)! Looking At The Graph, Generally Speaking, Many Of The Higher Values Are Cameron’s Values. Besides Anticipation, Cameron Does Have The Highest Values In All The Categories, And He Is Only One Behind Justin In Anticipation. Maybe Cameron’s Fans / Haters Are Very Passionate, Resulting In High Sentiment Values. As For Justin And Bryson, They Are Equal In Many Categories, And For The Sentiments They Are Not Equal In, Justin Is Higher Than Bryson. Bryson’s Fans / Haters Must Not Be As Emotional.
Round Four / The Final Round: Predictive or Prescriptive Analysis
##
## Regression Results
## ==========================================================
## Dependent variable:
## ---------------------------
## `FEDEXCUP POINTS`
## ----------------------------------------------------------
## `DRIVING ACC` 2.204 (5.919)
## `AVG DRIVING DISTANCE` 7.160* (4.211)
## `GREENS HIT IN REGULATION (%)` 13.202 (10.343)
## `AVG PUTTS` -7,405.685 (37,521.110)
## `SCORING AVG` -144.609* (81.289)
## `ST AVG` 7,396.714 (37,520.650)
## `TOTAL SG:T` 42,109.180*** (14,687.850)
## `TOTAL SG:T2G` -42,107.520*** (14,687.360)
## `TOTAL SG:P` -42,108.430*** (14,687.460)
## `SCRAMBLING (%)` 1.245 (6.021)
## `TG AVG` -7,437.331 (37,511.860)
## `SG:OTT` -154.097 (135.627)
## `SG:ARG` 49.308 (128.478)
## `SG:APR` -16.376 (129.434)
## `TOP 10` 99.381*** (17.205)
## Constant 7,447.992 (6,067.746)
## ----------------------------------------------------------
## Observations 143
## R2 0.679
## Adjusted R2 0.641
## Residual Std. Error 201.679 (df = 127)
## F Statistic 17.888*** (df = 15; 127)
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I Thought It Would Interesting To See If Any Of The Variables Can Be Used To Build A Predictive Model Of FedExCup Points. Therefore, I Included All The Major Variables From The Categories (Driving, Putting, Scoring, Strokes Gained, Scrambling, And Finishes). My Perceived Understanding Is That The Dependent Variables Will Have An Impact On The Independent Variable (FedExCup Points).
Based On The Regression Analysis, Not All Of The Variables Are Statistically Significant, Though. With A 90% Confidence, The Players’ Average Driving Distance And Weighted Scoring Average Are Determinants Of FedExCup Points. The Weighted Scoring Average Does Have A Negative Correlation, But Golfers Want A Lower Score, Decreasing The Amount Of Points Subtracted From Their Total FedExCup Points. With A 99% Confidence, Average Round Score, Average Strokes To The Green, Average Putting Strokes, And Top Ten Finishes Are Significant In Determining FedExCup Points. It Makes Sense That These Are Have A Negative Correlation, Because You Want Less Strokes In All Aspects Of The Game.
The Fact That The F Statistic Is Significant With 99% Confidence, Means That There Is A Strong, Joint Effect Of All The Variables Together. The R Squared Value Is Decently High And Means That About 68% Of The Variability In The Independent Variable Can be Explained By The Dependent Variables.
FEDEXCUP POINTS = 7,447.992 + 7.160(AVG DRIVING DISTANCE) - 144.609(SCORING AVG) + 42,109.180(TOTAL SG:T) - 42,107.520(TOTAL SG:T2G) - 42,108.430(TOTAL SG:P) + 99.381(TOP 10)
Hope I Receive The Green Jacket :)