Factors Influencing Golf Earnings

class: center, middle, inverse, title-slide

.title[
# Factors Influencing Golf Earnings
]
.subtitle[
## Linear Regression
]
.author[
### Tyler Battaglini & Ryan Lebo
]
.date[
### 2025-02-20
]

---

## Table of Contents

<li>Introduction </li>
<li>Variables </li> 
<li>Research Question </li>
<li>Exploratory Data Analysis </li>
<li>Linear Model </li>
<li>Box Cox Transformation</li>
<li>Log Model </li>
<li>Bootstrapping </li>
<li>Model Selection </li>
<li> Conclusion </li>

</ul>

---

## Introduction

<li>PGA 2004 data set (196 participants) </li>
<li>What is the PGA? </li>
<li>Data set provides (earnings and player stats) </li>

</ul>

---
## Variables 
<ul style="font-size: 1.2em; line-height: 1.6;">

.pull-left[
- Name
- Age
- Avg Drive
- Driving Accuracy
- Greens in Regulation
- Avg Number of Putts

]

.pull-right[
- Save Percentage
- Money Rank
- Number of Events
- Total Winnings
- Average Winnings

]

</ul>

---
## Research Question 
<ul style="font-size: 1.2em; line-height: 1.6;">

<li>What variables affect the players winnings of this given season? </li>
<li>Looking at average drive vs earnings </li>
<li>Short game vs Long game </li>

</ul>
---
##

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Exploratory Data Analysis
</div>

---
## Exploratory Data Analysis
<ul style="font-size: 1.2em; line-height: 1.6;">

<li>Check for high correlation </li>
<li>Take out missing observations </li>
<li>Remove some variables </li>

</ul>

---
##

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Linear Model
</div>

---

##Linear Model

Table: Statistics of Regression Coefficients

|                 | Estimate| Std. Error| t value| Pr(>&#124;t&#124;)|
|:----------------|--------:|----------:|-------:|------------------:|
|(Intercept)      |  -15.185|      2.640|  -5.751|              0.000|
|Average_Drive    |    0.021|      0.008|   2.750|              0.007|
|Greens_on_reg    |    0.130|      0.022|   6.039|              0.000|
|Save_Percent     |    0.044|      0.011|   3.962|              0.000|
|Number_events    |   -0.063|      0.013|  -4.851|              0.000|
|Age_Above_30TRUE |    0.063|      0.154|   0.410|              0.682|
---
## Linear Model

<li>Non-normal distribution </li>
<li>Several Outliers </li>
<li>Non-constant variance </li>

</ul>

---

## Linear Model

<li>All below 5 </li>
<li>Little to no multicollinearity </li>

</ul>

---
##

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Box-Cox Transformation
</div>

---
## Box-Cox Transformation

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>All the lambda values are close to 0</li>
  <li>Proceed with log transformation</li>
</ul>

---

## Log Transformation Model

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>Response variable – log (Average Winnings)</li>
  <li>Explanatory Variables – Average Drive, Greens on Regulation, Save Percentage, Number of Events, and Age Above 30</li>
</ul>

|                 | Estimate| Std. Error| t value| Pr(>&#124;t&#124;)|
|:----------------|--------:|----------:|-------:|------------------:|
|(Intercept)      |   -5.451|      2.383|  -2.287|              0.023|
|Average_Drive    |    0.006|      0.007|   0.819|              0.414|
|Greens_on_reg    |    0.192|      0.019|   9.875|              0.000|
|Save_Percent     |    0.056|      0.010|   5.562|              0.000|
|Number_events    |   -0.045|      0.012|  -3.857|              0.000|
|Age_Above_30TRUE |    0.033|      0.139|   0.240|              0.811|

---

## Goodness of Fit Measures

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>Improvement in constant variance</li>
  <li>Improvement in Normality</li>
</ul>

---

## Comparison of Models

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>Log model outperforms linear model</li>
  <li>SSE signifies better fit</li>
  <li>R squared and R adjusted better in log model</li>
</ul>

|             |     SSE|  R.sq| R.adj| Cp|      AIC|     SBC|   PRESS|
|:------------|-------:|-----:|-----:|--:|--------:|-------:|-------:|
|full.model   | 123.038| 0.335| 0.316|  6|  -64.865| -45.510| 135.110|
|log.winnings | 100.246| 0.455| 0.440|  6| -102.970| -83.615| 107.534|

---

## Comparison of Models Cont.

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>Vast improvement in Log transformation model</li>
  <li>Normality?</li>
</ul>

---

## Comparison of Model Cont.

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>Residuals vs. Fitted improvement in log transformation</li>
  <li>Can assume constant variance</li>
</ul>

<img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-15-1.png" width="100%" />
---

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Bootstrapping
</div>
---
## Bootstrapping

<li>Have not assumed normality in our QQ plot.</li>  
  
  <li>Uses a nonparametric model for comparison.</li>  
  
  <li>Estimating confidence intervals.</li>

</ul>

</div>

---
## Boostrapping Cont. with Cases

<ul style="font-size: 1.1em; line-height: 1.2;">
  <li>The red line represents the normal distribution curve based on standard errors from the original model</li>
  <li>The blue curve represents the density of bootstrap estimates for each coefficient</li>
  <li>Since we are resampling data points, our bootstrap estimates should reflect variability due to sampling uncertainty</li>
</ul>

<img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-16-1.png" width="100%" />
---
## Bootstrapping Cont. with Residuals

<ul style="font-size: 1.2em; line-height: 1.5;">
  <li>The red and blue curves are interpreted the same way as in the case resampling method</li>
  <li>The bootstrap estimates are again approximately normal, meaning the residual resampling does not introduce significant bias</li>
</ul>

<img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-18-1.png" width="100%" />
---
## Bootstrapping Cont. CIs

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li> Consistencies with CIs and p-values</li>
  <li>Violation in two variables</li>
</ul>

Table: Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs

|                 |Coefficients |95% CI (Bootstrap t)   |95% CI (Bootstrap r)   |p-values |
|:----------------|:------------|:----------------------|:----------------------|:--------|
|(Intercept)      |-5.4513      |[ -9.8419 ,  -0.4481 ] |[ -10.164 ,  -0.7775 ] |0.0233   |
|Average_Drive    |0.0058       |[ -0.0064 ,  0.0185 ]  |[ -0.0075 ,  0.0193 ]  |0.4139   |
|Greens_on_reg    |0.1920       |[ 0.1437 ,  0.2333 ]   |[ 0.1567 ,  0.2294 ]   |0.0000   |
|Save_Percent     |0.0563       |[ 0.0384 ,  0.075 ]    |[ 0.0368 ,  0.0756 ]   |0.0000   |
|Number_events    |-0.0450      |[ -0.0682 ,  -0.0192 ] |[ -0.0664 ,  -0.0226 ] |0.0002   |
|Age_Above_30TRUE |0.0334       |[ -0.2875 ,  0.3338 ]  |[ -0.2435 ,  0.3138 ]  |0.8108   |
---
## Model Selection

<li><u>Log Transformation</u>  
      <br> - Positives: Better QQ Plot & Residual vs Fitted, best R-squared and adjusted R-squared values.  
      <br> - Negatives: Age and Average Drive are insignificant.</li>
  
  <li><u>Linear Model</u>  
      <br> - Positives: Only Age is insignificant, and the model is simple.  
      <br> - Negatives: Cannot assume constant variance or normality.</li>

</ul>

</div>

---
## Final Model

<ul style="font-size: 1.2em; line-height: 1.6;">
  <li>log(Average Winnings)=−5.4512847 + 0.0057567 × Average_Drive + 0.1920075 × Greens_on_reg + 0.0563106 × Save_percent − 0.0450036 × Num_Events + 0.0333772 × Age_Above_30TRUE</li>
  <li>Convert log model back to original by exponentiation</li>
  <li>Values when holding all over variables constant
    <br> - Average Drive = .58%
    <br> - Save Percent = 5.8%
    <br> - Number of Events = -4.4%
    <br> - Age Above 30 = 3.4%
    <br> - Greens on Regulation = 21.18%</li>

</ul>

---
##

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Conclusion
</div>

---
## Conclusion

<ul style="margin-top: 15px; margin-bottom: 15px;">
  
  <li>Greens on Regulation is the biggest indicator of increase in Winnings 21.18%.</li>
  
  <li>When holding all other variables constant, an increase of one unit in Average Drive leads to a 0.58% increase in Average Winnings.</li>
  
  <li>Short game (e.g., greens on regulation or save percent) has a larger impact on winnings compared to the long game. <br>
      - Short game impact: Greens on Regulation:21.18% & Save Percentage: 5.8% <br>
      - Long game (Average Drive) impact: 0.58%</li>
  
  <li>Number of Events decreases Average Winnings by 4.4%.</li>

</ul>

</div>

---
## Limitations

<li>Average Drive and Age are insignificant values.</li>  
  
  <li>Log transformation is sensitive to outliers and could amplify small values.</li>  
  
  <li>Assumes a linear relationship.</li>  
  
  <li>Factors such as injury, weather, start time, and mental focus are not included in this dataset.</li>

</ul>

</div>

---
##

<div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;">
  Questions?
</div>

---
## Works Cited

<li>Datasets. (n.d.). <a href="https://users.stat.ufl.edu/~winner/datasets.html">https://users.stat.ufl.edu/~winner/datasets.html</a></li>

</ul>

</div>

---
##

<div style="display: flex; justify-content: center; align-items: center; height: 100%; font-size: 50px;">
  Thank You!
</div>

---
## Slide Contributors

<div style="font-size: 40px; line-height: 2;">
  <ul>
    <li>Ryan Lebo did slides from the Introduction to the Linear Regression Model</li>
    <li>Tyler Battaglini did the Box-Cox Transformation to the Conclusion</li>
  </ul>
</div>