class: center, middle, inverse, title-slide .title[ # Factors Influencing Golf Earnings ] .subtitle[ ## Linear Regression ] .author[ ### Tyler Battaglini & Ryan Lebo ] .date[ ### 2025-02-20 ] --- <!-- every new slide is created under three dashes (---) --> <!-- (<h1) makes the title for the slide --> ## Table of Contents <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Introduction </li> <li>Variables </li> <li>Research Question </li> <li>Exploratory Data Analysis </li> <li>Linear Model </li> <li>Box Cox Transformation</li> <li>Log Model </li> <li>Bootstrapping </li> <li>Model Selection </li> <li> Conclusion </li> </ul> --- ## Introduction <ul style="font-size: 1.2em; line-height: 1.6;"> <li>PGA 2004 data set (196 participants) </li> <li>What is the PGA? </li> <li>Data set provides (earnings and player stats) </li> </ul> --- ## Variables <ul style="font-size: 1.2em; line-height: 1.6;"> .pull-left[ - Name - Age - Avg Drive - Driving Accuracy - Greens in Regulation - Avg Number of Putts ] .pull-right[ - Save Percentage - Money Rank - Number of Events - Total Winnings - Average Winnings ] </ul> --- ## Research Question <ul style="font-size: 1.2em; line-height: 1.6;"> <li>What variables affect the players winnings of this given season? </li> <li>Looking at average drive vs earnings </li> <li>Short game vs Long game </li> </ul> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Exploratory Data Analysis </div> --- ## Exploratory Data Analysis <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Check for high correlation </li> <li>Take out missing observations </li> <li>Remove some variables </li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-2-1.png" width="80%" /> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Linear Model </div> --- ##Linear Model Table: Statistics of Regression Coefficients | | Estimate| Std. Error| t value| Pr(>|t|)| |:----------------|--------:|----------:|-------:|------------------:| |(Intercept) | -15.185| 2.640| -5.751| 0.000| |Average_Drive | 0.021| 0.008| 2.750| 0.007| |Greens_on_reg | 0.130| 0.022| 6.039| 0.000| |Save_Percent | 0.044| 0.011| 3.962| 0.000| |Number_events | -0.063| 0.013| -4.851| 0.000| |Age_Above_30TRUE | 0.063| 0.154| 0.410| 0.682| --- ## Linear Model <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Non-normal distribution </li> <li>Several Outliers </li> <li>Non-constant variance </li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-5-1.png" width="100%" /><img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-5-2.png" width="100%" /> --- ## Linear Model <ul style="font-size: 1.2em; line-height: 1.6;"> <li>All below 5 </li> <li>Little to no multicollinearity </li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-7-1.png" width="80%" /> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Box-Cox Transformation </div> --- ## Box-Cox Transformation <ul style="font-size: 1.2em; line-height: 1.6;"> <li>All the lambda values are close to 0</li> <li>Proceed with log transformation</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-9-1.png" width="100%" /> --- ## Log Transformation Model <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Response variable – log (Average Winnings)</li> <li>Explanatory Variables – Average Drive, Greens on Regulation, Save Percentage, Number of Events, and Age Above 30</li> </ul> | | Estimate| Std. Error| t value| Pr(>|t|)| |:----------------|--------:|----------:|-------:|------------------:| |(Intercept) | -5.451| 2.383| -2.287| 0.023| |Average_Drive | 0.006| 0.007| 0.819| 0.414| |Greens_on_reg | 0.192| 0.019| 9.875| 0.000| |Save_Percent | 0.056| 0.010| 5.562| 0.000| |Number_events | -0.045| 0.012| -3.857| 0.000| |Age_Above_30TRUE | 0.033| 0.139| 0.240| 0.811| --- ## Goodness of Fit Measures <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Improvement in constant variance</li> <li>Improvement in Normality</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-11-1.png" width="100%" /><img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-11-2.png" width="100%" /> --- ## Comparison of Models <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Log model outperforms linear model</li> <li>SSE signifies better fit</li> <li>R squared and R adjusted better in log model</li> </ul> | | SSE| R.sq| R.adj| Cp| AIC| SBC| PRESS| |:------------|-------:|-----:|-----:|--:|--------:|-------:|-------:| |full.model | 123.038| 0.335| 0.316| 6| -64.865| -45.510| 135.110| |log.winnings | 100.246| 0.455| 0.440| 6| -102.970| -83.615| 107.534| --- ## Comparison of Models Cont. <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Vast improvement in Log transformation model</li> <li>Normality?</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> --- ## Comparison of Model Cont. <ul style="font-size: 1.2em; line-height: 1.6;"> <li>Residuals vs. Fitted improvement in log transformation</li> <li>Can assume constant variance</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Bootstrapping </div> --- ## Bootstrapping <div style="font-size: 32px; line-height: 3;"> <ul style="margin-top: 15px; margin-bottom: 15px;"> <li>Have not assumed normality in our QQ plot.</li> <li>Uses a nonparametric model for comparison.</li> <li>Estimating confidence intervals.</li> </ul> </div> --- ## Boostrapping Cont. with Cases <ul style="font-size: 1.1em; line-height: 1.2;"> <li>The red line represents the normal distribution curve based on standard errors from the original model</li> <li>The blue curve represents the density of bootstrap estimates for each coefficient</li> <li>Since we are resampling data points, our bootstrap estimates should reflect variability due to sampling uncertainty</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> --- ## Bootstrapping Cont. with Residuals <ul style="font-size: 1.2em; line-height: 1.5;"> <li>The red and blue curves are interpreted the same way as in the case resampling method</li> <li>The bootstrap estimates are again approximately normal, meaning the residual resampling does not introduce significant bias</li> </ul> <img src="Presentation-1-Capstone-First-Draft_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> --- ## Bootstrapping Cont. CIs <ul style="font-size: 1.2em; line-height: 1.6;"> <li> Consistencies with CIs and p-values</li> <li>Violation in two variables</li> </ul> Table: Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs | |Coefficients |95% CI (Bootstrap t) |95% CI (Bootstrap r) |p-values | |:----------------|:------------|:----------------------|:----------------------|:--------| |(Intercept) |-5.4513 |[ -9.8419 , -0.4481 ] |[ -10.164 , -0.7775 ] |0.0233 | |Average_Drive |0.0058 |[ -0.0064 , 0.0185 ] |[ -0.0075 , 0.0193 ] |0.4139 | |Greens_on_reg |0.1920 |[ 0.1437 , 0.2333 ] |[ 0.1567 , 0.2294 ] |0.0000 | |Save_Percent |0.0563 |[ 0.0384 , 0.075 ] |[ 0.0368 , 0.0756 ] |0.0000 | |Number_events |-0.0450 |[ -0.0682 , -0.0192 ] |[ -0.0664 , -0.0226 ] |0.0002 | |Age_Above_30TRUE |0.0334 |[ -0.2875 , 0.3338 ] |[ -0.2435 , 0.3138 ] |0.8108 | --- ## Model Selection <div style="font-size: 26px; line-height: 1.5;"> <ul style="margin-top: 15px; margin-bottom: 15px;"> <li><u>Log Transformation</u> <br> - Positives: Better QQ Plot & Residual vs Fitted, best R-squared and adjusted R-squared values. <br> - Negatives: Age and Average Drive are insignificant.</li> <li><u>Linear Model</u> <br> - Positives: Only Age is insignificant, and the model is simple. <br> - Negatives: Cannot assume constant variance or normality.</li> </ul> </div> --- ## Final Model <ul style="font-size: 1.2em; line-height: 1.6;"> <li>log(Average Winnings)=−5.4512847 + 0.0057567 × Average_Drive + 0.1920075 × Greens_on_reg + 0.0563106 × Save_percent − 0.0450036 × Num_Events + 0.0333772 × Age_Above_30TRUE</li> <li>Convert log model back to original by exponentiation</li> <li>Values when holding all over variables constant <br> - Average Drive = .58% <br> - Save Percent = 5.8% <br> - Number of Events = -4.4% <br> - Age Above 30 = 3.4% <br> - Greens on Regulation = 21.18%</li> </ul> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Conclusion </div> --- ## Conclusion <div style="font-size: 26px; line-height: 2;"> <ul style="margin-top: 15px; margin-bottom: 15px;"> <li>Greens on Regulation is the biggest indicator of increase in Winnings 21.18%.</li> <li>When holding all other variables constant, an increase of one unit in Average Drive leads to a 0.58% increase in Average Winnings.</li> <li>Short game (e.g., greens on regulation or save percent) has a larger impact on winnings compared to the long game. <br> - Short game impact: Greens on Regulation:21.18% & Save Percentage: 5.8% <br> - Long game (Average Drive) impact: 0.58%</li> <li>Number of Events decreases Average Winnings by 4.4%.</li> </ul> </div> --- ## Limitations <div style="font-size: 26px; line-height: 2;"> <ul style="margin-top: 15px; margin-bottom: 15px;"> <li>Average Drive and Age are insignificant values.</li> <li>Log transformation is sensitive to outliers and could amplify small values.</li> <li>Assumes a linear relationship.</li> <li>Factors such as injury, weather, start time, and mental focus are not included in this dataset.</li> </ul> </div> --- ## <div style="display: flex; justify-content: center; align-items: center; height: 80vh; font-size: 50px;"> Questions? </div> --- ## Works Cited <div style="font-size: 35px; line-height: 1.5;"> <ul style="margin-top: 15px; margin-bottom: 15px;"> <li>Datasets. (n.d.). <a href="https://users.stat.ufl.edu/~winner/datasets.html">https://users.stat.ufl.edu/~winner/datasets.html</a></li> </ul> </div> --- ## <style> section { background-color: #A9D1D6; height: 100%; } </style> <div style="display: flex; justify-content: center; align-items: center; height: 100%; font-size: 50px;"> Thank You! </div> --- ## Slide Contributors <style> section { background-color: #D1E2FF; height: 100%; } </style> <div style="font-size: 40px; line-height: 2;"> <ul> <li>Ryan Lebo did slides from the Introduction to the Linear Regression Model</li> <li>Tyler Battaglini did the Box-Cox Transformation to the Conclusion</li> </ul> </div>