Introduction
In this assignment we will be looking at a data set about the LPGA
which is the woman’s golf league. With this data set we will be seeing
if our initial model will need any transformations. We will also being
using greens in regulation as our response variable. We want to look at
how all these different predictor variables in golf correlate to greens
in regulation. We will perform necessary tests to find a good model for
this.
Data Description
The data in this project was taken from (https://users.stat.ufl.edu/~winner/datasets.html). My
variables are
- Golfer- Name of the Golfer
- Nation- Where the golfer is from
- Region- What region the golfer is from
- fairways- How many fairways the golfer hit in regulation
- fairAtt- How many attempts the golfer took to get to the
fairway
- fairPct- The percent of fairways hit in regulation
- totPutts- Total amount of putts the golfer had
- totRounds- Total amouint of rounds played by the golfer
- avePutts- Average amount of putts when you reached the green per
hole
- greenReg- How many greens were hit in regulation
- totPrize- Amount of money won
- events- How many events the golfer went to
- driveDist- The average distance that the golfer hit with their
drive
- sandSaves- The amount of sand saves the golfer had
- sandAtt- The amount of shots taken from the sand
- sandPct- The percentage of shots that made it out of the sand
Practical
Question
The point of this study is to figure out the association between
greens in regulation and the predictor values available in this data
set.
Exploratory Data
Analysis
We first want to look at how the predictor variables affect our
response variable in greens hit in regulation.





Looking at the first scatter plot with the variable fairway we can
see that it has a positive linear trend. This means that the more
fairways you hit in regulation, the more greens you will hit in
regulation.
The scatter plot with fairway percentage seems left skewed. This
shows that most of the golfers in this data set were mostly hitting
higher percentages into the fairways.
The scatter plot for drive distance seems to have a more positive
linear trend.
The scatter plot for sand attempts seems to be slightly right skewed
with two possible outliers past the 120 sand attempts.
The scatter plot for sand percentage also has a positive linear
trend.
Full model and
diagnostics
We need to make a linear model with the predictor values. Based on
previous experiences, we have taken out Golfer, Nation, Region,
totPrize, and totRounds. The number of events, number of rounds, and
total prize will not be able to influence getting on the green in
regulation. We also need to drop the variables totPutts, avePutts,
sandSaves, and fairAtt because they are also not variables that can
affect a persons ability to get on the green in regulation. This is due
to them either not directly correlated or happen after you get onto the
green.
Regression Coefficients
| (Intercept) |
-9.9257920 |
8.8445752 |
-1.122246 |
0.2635273 |
| fairway |
0.0191607 |
0.0018383 |
10.423042 |
0.0000000 |
| fairPct |
0.1406976 |
0.0467571 |
3.009119 |
0.0030680 |
| driveDist |
0.2407452 |
0.0253429 |
9.499498 |
0.0000000 |
| sandAtt |
-0.1105811 |
0.0172524 |
-6.409602 |
0.0000000 |
| sandPct |
0.0525978 |
0.0249874 |
2.104974 |
0.0369384 |
Now we should look at our residual diagnostic analysis to check how
reliable our model is.

In these residual plots we can see that Q-Q residual has a normal
distribution which happens when. you have . The residuals vs fitted
shows the points not in a cone shape throughout and this means the
variance is constant.
Goodness-of-fit
Measures
Now, we look at the goodness of fit measures for the models.
Goodness-of-fit Measures of Full Model
| full.model |
787.5965 |
0.6780546 |
0.6674643 |
6 |
265.8098 |
284.1853 |
862.4798 |
We have a sample size of 158 which is large. We can see from the
above table that the goodness-of-fit measures of the first model are
significant. This shows that the model has a 67% predictive ability to
predict greens in regulation.
Final Model
This is the statistics of the chosen model.
Stats of Final Model
| (Intercept) |
-9.9257920 |
8.8445752 |
-1.122246 |
0.2635273 |
| fairway |
0.0191607 |
0.0018383 |
10.423042 |
0.0000000 |
| fairPct |
0.1406976 |
0.0467571 |
3.009119 |
0.0030680 |
| driveDist |
0.2407452 |
0.0253429 |
9.499498 |
0.0000000 |
| sandAtt |
-0.1105811 |
0.0172524 |
-6.409602 |
0.0000000 |
| sandPct |
0.0525978 |
0.0249874 |
2.104974 |
0.0369384 |
Since the sample size of 158 is large, the argument for validating
p-values is the Central Limit Theorem. All the p-values are close to 0
meaning that all coefficients are significant
In this case, due to the p-values there is no need to perform
variable selection to determine the final model.
Conclusion/Discussion
We didn’t have to use many different techniques such as Box-Cox to
transform the response variables. This was due to the assumption of
constant variance being met. We got rid of all the variables that would
not be significant or have any association in evaluating getting on the
green in regulation due to my past experience.
We looked at the residual plots and the goodness of fit measures to
access the model. In doing this we came to the conclusion that the model
we had didn’t need any transformations and our model was
significant.
---
title: "What helps a golfer get to greens in regulation?"
author: "Ryan Lebo"
date: "2024-10-22"
output: 
  html_document:
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options:
  chunk_output_type: inline
slways_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
#
# Specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = FALSE,  # include code chunk in the output file
                   warning = FALSE,  # Sometimes, your code may produce a warning
                                     # messages, you can choose to include the
                                     # warning messages in the output file. 
                   message = FALSE,  
                   results = TRUE,   # you can also decide whether to include 
                                     # the output in the output file.
                   comment = FALSE   # Suppress hash-tags in the output results.
                      )   
```

# Introduction
In this assignment we will be looking at a data set about the LPGA which is the woman's golf league. With this data set we will be seeing if our initial model will need any transformations. We will also being using greens in regulation as our response variable. We want to look at how all these different predictor variables in golf correlate to greens in regulation. We will perform necessary tests to find a good model for this.



## Data Description
The data in this project was taken from (https://users.stat.ufl.edu/~winner/datasets.html). My variables are

* Golfer- Name of the Golfer
* Nation- Where the golfer is from
* Region- What region the golfer is from
* fairways- How many fairways the golfer hit in regulation
* fairAtt- How many attempts the golfer took to get to the fairway
* fairPct- The percent of fairways hit in regulation
* totPutts- Total amount of putts the golfer had 
* totRounds- Total amouint of rounds played by the golfer
* avePutts- Average amount of putts when you reached the green per hole
* greenReg- How many greens were hit in regulation
* totPrize- Amount of money won
* events- How many events the golfer went to
* driveDist- The average distance that the golfer hit with their drive
* sandSaves- The amount of sand saves the golfer had
* sandAtt- The amount of shots taken from the sand
* sandPct- The percentage of shots that made it out of the sand

## Practical Question
The point of this study is to figure out the association between greens in regulation and the predictor values available in this data set. 

# Exploratory Data Analysis

We first want to look at how the predictor variables affect our response variable in greens hit in regulation.

```{r fig.align='center'}
lpga0 <- read.csv("https://raw.githubusercontent.com/RyanLebo/STA-321/refs/heads/main/lpga2022.csv", header = TRUE)
lpga <- lpga0[, -1]

fairway<- lpga$fairways
greens <- lpga$greenReg 
fpct<-lpga$fairPct
driver_distance<- lpga$driveDist
s_ATT<- lpga$sandAtt
s_PCT<- lpga$sandPct

plot(fairway, greens, main = "Greens in regulation vs Fairways hit")
abline(v=121.529, h=24.96, col="red", lty=2)

plot(fpct, greens, main = "Greens in regulation vs Fairway percentage")
abline(v=121.529, h=24.96, col="red", lty=2)

plot(driver_distance, greens, main = "Greens in regulation vs Drive distance")
abline(v=121.529, h=24.96, col="red", lty=2)

plot(s_ATT, greens, main = "Greens in regulation vs Sand attempts")
abline(v=121.529, h=24.96, col="red", lty=2)

plot(s_PCT, greens, main = "Greens in regulation vs Sand percentage")
abline(v=121.529, h=24.96, col="red", lty=2)
```

Looking at the first scatter plot with the variable fairway we can see that it has a positive linear trend. This means that the more fairways you hit in regulation, the more greens you will hit in regulation. 

The scatter plot with fairway percentage seems left skewed. This shows that most of the golfers in this data set were mostly hitting higher percentages into the fairways.

The scatter plot for drive distance seems to have a more positive linear trend.

The scatter plot for sand attempts seems to be slightly right skewed with two possible outliers past the 120 sand attempts. 

The scatter plot for sand percentage also has a positive linear trend. 



## Full model and diagnostics
We need to make a linear model with the predictor values. Based on previous experiences, we have taken out Golfer, Nation, Region, totPrize, and totRounds. The number of events, number of rounds, and total prize will not be able to influence getting on the green in regulation. We also need to drop the variables totPutts, avePutts, sandSaves, and fairAtt because they are also not variables that can affect a persons ability to get on the green in regulation. This is due to them either not directly correlated or happen after you get onto the green.

```{r}
full.model = lm(greens ~ fairway+ fairPct+ driveDist+ sandAtt+ sandPct, data = lpga)
kable(summary(full.model)$coef, caption ="Regression Coefficients")

```




Now we should look at our residual diagnostic analysis to check how reliable our model is.

```{r}
par(mfrow=c(2,2))
plot(full.model)

```

In these residual plots we can see that Q-Q residual has a normal distribution which happens when. you have . The residuals vs fitted shows the points not in a cone shape throughout and this means the variance is constant.


## Goodness-of-fit Measures

Now, we look at the goodness of fit measures for the models.

```{r}
select=function(m){ 
 e = m$resid                         
 n0 = length(e)                        
 SSE=(m$df)*(summary(m)$sigma)^2      
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r                
 MSE=(summary(m)$sigma)^2              
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))        
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)         
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  
 X=model.matrix(m)                     
 H=X%*%solve(t(X)%*%X)%*%t(X)         
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }

```

```{r}
output.sum = rbind(select(full.model))
row.names(output.sum) = c("full.model")
kable(output.sum, caption = "Goodness-of-fit Measures of Full Model")
```

We have a sample size of 158 which is large. We can see from the above table that the goodness-of-fit measures of the first model are significant. This shows that the model has a 67% predictive ability to predict greens in regulation.



# Final Model

This is the statistics of the chosen model.

```{r}
kable(summary(full.model)$coef, caption = "Stats of Final Model")
```

Since the sample size of 158 is large, the argument for validating p-values is the Central Limit Theorem. All the p-values are close to 0 meaning that all coefficients are significant

In this case, due to the p-values there is no need to perform variable selection to determine the final model.


# Conclusion/Discussion

We didn't have to use many different techniques such as Box-Cox to transform the response variables. This was due to the assumption of constant variance being met. We got rid of all the variables that would not be significant or have any association in evaluating getting on the green in regulation due to my past experience.

We looked at the residual plots and the goodness of fit measures to access the model. In doing this we came to the conclusion that the model we had didn't need any transformations and our model was significant.







