1 Introduction

In this project we are trying to look at the research question, is there an association between greens in regulation and the predictor values available in the data set. We will look for the best model, do bootstraping, and look at the bootstrap confidence intervals.

1.1 Data Description

The data I am using is about statistics on lpga golfers and how they have done in the 2024 season. I got this data set off of the website (https://users.stat.ufl.edu/~winner/datasets.html).

The variables in this data set are

  • Golfer- Name of the Golfer
  • Nation- Where the golfer is from
  • Region- What region the golfer is from
  • fairways- How many fairways the golfer hit in regulation
  • fairAtt- How many attempts the golfer took to get to the fairway
  • fairPct- The percent of fairways hit in regulation
  • totPutts- Total amount of putts the golfer had
  • totRounds- Total amouint of rounds played by the golfer
  • avePutts- Average amount of putts when you reached the green per hole
  • greenReg- How many greens were hit in regulation
  • totPrize- Amount of money won
  • events- How many events the golfer went to
  • driveDist- The average distance that the golfer hit with their drive
  • sandSaves- The amount of sand saves the golfer had
  • sandAtt- The amount of shots taken from the sand
  • sandPct- The percentage of shots that made it out of the sand

In this data set, we have sufficient information to address my research question.

1.2 Research Question

The point of this study is to figure out the association between greens in regulation and the predictor values available in this data set.

1.3 Data Preperation

We are going to take out some variables from the model. We will start with Golfer, Nation, Region, totPrize, totRounds, and events due to these variables either being categorical or insignificant to the model. The number of events, number of rounds, and total prize will not be able to influence getting on the green in regulation. We also need to drop the variables totPutts, avePutts, sandSaves, and fairAtt because they are also not variables that can affect a persons ability to get on the green in regulation. This is due to them either being not directly correlated or they happen after you get onto the green.

2 Model Building

Now we need to make the full model of the data and look to see if we need to use a Box-Cox transformation on it.

Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.9257920 8.8445752 -1.122246 0.2635273
fairway 0.0191607 0.0018383 10.423042 0.0000000
fairPct 0.1406976 0.0467571 3.009119 0.0030680
driveDist 0.2407452 0.0253429 9.499498 0.0000000
sandAtt -0.1105811 0.0172524 -6.409602 0.0000000
sandPct 0.0525978 0.0249874 2.104974 0.0369384

In these residual plots we can see that Q-Q residual plot is a positive linear trend. The residuals vs fitted shows the points not going in a cone shape and this makes the variance constant. Due to this we do not need to do any box cox transformations.

2.1 Goodness-of-Fit

Now we should look at the goodness of fit measures to try and help find the final model.

Goodness-of-fit Measures of Full Model
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 787.5965 0.6780546 0.6674643 6 265.8098 284.1853 862.4798

We have a sample size of 158 which is large. We can see from the above table that the goodness-of-fit measures of the first model are significant. This shows that the model has a 67% predictive ability to predict greens in regulation.

3 Bootstrap

Here we will use the bootstrap method to get a confidence interval of the coefficients in our selected model.

We will now make visual representations of histograms for each of the regression coefficients in the final model.

Since both of the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The code below will get a 95% bootstrap confidence interval for the final model.

Bootstrap CI
Estimate Std. Error t value Pr(>|t|) boot_conf.95
(Intercept) -9.9258 8.8446 -1.1222 0.2635 [ 38.5803 , 81.9638 ]
fairway 0.0192 0.0018 10.4230 0.0000 [ 0.0121 , 0.0128 ]
fairPct 0.1407 0.0468 3.0091 0.0031 [ -0.1037 , 0.0992 ]
driveDist 0.2407 0.0253 9.4995 0.0000 [ -0.0651 , 0.0633 ]
sandAtt -0.1106 0.0173 -6.4096 0.0000 [ -0.0266 , 0.0263 ]
sandPct 0.0526 0.0250 2.1050 0.0369 [ -0.0666 , 0.0635 ]

We can see that since some confidence intervals contain 0 that the intervals are not consistent.

3.1 Residual Bootstrap

Below is a histogram that shows the distribution of the bootstrap residuals.

Looking at the histogram you can see that it is slightly left skewed and there is one outlier on the far left.

We must make histograms to show the residual bootstrap estimates.

Looking at the histograms the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The 95% residual bootstrap confidence interval is shown below.

Regression Matrix with a 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) boot_conf.95
(Intercept) -9.9258 8.8446 -1.1222 0.2635 [ -26.799 , 7.4838 ]
fairway 0.0192 0.0018 10.4230 0.0000 [ 0.0155 , 0.0228 ]
fairPct 0.1407 0.0468 3.0091 0.0031 [ 0.0479 , 0.2326 ]
driveDist 0.2407 0.0253 9.4995 0.0000 [ 0.1908 , 0.2887 ]
sandAtt -0.1106 0.0173 -6.4096 0.0000 [ -0.1434 , -0.0759 ]
sandPct 0.0526 0.0250 2.1050 0.0369 [ 0.0059 , 0.0996 ]

The residual bootstrap confidence intervals look better than the none residual confidence intervals. This is due to how every interval besides the intercept do not contain 0. This is because the sample size is large enough so that the sampling distributions of estimated coefficients have sufficiently good approximations of normal distributions.

3.2 Combining Results

Finally, we put all inferential statistics in a single table so we can compare these results.

Final Combined Inferential Statistics
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) -9.9258 8.8446 0.2635 [ 38.5803 , 81.9638 ] [ -26.799 , 7.4838 ]
fairway 0.0192 0.0018 0.0000 [ 0.0121 , 0.0128 ] [ 0.0155 , 0.0228 ]
fairPct 0.1407 0.0468 0.0031 [ -0.1037 , 0.0992 ] [ 0.0479 , 0.2326 ]
driveDist 0.2407 0.0253 0.0000 [ -0.0651 , 0.0633 ] [ 0.1908 , 0.2887 ]
sandAtt -0.1106 0.0173 0.0000 [ -0.0266 , 0.0263 ] [ -0.1434 , -0.0759 ]
sandPct 0.0526 0.0250 0.0369 [ -0.0666 , 0.0635 ] [ 0.0059 , 0.0996 ]

This table shows the results side by side of the two bootstrap confidence intervals. Looking at them side to side you can see how much better the intervals for the residual bootstrap are compared to the regular bootstrap confidence interval.

width of the two bootstrap confidence intervals
boot_wd boot_wd2
43.3835 0.1301
0.0008 0.1301
0.2029 0.1301
0.1284 0.1301
0.0529 0.1301
0.1301 0.1301

Looking at this table you can see that the widths of the residual bootstrap are more consistent than the regular bootstrap.

4 Summary and Discussion

We didn’t have to use many different techniques such as Box-Cox to transform the response variables. This was due to the assumption of constant variance being met. We got rid of all the variables that would not be significant or have any association in evaluating getting on the green in regulation due to my past experience.

Looking at the response variable we can see most variables besides fairways hit contain 0 in the confidence interval for the combined inferential statistics. This shows that fairways hit is the most statistically significant variable in comparison to greens in regulation.

I had no drawbacks or improvements I can think of.

In the future I will use total prize as my response variable. This is because you can see then what statistic in golf is the most important factor in how much you win in games.

---
title: "What helps a golfer get to greens in regulation?"
author: "Ryan Lebo"
date: "2024-10-22"
output: 
  html_document:
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options:
  chunk_output_type: inline
slways_allow_html: true
---



```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
#
# Specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = FALSE,  # include code chunk in the output file
                   warning = FALSE,  # Sometimes, your code may produce a warning
                                     # messages, you can choose to include the
                                     # warning messages in the output file. 
                   message = FALSE,  
                   results = TRUE,   # you can also decide whether to include 
                                     # the output in the output file.
                   comment = FALSE   # Suppress hash-tags in the output results.
                      )   
```

# Introduction
In this project we are trying to look at the research question, is there an association between greens in regulation and the predictor values available in the data set. We will look for the best model, do bootstraping, and look at the bootstrap confidence intervals.


## Data Description
The data I am using is about statistics on lpga golfers and how they have done in the 2024 season. I got this data set off of the website (https://users.stat.ufl.edu/~winner/datasets.html).

The variables in this data set are 

* Golfer- Name of the Golfer
* Nation- Where the golfer is from
* Region- What region the golfer is from
* fairways- How many fairways the golfer hit in regulation
* fairAtt- How many attempts the golfer took to get to the fairway
* fairPct- The percent of fairways hit in regulation
* totPutts- Total amount of putts the golfer had 
* totRounds- Total amouint of rounds played by the golfer
* avePutts- Average amount of putts when you reached the green per hole
* greenReg- How many greens were hit in regulation
* totPrize- Amount of money won
* events- How many events the golfer went to
* driveDist- The average distance that the golfer hit with their drive
* sandSaves- The amount of sand saves the golfer had
* sandAtt- The amount of shots taken from the sand
* sandPct- The percentage of shots that made it out of the sand

In this data set, we have sufficient information to address my research question.

## Research Question
The point of this study is to figure out the association between greens in regulation and the predictor values available in this data set. 

## Data Preperation

We are going to take out some variables from the model. We will start with Golfer, Nation, Region, totPrize, totRounds, and events due to these variables either being categorical or insignificant to the model. The number of events, number of rounds, and total prize will not be able to influence getting on the green in regulation. We also need to drop the variables totPutts, avePutts, sandSaves, and fairAtt because they are also not variables that can affect a persons ability to get on the green in regulation. This is due to them either being not directly correlated or they happen after you get onto the green.

```{r fig.align='center'}
lpga0 <- read.csv("https://raw.githubusercontent.com/RyanLebo/STA-321/refs/heads/main/lpga2022.csv", header = TRUE)
lpga <- lpga0[, -1]

fairway<- lpga$fairways
greens <- lpga$greenReg 


```


# Model Building

Now we need to make the full model of the data and look to see if we need to use a Box-Cox transformation on it.


```{r}
full.model = lm(greens ~ fairway+ fairPct+ driveDist+ sandAtt+ sandPct, data = lpga)
kable(summary(full.model)$coef, caption ="Regression Coefficients")

```


```{r}
par(mfrow=c(2,2))
plot(full.model)

```

In these residual plots we can see that Q-Q residual plot is a positive linear trend. The residuals vs fitted shows the points not going in a cone shape and this makes the variance constant. Due to this we do not need to do any box cox transformations.

## Goodness-of-Fit

Now we should look at the goodness of fit measures to try and help find the final model.

```{r}
select=function(m){ 
 e = m$resid                         
 n0 = length(e)                        
 SSE=(m$df)*(summary(m)$sigma)^2      
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r                
 MSE=(summary(m)$sigma)^2              
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))        
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)         
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  
 X=model.matrix(m)                     
 H=X%*%solve(t(X)%*%X)%*%t(X)         
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }

```

```{r}
output.sum = rbind(select(full.model))
row.names(output.sum) = c("full.model")
kable(output.sum, caption = "Goodness-of-fit Measures of Full Model")
```

We have a sample size of 158 which is large. We can see from the above table that the goodness-of-fit measures of the first model are significant. This shows that the model has a 67% predictive ability to predict greens in regulation.


# Bootstrap

Here we will use the bootstrap method to get a confidence interval of the coefficients in our selected model. 

```{r}
full = lm(greens ~ fairway+ fairPct+ driveDist+ sandAtt+ sandPct, data = lpga)

B = 1000      

para_full = dim(model.frame(full))[2]  
samp_full = dim(model.frame(full))[1] 
coef.mtrx = matrix(rep(0, B*para_full), ncol = para_full)       

for (i in 1:B){
  bootc.id = sample(1:samp_full, samp_full, replace = TRUE) 
 bt_full =lm(greens ~ fairway+ fairPct+ driveDist+ sandAtt+ sandPct, data = lpga[bootc.id,])
  
  coef.mtrx[i,] = coef(bt_full)    
}
```


```{r}
boot_hist = function(log_trx, bt.coef.mtrx, var.id, var.nm){
 
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 

}
```

We will now make visual representations of histograms for each of the regression coefficients in the final model.

```{r fig.align='center', fig.width=7, fig.height=5}
par(mfrow=c(2,3))  
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Fairway" )
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Drive Distance" )
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Sand Attempts" )
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Sand Percent" )
boot_hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Fairway Percent" )

```

Since both of the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The code below will get a 95% bootstrap confidence interval for the final model.

```{r}
cmtrx <- summary(full)$coef
para_full2 = dim(coef.mtrx)[2]  
boot_conf = NULL
boot_wd = NULL
for (i in 1:para_full2){
  low_conf = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
  up_conf = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
  boot_wd[i] =  up_conf - low_conf
  boot_conf[i] = paste("[", round(low_conf,4),", ", round(up_conf,4),"]")
 }

kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), boot_conf.95=boot_conf)), 
      caption = "Bootstrap CI")
```

We can see that since some confidence intervals contain 0 that the intervals are not consistent.

## Residual Bootstrap

Below is a histogram that shows the distribution of the bootstrap residuals.

```{r fig.align='center', fig.width=7, fig.height=4}
hist(sort(full$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="red",
     main = "Histogram of Bootstrap Residuals")
```

Looking at the histogram you can see that it is slightly left skewed and there is one outlier on the far left.


```{r}
full2 = lm(greens ~ fairway+ fairPct+ driveDist+ sandAtt+ sandPct, data = lpga)

model_resid = full2$residuals

B = 1000      

para_full3 = dim(model.matrix(full2))[2]  
samp_full2 = dim(model.matrix(full2))[1] 
btr.mtrx = matrix(rep(0, para_full3 * B), ncol = para_full3)

for (i in 1:B){
   bt.full.green = full2$fitted.values +
        sample(full2$residuals, samp_full2, replace = TRUE) 
   btr.model = lm(bt.full.green ~fairway+fairPct+ driveDist+ sandAtt+ sandPct, data = lpga) 
  btr.mtrx[i,]=btr.model$coefficients 
}
```


We must make histograms to show the residual bootstrap estimates.

```{r}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
 
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 

}
```



```{r fig.align='center', fig.width=7, fig.height=5}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Fairway" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Drive Distance" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Sand Attempts" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Sand Percent" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Fairway Percent" )


```

Looking at the histograms the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The 95% residual bootstrap confidence interval is shown below.

```{r}

para_full4 = dim(coef.mtrx)[2]  
boot_conf2 = NULL
boot_wd2 = NULL
for (i in 1:para_full4){
  low_conf2 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
  up_conf2 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
  boot_wd2[i] = up_conf - low_conf
  boot_conf2[i] = paste("[", round(low_conf2,4),", ", round(up_conf2,4),"]")
}

kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), boot_conf.95=boot_conf2)), 
      caption = "Regression Matrix with a 95% Residual Bootstrap CI")
```

The residual bootstrap confidence intervals look better than the none residual confidence intervals. This is due to how every interval besides the intercept do not contain 0.  This is because the sample size is large enough so that the sampling distributions of estimated coefficients have sufficiently good approximations of normal distributions.


## Combining Results

Finally, we put all inferential statistics in a single table so we can compare these results.

```{r}
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=boot_conf,btr.ci.95=boot_conf2)), 
      caption="Final Combined Inferential Statistics")
```

This table shows the results side by side of the two bootstrap confidence intervals. Looking at them side to side you can see how much better the intervals for the residual bootstrap are compared to the regular bootstrap confidence interval.


```{r}
kable(round(cbind(boot_wd, boot_wd2),4), caption="width of the two bootstrap confidence intervals")
```

Looking at this table you can see that the widths of the residual bootstrap are more consistent than the regular bootstrap.


# Summary and Discussion

We didn't have to use many different techniques such as Box-Cox to transform the response variables. This was due to the assumption of constant variance being met. We got rid of all the variables that would not be significant or have any association in evaluating getting on the green in regulation due to my past experience.

Looking at the response variable we can see most variables besides fairways hit contain 0 in the confidence interval for the combined inferential statistics. This shows that fairways hit is the most statistically significant variable in comparison to greens in regulation.

I had no drawbacks or improvements I can think of.

In the future I will use total prize as my response variable. This is because you can see then what statistic in golf is the most important factor in how much you win in games.




