1 Introduction

In this project we are trying to look at the research question, is there an association between greens in regulation and the predictor values available in the data set. We will look for the best model, do bootstraping, and look at the bootstrap confidence intervals.

1.1 Data Description

The data I am using is about statistics on lpga golfers and how they have done in the 2024 season. I got this data set off of the website (https://users.stat.ufl.edu/~winner/datasets.html).

The variables in this data set are

  • Golfer- Name of the Golfer
  • Nation- Where the golfer is from
  • Region- What region the golfer is from
  • fairways- How many fairways the golfer hit in regulation
  • fairAtt- How many attempts the golfer took to get to the fairway
  • fairPct- The percent of fairways hit in regulation
  • totPutts- Total amount of putts the golfer had
  • totRounds- Total amouint of rounds played by the golfer
  • avePutts- Average amount of putts when you reached the green per hole
  • greenReg- How many greens were hit in regulation
  • totPrize- Amount of money won
  • events- How many events the golfer went to
  • driveDist- The average distance that the golfer hit with their drive
  • sandSaves- The amount of sand saves the golfer had
  • sandAtt- The amount of shots taken from the sand
  • sandPct- The percentage of shots that made it out of the sand

In this data set, we have sufficient information to address my research question.

1.2 Research Question

The point of this study is to figure out the association between greens in regulation and the predictor values available in this data set.

1.3 Data Preperation

We are going to take out some variables from the model. We will start with Golfer, Nation, Region, totPrize, totRounds, and events due to these variables either being categorical or insignificant to the model. The number of events, number of rounds, and total prize will not be able to influence getting on the green in regulation. We also need to drop the variables totPutts, avePutts, sandSaves, and fairAtt because they are also not variables that can affect a persons ability to get on the green in regulation. This is due to them either being not directly correlated or they happen after you get onto the green.

2 Model Building

Now we need to make the full model of the data and look to see if we need to use a Box-Cox transformation on it.

Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.9257920 8.8445752 -1.122246 0.2635273
fairway 0.0191607 0.0018383 10.423042 0.0000000
fairPct 0.1406976 0.0467571 3.009119 0.0030680
driveDist 0.2407452 0.0253429 9.499498 0.0000000
sandAtt -0.1105811 0.0172524 -6.409602 0.0000000
sandPct 0.0525978 0.0249874 2.104974 0.0369384

In these residual plots we can see that Q-Q residual plot is a positive linear trend. The residuals vs fitted shows the points not going in a cone shape and this makes the variance constant. Due to this we do not need to do any box cox transformations.

2.1 Goodness-of-Fit

Now we should look at the goodness of fit measures to try and help find the final model.

Goodness-of-fit Measures of Full Model
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 787.5965 0.6780546 0.6674643 6 265.8098 284.1853 862.4798

We have a sample size of 158 which is large. We can see from the above table that the goodness-of-fit measures of the first model are significant. This shows that the model has a 67% predictive ability to predict greens in regulation.

3 Bootstrap

Here we will use the bootstrap method to get a confidence interval of the coefficients in our selected model.

We will now make visual representations of histograms for each of the regression coefficients in the final model.

Since both of the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The code below will get a 95% bootstrap confidence interval for the final model.

Bootstrap CI
Estimate Std. Error t value Pr(>|t|) boot_conf.95
(Intercept) -9.9258 8.8446 -1.1222 0.2635 [ 38.5803 , 81.9638 ]
fairway 0.0192 0.0018 10.4230 0.0000 [ 0.0121 , 0.0128 ]
fairPct 0.1407 0.0468 3.0091 0.0031 [ -0.1037 , 0.0992 ]
driveDist 0.2407 0.0253 9.4995 0.0000 [ -0.0651 , 0.0633 ]
sandAtt -0.1106 0.0173 -6.4096 0.0000 [ -0.0266 , 0.0263 ]
sandPct 0.0526 0.0250 2.1050 0.0369 [ -0.0666 , 0.0635 ]

We can see that since some confidence intervals contain 0 that the intervals are not consistent.

3.1 Residual Bootstrap

Below is a histogram that shows the distribution of the bootstrap residuals.

Looking at the histogram you can see that it is slightly left skewed and there is one outlier on the far left.

We must make histograms to show the residual bootstrap estimates.

Looking at the histograms the density curves in the histograms are close together, we can conclude that the bootstrap confidence intervals will be consistent with the significance tests.

The 95% residual bootstrap confidence interval is shown below.

Regression Matrix with a 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) boot_conf.95
(Intercept) -9.9258 8.8446 -1.1222 0.2635 [ -26.799 , 7.4838 ]
fairway 0.0192 0.0018 10.4230 0.0000 [ 0.0155 , 0.0228 ]
fairPct 0.1407 0.0468 3.0091 0.0031 [ 0.0479 , 0.2326 ]
driveDist 0.2407 0.0253 9.4995 0.0000 [ 0.1908 , 0.2887 ]
sandAtt -0.1106 0.0173 -6.4096 0.0000 [ -0.1434 , -0.0759 ]
sandPct 0.0526 0.0250 2.1050 0.0369 [ 0.0059 , 0.0996 ]

The residual bootstrap confidence intervals look better than the none residual confidence intervals. This is due to how every interval besides the intercept do not contain 0. This is because the sample size is large enough so that the sampling distributions of estimated coefficients have sufficiently good approximations of normal distributions.

3.2 Combining Results

Finally, we put all inferential statistics in a single table so we can compare these results.

Final Combined Inferential Statistics
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) -9.9258 8.8446 0.2635 [ 38.5803 , 81.9638 ] [ -26.799 , 7.4838 ]
fairway 0.0192 0.0018 0.0000 [ 0.0121 , 0.0128 ] [ 0.0155 , 0.0228 ]
fairPct 0.1407 0.0468 0.0031 [ -0.1037 , 0.0992 ] [ 0.0479 , 0.2326 ]
driveDist 0.2407 0.0253 0.0000 [ -0.0651 , 0.0633 ] [ 0.1908 , 0.2887 ]
sandAtt -0.1106 0.0173 0.0000 [ -0.0266 , 0.0263 ] [ -0.1434 , -0.0759 ]
sandPct 0.0526 0.0250 0.0369 [ -0.0666 , 0.0635 ] [ 0.0059 , 0.0996 ]

This table shows the results side by side of the two bootstrap confidence intervals. Looking at them side to side you can see how much better the intervals for the residual bootstrap are compared to the regular bootstrap confidence interval.

width of the two bootstrap confidence intervals
boot_wd boot_wd2
43.3835 0.1301
0.0008 0.1301
0.2029 0.1301
0.1284 0.1301
0.0529 0.1301
0.1301 0.1301

Looking at this table you can see that the widths of the residual bootstrap are more consistent than the regular bootstrap.

4 Summary and Discussion

We didn’t have to use many different techniques such as Box-Cox to transform the response variables. This was due to the assumption of constant variance being met. We got rid of all the variables that would not be significant or have any association in evaluating getting on the green in regulation due to my past experience.

Looking at the response variable we can see most variables besides fairways hit contain 0 in the confidence interval for the combined inferential statistics. This shows that fairways hit is the most statistically significant variable in comparison to greens in regulation.

I had no drawbacks or improvements I can think of.

In the future I will use total prize as my response variable. This is because you can see then what statistic in golf is the most important factor in how much you win in games.

