Introduction
In this project we are trying to look at the research question, is
there an association between greens in regulation and the predictor
values available in the data set. We will look for the best model, do
bootstraping, and look at the bootstrap confidence intervals.
Data Description
The data I am using is about statistics on lpga golfers and how they
have done in the 2024 season. I got this data set off of the website (https://users.stat.ufl.edu/~winner/datasets.html).
The variables in this data set are
- Golfer- Name of the Golfer
- Nation- Where the golfer is from
- Region- What region the golfer is from
- fairways- How many fairways the golfer hit in regulation
- fairAtt- How many attempts the golfer took to get to the
fairway
- fairPct- The percent of fairways hit in regulation
- totPutts- Total amount of putts the golfer had
- totRounds- Total amouint of rounds played by the golfer
- avePutts- Average amount of putts when you reached the green per
hole
- greenReg- How many greens were hit in regulation
- totPrize- Amount of money won
- events- How many events the golfer went to
- driveDist- The average distance that the golfer hit with their
drive
- sandSaves- The amount of sand saves the golfer had
- sandAtt- The amount of shots taken from the sand
- sandPct- The percentage of shots that made it out of the sand
In this data set, we have sufficient information to address my
research question.
Research
Question
The point of this study is to figure out the association between
greens in regulation and the predictor values available in this data
set.
Data Preperation
We are going to take out some variables from the model. We will start
with Golfer, Nation, Region, totPrize, totRounds, and events due to
these variables either being categorical or insignificant to the model.
The number of events, number of rounds, and total prize will not be able
to influence getting on the green in regulation. We also need to drop
the variables totPutts, avePutts, sandSaves, and fairAtt because they
are also not variables that can affect a persons ability to get on the
green in regulation. This is due to them either being not directly
correlated or they happen after you get onto the green.
Model Building
Now we need to make the full model of the data and look to see if we
need to use a Box-Cox transformation on it.
Regression Coefficients
| (Intercept) |
-9.9257920 |
8.8445752 |
-1.122246 |
0.2635273 |
| fairway |
0.0191607 |
0.0018383 |
10.423042 |
0.0000000 |
| fairPct |
0.1406976 |
0.0467571 |
3.009119 |
0.0030680 |
| driveDist |
0.2407452 |
0.0253429 |
9.499498 |
0.0000000 |
| sandAtt |
-0.1105811 |
0.0172524 |
-6.409602 |
0.0000000 |
| sandPct |
0.0525978 |
0.0249874 |
2.104974 |
0.0369384 |

In these residual plots we can see that Q-Q residual plot is a
positive linear trend. The residuals vs fitted shows the points not
going in a cone shape and this makes the variance constant. Due to this
we do not need to do any box cox transformations.
Goodness-of-Fit
Now we should look at the goodness of fit measures to try and help
find the final model.
Goodness-of-fit Measures of Full Model
| full.model |
787.5965 |
0.6780546 |
0.6674643 |
6 |
265.8098 |
284.1853 |
862.4798 |
We have a sample size of 158 which is large. We can see from the
above table that the goodness-of-fit measures of the first model are
significant. This shows that the model has a 67% predictive ability to
predict greens in regulation.
Bootstrap
Here we will use the bootstrap method to get a confidence interval of
the coefficients in our selected model.
We will now make visual representations of histograms for each of the
regression coefficients in the final model.

Since both of the density curves in the histograms are close
together, we can conclude that the bootstrap confidence intervals will
be consistent with the significance tests.
The code below will get a 95% bootstrap confidence interval for the
final model.
Bootstrap CI
| (Intercept) |
-9.9258 |
8.8446 |
-1.1222 |
0.2635 |
[ 38.5803 , 81.9638 ] |
| fairway |
0.0192 |
0.0018 |
10.4230 |
0.0000 |
[ 0.0121 , 0.0128 ] |
| fairPct |
0.1407 |
0.0468 |
3.0091 |
0.0031 |
[ -0.1037 , 0.0992 ] |
| driveDist |
0.2407 |
0.0253 |
9.4995 |
0.0000 |
[ -0.0651 , 0.0633 ] |
| sandAtt |
-0.1106 |
0.0173 |
-6.4096 |
0.0000 |
[ -0.0266 , 0.0263 ] |
| sandPct |
0.0526 |
0.0250 |
2.1050 |
0.0369 |
[ -0.0666 , 0.0635 ] |
We can see that since some confidence intervals contain 0 that the
intervals are not consistent.
Residual
Bootstrap
Below is a histogram that shows the distribution of the bootstrap
residuals.

Looking at the histogram you can see that it is slightly left skewed
and there is one outlier on the far left.
We must make histograms to show the residual bootstrap estimates.

Looking at the histograms the density curves in the histograms are
close together, we can conclude that the bootstrap confidence intervals
will be consistent with the significance tests.
The 95% residual bootstrap confidence interval is shown below.
Regression Matrix with a 95% Residual Bootstrap CI
| (Intercept) |
-9.9258 |
8.8446 |
-1.1222 |
0.2635 |
[ -26.799 , 7.4838 ] |
| fairway |
0.0192 |
0.0018 |
10.4230 |
0.0000 |
[ 0.0155 , 0.0228 ] |
| fairPct |
0.1407 |
0.0468 |
3.0091 |
0.0031 |
[ 0.0479 , 0.2326 ] |
| driveDist |
0.2407 |
0.0253 |
9.4995 |
0.0000 |
[ 0.1908 , 0.2887 ] |
| sandAtt |
-0.1106 |
0.0173 |
-6.4096 |
0.0000 |
[ -0.1434 , -0.0759 ] |
| sandPct |
0.0526 |
0.0250 |
2.1050 |
0.0369 |
[ 0.0059 , 0.0996 ] |
The residual bootstrap confidence intervals look better than the none
residual confidence intervals. This is due to how every interval besides
the intercept do not contain 0. This is because the sample size is large
enough so that the sampling distributions of estimated coefficients have
sufficiently good approximations of normal distributions.
Combining
Results
Finally, we put all inferential statistics in a single table so we
can compare these results.
Final Combined Inferential Statistics
| (Intercept) |
-9.9258 |
8.8446 |
0.2635 |
[ 38.5803 , 81.9638 ] |
[ -26.799 , 7.4838 ] |
| fairway |
0.0192 |
0.0018 |
0.0000 |
[ 0.0121 , 0.0128 ] |
[ 0.0155 , 0.0228 ] |
| fairPct |
0.1407 |
0.0468 |
0.0031 |
[ -0.1037 , 0.0992 ] |
[ 0.0479 , 0.2326 ] |
| driveDist |
0.2407 |
0.0253 |
0.0000 |
[ -0.0651 , 0.0633 ] |
[ 0.1908 , 0.2887 ] |
| sandAtt |
-0.1106 |
0.0173 |
0.0000 |
[ -0.0266 , 0.0263 ] |
[ -0.1434 , -0.0759 ] |
| sandPct |
0.0526 |
0.0250 |
0.0369 |
[ -0.0666 , 0.0635 ] |
[ 0.0059 , 0.0996 ] |
This table shows the results side by side of the two bootstrap
confidence intervals. Looking at them side to side you can see how much
better the intervals for the residual bootstrap are compared to the
regular bootstrap confidence interval.
width of the two bootstrap confidence intervals
| 43.3835 |
0.1301 |
| 0.0008 |
0.1301 |
| 0.2029 |
0.1301 |
| 0.1284 |
0.1301 |
| 0.0529 |
0.1301 |
| 0.1301 |
0.1301 |
Looking at this table you can see that the widths of the residual
bootstrap are more consistent than the regular bootstrap.
Summary and
Discussion
We didn’t have to use many different techniques such as Box-Cox to
transform the response variables. This was due to the assumption of
constant variance being met. We got rid of all the variables that would
not be significant or have any association in evaluating getting on the
green in regulation due to my past experience.
Looking at the response variable we can see most variables besides
fairways hit contain 0 in the confidence interval for the combined
inferential statistics. This shows that fairways hit is the most
statistically significant variable in comparison to greens in
regulation.
I had no drawbacks or improvements I can think of.
In the future I will use total prize as my response variable. This is
because you can see then what statistic in golf is the most important
factor in how much you win in games.
