Introduction
In this assignment we will be looking at a data set about the LPGA
which is the woman’s golf league. With this data set we will be seeing
if our initial model will need any transformations. We will also being
using greens in regulation as our response variable. We want to look at
how all these different predictor variables in golf correlate to greens
in regulation. We will perform necessary tests to find a good model for
this.
Data Description
The data in this project was taken from (https://users.stat.ufl.edu/~winner/datasets.html). My
variables are
- Golfer- Name of the Golfer
- Nation- Where the golfer is from
- Region- What region the golfer is from
- fairways- How many fairways the golfer hit in regulation
- fairAtt- How many attempts the golfer took to get to the
fairway
- fairPct- The percent of fairways hit in regulation
- totPutts- Total amount of putts the golfer had
- totRounds- Total amouint of rounds played by the golfer
- avePutts- Average amount of putts when you reached the green per
hole
- greenReg- How many greens were hit in regulation
- totPrize- Amount of money won
- events- How many events the golfer went to
- driveDist- The average distance that the golfer hit with their
drive
- sandSaves- The amount of sand saves the golfer had
- sandAtt- The amount of shots taken from the sand
- sandPct- The percentage of shots that made it out of the sand
Practical
Question
The point of this study is to figure out the association between
greens in regulation and the predictor values available in this data
set.
Exploratory Data
Analysis
We first want to look at how the predictor variables affect our
response variable in greens hit in regulation.





Looking at the first scatter plot with the variable fairway we can
see that it has a positive linear trend. This means that the more
fairways you hit in regulation, the more greens you will hit in
regulation.
The scatter plot with fairway percentage seems left skewed. This
shows that most of the golfers in this data set were mostly hitting
higher percentages into the fairways.
The scatter plot for drive distance seems to have a more positive
linear trend.
The scatter plot for sand attempts seems to be slightly right skewed
with two possible outliers past the 120 sand attempts.
The scatter plot for sand percentage also has a positive linear
trend.
Full model and
diagnostics
We need to make a linear model with the predictor values. Based on
previous experiences, we have taken out Golfer, Nation, Region,
totPrize, and totRounds. The number of events, number of rounds, and
total prize will not be able to influence getting on the green in
regulation. We also need to drop the variables totPutts, avePutts,
sandSaves, and fairAtt because they are also not variables that can
affect a persons ability to get on the green in regulation. This is due
to them either not directly correlated or happen after you get onto the
green.
Regression Coefficients
| (Intercept) |
-9.9257920 |
8.8445752 |
-1.122246 |
0.2635273 |
| fairway |
0.0191607 |
0.0018383 |
10.423042 |
0.0000000 |
| fairPct |
0.1406976 |
0.0467571 |
3.009119 |
0.0030680 |
| driveDist |
0.2407452 |
0.0253429 |
9.499498 |
0.0000000 |
| sandAtt |
-0.1105811 |
0.0172524 |
-6.409602 |
0.0000000 |
| sandPct |
0.0525978 |
0.0249874 |
2.104974 |
0.0369384 |
Now we should look at our residual diagnostic analysis to check how
reliable our model is.

In these residual plots we can see that Q-Q residual has a normal
distribution which happens when. you have . The residuals vs fitted
shows the points not in a cone shape throughout and this means the
variance is constant.
Goodness-of-fit
Measures
Now, we look at the goodness of fit measures for the models.
Goodness-of-fit Measures of Full Model
| full.model |
787.5965 |
0.6780546 |
0.6674643 |
6 |
265.8098 |
284.1853 |
862.4798 |
We have a sample size of 158 which is large. We can see from the
above table that the goodness-of-fit measures of the first model are
significant. This shows that the model has a 67% predictive ability to
predict greens in regulation.
Final Model
This is the statistics of the chosen model.
Stats of Final Model
| (Intercept) |
-9.9257920 |
8.8445752 |
-1.122246 |
0.2635273 |
| fairway |
0.0191607 |
0.0018383 |
10.423042 |
0.0000000 |
| fairPct |
0.1406976 |
0.0467571 |
3.009119 |
0.0030680 |
| driveDist |
0.2407452 |
0.0253429 |
9.499498 |
0.0000000 |
| sandAtt |
-0.1105811 |
0.0172524 |
-6.409602 |
0.0000000 |
| sandPct |
0.0525978 |
0.0249874 |
2.104974 |
0.0369384 |
Since the sample size of 158 is large, the argument for validating
p-values is the Central Limit Theorem. All the p-values are close to 0
meaning that all coefficients are significant
In this case, due to the p-values there is no need to perform
variable selection to determine the final model.
Conclusion/Discussion
We didn’t have to use many different techniques such as Box-Cox to
transform the response variables. This was due to the assumption of
constant variance being met. We got rid of all the variables that would
not be significant or have any association in evaluating getting on the
green in regulation due to my past experience.
We looked at the residual plots and the goodness of fit measures to
access the model. In doing this we came to the conclusion that the model
we had didn’t need any transformations and our model was
significant.
