1 Introduction

In this assignment we will be looking at a data set about the LPGA which is the woman’s golf league. With this data set we will be seeing if our initial model will need any transformations. We will also being using greens in regulation as our response variable. We want to look at how all these different predictor variables in golf correlate to greens in regulation. We will perform necessary tests to find a good model for this.

1.1 Data Description

The data in this project was taken from (https://users.stat.ufl.edu/~winner/datasets.html). My variables are

  • Golfer- Name of the Golfer
  • Nation- Where the golfer is from
  • Region- What region the golfer is from
  • fairways- How many fairways the golfer hit in regulation
  • fairAtt- How many attempts the golfer took to get to the fairway
  • fairPct- The percent of fairways hit in regulation
  • totPutts- Total amount of putts the golfer had
  • totRounds- Total amouint of rounds played by the golfer
  • avePutts- Average amount of putts when you reached the green per hole
  • greenReg- How many greens were hit in regulation
  • totPrize- Amount of money won
  • events- How many events the golfer went to
  • driveDist- The average distance that the golfer hit with their drive
  • sandSaves- The amount of sand saves the golfer had
  • sandAtt- The amount of shots taken from the sand
  • sandPct- The percentage of shots that made it out of the sand

1.2 Practical Question

The point of this study is to figure out the association between greens in regulation and the predictor values available in this data set.

2 Exploratory Data Analysis

We first want to look at how the predictor variables affect our response variable in greens hit in regulation.

Looking at the first scatter plot with the variable fairway we can see that it has a positive linear trend. This means that the more fairways you hit in regulation, the more greens you will hit in regulation.

The scatter plot with fairway percentage seems left skewed. This shows that most of the golfers in this data set were mostly hitting higher percentages into the fairways.

The scatter plot for drive distance seems to have a more positive linear trend.

The scatter plot for sand attempts seems to be slightly right skewed with two possible outliers past the 120 sand attempts.

The scatter plot for sand percentage also has a positive linear trend.

2.1 Full model and diagnostics

We need to make a linear model with the predictor values. Based on previous experiences, we have taken out Golfer, Nation, Region, totPrize, and totRounds. The number of events, number of rounds, and total prize will not be able to influence getting on the green in regulation. We also need to drop the variables totPutts, avePutts, sandSaves, and fairAtt because they are also not variables that can affect a persons ability to get on the green in regulation. This is due to them either not directly correlated or happen after you get onto the green.

Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.9257920 8.8445752 -1.122246 0.2635273
fairway 0.0191607 0.0018383 10.423042 0.0000000
fairPct 0.1406976 0.0467571 3.009119 0.0030680
driveDist 0.2407452 0.0253429 9.499498 0.0000000
sandAtt -0.1105811 0.0172524 -6.409602 0.0000000
sandPct 0.0525978 0.0249874 2.104974 0.0369384

Now we should look at our residual diagnostic analysis to check how reliable our model is.

In these residual plots we can see that Q-Q residual has a normal distribution which happens when. you have . The residuals vs fitted shows the points not in a cone shape throughout and this means the variance is constant.

2.2 Goodness-of-fit Measures

Now, we look at the goodness of fit measures for the models.

Goodness-of-fit Measures of Full Model
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 787.5965 0.6780546 0.6674643 6 265.8098 284.1853 862.4798

We have a sample size of 158 which is large. We can see from the above table that the goodness-of-fit measures of the first model are significant. This shows that the model has a 67% predictive ability to predict greens in regulation.

3 Final Model

This is the statistics of the chosen model.

Stats of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.9257920 8.8445752 -1.122246 0.2635273
fairway 0.0191607 0.0018383 10.423042 0.0000000
fairPct 0.1406976 0.0467571 3.009119 0.0030680
driveDist 0.2407452 0.0253429 9.499498 0.0000000
sandAtt -0.1105811 0.0172524 -6.409602 0.0000000
sandPct 0.0525978 0.0249874 2.104974 0.0369384

Since the sample size of 158 is large, the argument for validating p-values is the Central Limit Theorem. All the p-values are close to 0 meaning that all coefficients are significant

In this case, due to the p-values there is no need to perform variable selection to determine the final model.

4 Conclusion/Discussion

We didn’t have to use many different techniques such as Box-Cox to transform the response variables. This was due to the assumption of constant variance being met. We got rid of all the variables that would not be significant or have any association in evaluating getting on the green in regulation due to my past experience.

We looked at the residual plots and the goodness of fit measures to access the model. In doing this we came to the conclusion that the model we had didn’t need any transformations and our model was significant.

