Abstract

A group of researchers from Northeastern University have developed a new method for ranking the total driving performance of golfers on the Professional Golf Association (PGA) tour (Sport Journal, Winter 2007). The method requires knowing a golfer’s average driving distance (yards) and driving accuracy (percent of drives that land in the fairway). A professional golfer is practicing a new swing to increase his average driving distance and wants to know if this will affect his driving accuracy. For this, we look at data for the top 40 PGA golfers (as ranked by the new method), and use simple linear regression to analysis the validity of his concern.

Introduction

We are given a data set of the top 40 golfers which includes data for their driving distance and driving accuracy. We must first come up with a hypothesised model for the data.

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where:

\[\begin{align*} y &= \text{the Dependent variable, Driving Distance}\\ x &= \text{the Independent variable, Driving Accuracy}\\ E(y) &= \beta_0 + \beta_1 x = \text{Deterministic component}\\ \epsilon &= \text{Random error component}\\ \beta_0 &= \text{(beta zero)} = y\text{-intercept}\\ \beta_1 &= \text{(beta one)} = \text{Slope of the line} \end{align*}\]

Next, we start to analize the given data by first looking at a scatterplot to gain information about the approximate values of our parameters.

We can clearly see from the plot that there appears to be a linear relationship between our two variables, distance and accuracy.

Now that we have established there is a linear relationship, it’s time to build our model. We do this by relating the distance to the accuracy from our given PGA tour data as follows,

accuracyMod <- lm(accuracy ~ distance, data = pgadriver)


Now that our model is build, we can analize the results,

## 
## Call:
## lm(formula = accuracy ~ distance, data = pgadriver)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7923 -1.4000 -0.2936  1.0767  5.0493 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 250.14203   14.23101   17.58  < 2e-16 ***
## distance     -0.62944    0.04759  -13.23 8.48e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.236 on 38 degrees of freedom
## Multiple R-squared:  0.8216, Adjusted R-squared:  0.8169 
## F-statistic: 174.9 on 1 and 38 DF,  p-value: 8.478e-16



Now plotting the least squares line on our original scatterplot yeilds



Based on our linear model, we can see that the \(\hat{\beta_0} = y\)-intercept of our predicted model is

## [1] 250.142

and the \(\hat{\beta_1} =\) slope of our predicted model is

## [1] -0.6294431


This gives us a predicted model as follows,

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon \]

Or,

\[ \hat{y} = 250.1420271 + -0.6294431 x + \epsilon \]

Summary

So we have some assumptions when working with this model. They are as follows,

  1. We assumed that the probilistic model relating the driving accuracy \(y\) to the driving distance \(x\) is

\[ y = \beta_0 + \beta_1 x + \epsilon. \]

  1. The least squares estimate of the deterministic component of the model \(\beta_0 + \beta_1 x\) is

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon = 250.1420271 + -0.6294431 x \]

  1. The mean of the probility distribution of \(\epsilon\) is \(0\).

  2. The variance of the probability distribution of \(\epsilon\) is constant for all settings of the independent variable \(x\).

  3. The probability distribution of \(\epsilon\) is normal.

Now to assess the utility of the model, we will assume an \(\alpha\) of \(.05\), or a confidence of \(95\%\).

## 
## Call:
## lm(formula = accuracy ~ distance, data = pgadriver)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7923 -1.4000 -0.2936  1.0767  5.0493 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 250.14203   14.23101   17.58  < 2e-16 ***
## distance     -0.62944    0.04759  -13.23 8.48e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.236 on 38 degrees of freedom
## Multiple R-squared:  0.8216, Adjusted R-squared:  0.8169 
## F-statistic: 174.9 on 1 and 38 DF,  p-value: 8.478e-16



Since \(Pr(>|t|) = 8.48e^{-16} < .05\) Then we can conclude that \(\beta_0 \neq 0\), or in other words, our model is useful.

Now turning to the correlation,

cor(pgadriver$distance, pgadriver$accuracy)
## [1] -0.906395



Since correlation \(= -0.906395\) is close to \(-1\), we can say that there is a negative correlation between the golfers driving distance and accuracy. In other words, the golfers accuracy decreases when his distance goes up. This was evident in the scatterplots we did above.

Conclusion

In conclusion, we can see that we were able to hypothesize a model

\[ y = \beta_0 + \beta_1 x + \epsilon \]

and fit the data to come up with a prediction model

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon = 250.1420271 + -0.6294431 x. \]

Based on this we were able to observe the correlation between a golfer’s driving distance and driving accuracy

cor(pgadriver$distance, pgadriver$accuracy)
## [1] -0.906395



and conclude that the golfer’s concern was indeed valid. If he were to change his swing to increase his driving distance, this would negatively affect his accuracy, making it decrease.