A group of researchers from Northeastern University have developed a new method for ranking the total driving performance of golfers on the Professional Golf Association (PGA) tour (Sport Journal, Winter 2007). The method requires knowing a golfer’s average driving distance (yards) and driving accuracy (percent of drives that land in the fairway). A professional golfer is practicing a new swing to increase his average driving distance and wants to know if this will affect his driving accuracy. For this, we look at data for the top 40 PGA golfers (as ranked by the new method), and use simple linear regression to analysis the validity of his concern.
We are given a data set of the top 40 golfers which includes data for their driving distance and driving accuracy. We must first come up with a hypothesised model for the data.
\[ y = \beta_0 + \beta_1 x + \epsilon \]
Where:
\[\begin{align*} y &= \text{the Dependent variable, Driving Distance}\\ x &= \text{the Independent variable, Driving Accuracy}\\ E(y) &= \beta_0 + \beta_1 x = \text{Deterministic component}\\ \epsilon &= \text{Random error component}\\ \beta_0 &= \text{(beta zero)} = y\text{-intercept}\\ \beta_1 &= \text{(beta one)} = \text{Slope of the line} \end{align*}\]Next, we start to analize the given data by first looking at a scatterplot to gain information about the approximate values of our parameters.
We can clearly see from the plot that there appears to be a linear relationship between our two variables, distance and accuracy.
Now that we have established there is a linear relationship, it’s time to build our model. We do this by relating the distance to the accuracy from our given PGA tour data as follows,
accuracyMod <- lm(accuracy ~ distance, data = pgadriver)
Now that our model is build, we can analize the results,
##
## Call:
## lm(formula = accuracy ~ distance, data = pgadriver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7923 -1.4000 -0.2936 1.0767 5.0493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250.14203 14.23101 17.58 < 2e-16 ***
## distance -0.62944 0.04759 -13.23 8.48e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.236 on 38 degrees of freedom
## Multiple R-squared: 0.8216, Adjusted R-squared: 0.8169
## F-statistic: 174.9 on 1 and 38 DF, p-value: 8.478e-16
Now plotting the least squares line on our original scatterplot yeilds
Based on our linear model, we can see that the \(\hat{\beta_0} = y\)-intercept of our predicted model is
## [1] 250.142
and the \(\hat{\beta_1} =\) slope of our predicted model is
## [1] -0.6294431
This gives us a predicted model as follows,
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon \]
Or,
\[ \hat{y} = 250.1420271 + -0.6294431 x + \epsilon \]
So we have some assumptions when working with this model. They are as follows,
\[ y = \beta_0 + \beta_1 x + \epsilon. \]
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon = 250.1420271 + -0.6294431 x \]
The mean of the probility distribution of \(\epsilon\) is \(0\).
The variance of the probability distribution of \(\epsilon\) is constant for all settings of the independent variable \(x\).
The probability distribution of \(\epsilon\) is normal.
Now to assess the utility of the model, we will assume an \(\alpha\) of \(.05\), or a confidence of \(95\%\).
##
## Call:
## lm(formula = accuracy ~ distance, data = pgadriver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7923 -1.4000 -0.2936 1.0767 5.0493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250.14203 14.23101 17.58 < 2e-16 ***
## distance -0.62944 0.04759 -13.23 8.48e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.236 on 38 degrees of freedom
## Multiple R-squared: 0.8216, Adjusted R-squared: 0.8169
## F-statistic: 174.9 on 1 and 38 DF, p-value: 8.478e-16
Since \(Pr(>|t|) = 8.48e^{-16} < .05\) Then we can conclude that \(\beta_0 \neq 0\), or in other words, our model is useful.
Now turning to the correlation,
cor(pgadriver$distance, pgadriver$accuracy)
## [1] -0.906395
Since correlation \(= -0.906395\) is close to \(-1\), we can say that there is a negative correlation between the golfers driving distance and accuracy. In other words, the golfers accuracy decreases when his distance goes up. This was evident in the scatterplots we did above.
In conclusion, we can see that we were able to hypothesize a model
\[ y = \beta_0 + \beta_1 x + \epsilon \]
and fit the data to come up with a prediction model
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon = 250.1420271 + -0.6294431 x. \]
Based on this we were able to observe the correlation between a golfer’s driving distance and driving accuracy
cor(pgadriver$distance, pgadriver$accuracy)
## [1] -0.906395
and conclude that the golfer’s concern was indeed valid. If he were to change his swing to increase his driving distance, this would negatively affect his accuracy, making it decrease.