LAST NAME: RISER
FIRST NAME: SAMANTHA
M#: M05023930
First, set the working directory
setwd("~/OneNote Notebooks/7038-DA Methods/Homework 3")
# Packages required
library(tidyverse)
library(dplyr)
Data set PGA.csv contains statistics of performance and winnings for 196 PGA tour participants during 2004 season [https://www.pgatour.com/]. Here is the list of variable: Name, Age, Average Drive (Yards), Driving accuracy (percent), Greens on regulation (%), Average # of putts, Save Percent, Money Rank, # Events, Total Winnings ($), Average winnings.
Please read PGA data (PGA.csv) into R and visualize it. Note that Average winnings = Total Winnings/# Events. For this data set, fit a multiple linear regression to the data. Use Average winnings as the response variable and use Age, Average Drive (Yards), Driving Accuracy (percent), Greens on Regulation (%), Average # of Putts, Save Percent, and # Events as covariates.
pga <- read.csv("PGA.csv", header = T, stringsAsFactors = FALSE)
head(pga, 5)
## Name Age AverageDrive DrivingAccuracy GreensonRegulation
## 1 Aaron Baddeley 23 288.0 53.1 58.2
## 2 Adam Scott 24 295.4 57.7 65.6
## 3 Alex Cejka 34 285.8 64.2 63.8
## 4 Andre Stolz 34 297.9 59.0 63.0
## 5 Arjun Atwal 31 289.4 60.5 62.5
## AverageNumofPutts SavePercent MoneyRank NumEvents TotalWinnings
## 1 1.767 50.9 123 27 632878
## 2 1.757 59.3 7 16 3724984
## 3 1.795 50.7 54 24 1313484
## 4 1.787 47.7 101 20 808373
## 5 1.766 43.5 146 30 486053
## AverageWinnings
## 1 23440
## 2 232812
## 3 54729
## 4 40419
## 5 16202
Some brief visualizations:
Let’s see the spread of the players’ ages, average drives, and driving accuracy.
Observations from the Histograms:
Most players seem to be in their 30’s, but ages range from low 20’s to mid 50’s.
The players’ average drives center around ~285 yards, and the distribution of average drive across all players is roughly normal.
Do we have any players who win signficantly more money than the majority of players? We can look for outliers.
Observations from these boxplots:
The distribution of winnings is heavily skewed to the right. There are many players who can be considered outliers. This means there are players that have won dramatiaclly more money or are dramatically more successful on average than the majority of players.
There is one outlier even farther away than the other outliers in the “Total Winnings” boxplot. We do not see a similar dramatic outlier in the “Average Winnings” boxplot. This may indicate that one player won an important tournament with a large cash prize, but lost enough other tournaments that he is not as successful on average.
pga_fit <- lm(AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga)
summary(pga_fit)
##
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy +
## GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents,
## data = pga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71690 -22176 -6735 17147 247928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 945579.88 305886.59 3.091 0.00230 **
## Age -587.13 519.32 -1.131 0.25968
## AverageDrive -94.76 567.42 -0.167 0.86755
## DrivingAccuracy -2360.57 854.02 -2.764 0.00628 **
## GreensonRegulation 8466.04 1303.87 6.493 7.30e-10 ***
## AverageNumofPutts -694226.49 138155.99 -5.025 1.17e-06 ***
## SavePercent 1395.67 587.54 2.375 0.01853 *
## NumEvents -3159.22 644.24 -4.904 2.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared: 0.4527, Adjusted R-squared: 0.4323
## F-statistic: 22.21 on 7 and 188 DF, p-value: < 2.2e-16
Using the fitted regression in previous question, obtain the prediction of the response variable for a new player with Age = 35, AverageDrive = 287, DrivingAccuracy = 64, GreensonRegulation = 64.9, AverageNumofPutts = 1.778, SavePercent = 48, NumEvents = 26. Provide a prediction interval and explain what it means.
& 3.
# create player 2
pred_data2 <- data.frame(Age=35, AverageDrive=287, DrivingAccuracy=64, GreensonRegulation=64.9, AverageNumofPutts=1.778, SavePercent=48, NumEvents=26)
# apply player's stats and predict their Average Winnings
pred_val2 <- predict(pga_fit, pred_data2, interval = "prediction", level = 0.95, type = "response")
two <- cbind(pred_data2, pred_val2) # will be used later
pred_val2
## fit lwr upr
## 1 46720.76 -35236.56 128678.1
Similarly, make another prediction for the case of Age = 42, AverageDrive = 295, DrivingAccuracy = 69, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.
& 3.
# Create player 3
pred_data3 <- data.frame(Age=42, AverageDrive=295, DrivingAccuracy=69, GreensonRegulation=67.7, AverageNumofPutts=1.80, SavePercent=54, NumEvents=30)
# apply player's stats and predict their Average Winnings
pred_val3 <- predict(pga_fit, pred_data3, interval = "prediction", level = 0.95, type = "response")
three <- cbind(pred_data3, pred_val3) # will be used later
pred_val3
## fit lwr upr
## 1 34218.97 -49843.5 118281.4
Make a third prediction for the case of Age = 45, AverageDrive = 320, DrivingAccuracy = 70, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.
# create player 4
pred_data4 <- data.frame(Age=45, AverageDrive=320, DrivingAccuracy=70, GreensonRegulation=67.7, AverageNumofPutts=1.80, SavePercent=54, NumEvents=30)
# apply player's stats and predict their Average Winnings
pred_val4 <- predict(pga_fit, pred_data4, interval = "prediction", level = 0.95, type = "response")
four <- cbind(pred_data4, pred_val4) # will be used later
pred_val4
## fit lwr upr
## 1 27727.99 -65911.49 121367.5
Compare the predictions from previous questions, what do you observe? For example, which prediction interval is wider? which prediction dosen’t make sense? Among these predictions, which one is extrapolation? And why?
Compare the Prediction Intervals of all three players (Q2-4) and note observations.
& 3.
pred_table <- rbind(two, three, four)
pred_table <- as.data.frame(pred_table)
pred_table <- mutate(pred_table, upr-lwr)
pred_table
## Age AverageDrive DrivingAccuracy GreensonRegulation AverageNumofPutts
## 1 35 287 64 64.9 1.778
## 2 42 295 69 67.7 1.800
## 3 45 320 70 67.7 1.800
## SavePercent NumEvents fit lwr upr upr - lwr
## 1 48 26 46720.76 -35236.56 128678.1 163914.6
## 2 54 30 34218.97 -49843.50 118281.4 168125.0
## 3 54 30 27727.99 -65911.49 121367.5 187279.0
A few observations:
Use Leverage to determine which of the 3 observations are considered to be an Extrapolation.
& 3.
# the maximum leverage in the given data set is calculated by the following:
max(cbind(pga, leverage = hatvalues(pga_fit))$leverage)
## [1] 0.1568547
two_new <- c(1,35,287,64,64.9,1.778,48,26)
three_new <- c(1,42,295,69,67.7,1.800,54,30)
four_new <- c(1,45,320,70,67.7,1.800,54,30)
X = model.matrix(pga_fit)
leverage2 <- t(two_new) %*% (solve(t(X) %*% X)) %*% two_new
leverage3 <- t(three_new) %*% (solve(t(X) %*% X)) %*% three_new
leverage4 <- t(four_new) %*% (solve(t(X) %*% X)) %*% four_new
rbind(leverage2, leverage3, leverage4)
## [,1]
## [1,] 0.005502717
## [2,] 0.057820788
## [3,] 0.312579929
Using leverage as our method of identifying extrapolations from our data, we can confirm that the third player used for prediction is extrapolated. The leverage of the player is 0.3125799 which is larger than the maximum leverage value in the dataset, given to be 0.1568547. This intuitively makes sense, as the prediction interval was the largest for this predicted observation.
Obtain the standardized regression coefficients and compare the influence of all variables.
& 3.
# transforming the data
library(dplyr)
pga_red <- select(pga, AverageWinnings, Age, AverageDrive, DrivingAccuracy, GreensonRegulation, AverageNumofPutts, SavePercent, NumEvents)
pga_unit_normal <- as.data.frame(apply(pga_red, 2, function(x){(x-mean(x))/sd(x)}))
# redoing the regression
pga_fit_unit_normal <- lm(AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga_unit_normal)
# obtain the standardized lm coefficients
pga_fit_unit_normal$coefficients
## (Intercept) Age AverageDrive
## 7.713999e-16 -6.831285e-02 -1.425906e-02
## DrivingAccuracy GreensonRegulation AverageNumofPutts
## -2.279022e-01 4.388314e-01 -2.970658e-01
## SavePercent NumEvents
## 1.396050e-01 -2.716132e-01
It appears as though Greens on Regulation has the biggest influence on Average Winnings, since the absolute value of the covariate’s coefficient is the largest of all coefficients.
Obtain the residuals of the fitted regression. Perform the residual analysis. What does the residual analysis tell you? For example, are all four assumptions satisfied? If not, which ones are not satisfied. Suppose you take a logarithm of the response variable and fit the regression, does that help with the residuals?
# obtain the residuals
MSRes <- summary(pga_fit)$sigma^2
head(MSRes)
## [1] 1716669098
# obtain the standardized residuals
std_res <- pga_fit$residuals / summary(pga_fit)$sigma
head(std_res)
## 1 2 3 4 5 6
## -0.1873876 2.3577753 0.4539229 -0.3345661 -0.2412748 -0.2443369
We can use the qqplot to determine if the normality condition is met. The condition is met (meaning the residuals are normally distributed) if the Q-Q Plot is around the 45 degree line. The plot below shows that the data has a positive skew (non-normal errors). It’s not ideal but I don’t think it’s extreme enough to cause a problem.
qqnorm(pga_fit$residuals)
qqline(pga_fit$residuals)
Here we will plot all the residuals against the fitted values of each of the regressors. If the linearity and equal variance assumptions are satisfied, we would expect the dots to be evenly distributed around zero with a constant variance across the x-axis. (Referencing the slides for this explanation). No obvious patterns appear in the graphs below, except: The graph of residuals versus fitted values appears to have a pattern. This indicates nonnormal errors.
# Generate a residual plots
par(mfrow = c(3,3))
plot(pga$Age, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageDrive, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$DrivingAccuracy, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$GreensonRegulation, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageNumofPutts, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$SavePercent, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$NumEvents, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga_fit$fitted.values, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
There’s no time order in this data and therefore this is not applicable.
# refit the model with log of the repsonse variable
pga_fit_log <- lm(log(AverageWinnings) ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga)
summary(pga_fit_log)
##
## Call:
## lm(formula = log(AverageWinnings) ~ Age + AverageDrive + DrivingAccuracy +
## GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents,
## data = pga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77513 -0.41331 -0.01228 0.43022 1.73912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.831322 4.482002 7.548 1.85e-12 ***
## Age -0.008215 0.007609 -1.080 0.28169
## AverageDrive -0.006110 0.008314 -0.735 0.46335
## DrivingAccuracy -0.020495 0.012514 -1.638 0.10313
## GreensonRegulation 0.190382 0.019105 9.965 < 2e-16 ***
## AverageNumofPutts -18.457365 2.024330 -9.118 < 2e-16 ***
## SavePercent 0.027000 0.008609 3.136 0.00199 **
## NumEvents -0.039754 0.009440 -4.211 3.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6071 on 188 degrees of freedom
## Multiple R-squared: 0.6358, Adjusted R-squared: 0.6223
## F-statistic: 46.89 on 7 and 188 DF, p-value: < 2.2e-16
# Generate a residual plots
par(mfrow = c(3,3))
plot(pga$Age, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageDrive, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$DrivingAccuracy, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$GreensonRegulation, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageNumofPutts, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$SavePercent, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$NumEvents, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga_fit$fitted.values, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")