7038 Homework 3

LAST NAME: RISER

FIRST NAME: SAMANTHA

M#: M05023930

First, set the working directory

setwd("~/OneNote Notebooks/7038-DA Methods/Homework 3")

# Packages required
library(tidyverse)
library(dplyr)

Data set PGA.csv contains statistics of performance and winnings for 196 PGA tour participants during 2004 season [https://www.pgatour.com/]. Here is the list of variable: Name, Age, Average Drive (Yards), Driving accuracy (percent), Greens on regulation (%), Average # of putts, Save Percent, Money Rank, # Events, Total Winnings ($), Average winnings.

Question 1

Please read PGA data (PGA.csv) into R and visualize it. Note that Average winnings = Total Winnings/# Events. For this data set, fit a multiple linear regression to the data. Use Average winnings as the response variable and use Age, Average Drive (Yards), Driving Accuracy (percent), Greens on Regulation (%), Average # of Putts, Save Percent, and # Events as covariates.

Part I

Read PGA data into R and visualize it.
& 3.

pga <- read.csv("PGA.csv", header = T, stringsAsFactors = FALSE)
head(pga, 5)

##             Name Age AverageDrive DrivingAccuracy GreensonRegulation
## 1 Aaron Baddeley  23        288.0            53.1               58.2
## 2     Adam Scott  24        295.4            57.7               65.6
## 3     Alex Cejka  34        285.8            64.2               63.8
## 4    Andre Stolz  34        297.9            59.0               63.0
## 5    Arjun Atwal  31        289.4            60.5               62.5
##   AverageNumofPutts SavePercent MoneyRank NumEvents TotalWinnings
## 1             1.767        50.9       123        27        632878
## 2             1.757        59.3         7        16       3724984
## 3             1.795        50.7        54        24       1313484
## 4             1.787        47.7       101        20        808373
## 5             1.766        43.5       146        30        486053
##   AverageWinnings
## 1           23440
## 2          232812
## 3           54729
## 4           40419
## 5           16202

Some brief visualizations:

Let’s see the spread of the players’ ages, average drives, and driving accuracy.

Observations from the Histograms:

Most players seem to be in their 30’s, but ages range from low 20’s to mid 50’s.
The players’ average drives center around ~285 yards, and the distribution of average drive across all players is roughly normal.

Do we have any players who win signficantly more money than the majority of players? We can look for outliers.

Observations from these boxplots:

The distribution of winnings is heavily skewed to the right. There are many players who can be considered outliers. This means there are players that have won dramatiaclly more money or are dramatically more successful on average than the majority of players.
There is one outlier even farther away than the other outliers in the “Total Winnings” boxplot. We do not see a similar dramatic outlier in the “Average Winnings” boxplot. This may indicate that one player won an important tournament with a large cash prize, but lost enough other tournaments that he is not as successful on average.

Part II

For this data set, fit a multiple linear regression to the data.
& 3.

pga_fit <- lm(AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga)
summary(pga_fit)

## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, 
##     data = pga)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71690 -22176  -6735  17147 247928 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         945579.88  305886.59   3.091  0.00230 ** 
## Age                   -587.13     519.32  -1.131  0.25968    
## AverageDrive           -94.76     567.42  -0.167  0.86755    
## DrivingAccuracy      -2360.57     854.02  -2.764  0.00628 ** 
## GreensonRegulation    8466.04    1303.87   6.493 7.30e-10 ***
## AverageNumofPutts  -694226.49  138155.99  -5.025 1.17e-06 ***
## SavePercent           1395.67     587.54   2.375  0.01853 *  
## NumEvents            -3159.22     644.24  -4.904 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

A brief look at the summary output of the requested linear model indicates that the model is signficant (suing the F test, P-value < 2.2e-16 which is significant at the 1% level.)

Question 2

Using the fitted regression in previous question, obtain the prediction of the response variable for a new player with Age = 35, AverageDrive = 287, DrivingAccuracy = 64, GreensonRegulation = 64.9, AverageNumofPutts = 1.778, SavePercent = 48, NumEvents = 26. Provide a prediction interval and explain what it means.
& 3.

# create player 2
pred_data2 <- data.frame(Age=35, AverageDrive=287, DrivingAccuracy=64, GreensonRegulation=64.9, AverageNumofPutts=1.778, SavePercent=48, NumEvents=26)

# apply player's stats and predict their Average Winnings
pred_val2 <- predict(pga_fit, pred_data2, interval = "prediction", level = 0.95, type = "response")
two <- cbind(pred_data2, pred_val2)  # will be used later
pred_val2

##        fit       lwr      upr
## 1 46720.76 -35236.56 128678.1

Given the specific values for the covariates (above), the predicted value of AverageWinnings for this player is $46,721. We can say with 95% confidence that the true average winnings given someone with the stats represented in the covariates above is between zero dollars (because you can’t win negative dollars) and $128,678.

Question 3

Similarly, make another prediction for the case of Age = 42, AverageDrive = 295, DrivingAccuracy = 69, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.
& 3.

# Create player 3
pred_data3 <- data.frame(Age=42, AverageDrive=295, DrivingAccuracy=69, GreensonRegulation=67.7, AverageNumofPutts=1.80, SavePercent=54, NumEvents=30)

# apply player's stats and predict their Average Winnings
pred_val3 <- predict(pga_fit, pred_data3, interval = "prediction", level = 0.95, type = "response")
three <- cbind(pred_data3, pred_val3)  # will be used later
pred_val3

##        fit      lwr      upr
## 1 34218.97 -49843.5 118281.4

Given the specific values for the covariates (above), the predicted value of AverageWinnings for this player is $34,219. We can say with 95% confidence that the true average winnings given someone with the stats represented in the covariates above is between zero dollars (because you can’t win negative dollars) and $118,281.

Question 4

Make a third prediction for the case of Age = 45, AverageDrive = 320, DrivingAccuracy = 70, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.

& 3.

# create player 4
pred_data4 <- data.frame(Age=45, AverageDrive=320, DrivingAccuracy=70, GreensonRegulation=67.7, AverageNumofPutts=1.80, SavePercent=54, NumEvents=30)

# apply player's stats and predict their Average Winnings
pred_val4 <- predict(pga_fit, pred_data4, interval = "prediction", level = 0.95, type = "response")
four <- cbind(pred_data4, pred_val4)  # will be used later
pred_val4

##        fit       lwr      upr
## 1 27727.99 -65911.49 121367.5

Given the specific values for the covariates (above), the predicted value of AverageWinnings for this player is $27,728. We can say with 95% confidence that the true average winnings given someone with the stats represented in the covariates above is between zero dollars (because you can’t win negative dollars) and $121,367.

Question 5

Compare the predictions from previous questions, what do you observe? For example, which prediction interval is wider? which prediction dosen’t make sense? Among these predictions, which one is extrapolation? And why?

Part I

Compare the Prediction Intervals of all three players (Q2-4) and note observations.
& 3.

pred_table <- rbind(two, three, four)
pred_table <- as.data.frame(pred_table)
pred_table <- mutate(pred_table, upr-lwr)
pred_table

##   Age AverageDrive DrivingAccuracy GreensonRegulation AverageNumofPutts
## 1  35          287              64               64.9             1.778
## 2  42          295              69               67.7             1.800
## 3  45          320              70               67.7             1.800
##   SavePercent NumEvents      fit       lwr      upr upr - lwr
## 1          48        26 46720.76 -35236.56 128678.1  163914.6
## 2          54        30 34218.97 -49843.50 118281.4  168125.0
## 3          54        30 27727.99 -65911.49 121367.5  187279.0

A few observations:

The predictions with negative winnings do not make sense. The amount of Zero Dollars should be imputted in the lwr bound of the prediction interval.
The player with the longest drive, highest driving accuracy, and tied for highest on all other covariates has the highest prediction interval.

Part II:

Use Leverage to determine which of the 3 observations are considered to be an Extrapolation.
& 3.

# the maximum leverage in the given data set is calculated by the following:
max(cbind(pga, leverage = hatvalues(pga_fit))$leverage)

## [1] 0.1568547

two_new <- c(1,35,287,64,64.9,1.778,48,26)
three_new <- c(1,42,295,69,67.7,1.800,54,30)
four_new <- c(1,45,320,70,67.7,1.800,54,30)
X = model.matrix(pga_fit)
leverage2 <- t(two_new) %*% (solve(t(X) %*% X)) %*% two_new
leverage3 <- t(three_new) %*% (solve(t(X) %*% X)) %*% three_new
leverage4 <- t(four_new) %*% (solve(t(X) %*% X)) %*% four_new
rbind(leverage2, leverage3, leverage4)

##             [,1]
## [1,] 0.005502717
## [2,] 0.057820788
## [3,] 0.312579929

Using leverage as our method of identifying extrapolations from our data, we can confirm that the third player used for prediction is extrapolated. The leverage of the player is 0.3125799 which is larger than the maximum leverage value in the dataset, given to be 0.1568547. This intuitively makes sense, as the prediction interval was the largest for this predicted observation.

Question 6

Obtain the standardized regression coefficients and compare the influence of all variables.
& 3.

# transforming the data
library(dplyr)
pga_red <- select(pga, AverageWinnings, Age, AverageDrive, DrivingAccuracy, GreensonRegulation, AverageNumofPutts, SavePercent, NumEvents)
pga_unit_normal <- as.data.frame(apply(pga_red, 2, function(x){(x-mean(x))/sd(x)}))

# redoing the regression
pga_fit_unit_normal <- lm(AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga_unit_normal)

# obtain the standardized lm coefficients
pga_fit_unit_normal$coefficients

##        (Intercept)                Age       AverageDrive 
##       7.713999e-16      -6.831285e-02      -1.425906e-02 
##    DrivingAccuracy GreensonRegulation  AverageNumofPutts 
##      -2.279022e-01       4.388314e-01      -2.970658e-01 
##        SavePercent          NumEvents 
##       1.396050e-01      -2.716132e-01

These coefficients have been standardized. The slopes of the covariates are now interpreted as such: For every 1 standard deviation increase of the covariate, the response variable will increase or decrease by the covariate’s coeffient * 1 standard deviation (all other covariates held constant). For example, for every 1 standard deviation increase in Age, the player’s Average Winnings will decrease by 0.068 standard deviations (all other covariates held constant).

It appears as though Greens on Regulation has the biggest influence on Average Winnings, since the absolute value of the covariate’s coefficient is the largest of all coefficients.

Question 7

Obtain the residuals of the fitted regression. Perform the residual analysis. What does the residual analysis tell you? For example, are all four assumptions satisfied? If not, which ones are not satisfied. Suppose you take a logarithm of the response variable and fit the regression, does that help with the residuals?

& 3.

# obtain the residuals
MSRes <- summary(pga_fit)$sigma^2
head(MSRes)

## [1] 1716669098

# obtain the standardized residuals
std_res <- pga_fit$residuals / summary(pga_fit)$sigma
head(std_res)

##          1          2          3          4          5          6 
## -0.1873876  2.3577753  0.4539229 -0.3345661 -0.2412748 -0.2443369

Check the Normality Assumption

We can use the qqplot to determine if the normality condition is met. The condition is met (meaning the residuals are normally distributed) if the Q-Q Plot is around the 45 degree line. The plot below shows that the data has a positive skew (non-normal errors). It’s not ideal but I don’t think it’s extreme enough to cause a problem.

qqnorm(pga_fit$residuals)
qqline(pga_fit$residuals)

Check for Equal Variance & Linearity

Here we will plot all the residuals against the fitted values of each of the regressors. If the linearity and equal variance assumptions are satisfied, we would expect the dots to be evenly distributed around zero with a constant variance across the x-axis. (Referencing the slides for this explanation). No obvious patterns appear in the graphs below, except: The graph of residuals versus fitted values appears to have a pattern. This indicates nonnormal errors.

# Generate a residual plots
par(mfrow = c(3,3))
plot(pga$Age, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageDrive, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$DrivingAccuracy, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$GreensonRegulation, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageNumofPutts, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$SavePercent, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$NumEvents, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga_fit$fitted.values, pga_fit$residuals, pch=20)
abline(h = 0, col = "grey")

Check for Independent Errors

There’s no time order in this data and therefore this is not applicable.

Log the response variable

& 3.

# refit the model with log of the repsonse variable
pga_fit_log <- lm(log(AverageWinnings) ~ Age + AverageDrive + DrivingAccuracy + GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, data = pga)
summary(pga_fit_log)

## 
## Call:
## lm(formula = log(AverageWinnings) ~ Age + AverageDrive + DrivingAccuracy + 
##     GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents, 
##     data = pga)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77513 -0.41331 -0.01228  0.43022  1.73912 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         33.831322   4.482002   7.548 1.85e-12 ***
## Age                 -0.008215   0.007609  -1.080  0.28169    
## AverageDrive        -0.006110   0.008314  -0.735  0.46335    
## DrivingAccuracy     -0.020495   0.012514  -1.638  0.10313    
## GreensonRegulation   0.190382   0.019105   9.965  < 2e-16 ***
## AverageNumofPutts  -18.457365   2.024330  -9.118  < 2e-16 ***
## SavePercent          0.027000   0.008609   3.136  0.00199 ** 
## NumEvents           -0.039754   0.009440  -4.211 3.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6071 on 188 degrees of freedom
## Multiple R-squared:  0.6358, Adjusted R-squared:  0.6223 
## F-statistic: 46.89 on 7 and 188 DF,  p-value: < 2.2e-16

# Generate a residual plots
par(mfrow = c(3,3))
plot(pga$Age, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageDrive, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$DrivingAccuracy, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$GreensonRegulation, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$AverageNumofPutts, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$SavePercent, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga$NumEvents, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")
plot(pga_fit$fitted.values, pga_fit_log$residuals, pch=20)
abline(h = 0, col = "grey")

Taking the log of the response variable completely eliminates any pattern and scatters the datapoints randomly! Which is what we are looking for.