library("psych")
## Warning: package 'psych' was built under R version 3.6.2
library("caTools")
## Warning: package 'caTools' was built under R version 3.6.2
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages ------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.1
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.1
## Warning: package 'stringr' was built under R version 3.6.1
## Warning: package 'forcats' was built under R version 3.6.1
## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
  1. Data set PGA.csv contains statistics of performance and winnings for 196 PGA tour participants during 2004 season [https://www.pgatour.com/]. Here is the list of variable: Name, Age, Average Drive (Yards), Driving accuracy (percent), Greens on regulation (%), Average # of putts, Save Percent, Money Rank, # Events, Total Winnings ($), Average winnings ($). Please read PGA data (PGA.csv) into R.
pga <- read.csv('PGA.csv')
names(pga)
##  [1] "Name"               "Age"                "AverageDrive"      
##  [4] "DrivingAccuracy"    "GreensonRegulation" "AverageNumofPutts" 
##  [7] "SavePercent"        "MoneyRank"          "NumEvents"         
## [10] "TotalWinnings"      "AverageWinnings"
summary(pga)
##                Name          Age         AverageDrive   DrivingAccuracy
##  Aaron Baddeley  :  1   Min.   :21.00   Min.   :268.2   Min.   :49.90  
##  Adam Scott      :  1   1st Qu.:31.00   1st Qu.:281.3   1st Qu.:60.95  
##  Alex Cejka      :  1   Median :35.50   Median :287.2   Median :64.30  
##  Andre Stolz     :  1   Mean   :35.96   Mean   :287.2   Mean   :64.08  
##  Arjun Atwal     :  1   3rd Qu.:40.25   3rd Qu.:292.1   3rd Qu.:67.80  
##  Arron Oberholser:  1   Max.   :51.00   Max.   :314.4   Max.   :77.20  
##  (Other)         :190                                                  
##  GreensonRegulation AverageNumofPutts  SavePercent      MoneyRank     
##  Min.   :54.70      Min.   :1.723     Min.   :31.80   Min.   :  1.00  
##  1st Qu.:63.00      1st Qu.:1.762     1st Qu.:45.67   1st Qu.: 49.75  
##  Median :64.90      Median :1.776     Median :49.00   Median : 99.50  
##  Mean   :64.90      Mean   :1.778     Mean   :48.97   Mean   :101.80  
##  3rd Qu.:66.83      3rd Qu.:1.796     3rd Qu.:52.42   3rd Qu.:151.25  
##  Max.   :73.30      Max.   :1.847     Max.   :62.30   Max.   :245.00  
##                                                                       
##    NumEvents     TotalWinnings      AverageWinnings 
##  Min.   :15.00   Min.   :   21250   Min.   :   850  
##  1st Qu.:23.00   1st Qu.:  436617   1st Qu.: 15749  
##  Median :27.00   Median :  814989   Median : 30849  
##  Mean   :26.19   Mean   : 1134632   Mean   : 46549  
##  3rd Qu.:30.00   3rd Qu.: 1407922   3rd Qu.: 56209  
##  Max.   :36.00   Max.   :10905167   Max.   :376040  
## 
  1. Visualize the data concisely using scatter plots and histograms. For example, you can use R functions such as pairs(), hist(), and pairs.panels() in the R package psych. Briefly describe the visualization results, e.g., outliers, skewness, strong correlations, clusters, and so on.

Related Pairs

First step is to find interrelated variables and variables influencing the response variable

par(mfrow=c(1,3))
hist(pga$Age, main = 'Distribution of Age', col = 'light green', xlab = 'Age')
hist(pga$AverageDrive, main = 'Distribution of Average Drive', col = 'light green', xlab = 'Average Drive')
hist(pga$DrivingAccuracy, main = 'Distribution of Driving Accuracy', col = 'light green', xlab = 'Driving Accuracy')

par(mfrow=c(1,3))
plot(pga$Age~pga$AverageDrive, main = 'Age vs Average Drive', col = 'blue', ylab = 'Age', xlab = 'Average Drive')
plot(pga$Age~pga$DrivingAccuracy, main = 'Age vs Driving Accuracy', col = 'blue', ylab = 'Age', xlab = 'Driving Accuracy')
plot(pga$AverageDrive~pga$DrivingAccuracy, main = 'Average Drive vs Driving Accuracy', col = 'blue', ylab = 'Average Drive', xlab = 'Driving Accuracy')

This dataset consists 196 golfers with Age ranging from 20 to 55 with most golfers around the age of 30 to 40. As can be seen from scatter plot, Age is related to Average Drive and Driving Accuracy. Average Drive decreases with Age, Driving Accuracy increases with Age. This could be because with time, golfers become better more accurate with time and age but the drive goes down. However, looking at 3rd plot, we can also say that with less distances of shots, accuracy is genrally high and Age is not a factor here.

par(mfrow=c(1,2))
plot(pga$AverageWinnings~pga$MoneyRank, main = 'Average Winnings vs Money Rank', col = 'blue', ylab = 'Average Winnings', xlab = 'Money Rank')
plot(pga$TotalWinnings~pga$MoneyRank, main = 'Total Winnings vs Money Rank', col = 'blue', ylab = 'Total Winnings', xlab = 'Money Rank')

I think it’s quite obvious Ranking is highly related to Winnings, meaning higher the ranking is higher the Winnings should be. In this plot, we can see that Ranking is exponentially increasing with Winnings and golfers in the top 50 rank have high Winnings amount too.

Due to it’s scale, difficult to relate Winnings with other variables but we can find patterns between Money Rank and other variables.

par(mfrow=c(1,3))
plot(pga$DrivingAccuracy~pga$MoneyRank, main = 'Driving Accuracy vs MoneyRank', col = 'blue', ylab = 'Driving Accuracy', xlab = 'Greens on Regulation')
plot(pga$GreensonRegulation~pga$MoneyRank, main = 'Greens on Regulation vs Money Rank', col = 'blue', ylab = 'Greens on Regulation', xlab = 'Money Rank')
plot(pga$AverageNumofPutts~pga$MoneyRank, main = 'Avg # of Putts vs Money Rank', col = 'blue', ylab = 'Avg # of Putts', xlab = 'Money Rank')

1, Golfers with higher Rank have accuracy in the range of 60 to 70 but golfers within the ranking of 200 have this range of accuracy

2, Greens on Regulation is higher on the range og 65+ for higher ranked Golfers

3, Average # of Putts is lower in the range of less than 1.78 for higher ranked Golfers.

These can be factors differentiating better golfers from the whole list and can be the factors influecing the Winnings.

Outliers Box Plots of each variables are used to check for any outliers. There are a few outliers as we can see from above and below plots but they are very few when compared to total # of records

par(mfrow=c(3,2))
boxplot(pga$AverageDrive, main = 'Average Drive', col = 'gray', horizontal = TRUE)
boxplot(pga$DrivingAccuracy, main = 'Driving Accuracy', col = 'gray', horizontal = TRUE)
boxplot(pga$GreensonRegulation, main = 'Greens on Regulation', col = 'gray', horizontal = TRUE)
boxplot(pga$SavePercent, main = 'Save Percent', col = 'gray', horizontal = TRUE)
boxplot(pga$TotalWinnings, main = 'Total Winnings', col = 'gray', horizontal = TRUE)
boxplot(pga$AverageWinnings, main = 'Average Winnings', col = 'gray', horizontal = TRUE)

pairs.panels(pga[2:11], lm = TRUE, method = 'pearson', hist.col = 'light green', smoother = TRUE,
             pch=20,main="PGA Data Pairs Panel plot")

Correlation Response Variable (Average Winnings) is highly correlated to Total Winnings (+.95), Money Rank(-.70) and moderately correlted to Average # of Putts (-.44), Greens on Regulation(-.40)

  1. Fit a multiple linear regression to the data. Use Average winnings as the response variable and use Age, Average Drive (Yards), Driving Accuracy (percent), Greens on Regulation (%), Average # of Putts, Save Percent, and # Events as covariates. Based on the regression results, which covariate has the largest estimated slope (in absolute value)? Which covariates are important in terms of the association with the response variable?
Model_All <- lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+NumEvents+AverageNumofPutts+SavePercent+GreensonRegulation, data = pga)
summary(Model_All)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     NumEvents + AverageNumofPutts + SavePercent + GreensonRegulation, 
##     data = pga)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71690 -22176  -6735  17147 247928 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         945579.88  305886.59   3.091  0.00230 ** 
## Age                   -587.13     519.32  -1.131  0.25968    
## AverageDrive           -94.76     567.42  -0.167  0.86755    
## DrivingAccuracy      -2360.57     854.02  -2.764  0.00628 ** 
## NumEvents            -3159.22     644.24  -4.904 2.03e-06 ***
## AverageNumofPutts  -694226.49  138155.99  -5.025 1.17e-06 ***
## SavePercent           1395.67     587.54   2.375  0.01853 *  
## GreensonRegulation    8466.04    1303.87   6.493 7.30e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

Average # of Putts has the highest absolute Estimate value at (694226.49)

Based on the p-value, we can say below Covariates are significant: 1, Driving Accuracy 2, Number of Events 3, Average Number of Putts 4, Save Percent 5, Greens of Regulation

  1. Based on the fitted regression, is AverageDrive a covariate with statistically significant nonzero slope? State your conclusion and reasons.

No, since p-value at 0.86755 is greater than the significance level of 0.05, we cannot say that Average Drive is statistically significant nonzero slope.

  1. If you run a simple linear regression of lm(AverageWinnings ~ AverageDrive, data=d), is AverageDrive a covariate with statistically significant nonzero slope in this case? State your conclusion and reasons.
Model_Simple <- lm(AverageWinnings ~ AverageDrive, data=pga)
summary(Model_Simple)
## 
## Call:
## lm(formula = AverageWinnings ~ AverageDrive, data = pga)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62001 -28630 -13234  10847 311571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -331133.4   134365.6  -2.464  0.01459 * 
## AverageDrive    1315.2      467.7   2.812  0.00543 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54040 on 194 degrees of freedom
## Multiple R-squared:  0.03916,    Adjusted R-squared:  0.03421 
## F-statistic: 7.907 on 1 and 194 DF,  p-value: 0.005429

Yes, with this model, Average Drive is moderately significant nonzero slope becase with p-value at 0.00543, it is less than significance level of 0.05.

This could be because with only one variable used, all variations in response variable is contributed by “Average Drive” but if we introduce other variables, then other variables are far more significant than Average Drive.

  1. If you run a multiple linear regression similar to the one in question 3 except you remove Driving Accuracy as a covariate, is AverageDrive a significant covariate with nonzero slope in this case? Are these results consistent or contradicted with the results in questions 4 and 5? State your conclusion and reasons.
Model_woDA <- lm(AverageWinnings~Age+AverageDrive+NumEvents+AverageNumofPutts+SavePercent+GreensonRegulation, data = pga)
summary(Model_woDA)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + NumEvents + 
##     AverageNumofPutts + SavePercent + GreensonRegulation, data = pga)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72734 -22280  -6769  15026 256538 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         733547.2   301268.4   2.435   0.0158 *  
## Age                   -466.2      526.5  -0.886   0.3770    
## AverageDrive           999.1      413.7   2.415   0.0167 *  
## NumEvents            -3351.0      651.7  -5.142 6.75e-07 ***
## AverageNumofPutts  -761413.2   138369.3  -5.503 1.21e-07 ***
## SavePercent           1324.0      597.2   2.217   0.0278 *  
## GreensonRegulation    6467.5     1103.9   5.859 2.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42150 on 189 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4124 
## F-statistic: 23.81 on 6 and 189 DF,  p-value: < 2.2e-16

With “Driving Accuracy” removed, two things happened 1, “Average Drive” became statistically slightly significant. This could be because the variations captured by “Driving Accuracy” is now being captured by “Average Drive”. With “Driving Accuracy” in the model, “Average Drive” was getting camouflaged because both these variables are interrelated and are trying to capture same variations in the data.

2, Adjusted R-squared went down which means, “Driving Accuracy” was a useful variable and was contributing to the Model by capturing variations in Response Variable. The drop in Adjusted R-squared means it was not random noise.

  1. Based on regression results, is the fitted regression model in question 3 a significant model (at least one nonzero slope)? What about the fitted regression models in question 5 and 6? State your conclusion and reasons.

Yes, Regression Model in Q3 is Significant because 1, Atleast there is one nonzero slope which is statistically significant. Infact there are 5 variables. 2, F-Test p-value is very low which means the model statistically significant.

8.1 Using the fitted regression in question 3, obtain the prediction of the response variable for a new player with Age = 35, AverageDrive = 287, DrivingAccuracy = 64, GreensonRegulation = 64.9, AverageNumofPutts = 1.778, SavePercent = 48, NumEvents = 26. Provide a prediction interval and explain what it means.

y_test1 <- data.frame(Age = 35, AverageDrive = 287, DrivingAccuracy = 64, GreensonRegulation = 64.9, AverageNumofPutts = 1.778, SavePercent = 48, NumEvents = 26)

y_pred1 <- predict(Model_All, newdata = y_test1, interval = 'confidence', level = 0.95, type = 'response')

y_pred1
##        fit     lwr      upr
## 1 46720.76 40657.8 52783.72

Prediction Interval = 40657.8 and 52783.72 which means response to this input values is going to be within ths interval with a confidence of 95%

8.2 Similarly, make another prediction for the case of Age = 42, AverageDrive = 295, DrivingAccuracy = 69, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.

y_test2 <- data.frame(Age = 42, AverageDrive = 295, DrivingAccuracy = 69, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30)

y_pred2 <- predict(Model_All, newdata = y_test2, interval = 'confidence', level = 0.95, type = 'response')

y_pred2
##        fit      lwr      upr
## 1 34218.97 14565.55 53872.39

8.3 Make a third prediction for the case of Age = 45, AverageDrive = 320, DrivingAccuracy = 70, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30.

y_test3 <- data.frame(Age = 45, AverageDrive = 320, DrivingAccuracy = 70, GreensonRegulation = 67.7, AverageNumofPutts = 1.80, SavePercent = 54, NumEvents = 30)

y_pred3 <- predict(Model_All, newdata = y_test3, interval = 'confidence', level = 0.95, type = 'response')

y_pred3
##        fit       lwr      upr
## 1 27727.99 -17967.85 73423.83
  1. Compare the predictions from previous questions, what do you observe? For example, which prediction interval is wider? which prediction dosen’t make sense? Among these predictions, which one is extrapolation? State your conclusion and reasons.
prediction <- data.frame(observation = 1, Prediction = y_pred1[1], LowerBound = y_pred1[2], UpperBound = y_pred1[3])
prediction <- rbind(prediction,list(2, y_pred2[1], y_pred2[2], y_pred2[3]))
prediction <- rbind(prediction,list(3, y_pred3[1], y_pred3[2], y_pred3[3]))
prediction
##   observation Prediction LowerBound UpperBound
## 1           1   46720.76   40657.80   52783.72
## 2           2   34218.97   14565.55   53872.39
## 3           3   27727.99  -17967.85   73423.83

Prediction for Observation # 3 is very wide with lower bound at -17967.85 and upper bound at 73423.83 which doesn’t mase sense because negative prediction for Average Winnings isn’t possible.

pga_data <- pga[,2:11]
pga_data_normal <- as.data.frame(apply(pga_data,2,function(x){(x - mean(x))/sd(x)}))

Model_All_normalized <- lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+NumEvents+AverageNumofPutts+SavePercent+GreensonRegulation, data = pga_data_normal)
summary(Model_All_normalized)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     NumEvents + AverageNumofPutts + SavePercent + GreensonRegulation, 
##     data = pga_data_normal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3037 -0.4033 -0.1225  0.3118  4.5086 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.714e-16  5.382e-02   0.000  1.00000    
## Age                -6.831e-02  6.042e-02  -1.131  0.25968    
## AverageDrive       -1.426e-02  8.538e-02  -0.167  0.86755    
## DrivingAccuracy    -2.279e-01  8.245e-02  -2.764  0.00628 ** 
## NumEvents          -2.716e-01  5.539e-02  -4.904 2.03e-06 ***
## AverageNumofPutts  -2.971e-01  5.912e-02  -5.025 1.17e-06 ***
## SavePercent         1.396e-01  5.877e-02   2.375  0.01853 *  
## GreensonRegulation  4.388e-01  6.759e-02   6.493 7.30e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7535 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16
hmax <- max(hatvalues(Model_All_normalized))

X <- model.matrix(Model_All)

x_new1 <- c(1,35,287,64,26,1.778,48,64.9)
x_new2 <- c(1,42,295,69,30,1.80,54,67.7)
x_new3 <- c(1,45, 320, 70, 30, 1.80, 54,67.7)

t_new1 <- t(x_new1) %*% (solve(t(X) %*% X)) %*% x_new1
t_new2 <- t(x_new2)%*%solve(t(X)%*%X)%*%x_new2
t_new3 <- t(x_new3)%*%solve(t(X)%*%X)%*%x_new3

xpl <- data.frame(obs = 1, t_new = t_new1, hmax = hmax)
xpl <- rbind(xpl, list(obs = 2, t_new = t_new2, hmax = hmax))
xpl <- rbind(xpl, list(obs = 3, t_new = t_new3, hmax = hmax))
xpl$expl <- xpl$t_new > xpl$hmax
xpl
##   obs       t_new      hmax  expl
## 1   1 0.005502717 0.1568547 FALSE
## 2   2 0.057820788 0.1568547 FALSE
## 3   3 0.312579929 0.1568547  TRUE

As we can observe, when compared to hmax, Observation 3 is a case of extrapolation

  1. Obtain the standardized regression coefficients and compare the influence of all variables.
Model_All_normalized$coefficients
##        (Intercept)                Age       AverageDrive    DrivingAccuracy 
##       7.713999e-16      -6.831285e-02      -1.425906e-02      -2.279022e-01 
##          NumEvents  AverageNumofPutts        SavePercent GreensonRegulation 
##      -2.716132e-01      -2.970658e-01       1.396050e-01       4.388314e-01

Comparing the estimates of Standardized Model, 1, “Greens on Regulation” has the highest +ve estimate value at 0.4388

  1. Obtain the residuals of the fitted regression. Perform the residual analysis. What does the residual analysis tell you? For example, are all four assumptions satisfied? If not, which ones are not satisfied. Suppose you take a logarithm of the response variable and fit the regression, does that help with the residuals?
qqnorm(Model_All$residuals,main='Model with All Variables')
qqline(Model_All$residuals)

par(mfrow = c(3,3))
plot(pga$Age ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$AverageDrive ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$DrivingAccuracy ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$GreensonRegulation ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$AverageNumofPutts ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$SavePercent ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(pga$NumEvents ,Model_All$residuals,pch=20)
abline(h = 0, col='red')
plot(Model_All$fitted.values,Model_All$residuals,pch=20)
abline(h = 0, col='red')

Normality Assumption: QQ-Plot is around 45 degree for Model with All Variables

Linearity & Equal Variance: Scatter plots show similar distribution on both side on Zero-Line

Independent: There is no time-series variable, hence not checking for Independent