Problem 1

Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salary data is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets. The response variable is logSalary

#Loads packages
library(caret)
library(ISLR)
library(car)
library(leaps)
library(earth)

Let us look at the data set to see which variables we must remove.

hitters <- Hitters
str(hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

We make the following modifications:

#Removes factor variables League, Division, and NewLeague
hitters$League <- NULL    
hitters$Division <- NULL  
hitters$NewLeague <- NULL 

#Removes entries with "NA" in Salary column, replaces Salary with log(Salary), and renames Salary to logSalary
hitters <- hitters[!is.na(hitters$Salary), ]
hitters$Salary <- log(hitters$Salary)        
colnames(hitters)[colnames(hitters)=="Salary"] <- "logSalary"

We now partition our data:

#Sets seed and partitions data
set.seed(12345)
trainingIndices <- createDataPartition(hitters$logSalary, p=0.7, list=FALSE)
training <- hitters[trainingIndices, ]
testing <- hitters[-trainingIndices, ]

Problem 2

First, let’s build a linear model called LM1 by looking at coeffcient P-values.

(a)

Using the training data, construct a linear model (LM1) for logSalary using all of the other variables as predictors.

LM1 <- train(logSalary~., data=training, method="lm")
summary(LM1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27147 -0.44874  0.01607  0.40936  2.79909 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.677e+00  1.959e-01  23.872   <2e-16 ***
## AtBat       -3.931e-03  1.523e-03  -2.582   0.0107 *  
## Hits         1.445e-02  5.813e-03   2.485   0.0139 *  
## HmRun        9.406e-03  1.480e-02   0.636   0.5258    
## Runs        -3.868e-03  6.987e-03  -0.554   0.5805    
## RBI          3.493e-03  6.463e-03   0.540   0.5896    
## Walks        1.066e-02  4.492e-03   2.373   0.0188 *  
## Years        4.177e-02  3.128e-02   1.335   0.1836    
## CAtBat      -2.807e-05  3.299e-04  -0.085   0.9323    
## CHits        1.097e-03  1.745e-03   0.629   0.5302    
## CHmRun       5.071e-04  3.912e-03   0.130   0.8970    
## CRuns        8.409e-04  1.764e-03   0.477   0.6343    
## CRBI        -1.265e-03  1.822e-03  -0.694   0.4886    
## CWalks      -9.993e-04  8.293e-04  -1.205   0.2299    
## PutOuts      4.154e-04  1.765e-04   2.355   0.0197 *  
## Assists      1.052e-03  5.131e-04   2.051   0.0418 *  
## Errors      -1.647e-02  1.017e-02  -1.619   0.1074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared:  0.5713, Adjusted R-squared:  0.5305 
## F-statistic: 13.99 on 16 and 168 DF,  p-value: < 2.2e-16

(b)

Rebuild LM1 removing all of the predictors with P-values > 0.05.

The variables in LM1 from part (a) with a P-values > 0.05 are HmRun, Runs, RBI, Years, CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks, and Errors.

LM1 <- train(logSalary~.-HmRun-Runs-RBI-Years-CAtBat-CHits-CHmRun-CRuns-CRBI-CWalks-Errors, data=training, method="lm")
summary(LM1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5468 -0.5977  0.1007  0.5593  2.6496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.048e+00  1.799e-01  28.069  < 2e-16 ***
## AtBat       -4.240e-03  1.648e-03  -2.573 0.010895 *  
## Hits         1.892e-02  5.039e-03   3.753 0.000236 ***
## Walks        1.191e-02  3.544e-03   3.361 0.000950 ***
## PutOuts      1.423e-04  2.081e-04   0.684 0.495085    
## Assists      8.827e-05  4.392e-04   0.201 0.840930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7804 on 179 degrees of freedom
## Multiple R-squared:  0.2734, Adjusted R-squared:  0.2531 
## F-statistic: 13.47 on 5 and 179 DF,  p-value: 3.739e-11

(c)

Rebuild LM1 removing all of the remaining predictors with P-values > 0.05.

The variables in LM1 from part (b) with P-values > 0.05 are PutOuts and Assists.

LM1 <- train(logSalary~.-HmRun-Runs-RBI-Years-CAtBat-CHits-CHmRun-CRuns-CRBI-CWalks-Errors-PutOuts-Assists, data=training, method="lm")

(d)

Examine the variance inflation factors and remove appropriate predictor(s) to avoid collinearity. Be sure to explain your choice(s). Rebuild LM1 using the remaining predictors.

LM1 <- lm(logSalary~.-HmRun-Runs-RBI-Years-CAtBat-CHits-CHmRun-CRuns-CRBI-CWalks-Errors-PutOuts-Assists, data=training)
vif(LM1)
##     AtBat      Hits     Walks 
## 16.374806 15.387746  1.551907

Looking at the VIF’s, we see that AtBat is highly correlated with the rest of the variables. We decide to remove AtBat from the model.

LM1 <- train(logSalary~.-HmRun-Runs-RBI-Years-CAtBat-CHits-CHmRun-CRuns-CRBI-CWalks-Errors-PutOuts-Assists-AtBat, data=training, method="lm")

(e)

Evaluate LM1. (Are there patterns in the residuals? Plot predicted versus observed logSalary. Compute \(R^2\). Examine VIFs for collinearity.)

#Creates residual plot
predicted_LM1 <- predict(LM1, data=training)
residuals_LM1 <- training$logSalary - predicted_LM1
plot(predicted_LM1, residuals_LM1,
     main = "Residuals vs. Predicted",
     xlab="Predicted",
     ylab="Residuals",
     col="blue"
     )
abline(0, 0, col = "red")

plot(training$logSalary, predicted_LM1,
     main = "Predicted vs. Observed: LM1 Model, Training Data",
     xlab="logSalary",
     ylab="Predicted", 
     col="blue"
     )
abline(0, 1, col = "red")

rSquare_LM1_training <- summary(LM1)$r.squared

LM1 <- lm(logSalary~.-HmRun-Runs-RBI-Years-CAtBat-CHits-CHmRun-CRuns-CRBI-CWalks-Errors-PutOuts-Assists-AtBat, data=training)
vif(LM1)
##    Hits   Walks 
## 1.44903 1.44903
rSquare_LM1_training
## [1] 0.2427942

There is no apparent pattern in the residuals, and the VIFs for this model are relatively low; however, we see in the predicted vs. observed logSalary plot that the data points are not close to the identity line. Furthermore, the \(R^2\) for this model is very low (around 24%). Based on this, we are inclined to conclude that LM1 is not a good model.

(f)

Evaluate the predictive capacity of LM1 by computing \(R^2\) for the testing data. How does it compare to the \(R^2\) for the training data?

predicted_testing <- predict(LM1, newdata=testing)
rSquare_LM1_testing <- cor(testing$logSalary, predicted_testing)^2

plot(testing$logSalary, predicted_testing,
     main = "Predicted vs. Observed: LM1 Model, Testing Data",
     xlab="logSalary",
     ylab="Predicted", 
     pch=21, 
     col="blue"
     )
abline(0, 1, col = "red")

rSquare_LM1_testing
## [1] 0.2522379

We see that the \(R^2\) for the testing data (25%) is not much of an improvement from the \(R^2\) for the training data (24%). We conclude that LM1 has a low predictive capacity.

Problem 3

Now let’s use the leaps package and step() to build linear models using stepwise regression. Stepwise regression is a traditional method of linear model building that is not based on P-values.

full<-lm(logSalary~.,data=training) # Use for "Backward"
null<-lm(logSalary~1,data=training) # Use for "Forward"

(a)

Use forward stepwise regression to build a linear model (LM2) for logSalary using the original train-ing data, and evaluate the predictive capacity of LM2 by computing \(R^2\) for the testing data.

LM2 <- step(null,direction = "forward",trace=1,scope= logSalary~ AtBat + Hits+HmRun+Runs+RBI+Walks+Years+CAtBat+CHits+CHmRun+CRBI+CWalks+PutOuts+Assists+Errors)
predicted_LM2<-predict(LM2,newdata=testing)
rSquare_LM2 <- cor(testing$logSalary, predicted_LM2)^2

plot(testing$logSalary, predicted_LM2,
     main = "Predicted vs. Observed: LM2 Model",
     xlab="logSalary",
     ylab="Predicted", 
     col="blue"
     )
abline(0, 1, col = "red")

rSquare_LM2
## [1] 0.4212

Based on this \(R^2\) value, we conclude that LM2 has a low predictive capacity.

(b)

Use backward stepwise regression to build a linear model (LM3) for logSalary using the original training data, and evaluate the predictive capacity of LM3 by computing \(R^2\) for the testing data.

LM3 <- step(full,direction = "backward")
predicted_LM3 <- predict(LM3,newdata=testing)
rSquare_LM3 <- cor(testing$logSalary, predicted_LM3)^2

plot(testing$logSalary, predicted_LM3,
     main = "Predicted vs. Observed: LM3 Model",
     xlab="logSalary",
     ylab="Predicted", 
     col="blue"
     )
abline(0, 1, col = "red")

rSquare_LM3
## [1] 0.4549475

Based on this \(R^2\) value, we conclude that LM3 has a low predictive capacity.

Problem 4

Now let’s build data analytic models using 10-fold cross-validation by including the optional argument trControl=trainControl(method=“cv”, number=10) in the train command.

(a)

Use the training data to build a linear model (LM4) on the original training data using all of the original numeric predictors and evaluate the predictive capacity of LM4 by computing \(R^2\) for the testing data.

LM4 <- train(logSalary~., data=training, method ="lm", trControl=trainControl(method = "cv", number = 10))
predicted_LM4 <- predict(LM4, newdata =testing)
rSquare_LM4 <- cor(testing$logSalary, predicted_LM4)^2

# Graphing to make sure that the predicted model follow the abline
plot(testing$logSalary, predicted_LM4,
     main = "Predicted vs. Observed: LM4 Model",
     xlab="logSalary",
     ylab="Predicted",
     col="blue"
     )
abline(0,1, col = "red")

rSquare_LM4
## [1] 0.4518025

Based on this \(R^2\) value, we conclude that LM4 has a low predictive capacity.

(b)

Use the training data to build a k nearest neighbor model (kNN1) for logSalary and evaluate the predictive capacity of kNN1 by computing \(R^2\) for the testing data.

kNN1 <- train(logSalary~., data=training, method ="knn", trControl=trainControl(method = "cv", number = 10))
predicted_kNN1 <- predict(kNN1, newdata = testing)
rSquare_kNN1 <- cor(predicted_kNN1, testing$logSalary)^2

# Graphing to make sure that the predicted model follow the abline
plot(testing$logSalary, predicted_kNN1,
     main = "Predicted vs. Observed: kNN1 Model",
     xlab="logSalary",
     ylab="Predicted", 
     col="blue"
     )
abline(0,1, col = "red")

rSquare_kNN1
## [1] 0.6818339

The \(R^2\) value for the kNN1 model is pretty high, and so we can conclude that it has a good predictive capacity.

(c)

Use the training data to build a MARS model (MARS1) for logSalary and evaluate the predictive capacity of MARS1 by computing \(R^2\) for the testing data.

MARS1 <- train(logSalary~., data=training, method ="earth", trControl=trainControl(method = "cv", number = 10))
predicted_MARS1 <- predict(MARS1, newdata=testing)[ , 1]
rSquare_MARS1 <- cor(predicted_MARS1, testing$logSalary)^2

plot(testing$logSalary, predicted_MARS1,
     main = "Predicted vs. Observed: MARS1 Model",
     xlab="logSalary",
     ylab="Predicted", 
     col="blue"
     )
abline(0,1, col = "red")

rSquare_MARS1
## [1] 0.6125382

The \(R^2\) value for the MARS model is pretty high, and so we can conclude that it has a good predictive capacity.

Problem 5

Let’s review.

(a)

Summarize the differences between the four linear models LM1, LM2, LM3, and LM4. For example, what variables are present? How accurate are the models? How easily can you interpret the model?

summary(LM1)
## 
## Call:
## lm(formula = logSalary ~ . - HmRun - Runs - RBI - Years - CAtBat - 
##     CHits - CHmRun - CRuns - CRBI - CWalks - Errors - PutOuts - 
##     Assists - AtBat, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5822 -0.6320  0.1774  0.5589  2.8355 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.820657   0.156824  30.739  < 2e-16 ***
## Hits        0.006499   0.001533   4.239 3.56e-05 ***
## Walks       0.009822   0.003365   2.919  0.00395 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7901 on 182 degrees of freedom
## Multiple R-squared:  0.2428, Adjusted R-squared:  0.2345 
## F-statistic: 29.18 on 2 and 182 DF,  p-value: 1.02e-11
rSquare_LM1_testing
## [1] 0.2522379
summary(LM2)
## 
## Call:
## lm(formula = logSalary ~ CHits + Hits + PutOuts + AtBat + Walks + 
##     CRBI + RBI, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.28740 -0.45454  0.03411  0.40698  2.92931 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.7550960  0.1446934  32.863  < 2e-16 ***
## CHits        0.0014553  0.0003088   4.713 4.93e-06 ***
## Hits         0.0116276  0.0040778   2.851  0.00487 ** 
## PutOuts      0.0003602  0.0001654   2.178  0.03070 *  
## AtBat       -0.0032732  0.0012653  -2.587  0.01048 *  
## Walks        0.0060920  0.0028786   2.116  0.03572 *  
## CRBI        -0.0013054  0.0006118  -2.134  0.03425 *  
## RBI          0.0055898  0.0038989   1.434  0.15342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6164 on 177 degrees of freedom
## Multiple R-squared:  0.5518, Adjusted R-squared:  0.5341 
## F-statistic: 31.13 on 7 and 177 DF,  p-value: < 2.2e-16
rSquare_LM2
## [1] 0.4212
summary(LM3)
## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + HmRun + Walks + Years + 
##     CHits + CRBI + CWalks + PutOuts + Assists + Errors, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22360 -0.45321  0.02173  0.41530  2.81056 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.6676237  0.1919104  24.322  < 2e-16 ***
## AtBat       -0.0037984  0.0013439  -2.826  0.00526 ** 
## Hits         0.0133657  0.0040658   3.287  0.00122 ** 
## HmRun        0.0138052  0.0081603   1.692  0.09250 .  
## Walks        0.0102815  0.0038651   2.660  0.00855 ** 
## Years        0.0386732  0.0270888   1.428  0.15520    
## CHits        0.0012717  0.0003730   3.409  0.00081 ***
## CRBI        -0.0008533  0.0005866  -1.455  0.14760    
## CWalks      -0.0009053  0.0005951  -1.521  0.13004    
## PutOuts      0.0004071  0.0001672   2.434  0.01593 *  
## Assists      0.0010533  0.0005011   2.102  0.03700 *  
## Errors      -0.0159414  0.0098651  -1.616  0.10793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6116 on 173 degrees of freedom
## Multiple R-squared:  0.5688, Adjusted R-squared:  0.5413 
## F-statistic: 20.74 on 11 and 173 DF,  p-value: < 2.2e-16
rSquare_LM3
## [1] 0.4549475
summary(LM4)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27147 -0.44874  0.01607  0.40936  2.79909 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.677e+00  1.959e-01  23.872   <2e-16 ***
## AtBat       -3.931e-03  1.523e-03  -2.582   0.0107 *  
## Hits         1.445e-02  5.813e-03   2.485   0.0139 *  
## HmRun        9.406e-03  1.480e-02   0.636   0.5258    
## Runs        -3.868e-03  6.987e-03  -0.554   0.5805    
## RBI          3.493e-03  6.463e-03   0.540   0.5896    
## Walks        1.066e-02  4.492e-03   2.373   0.0188 *  
## Years        4.177e-02  3.128e-02   1.335   0.1836    
## CAtBat      -2.807e-05  3.299e-04  -0.085   0.9323    
## CHits        1.097e-03  1.745e-03   0.629   0.5302    
## CHmRun       5.071e-04  3.912e-03   0.130   0.8970    
## CRuns        8.409e-04  1.764e-03   0.477   0.6343    
## CRBI        -1.265e-03  1.822e-03  -0.694   0.4886    
## CWalks      -9.993e-04  8.293e-04  -1.205   0.2299    
## PutOuts      4.154e-04  1.765e-04   2.355   0.0197 *  
## Assists      1.052e-03  5.131e-04   2.051   0.0418 *  
## Errors      -1.647e-02  1.017e-02  -1.619   0.1074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared:  0.5713, Adjusted R-squared:  0.5305 
## F-statistic: 13.99 on 16 and 168 DF,  p-value: < 2.2e-16
rSquare_LM4
## [1] 0.4518025

Out of these four models, LM1 contains the least number of variables (only Hits and Walks are present), while LM4 containts the most number of variables (all variables are present). The \(R^2\) values (based on the testing data) are given below each of their respective models. The largest of these is that of LM3, which indicates that it is has the highest predictive capacity and is the most accurate model. All four of these models have high interpretability. Each of the coefficients describes how logSalary changes with each unit increase in their associated variables, assuming all other variables are held constant.

(b)

Of the six models, LM1, LM2, LM3, LM4, kNN1, and MARS1, which model do you think is best? Explain carefully.

We give the \(R^2\) values (based on the testing data) for each of the models below:

rSquare_LM1_testing
## [1] 0.2522379
rSquare_LM2
## [1] 0.4212
rSquare_LM3
## [1] 0.4549475
rSquare_LM4
## [1] 0.4518025
rSquare_kNN1
## [1] 0.6818339
rSquare_MARS1
## [1] 0.6125382

The largest \(R^2\) value is that of the kNN model, which means that it has the highest predictive capacity when given new data (the testing data). We therefore conclude that this is our best model.