Project 2

Problem 1

Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salary data is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets. The response variable is logSalary.

library(ISLR) #package that contains Hitters dataset
library(dplyr) #pacakge allows me to delete columns easily
library(caret) #package has data partion function

Hitters1=Hitters
#found all factor variables(League, Divison, NewLeague) using str()
Hitters1 = select(Hitters1, -c(League, Division, NewLeague)) #removes factor variables
Hitters2 =  Hitters1[complete.cases(Hitters1$Salary), ] #removes records where Salary is missing

#removing right skew and replacing salary with logSalary
logSalary = log(Hitters2$Salary)
Hitters2$Salary = logSalary
colnames(Hitters2)[colnames(Hitters2)=="Salary"]="logSalary"

#splitting the data into training and testing sets
set.seed(12345)
TrainingIndex = createDataPartition(Hitters2$logSalary, p=.7, list=FALSE)
TrainingData = Hitters2[TrainingIndex[,1], ]
TestingData = Hitters2[-TrainingIndex[,1], ]

Problem 2

(2a-Windows) Using the training data, construct a linear model (LM1) for logSalary using all of the other variables as predictors

LM1 = train(logSalary~. , data=TrainingData, method="lm")
summary(LM1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.28299 -0.43609  0.02173  0.43902  1.13982 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4086816  0.1965854  22.426   <2e-16 ***
## AtBat       -0.0027278  0.0015398  -1.772   0.0783 .  
## Hits         0.0131981  0.0056473   2.337   0.0206 *  
## HmRun        0.0137859  0.0142506   0.967   0.3347    
## Runs        -0.0034222  0.0069924  -0.489   0.6252    
## RBI          0.0007142  0.0061828   0.116   0.9082    
## Walks        0.0113405  0.0042913   2.643   0.0090 ** 
## Years        0.0475593  0.0272016   1.748   0.0822 .  
## CAtBat       0.0001933  0.0003241   0.596   0.5518    
## CHits       -0.0002160  0.0016771  -0.129   0.8977    
## CHmRun      -0.0001548  0.0037416  -0.041   0.9670    
## CRuns        0.0015382  0.0017284   0.890   0.3748    
## CRBI        -0.0009776  0.0016929  -0.577   0.5644    
## CWalks      -0.0010780  0.0007940  -1.358   0.1764    
## PutOuts      0.0002500  0.0001740   1.437   0.1527    
## Assists      0.0002214  0.0005273   0.420   0.6751    
## Errors      -0.0094860  0.0110696  -0.857   0.3927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6018 on 168 degrees of freedom
## Multiple R-squared:  0.5899, Adjusted R-squared:  0.5508 
## F-statistic:  15.1 on 16 and 168 DF,  p-value: < 2.2e-16

(2b-Windows) Rebuild LM1 removing all of the predictors with P-values > 0.05.

LM1b = train(logSalary~ Hits+ Walks , data=TrainingData, method="lm")
summary(LM1b)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3019 -0.6136  0.1631  0.5244  1.6602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.683377   0.152132  30.785  < 2e-16 ***
## Hits        0.007004   0.001536   4.560 9.39e-06 ***
## Walks       0.010974   0.003089   3.553 0.000486 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2875 
## F-statistic: 38.11 on 2 and 182 DF,  p-value: 1.493e-14

(2c-Windows) Rebuild LM1 removing all of the remaining predictors with P-values > 0.05.

There are no remaining predictors with p-values greater than 0.05.

(2d-Windows) Examine the variance inflation factors and remove appropriate predictor(s) to avoid collinearity. Be sure to explain your choice(s). Rebuild LM1 using the remaining predictors.

library(car)
LM1VIF = lm(logSalary~Hits+Walks, data=TrainingData)
vif(LM1VIF)

##     Hits    Walks 
## 1.472601 1.472601

library(caret)
 LM1d = train(logSalary~ Hits+Walks, data=TrainingData, method ="lm")
summary(LM1d)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3019 -0.6136  0.1631  0.5244  1.6602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.683377   0.152132  30.785  < 2e-16 ***
## Hits        0.007004   0.001536   4.560 9.39e-06 ***
## Walks       0.010974   0.003089   3.553 0.000486 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2875 
## F-statistic: 38.11 on 2 and 182 DF,  p-value: 1.493e-14

Due to both Hits and Walks having the same VIF, it does not make sense to remove either from the model. This would work best if we had more variables included in our model. If we did, and one or more of the variables had a high VIF, it might make sense to remove that varaible in order to avoid collinearity. So in this case, our final model includes both Hits and Walks. Note that the r-squared on this model is very low 0.2952.

(2e-Windows) Evaluate LM1. (Are there patterns in the residuals? Plot predicted versus observed logSalary. Compute R2. Examine VIFs for collinearity.)

library(caret)
predictedvalues = predict(LM1d, data=TrainingData)
LM1Residuals = resid(LM1d)
plot(predictedvalues, LM1Residuals, col="green", xlab= "Predicted", ylab="Residuals")
abline(0,0)

There is no pattern to our residuals.

plot(predictedvalues,TrainingData$logSalary, col="green", xlab="Predicted log Salary", ylab="Observed log Salary")
abline(0,0)

Here, it appears that our residuals are all positive which indicates that our predictions are well under the actual, or observed, values. (residuals=observed-predicted). There appears to be a slight upward, linear pattern.

summary(LM1d) #compute R-squared

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3019 -0.6136  0.1631  0.5244  1.6602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.683377   0.152132  30.785  < 2e-16 ***
## Hits        0.007004   0.001536   4.560 9.39e-06 ***
## Walks       0.010974   0.003089   3.553 0.000486 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2875 
## F-statistic: 38.11 on 2 and 182 DF,  p-value: 1.493e-14

vif(lm(TrainingData$logSalary~TrainingData$Hits +TrainingData$Walks)) #Examine collinearity

##  TrainingData$Hits TrainingData$Walks 
##           1.472601           1.472601

R-squared is 0.2875 and there does not seem to be collinearity in our model due to the VIF of Hits and Walks both being 1.472601 which is fairly low.

(2f-Windows) Evaluate the predictive capacity of LM1 by computing R2 for the testing data. How does it compare to the R2 for the training data?

TestPredictions =2.0339649 + 0.0030419*TestingData$Hits +0.0047658*TestingData$Walks
plot(TestPredictions, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,0)

cor(TestPredictions, TestingData$logSalary)^2

## [1] 0.1537869

There does seem to be a small pattern in the residuals of predicted and observed log Salary. It appears to have a slight positive linear pattern which could explain the lower R-squared on the testing data. The R-squared for the testing data is 0.1537869 while the R-squared for the training data was 0.2875.

Problem 3

Use forward stepwise regression to build a linear model (LM2) forlogSalaryusing the original train-ing data, and evaluate the predictive capacity of LM2 by computingR2for the testingdata.

nullModel = lm(logSalary~1, data=TrainingData)
fullModel = lm(logSalary~., data=TrainingData)
LM2 = step(nullModel, direction="forward", scope = list(lower=nullModel, upper=fullModel))
summary(LM2)

summary(LM2)

## 
## Call:
## lm(formula = logSalary ~ CRuns + Hits + Years + Walks + CWalks + 
##     AtBat, data = TrainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30148 -0.49901 -0.01691  0.47964  1.21765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.3320479  0.1844936  23.481  < 2e-16 ***
## CRuns        0.0013248  0.0005129   2.583  0.01060 *  
## Hits         0.0123288  0.0038309   3.218  0.00153 ** 
## Years        0.0615065  0.0219424   2.803  0.00562 ** 
## Walks        0.0113305  0.0035898   3.156  0.00188 ** 
## CWalks      -0.0010602  0.0005746  -1.845  0.06668 .  
## AtBat       -0.0022484  0.0012457  -1.805  0.07277 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5963 on 178 degrees of freedom
## Multiple R-squared:  0.5734, Adjusted R-squared:  0.559 
## F-statistic: 39.87 on 6 and 178 DF,  p-value: < 2.2e-16

TestingData$prediction = predict(LM2, TestingData)
plot(TestingData$prediction, TestingData$logSalary, xlab="Predicted Log Salary", ylab = "Observed Log Salary")
abline(0, 1)

cor(TestingData$prediction, TestingData$logSalary)^2

## [1] 0.4384241

R^2 = 0.4384241

Use backward stepwise regression to build a linear model (LM3) for logSalary using the original training data, and evaluate the predictive capacity of LM3 by computing R2 for the testingdata.

LM3 = step(fullModel, direction = "backward", scope = list(lower = nullModel, upper = fullModel))

summary(LM3)

## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks + Years + CRuns + 
##     CWalks, data = TrainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30148 -0.49901 -0.01691  0.47964  1.21765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.3320479  0.1844936  23.481  < 2e-16 ***
## AtBat       -0.0022484  0.0012457  -1.805  0.07277 .  
## Hits         0.0123288  0.0038309   3.218  0.00153 ** 
## Walks        0.0113305  0.0035898   3.156  0.00188 ** 
## Years        0.0615065  0.0219424   2.803  0.00562 ** 
## CRuns        0.0013248  0.0005129   2.583  0.01060 *  
## CWalks      -0.0010602  0.0005746  -1.845  0.06668 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5963 on 178 degrees of freedom
## Multiple R-squared:  0.5734, Adjusted R-squared:  0.559 
## F-statistic: 39.87 on 6 and 178 DF,  p-value: < 2.2e-16

TestingData$prediction2 = predict(LM3, TestingData)
plot(TestingData$prediction2, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0, 1)

cor(TestingData$prediction2, TestingData$logSalary)^2

## [1] 0.4384241

R^2 = 0.4384241

Problem 4

Now let’s build data analytic models using 10-fold cross-validation by including the optional argument trControl=trainControl(method=“cv”, number=10)in the train command.

Use the training data to build a linear model (LM4) on the original training data using all of theoriginal numeric predictors and evaluate the predictive capacity of LM4 by computing R2 for the testing data.

LM4 = train(logSalary~.,data = TrainingData, method="lm", trControl=trainControl(method="cv", number=10))

TestingData$prediction3 = predict(LM4, TestingData)
plot(TestingData$prediction3, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,1)

cor(TestingData$prediction3, TestingData$logSalary)^2

## [1] 0.4251033

R^2 = 0.4251033

Use the training data to build aknearest neighbor model (kNN1) forlogSalary and evaluate the predictive capacity of kNN1 by computing R2 for the testingdata.

kNN1 = train(logSalary~., data=TrainingData, method='knn', trControl=trainControl(method = "cv", number=10))
TestingData$prediction4 = predict(kNN1, TestingData)
plot(TestingData$prediction4, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,1)

cor(TestingData$prediction4, TestingData$logSalary)^2

## [1] 0.5343042

R^2 = 0.5343042

Use the training data to build a MARS model (MARS1) forlogSalary and evaluate the predictive capacity of MARS1 by computing R2 for the testingdata.

MARS1 = train(logSalary~., data=TrainingData, method='earth', trControl=trainControl(method="cv", number=10))
TestingData$prediction5 = predict(MARS1, TestingData)
plot(TestingData$prediction5, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0, 1)

cat(cor(TestingData$prediction5, TestingData$logSalary)^2)

## 0.6007319

R^2 = 0.6007319

Problem 5

Summarize the differences between the four linear models LM1, LM2, LM3, and LM4. For example, what variables are present? How accurate are the models? How easily can you interpret the model?

Summary:

LM1 - simple linear regression

R-squared testing data= 0.1537869

LM2 - forward stepwise regression

R-squared testing data = 0.4384

LM3 - backward stepwise regression

R-squared testing data = 0.4384

LM4 - 10-fold cross validation

R-squared = 0.4251033

All four of these models are linear regressions. LM1 is the simplest to interpret due to the simplicity of how was it created (removing p-values with no significance). LM2-4 are more intricate linear regressions in that they use more complicated methods in order to fit the best model. These models also had much higher R-squares than LM1, which is to be expected. In LM1 we ended up only keeping two variables in the model, which could have contributed to the lower R-squared. In our stepwise regressions we kept 6 variables in the model which probably contributed to a higher R^2 than LM1. Note that the stepwise regressions had the highest between all four models and it is probably due to the complexity of the way it chose what variables to keep in the model. Stepwise regression looks at one variable at a time and determines if it should be included or not by using statistical methods, typically F-tests. LM1 had a much simpler and easier to follow method of reducing the insignificant variables from the model by using 0.05 p-value as a cut off point.

Of the six models, LM1, LM2, LM3, LM4, kNN1, and MARS1, which model do you think is best? Explain carefully.

Summary:

LM1 - simple linear regression

R-squared testing data= 0.1537869

LM2 - forward stepwise regression

R-squared testing data = 0.4384

LM3 - backward stepwise regression

R-squared testing data = 0.4384

LM4 - 10-fold cross validation

R-squared = 0.4251033

KNN1- K-Nearest Neighbor

R-squared = 0.5343042

MARS1 - MARs model

R-squared = 0.6007

If we compare all 6 of our models based solely on R-squared, our Mars model is the best. But our K-nearest neighbor model still had a significantly better R-square than our linear regression models. K-Nearest neighbor models are also very different from our others because it bases its predictions solely off of the points nearest to it. In contrast, the other models use math and statistical procedures to make predictions.

Problem 2- Linux

(2a-Linux)

(2b-Linux) Note in part a we get different p-values depending on whether it is run on Windows or Linux. Here we will remove variables greater than 0.05 from the Linux output and get the following:

LM1b = train(logSalary~ AtBat+Hits+Walks+PutOuts+Assists, data=TrainingData, method =“lm”) Summary(LM1b)

(2c-Linux) When running part b in Linux, we see that there are some predictors still in the model with p-values greater than 0.05 (Assists and Putouts), we will remove them here.

LM1c = train(logSalary~ AtBat+Hits+Walks, data=TrainingData, method =“lm”) summary(LM1c)

(2d-Linux) LM1VIF = lm(logSalary~Hits+Walks+AtBat, data=TrainingData) vif(LM1VIF)

Here, AtBat has a high VIF so I will remove it from the model in order to avoid collinearity.

LM1VIF2 = lm(logSalary~Walks+Hits, data=TrainingData) vif(LM1VIF2)

Both Walks and AtBat have low VIFs so both will remain in the model.

LM1d = train(logSalary~ Hits+Walks, data=TrainingData, method =“lm”) summary(LM1d)

Note that this model also has a very low R-squared at 0.2345.

(2e-Linux)

(2f-Linux)

Here the R-squared is 0.2475234.

Project 2

Sarah Brickman, Christopher Myers, and Michael Cronin

2/13/2020

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Problem 2- Linux