Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salary data is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets. The response variable is logSalary.
library(ISLR) #package that contains Hitters dataset
library(dplyr) #pacakge allows me to delete columns easily
library(caret) #package has data partion function
Hitters1=Hitters
#found all factor variables(League, Divison, NewLeague) using str()
Hitters1 = select(Hitters1, -c(League, Division, NewLeague)) #removes factor variables
Hitters2 = Hitters1[complete.cases(Hitters1$Salary), ] #removes records where Salary is missing
#removing right skew and replacing salary with logSalary
logSalary = log(Hitters2$Salary)
Hitters2$Salary = logSalary
colnames(Hitters2)[colnames(Hitters2)=="Salary"]="logSalary"
#splitting the data into training and testing sets
set.seed(12345)
TrainingIndex = createDataPartition(Hitters2$logSalary, p=.7, list=FALSE)
TrainingData = Hitters2[TrainingIndex[,1], ]
TestingData = Hitters2[-TrainingIndex[,1], ]
(2a-Windows) Using the training data, construct a linear model (LM1) for logSalary using all of the other variables as predictors
LM1 = train(logSalary~. , data=TrainingData, method="lm")
summary(LM1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28299 -0.43609 0.02173 0.43902 1.13982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4086816 0.1965854 22.426 <2e-16 ***
## AtBat -0.0027278 0.0015398 -1.772 0.0783 .
## Hits 0.0131981 0.0056473 2.337 0.0206 *
## HmRun 0.0137859 0.0142506 0.967 0.3347
## Runs -0.0034222 0.0069924 -0.489 0.6252
## RBI 0.0007142 0.0061828 0.116 0.9082
## Walks 0.0113405 0.0042913 2.643 0.0090 **
## Years 0.0475593 0.0272016 1.748 0.0822 .
## CAtBat 0.0001933 0.0003241 0.596 0.5518
## CHits -0.0002160 0.0016771 -0.129 0.8977
## CHmRun -0.0001548 0.0037416 -0.041 0.9670
## CRuns 0.0015382 0.0017284 0.890 0.3748
## CRBI -0.0009776 0.0016929 -0.577 0.5644
## CWalks -0.0010780 0.0007940 -1.358 0.1764
## PutOuts 0.0002500 0.0001740 1.437 0.1527
## Assists 0.0002214 0.0005273 0.420 0.6751
## Errors -0.0094860 0.0110696 -0.857 0.3927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6018 on 168 degrees of freedom
## Multiple R-squared: 0.5899, Adjusted R-squared: 0.5508
## F-statistic: 15.1 on 16 and 168 DF, p-value: < 2.2e-16
(2b-Windows) Rebuild LM1 removing all of the predictors with P-values > 0.05.
LM1b = train(logSalary~ Hits+ Walks , data=TrainingData, method="lm")
summary(LM1b)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3019 -0.6136 0.1631 0.5244 1.6602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.683377 0.152132 30.785 < 2e-16 ***
## Hits 0.007004 0.001536 4.560 9.39e-06 ***
## Walks 0.010974 0.003089 3.553 0.000486 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared: 0.2952, Adjusted R-squared: 0.2875
## F-statistic: 38.11 on 2 and 182 DF, p-value: 1.493e-14
(2c-Windows) Rebuild LM1 removing all of the remaining predictors with P-values > 0.05.
There are no remaining predictors with p-values greater than 0.05.
(2d-Windows) Examine the variance inflation factors and remove appropriate predictor(s) to avoid collinearity. Be sure to explain your choice(s). Rebuild LM1 using the remaining predictors.
library(car)
LM1VIF = lm(logSalary~Hits+Walks, data=TrainingData)
vif(LM1VIF)
## Hits Walks
## 1.472601 1.472601
library(caret)
LM1d = train(logSalary~ Hits+Walks, data=TrainingData, method ="lm")
summary(LM1d)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3019 -0.6136 0.1631 0.5244 1.6602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.683377 0.152132 30.785 < 2e-16 ***
## Hits 0.007004 0.001536 4.560 9.39e-06 ***
## Walks 0.010974 0.003089 3.553 0.000486 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared: 0.2952, Adjusted R-squared: 0.2875
## F-statistic: 38.11 on 2 and 182 DF, p-value: 1.493e-14
Due to both Hits and Walks having the same VIF, it does not make sense to remove either from the model. This would work best if we had more variables included in our model. If we did, and one or more of the variables had a high VIF, it might make sense to remove that varaible in order to avoid collinearity. So in this case, our final model includes both Hits and Walks. Note that the r-squared on this model is very low 0.2952.
(2e-Windows) Evaluate LM1. (Are there patterns in the residuals? Plot predicted versus observed logSalary. Compute R2. Examine VIFs for collinearity.)
library(caret)
predictedvalues = predict(LM1d, data=TrainingData)
LM1Residuals = resid(LM1d)
plot(predictedvalues, LM1Residuals, col="green", xlab= "Predicted", ylab="Residuals")
abline(0,0)
There is no pattern to our residuals.
plot(predictedvalues,TrainingData$logSalary, col="green", xlab="Predicted log Salary", ylab="Observed log Salary")
abline(0,0)
Here, it appears that our residuals are all positive which indicates that our predictions are well under the actual, or observed, values. (residuals=observed-predicted). There appears to be a slight upward, linear pattern.
summary(LM1d) #compute R-squared
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3019 -0.6136 0.1631 0.5244 1.6602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.683377 0.152132 30.785 < 2e-16 ***
## Hits 0.007004 0.001536 4.560 9.39e-06 ***
## Walks 0.010974 0.003089 3.553 0.000486 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.758 on 182 degrees of freedom
## Multiple R-squared: 0.2952, Adjusted R-squared: 0.2875
## F-statistic: 38.11 on 2 and 182 DF, p-value: 1.493e-14
vif(lm(TrainingData$logSalary~TrainingData$Hits +TrainingData$Walks)) #Examine collinearity
## TrainingData$Hits TrainingData$Walks
## 1.472601 1.472601
R-squared is 0.2875 and there does not seem to be collinearity in our model due to the VIF of Hits and Walks both being 1.472601 which is fairly low.
(2f-Windows) Evaluate the predictive capacity of LM1 by computing R2 for the testing data. How does it compare to the R2 for the training data?
TestPredictions =2.0339649 + 0.0030419*TestingData$Hits +0.0047658*TestingData$Walks
plot(TestPredictions, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,0)
cor(TestPredictions, TestingData$logSalary)^2
## [1] 0.1537869
There does seem to be a small pattern in the residuals of predicted and observed log Salary. It appears to have a slight positive linear pattern which could explain the lower R-squared on the testing data. The R-squared for the testing data is 0.1537869 while the R-squared for the training data was 0.2875.
nullModel = lm(logSalary~1, data=TrainingData)
fullModel = lm(logSalary~., data=TrainingData)
LM2 = step(nullModel, direction="forward", scope = list(lower=nullModel, upper=fullModel))
summary(LM2)
summary(LM2)
##
## Call:
## lm(formula = logSalary ~ CRuns + Hits + Years + Walks + CWalks +
## AtBat, data = TrainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30148 -0.49901 -0.01691 0.47964 1.21765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3320479 0.1844936 23.481 < 2e-16 ***
## CRuns 0.0013248 0.0005129 2.583 0.01060 *
## Hits 0.0123288 0.0038309 3.218 0.00153 **
## Years 0.0615065 0.0219424 2.803 0.00562 **
## Walks 0.0113305 0.0035898 3.156 0.00188 **
## CWalks -0.0010602 0.0005746 -1.845 0.06668 .
## AtBat -0.0022484 0.0012457 -1.805 0.07277 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5963 on 178 degrees of freedom
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.559
## F-statistic: 39.87 on 6 and 178 DF, p-value: < 2.2e-16
TestingData$prediction = predict(LM2, TestingData)
plot(TestingData$prediction, TestingData$logSalary, xlab="Predicted Log Salary", ylab = "Observed Log Salary")
abline(0, 1)
cor(TestingData$prediction, TestingData$logSalary)^2
## [1] 0.4384241
R^2 = 0.4384241
LM3 = step(fullModel, direction = "backward", scope = list(lower = nullModel, upper = fullModel))
summary(LM3)
##
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks + Years + CRuns +
## CWalks, data = TrainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30148 -0.49901 -0.01691 0.47964 1.21765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3320479 0.1844936 23.481 < 2e-16 ***
## AtBat -0.0022484 0.0012457 -1.805 0.07277 .
## Hits 0.0123288 0.0038309 3.218 0.00153 **
## Walks 0.0113305 0.0035898 3.156 0.00188 **
## Years 0.0615065 0.0219424 2.803 0.00562 **
## CRuns 0.0013248 0.0005129 2.583 0.01060 *
## CWalks -0.0010602 0.0005746 -1.845 0.06668 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5963 on 178 degrees of freedom
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.559
## F-statistic: 39.87 on 6 and 178 DF, p-value: < 2.2e-16
TestingData$prediction2 = predict(LM3, TestingData)
plot(TestingData$prediction2, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0, 1)
cor(TestingData$prediction2, TestingData$logSalary)^2
## [1] 0.4384241
R^2 = 0.4384241
Now let’s build data analytic models using 10-fold cross-validation by including the optional argument trControl=trainControl(method=“cv”, number=10)in the train command.
LM4 = train(logSalary~.,data = TrainingData, method="lm", trControl=trainControl(method="cv", number=10))
TestingData$prediction3 = predict(LM4, TestingData)
plot(TestingData$prediction3, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,1)
cor(TestingData$prediction3, TestingData$logSalary)^2
## [1] 0.4251033
R^2 = 0.4251033
kNN1 = train(logSalary~., data=TrainingData, method='knn', trControl=trainControl(method = "cv", number=10))
TestingData$prediction4 = predict(kNN1, TestingData)
plot(TestingData$prediction4, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0,1)
cor(TestingData$prediction4, TestingData$logSalary)^2
## [1] 0.5343042
R^2 = 0.5343042
MARS1 = train(logSalary~., data=TrainingData, method='earth', trControl=trainControl(method="cv", number=10))
TestingData$prediction5 = predict(MARS1, TestingData)
plot(TestingData$prediction5, TestingData$logSalary, xlab="Predicted Log Salary", ylab="Observed Log Salary")
abline(0, 1)
cat(cor(TestingData$prediction5, TestingData$logSalary)^2)
## 0.6007319
R^2 = 0.6007319
Summary:
LM1 - simple linear regression
R-squared testing data= 0.1537869
LM2 - forward stepwise regression
R-squared testing data = 0.4384
LM3 - backward stepwise regression
R-squared testing data = 0.4384
LM4 - 10-fold cross validation
R-squared = 0.4251033
All four of these models are linear regressions. LM1 is the simplest to interpret due to the simplicity of how was it created (removing p-values with no significance). LM2-4 are more intricate linear regressions in that they use more complicated methods in order to fit the best model. These models also had much higher R-squares than LM1, which is to be expected. In LM1 we ended up only keeping two variables in the model, which could have contributed to the lower R-squared. In our stepwise regressions we kept 6 variables in the model which probably contributed to a higher R^2 than LM1. Note that the stepwise regressions had the highest between all four models and it is probably due to the complexity of the way it chose what variables to keep in the model. Stepwise regression looks at one variable at a time and determines if it should be included or not by using statistical methods, typically F-tests. LM1 had a much simpler and easier to follow method of reducing the insignificant variables from the model by using 0.05 p-value as a cut off point.
Summary:
LM1 - simple linear regression
R-squared testing data= 0.1537869
LM2 - forward stepwise regression
R-squared testing data = 0.4384
LM3 - backward stepwise regression
R-squared testing data = 0.4384
LM4 - 10-fold cross validation
R-squared = 0.4251033
KNN1- K-Nearest Neighbor
R-squared = 0.5343042
MARS1 - MARs model
R-squared = 0.6007
If we compare all 6 of our models based solely on R-squared, our Mars model is the best. But our K-nearest neighbor model still had a significantly better R-square than our linear regression models. K-Nearest neighbor models are also very different from our others because it bases its predictions solely off of the points nearest to it. In contrast, the other models use math and statistical procedures to make predictions.
(2a-Linux)
(2b-Linux) Note in part a we get different p-values depending on whether it is run on Windows or Linux. Here we will remove variables greater than 0.05 from the Linux output and get the following:
LM1b = train(logSalary~ AtBat+Hits+Walks+PutOuts+Assists, data=TrainingData, method =“lm”) Summary(LM1b)
(2c-Linux) When running part b in Linux, we see that there are some predictors still in the model with p-values greater than 0.05 (Assists and Putouts), we will remove them here.
LM1c = train(logSalary~ AtBat+Hits+Walks, data=TrainingData, method =“lm”) summary(LM1c)
(2d-Linux) LM1VIF = lm(logSalary~Hits+Walks+AtBat, data=TrainingData) vif(LM1VIF)
Here, AtBat has a high VIF so I will remove it from the model in order to avoid collinearity.
LM1VIF2 = lm(logSalary~Walks+Hits, data=TrainingData) vif(LM1VIF2)
Both Walks and AtBat have low VIFs so both will remain in the model.
LM1d = train(logSalary~ Hits+Walks, data=TrainingData, method =“lm”) summary(LM1d)
Note that this model also has a very low R-squared at 0.2345.
(2e-Linux)
(2f-Linux)
Here the R-squared is 0.2475234.