Hitters1 <- Hitters[c(-14, -15, -20)]
Hitters2 <- na.omit(Hitters1)
logSalary <- log(Hitters2$Salary)
Hitters2$Salary <- logSalary
colnames(Hitters2)[17] <- "logSalary"
set.seed("12345")
dp <- createDataPartition(Hitters2$logSalary, p=0.7, list=FALSE)
training <- Hitters2[dp,]
testing <- Hitters2[-dp,]
LM1 <- lm(logSalary~., training)
summary(LM1)
##
## Call:
## lm(formula = logSalary ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27147 -0.44874 0.01607 0.40936 2.79909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.677e+00 1.959e-01 23.872 <2e-16 ***
## AtBat -3.931e-03 1.523e-03 -2.582 0.0107 *
## Hits 1.445e-02 5.813e-03 2.485 0.0139 *
## HmRun 9.406e-03 1.480e-02 0.636 0.5258
## Runs -3.868e-03 6.987e-03 -0.554 0.5805
## RBI 3.493e-03 6.463e-03 0.540 0.5896
## Walks 1.066e-02 4.492e-03 2.373 0.0188 *
## Years 4.177e-02 3.128e-02 1.335 0.1836
## CAtBat -2.807e-05 3.299e-04 -0.085 0.9323
## CHits 1.097e-03 1.745e-03 0.629 0.5302
## CHmRun 5.071e-04 3.912e-03 0.130 0.8970
## CRuns 8.409e-04 1.764e-03 0.477 0.6343
## CRBI -1.265e-03 1.822e-03 -0.694 0.4886
## CWalks -9.993e-04 8.293e-04 -1.205 0.2299
## PutOuts 4.154e-04 1.765e-04 2.355 0.0197 *
## Assists 1.052e-03 5.131e-04 2.051 0.0418 *
## Errors -1.647e-02 1.017e-02 -1.619 0.1074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared: 0.5713, Adjusted R-squared: 0.5305
## F-statistic: 13.99 on 16 and 168 DF, p-value: < 2.2e-16
Need to eliminate 11 predictors due to high (>.05) P values.
LM1 <- lm(logSalary~AtBat+Hits+Walks+PutOuts+Assists, training)
summary(LM1)
##
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks + PutOuts + Assists,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5468 -0.5977 0.1007 0.5593 2.6496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.048e+00 1.799e-01 28.069 < 2e-16 ***
## AtBat -4.240e-03 1.648e-03 -2.573 0.010895 *
## Hits 1.892e-02 5.039e-03 3.753 0.000236 ***
## Walks 1.191e-02 3.544e-03 3.361 0.000950 ***
## PutOuts 1.423e-04 2.081e-04 0.684 0.495085
## Assists 8.827e-05 4.392e-04 0.201 0.840930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7804 on 179 degrees of freedom
## Multiple R-squared: 0.2734, Adjusted R-squared: 0.2531
## F-statistic: 13.47 on 5 and 179 DF, p-value: 3.739e-11
Need to eliminate PutOuts and Assists due to high P values.
LM1 <- lm(logSalary~AtBat+Hits+Walks, training)
summary(LM1)
##
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4733 -0.6045 0.1040 0.5608 2.6699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.057128 0.177881 28.430 < 2e-16 ***
## AtBat -0.004171 0.001563 -2.669 0.008294 **
## Hits 0.018981 0.004913 3.863 0.000156 ***
## Walks 0.012176 0.003425 3.555 0.000482 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7771 on 181 degrees of freedom
## Multiple R-squared: 0.2715, Adjusted R-squared: 0.2594
## F-statistic: 22.48 on 3 and 181 DF, p-value: 2.028e-12
vif(LM1)
## AtBat Hits Walks
## 16.374806 15.387746 1.551907
LM1 <- lm(logSalary~Hits+Walks, training)
vif(LM1)
## Hits Walks
## 1.44903 1.44903
After examining the VIFs, I noticed that both AtBat and Hits had very high VIFs. I removed AtBat, since it was the highest at 16.375, and re-evaluated. The VIFs for the remaining two predictors of LM1 (Hits & Walks) were both 1.44903 which was satisfactory.
LM1.predicted <- predict(LM1)
plot(training$logSalary, LM1.predicted, ylab = "Predicted", xlab = "Actual", main = "Predicted vs. Actual")
LM1.res <- resid(LM1)
plot(LM1.predicted, LM1.res, ylab = "Residuals of LM1", main = "Residuals")
vif(LM1)
## Hits Walks
## 1.44903 1.44903
The “Residuals”" plot above shows no patterns, which is good. Most of the Residuals for the predicted values are between -1 and 1, which means the predicted logSalary’s were decently close to the actual.
In the “Predicted vs. Actual” plot, there seems to be a slight positive linear shape.
The VIFs still look good with both at 1.44903.
R2 of LM1 is rather low at 0.2427942.
R2 using the testing data with LM1 is also rather low at 0.2522379.
null <- lm(logSalary~1, data = training)
full <- lm(logSalary~., data = training)
LM2 <- step(null, direction = "forward", scope = list(lower=null, upper=full), trace = FALSE)
R2 using the testing data with LM2 is higher than LM1 at 0.4212.
LM3 <- step(full, direction = "backward", trace = FALSE)
R2 using the testing data with LM3 is higher than LM2 at 0.4549475.
LM4 <- train(logSalary~., method = "lm", data = training, trControl = trainControl(method = "cv", number = 10))
R2 using the testing data with LM4 is slightly lower than LM3 at 0.4518025.
kNN1 <- train(logSalary~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10))
R2 using the testing data with kNN1 is the highest of any of the models at 0.6757635.
MARS1 <- train(logSalary~., method = "earth", data = training, trControl = trainControl(method = "cv", number = 10))
R2 using the testing data with MARS1 is the second highest behind the kNN1 model at 0.6125382.
LM1 is a linear model for logSalary using only two variables, in this case Hits and Walks, as predictors for logSalary.
LM2 is a forwards step-wise function which starts starts with just logSalary and adds one predictor at a time until it finds the ‘best’ model.
LM3 is a backwards step-wise function which starts with a full model with all possible predictors included and then takes one out at a time until it finds the ‘best’ model.
LM4 is a linear cross-validation model that breaks the data into 10 subsets and sees which subset gives the lowest CV Error value.
I believe that the LM3, the backwards step-wise function is the best. I believe this because it starts with a full model and eliminates variables based of how statiscally insignificant the change to the model is without it. The 10 fold cross validation model can lead to high variance, the nearest neighbors doesn’t produce a model. LM1 is a linear model which only uses 2 predictors which seems low. The forward step-wise function has a lower R2 than the backwards so it leads me to believe the backwards is slightly better.