Project 2

Problem 1

Hitters1 <- Hitters[c(-14, -15, -20)]
Hitters2 <- na.omit(Hitters1)

logSalary <- log(Hitters2$Salary)
Hitters2$Salary <- logSalary
colnames(Hitters2)[17] <- "logSalary"

set.seed("12345")
dp <- createDataPartition(Hitters2$logSalary, p=0.7, list=FALSE)
training <- Hitters2[dp,]
testing <- Hitters2[-dp,]

Problem 2

A.

LM1 <- lm(logSalary~., training)
summary(LM1)

## 
## Call:
## lm(formula = logSalary ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27147 -0.44874  0.01607  0.40936  2.79909 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.677e+00  1.959e-01  23.872   <2e-16 ***
## AtBat       -3.931e-03  1.523e-03  -2.582   0.0107 *  
## Hits         1.445e-02  5.813e-03   2.485   0.0139 *  
## HmRun        9.406e-03  1.480e-02   0.636   0.5258    
## Runs        -3.868e-03  6.987e-03  -0.554   0.5805    
## RBI          3.493e-03  6.463e-03   0.540   0.5896    
## Walks        1.066e-02  4.492e-03   2.373   0.0188 *  
## Years        4.177e-02  3.128e-02   1.335   0.1836    
## CAtBat      -2.807e-05  3.299e-04  -0.085   0.9323    
## CHits        1.097e-03  1.745e-03   0.629   0.5302    
## CHmRun       5.071e-04  3.912e-03   0.130   0.8970    
## CRuns        8.409e-04  1.764e-03   0.477   0.6343    
## CRBI        -1.265e-03  1.822e-03  -0.694   0.4886    
## CWalks      -9.993e-04  8.293e-04  -1.205   0.2299    
## PutOuts      4.154e-04  1.765e-04   2.355   0.0197 *  
## Assists      1.052e-03  5.131e-04   2.051   0.0418 *  
## Errors      -1.647e-02  1.017e-02  -1.619   0.1074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared:  0.5713, Adjusted R-squared:  0.5305 
## F-statistic: 13.99 on 16 and 168 DF,  p-value: < 2.2e-16

Need to eliminate 11 predictors due to high (>.05) P values.

B.

LM1 <- lm(logSalary~AtBat+Hits+Walks+PutOuts+Assists, training)
summary(LM1)

## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks + PutOuts + Assists, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5468 -0.5977  0.1007  0.5593  2.6496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.048e+00  1.799e-01  28.069  < 2e-16 ***
## AtBat       -4.240e-03  1.648e-03  -2.573 0.010895 *  
## Hits         1.892e-02  5.039e-03   3.753 0.000236 ***
## Walks        1.191e-02  3.544e-03   3.361 0.000950 ***
## PutOuts      1.423e-04  2.081e-04   0.684 0.495085    
## Assists      8.827e-05  4.392e-04   0.201 0.840930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7804 on 179 degrees of freedom
## Multiple R-squared:  0.2734, Adjusted R-squared:  0.2531 
## F-statistic: 13.47 on 5 and 179 DF,  p-value: 3.739e-11

Need to eliminate PutOuts and Assists due to high P values.

C.

LM1 <- lm(logSalary~AtBat+Hits+Walks, training)
summary(LM1)

## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4733 -0.6045  0.1040  0.5608  2.6699 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.057128   0.177881  28.430  < 2e-16 ***
## AtBat       -0.004171   0.001563  -2.669 0.008294 ** 
## Hits         0.018981   0.004913   3.863 0.000156 ***
## Walks        0.012176   0.003425   3.555 0.000482 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7771 on 181 degrees of freedom
## Multiple R-squared:  0.2715, Adjusted R-squared:  0.2594 
## F-statistic: 22.48 on 3 and 181 DF,  p-value: 2.028e-12

D.

vif(LM1)

##     AtBat      Hits     Walks 
## 16.374806 15.387746  1.551907

LM1 <- lm(logSalary~Hits+Walks, training)
vif(LM1)

##    Hits   Walks 
## 1.44903 1.44903

After examining the VIFs, I noticed that both AtBat and Hits had very high VIFs. I removed AtBat, since it was the highest at 16.375, and re-evaluated. The VIFs for the remaining two predictors of LM1 (Hits & Walks) were both 1.44903 which was satisfactory.

E.

LM1.predicted <- predict(LM1)
plot(training$logSalary, LM1.predicted, ylab = "Predicted", xlab = "Actual", main = "Predicted vs. Actual")

LM1.res <- resid(LM1)
plot(LM1.predicted, LM1.res, ylab = "Residuals of LM1", main = "Residuals")

vif(LM1)

##    Hits   Walks 
## 1.44903 1.44903

The “Residuals”" plot above shows no patterns, which is good. Most of the Residuals for the predicted values are between -1 and 1, which means the predicted logSalary’s were decently close to the actual.

In the “Predicted vs. Actual” plot, there seems to be a slight positive linear shape.

The VIFs still look good with both at 1.44903.

R² of LM1 is rather low at 0.2427942.

F.

R² using the testing data with LM1 is also rather low at 0.2522379.

Problem 3

A.

null <- lm(logSalary~1, data = training)
full <- lm(logSalary~., data = training)

LM2 <- step(null, direction = "forward", scope = list(lower=null, upper=full), trace = FALSE)

R² using the testing data with LM2 is higher than LM1 at 0.4212.

B.

LM3 <- step(full, direction = "backward", trace = FALSE)

R² using the testing data with LM3 is higher than LM2 at 0.4549475.

Problem 4

A.

LM4 <- train(logSalary~., method = "lm", data = training, trControl = trainControl(method = "cv", number = 10))

R² using the testing data with LM4 is slightly lower than LM3 at 0.4518025.

B.

kNN1 <- train(logSalary~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10))

R² using the testing data with kNN1 is the highest of any of the models at 0.6757635.

C.

MARS1 <- train(logSalary~., method = "earth", data = training, trControl = trainControl(method = "cv", number = 10))

R² using the testing data with MARS1 is the second highest behind the kNN1 model at 0.6125382.

Problem 5

A.

LM1 is a linear model for logSalary using only two variables, in this case Hits and Walks, as predictors for logSalary.

LM2 is a forwards step-wise function which starts starts with just logSalary and adds one predictor at a time until it finds the ‘best’ model.

LM3 is a backwards step-wise function which starts with a full model with all possible predictors included and then takes one out at a time until it finds the ‘best’ model.

LM4 is a linear cross-validation model that breaks the data into 10 subsets and sees which subset gives the lowest CV Error value.

B.

I believe that the LM3, the backwards step-wise function is the best. I believe this because it starts with a full model and eliminates variables based of how statiscally insignificant the change to the model is without it. The 10 fold cross validation model can lead to high variance, the nearest neighbors doesn’t produce a model. LM1 is a linear model which only uses 2 predictors which seems low. The forward step-wise function has a lower R² than the backwards so it leads me to believe the backwards is slightly better.