title: “Project 2” author: “Nina Cicero” date: “February 9, 2017” output: html_document

## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos

Question 1

newHitters<- Hitters[complete.cases(Hitters$Salary),]
newHitters$Salary <- log(newHitters$Salary)
names(newHitters)[names(newHitters)== "Salary"]<- "logSalary"
set.seed(12345)
trainingIndex<- createDataPartition(y= newHitters$logSalary, p=.7, list= FALSE)
trainingSet<- newHitters[trainingIndex, ]
testingSet<- newHitters[-trainingIndex, ]

Question 2

A)

LM1<- lm(formula = logSalary~., data=trainingSet)
summary(LM1)
## 
## Call:
## lm(formula = logSalary ~ ., data = trainingSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18928 -0.41465  0.05261  0.39266  2.74912 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.656e+00  2.077e-01  22.418   <2e-16 ***
## AtBat       -3.325e-03  1.520e-03  -2.187   0.0301 *  
## Hits         1.304e-02  5.762e-03   2.262   0.0250 *  
## HmRun        1.185e-02  1.469e-02   0.807   0.4209    
## Runs        -3.596e-03  6.905e-03  -0.521   0.6032    
## RBI          2.534e-03  6.422e-03   0.395   0.6936    
## Walks        1.085e-02  4.453e-03   2.438   0.0158 *  
## Years        3.869e-02  3.089e-02   1.253   0.2121    
## CAtBat      -2.623e-05  3.262e-04  -0.080   0.9360    
## CHits        1.366e-03  1.735e-03   0.787   0.4322    
## CHmRun       7.679e-04  3.865e-03   0.199   0.8428    
## CRuns        3.456e-04  1.767e-03   0.196   0.8452    
## CRBI        -1.352e-03  1.803e-03  -0.750   0.4545    
## CWalks      -9.906e-04  8.181e-04  -1.211   0.2277    
## LeagueN      3.578e-01  1.654e-01   2.164   0.0319 *  
## DivisionW   -1.579e-01  9.382e-02  -1.683   0.0943 .  
## PutOuts      4.035e-04  1.741e-04   2.318   0.0217 *  
## Assists      1.024e-03  5.096e-04   2.009   0.0461 *  
## Errors      -1.858e-02  1.012e-02  -1.835   0.0683 .  
## NewLeagueN  -2.710e-01  1.655e-01  -1.638   0.1034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6102 on 165 degrees of freedom
## Multiple R-squared:  0.5906, Adjusted R-squared:  0.5434 
## F-statistic: 12.53 on 19 and 165 DF,  p-value: < 2.2e-16

B)

LM1<- lm(formula = logSalary~ AtBat+Hits+Walks+League+PutOuts+Assists, data=trainingSet)
summary(LM1)
## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks + League + PutOuts + 
##     Assists, data = trainingSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4973 -0.6347  0.1237  0.5666  2.6246 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.9606725  0.2003227  24.763  < 2e-16 ***
## AtBat       -0.0041296  0.0016516  -2.500 0.013310 *  
## Hits         0.0189053  0.0050396   3.751 0.000238 ***
## Walks        0.0118197  0.0035456   3.334 0.001043 ** 
## LeagueN      0.1174263  0.1180117   0.995 0.321068    
## PutOuts      0.0001288  0.0002086   0.617 0.537731    
## Assists      0.0000619  0.0004400   0.141 0.888276    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7804 on 178 degrees of freedom
## Multiple R-squared:  0.2774, Adjusted R-squared:  0.2531 
## F-statistic: 11.39 on 6 and 178 DF,  p-value: 9.205e-11

C)

LM1<-lm(formula = logSalary~ AtBat+Hits+Walks, data=trainingSet)
summary(LM1)
## 
## Call:
## lm(formula = logSalary ~ AtBat + Hits + Walks, data = trainingSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4733 -0.6045  0.1040  0.5608  2.6699 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.057128   0.177881  28.430  < 2e-16 ***
## AtBat       -0.004171   0.001563  -2.669 0.008294 ** 
## Hits         0.018981   0.004913   3.863 0.000156 ***
## Walks        0.012176   0.003425   3.555 0.000482 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7771 on 181 degrees of freedom
## Multiple R-squared:  0.2715, Adjusted R-squared:  0.2594 
## F-statistic: 22.48 on 3 and 181 DF,  p-value: 2.028e-12

D)

vif(LM1)
##     AtBat      Hits     Walks 
## 16.374806 15.387746  1.551907

The only predictor that has a variance inflation factor less than 10 is “Walks”. Although “Hits” has a high variance inflation factor, we will keep it as a predictor for the sake of keeping our model a multiple regression. We will remove “AtBat” since it has the highest variance inflation factor.

LM1 <-lm(formula = logSalary~ Hits + Walks, data=trainingSet)
summary(LM1)
## 
## Call:
## lm(formula = logSalary ~ Hits + Walks, data = trainingSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5822 -0.6320  0.1774  0.5589  2.8355 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.820657   0.156824  30.739  < 2e-16 ***
## Hits        0.006499   0.001533   4.239 3.56e-05 ***
## Walks       0.009822   0.003365   2.919  0.00395 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7901 on 182 degrees of freedom
## Multiple R-squared:  0.2428, Adjusted R-squared:  0.2345 
## F-statistic: 29.18 on 2 and 182 DF,  p-value: 1.02e-11

E)

residual<-resid(LM1)
plot(LM1$fitted.values, residual, ylab="Residuals", xlab="Observed Log of Salary")

There does not appear to be any patterns in the residual plot.

vif(LM1)
##    Hits   Walks 
## 1.44903 1.44903

The VIF’s are low which means there is not a lot of colinearity.

summary(LM1)$r.squared
## [1] 0.2427942

Our r-sqared value is very low, meaning the model is not an accurate predictor.

F)

LM1Test<- predict(LM1, newdata = testingSet)
cor(LM1Test, testingSet$logSalary)^2
## [1] 0.2522379

This r-squared value is also very low, reaffirming that our linear model is a poor predictor.

Question 3

A)

null<- lm(formula= logSalary~1, data=trainingSet)
full<-lm(formula=logSalary~., data=trainingSet)
LM2<-step(null, direction="forward", data=trainingSet, scope=list(lower=null, upper=full))
## Start:  AIC=-36.75
## logSalary ~ 1
## 
##             Df Sum of Sq     RSS      AIC
## + CHits      1    62.279  87.758 -133.968
## + CRuns      1    61.152  88.885 -131.607
## + CAtBat     1    58.656  91.381 -126.484
## + CRBI       1    52.014  98.023 -113.503
## + CWalks     1    45.190 104.848 -101.052
## + Years      1    42.795 107.243  -96.873
## + CHmRun     1    35.394 114.643  -84.528
## + Hits       1    31.109 118.929  -77.739
## + RBI        1    28.142 121.895  -73.181
## + Runs       1    25.595 124.442  -69.355
## + AtBat      1    25.314 124.723  -68.938
## + Walks      1    25.209 124.828  -68.782
## + HmRun      1    16.054 133.983  -55.689
## + PutOuts    1     6.397 143.640  -42.813
## + Division   1     4.106 145.931  -39.886
## <none>                   150.037  -36.752
## + Assists    1     0.680 149.357  -35.593
## + NewLeague  1     0.107 149.930  -34.884
## + League     1     0.044 149.993  -34.807
## + Errors     1     0.009 150.028  -34.763
## 
## Step:  AIC=-133.97
## logSalary ~ CHits
## 
##             Df Sum of Sq    RSS     AIC
## + Hits       1   13.7220 74.036 -163.42
## + Runs       1   11.9989 75.759 -159.17
## + AtBat      1   10.6337 77.124 -155.86
## + RBI        1    9.2378 78.520 -152.54
## + Walks      1    7.9072 79.851 -149.44
## + PutOuts    1    6.5531 81.205 -146.32
## + HmRun      1    4.2733 83.485 -141.20
## + CAtBat     1    3.3886 84.370 -139.25
## + Division   1    1.9463 85.812 -136.12
## + Years      1    1.6035 86.155 -135.38
## + CRBI       1    1.1586 86.599 -134.43
## + Assists    1    1.0612 86.697 -134.22
## <none>                   87.758 -133.97
## + CHmRun     1    0.8151 86.943 -133.69
## + CWalks     1    0.5362 87.222 -133.10
## + Errors     1    0.1170 87.641 -132.21
## + CRuns      1    0.0740 87.684 -132.12
## + League     1    0.0491 87.709 -132.07
## + NewLeague  1    0.0355 87.723 -132.04
## 
## Step:  AIC=-163.42
## logSalary ~ CHits + Hits
## 
##             Df Sum of Sq    RSS     AIC
## + PutOuts    1   2.30371 71.732 -167.27
## + Division   1   1.64994 72.386 -165.59
## + AtBat      1   1.41545 72.621 -165.00
## + Walks      1   1.06455 72.972 -164.10
## + League     1   0.94249 73.094 -163.79
## <none>                   74.036 -163.42
## + Errors     1   0.57543 73.461 -162.87
## + Years      1   0.42476 73.611 -162.49
## + CHmRun     1   0.35216 73.684 -162.31
## + CRBI       1   0.34062 73.696 -162.28
## + CAtBat     1   0.27880 73.757 -162.12
## + NewLeague  1   0.07020 73.966 -161.60
## + RBI        1   0.06229 73.974 -161.58
## + HmRun      1   0.05493 73.981 -161.56
## + Runs       1   0.05245 73.984 -161.56
## + CWalks     1   0.01684 74.019 -161.47
## + Assists    1   0.01205 74.024 -161.45
## + CRuns      1   0.00068 74.035 -161.43
## 
## Step:  AIC=-167.27
## logSalary ~ CHits + Hits + PutOuts
## 
##             Df Sum of Sq    RSS     AIC
## + Division   1   1.82283 69.910 -170.03
## + AtBat      1   1.45545 70.277 -169.06
## + League     1   0.77815 70.954 -167.29
## <none>                   71.732 -167.27
## + CRBI       1   0.69234 71.040 -167.07
## + Walks      1   0.58028 71.152 -166.77
## + Years      1   0.56417 71.168 -166.73
## + CHmRun     1   0.53506 71.197 -166.66
## + Errors     1   0.51310 71.219 -166.60
## + CAtBat     1   0.12139 71.611 -165.59
## + Runs       1   0.08114 71.651 -165.48
## + NewLeague  1   0.02938 71.703 -165.35
## + Assists    1   0.02542 71.707 -165.34
## + CRuns      1   0.01330 71.719 -165.31
## + HmRun      1   0.00192 71.731 -165.28
## + RBI        1   0.00161 71.731 -165.28
## + CWalks     1   0.00103 71.731 -165.27
## 
## Step:  AIC=-170.03
## logSalary ~ CHits + Hits + PutOuts + Division
## 
##             Df Sum of Sq    RSS     AIC
## + AtBat      1   1.13467 68.775 -171.06
## <none>                   69.910 -170.03
## + League     1   0.63947 69.270 -169.73
## + Walks      1   0.60943 69.300 -169.65
## + Years      1   0.59595 69.314 -169.62
## + CRBI       1   0.53736 69.372 -169.46
## + Errors     1   0.52909 69.381 -169.44
## + CHmRun     1   0.43005 69.480 -169.18
## + Runs       1   0.05418 69.855 -168.18
## + CAtBat     1   0.04974 69.860 -168.16
## + HmRun      1   0.02318 69.886 -168.09
## + NewLeague  1   0.00922 69.900 -168.06
## + RBI        1   0.00819 69.901 -168.06
## + Assists    1   0.00582 69.904 -168.05
## + CRuns      1   0.00085 69.909 -168.03
## + CWalks     1   0.00031 69.909 -168.03
## 
## Step:  AIC=-171.06
## logSalary ~ CHits + Hits + PutOuts + Division + AtBat
## 
##             Df Sum of Sq    RSS     AIC
## + Walks      1   1.23712 67.538 -172.42
## <none>                   68.775 -171.06
## + Years      1   0.57231 68.203 -170.61
## + League     1   0.57054 68.204 -170.60
## + CRBI       1   0.46059 68.314 -170.30
## + CHmRun     1   0.27210 68.503 -169.79
## + Errors     1   0.22554 68.549 -169.67
## + Runs       1   0.17946 68.595 -169.54
## + Assists    1   0.14739 68.628 -169.46
## + HmRun      1   0.08331 68.692 -169.28
## + RBI        1   0.08122 68.694 -169.28
## + CAtBat     1   0.04339 68.732 -169.18
## + CRuns      1   0.03445 68.740 -169.15
## + CWalks     1   0.01994 68.755 -169.11
## + NewLeague  1   0.00902 68.766 -169.09
## 
## Step:  AIC=-172.42
## logSalary ~ CHits + Hits + PutOuts + Division + AtBat + Walks
## 
##             Df Sum of Sq    RSS     AIC
## + CRBI       1   0.81009 66.728 -172.65
## <none>                   67.538 -172.42
## + Years      1   0.66741 66.870 -172.26
## + CWalks     1   0.61129 66.927 -172.10
## + CHmRun     1   0.57617 66.962 -172.00
## + League     1   0.54568 66.992 -171.92
## + Assists    1   0.31821 67.220 -171.29
## + CRuns      1   0.13927 67.399 -170.80
## + Errors     1   0.09725 67.441 -170.69
## + HmRun      1   0.04439 67.493 -170.54
## + RBI        1   0.03144 67.506 -170.50
## + Runs       1   0.02690 67.511 -170.49
## + CAtBat     1   0.01208 67.526 -170.45
## + NewLeague  1   0.00796 67.530 -170.44
## 
## Step:  AIC=-172.65
## logSalary ~ CHits + Hits + PutOuts + Division + AtBat + Walks + 
##     CRBI
## 
##             Df Sum of Sq    RSS     AIC
## + RBI        1   0.72657 66.001 -172.68
## <none>                   66.728 -172.65
## + Years      1   0.57954 66.148 -172.26
## + HmRun      1   0.57433 66.153 -172.25
## + League     1   0.47603 66.252 -171.98
## + CWalks     1   0.36609 66.362 -171.67
## + Errors     1   0.14847 66.579 -171.06
## + Assists    1   0.14562 66.582 -171.06
## + CRuns      1   0.05531 66.672 -170.80
## + CAtBat     1   0.03974 66.688 -170.76
## + CHmRun     1   0.02749 66.700 -170.73
## + Runs       1   0.01666 66.711 -170.70
## + NewLeague  1   0.00000 66.728 -170.65
## 
## Step:  AIC=-172.68
## logSalary ~ CHits + Hits + PutOuts + Division + AtBat + Walks + 
##     CRBI + RBI
## 
##             Df Sum of Sq    RSS     AIC
## <none>                   66.001 -172.68
## + Years      1   0.52052 65.481 -172.14
## + League     1   0.49750 65.504 -172.08
## + CWalks     1   0.25218 65.749 -171.38
## + Assists    1   0.22916 65.772 -171.32
## + Errors     1   0.18532 65.816 -171.20
## + HmRun      1   0.05016 65.951 -170.82
## + CHmRun     1   0.04426 65.957 -170.80
## + Runs       1   0.04194 65.959 -170.79
## + CAtBat     1   0.04133 65.960 -170.79
## + CRuns      1   0.02363 65.978 -170.74
## + NewLeague  1   0.00002 66.001 -170.68
LM2Test<- predict(LM2, newdata = testingSet)
cor(LM2Test, testingSet$logSalary)^2
## [1] 0.4333624

B)

LM3<-step(full, direction="backward", data=trainingSet)
## Start:  AIC=-163.96
## logSalary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + 
##     CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + League + 
##     Division + PutOuts + Assists + Errors + NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## - CAtBat     1   0.00241 61.431 -165.95
## - CRuns      1   0.01423 61.443 -165.92
## - CHmRun     1   0.01470 61.443 -165.91
## - RBI        1   0.05799 61.486 -165.78
## - Runs       1   0.10096 61.529 -165.66
## - CRBI       1   0.20925 61.638 -165.33
## - CHits      1   0.23076 61.659 -165.27
## - HmRun      1   0.24237 61.671 -165.23
## - CWalks     1   0.54584 61.974 -164.32
## - Years      1   0.58421 62.013 -164.21
## <none>                   61.428 -163.96
## - NewLeague  1   0.99855 62.427 -162.98
## - Division   1   1.05460 62.483 -162.81
## - Errors     1   1.25390 62.682 -162.22
## - Assists    1   1.50304 62.931 -161.49
## - League     1   1.74300 63.171 -160.78
## - AtBat      1   1.78079 63.209 -160.67
## - Hits       1   1.90536 63.334 -160.31
## - PutOuts    1   2.00120 63.430 -160.03
## - Walks      1   2.21208 63.640 -159.41
## 
## Step:  AIC=-165.95
## logSalary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + 
##     CHits + CHmRun + CRuns + CRBI + CWalks + League + Division + 
##     PutOuts + Assists + Errors + NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## - CHmRun     1   0.01240 61.443 -167.91
## - CRuns      1   0.02067 61.451 -167.89
## - RBI        1   0.05642 61.487 -167.78
## - Runs       1   0.10839 61.539 -167.63
## - CRBI       1   0.21388 61.645 -167.31
## - HmRun      1   0.24775 61.679 -167.21
## - CHits      1   0.49390 61.925 -166.47
## - Years      1   0.66610 62.097 -165.96
## <none>                   61.431 -165.95
## - CWalks     1   0.75102 62.182 -165.70
## - NewLeague  1   0.99615 62.427 -164.98
## - Division   1   1.06226 62.493 -164.78
## - Errors     1   1.25169 62.682 -164.22
## - Assists    1   1.51768 62.948 -163.44
## - League     1   1.74064 63.171 -162.78
## - PutOuts    1   2.04208 63.473 -161.90
## - AtBat      1   2.16495 63.596 -161.54
## - Walks      1   2.33249 63.763 -161.06
## - Hits       1   2.37442 63.805 -160.94
## 
## Step:  AIC=-167.92
## logSalary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + 
##     CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + 
##     Assists + Errors + NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## - RBI        1   0.04446 61.488 -169.78
## - CRuns      1   0.06632 61.509 -169.72
## - Runs       1   0.14562 61.589 -169.48
## - HmRun      1   0.39722 61.840 -168.72
## <none>                   61.443 -167.91
## - Years      1   0.67515 62.118 -167.89
## - CWalks     1   0.88217 62.325 -167.28
## - CHits      1   0.90690 62.350 -167.20
## - CRBI       1   0.92757 62.371 -167.14
## - NewLeague  1   0.99128 62.434 -166.95
## - Division   1   1.06645 62.510 -166.73
## - Errors     1   1.24530 62.688 -166.20
## - Assists    1   1.51006 62.953 -165.42
## - League     1   1.73113 63.174 -164.78
## - PutOuts    1   2.03314 63.476 -163.89
## - AtBat      1   2.16952 63.613 -163.50
## - Walks      1   2.40890 63.852 -162.80
## - Hits       1   2.44518 63.888 -162.69
## 
## Step:  AIC=-169.78
## logSalary ~ AtBat + Hits + HmRun + Runs + Walks + Years + CHits + 
##     CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + 
##     Errors + NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## - CRuns      1   0.05454 61.542 -171.62
## - Runs       1   0.17443 61.662 -171.26
## <none>                   61.488 -169.78
## - Years      1   0.70475 62.192 -169.67
## - CHits      1   0.89988 62.388 -169.09
## - CRBI       1   0.90886 62.396 -169.07
## - CWalks     1   0.95585 62.443 -168.93
## - NewLeague  1   0.98611 62.474 -168.84
## - Division   1   1.13388 62.621 -168.40
## - Errors     1   1.21067 62.698 -168.17
## - HmRun      1   1.32358 62.811 -167.84
## - Assists    1   1.52746 63.015 -167.24
## - League     1   1.72694 63.215 -166.66
## - PutOuts    1   2.00778 63.495 -165.84
## - AtBat      1   2.13533 63.623 -165.47
## - Walks      1   2.73009 64.218 -163.74
## - Hits       1   2.90052 64.388 -163.25
## 
## Step:  AIC=-171.62
## logSalary ~ AtBat + Hits + HmRun + Runs + Walks + Years + CHits + 
##     CRBI + CWalks + League + Division + PutOuts + Assists + Errors + 
##     NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## - Runs       1    0.1222 61.664 -173.25
## - Years      1    0.6512 62.193 -171.67
## <none>                   61.542 -171.62
## - CRBI       1    0.8988 62.441 -170.94
## - CWalks     1    0.9717 62.514 -170.72
## - NewLeague  1    1.0690 62.611 -170.43
## - Division   1    1.2444 62.787 -169.91
## - Errors     1    1.2587 62.801 -169.87
## - HmRun      1    1.3142 62.856 -169.71
## - Assists    1    1.5157 63.058 -169.12
## - League     1    1.7543 63.296 -168.42
## - PutOuts    1    1.9543 63.496 -167.83
## - AtBat      1    2.0809 63.623 -167.47
## - Walks      1    2.6782 64.220 -165.74
## - Hits       1    3.0323 64.574 -164.72
## - CHits      1    4.6312 66.173 -160.19
## 
## Step:  AIC=-173.25
## logSalary ~ AtBat + Hits + HmRun + Walks + Years + CHits + CRBI + 
##     CWalks + League + Division + PutOuts + Assists + Errors + 
##     NewLeague
## 
##             Df Sum of Sq    RSS     AIC
## <none>                   61.664 -173.25
## - Years      1    0.7243 62.389 -173.09
## - CRBI       1    0.8050 62.469 -172.85
## - NewLeague  1    1.0146 62.679 -172.23
## - CWalks     1    1.0377 62.702 -172.16
## - Division   1    1.1855 62.850 -171.73
## - Errors     1    1.1906 62.855 -171.71
## - HmRun      1    1.1943 62.859 -171.70
## - Assists    1    1.5217 63.186 -170.74
## - League     1    1.7542 63.419 -170.06
## - AtBat      1    2.1294 63.794 -168.97
## - PutOuts    1    2.1977 63.862 -168.77
## - Walks      1    2.6402 64.305 -167.49
## - Hits       1    3.1666 64.831 -165.99
## - CHits      1    4.5243 66.189 -162.15
LM3Test<- predict(LM3, newdata = testingSet)
cor(LM3Test, testingSet$logSalary)^2
## [1] 0.4669315

Question 4

A)

LM4 <- train(logSalary~., data=trainingSet, method="lm",
  trControl=trainControl(method="cv", number=10)
  )
LM4Test<- predict(LM4, newdata = testingSet)
cor(LM4Test, testingSet$logSalary)^2
## [1] 0.4617796

This model is a poor predictor since its r-squared value is so low.

B)

kNN1 <- train(logSalary~., data=trainingSet, method="knn",
  trControl=trainControl(method="cv", number=10)
  )
kNN1Test<- predict(kNN1, newdata = testingSet)
cor(kNN1Test, testingSet$logSalary)^2
## [1] 0.6818339

C)

MARS1 <- train(logSalary~., data=trainingSet, method="earth",
  trControl=trainControl(method="cv", number=10)
  )
MARS1Test<- predict(MARS1, newdata = testingSet)
cor(MARS1Test, testingSet$logSalary)^2
##       [,1]
## y 0.633359

5

A)

LM4 is the simplest model we made since we used all variables as predictors without evaluating their importance. LM4’s r-squared value is about 0.462. In LM1, we essentially started with LM4 and manually removed variables until each predictor had a significant p-value. This resulted in a model with an r-squared value of 0.252. Our manipulation actually reduced the accuracy of the model.

LM2 and LM3 were both generated using a stepwise process. LM2 used forward stepwise regression meaning we started using a linear model without any predictors, and added variables one at a time until we maximized its predictive power. LM2 gave us an r-squared value of 0.433. We created LM3 using a backward stepwise regression. We began by making a model using all of the predictors, then removed unnecessary variables until we maximized its accuracy. LM3 had an r-squared value of 0.467 which is the highest of all the linear models.

B)

Our best model is kNN1 since it has the highest r-squared value, 0.682. This type of model (k Nearest Neighbors) does not use any parameters, and thus does not have an equation. This model does not have much interpretability, but its r-squared value tells us that it is the most accurate predictor of our models.