MATH 230 - Homework #5 - Spring 2014

MATH 230 - Homework #5 - Spring 2014 - Due Friday, March 7

Johannes Ferstad

require(mosaic)
options(digits = 3)
trellis.par.set(theme = col.mosaic())  # get a better color scheme for lattice
plotAddedVar = function(fm, var.test) {
    formula = as.formula(fm)
    fm1 = update(fm, formula. = as.formula(paste("~. -", var.test)))
    fm2 = update(fm, formula. = as.formula(paste(var.test, "~. -", var.test)))
    print(plotPoints(fm1$residuals ~ fm2$residuals, type = c("p", "r"), alpha = 0.3, 
        pch = 19, cex = 2, lwd = 3, main = paste("Added Variable Plot:", var.test), 
        ylab = paste(fm1$call), xlab = paste(fm2$call)))
}

PROBLEMS TO TURN IN: #4.2, #4.3, #4.11, #4.16, #4.19

Exercise 4.2

HighPeaks = read.csv("http://www.amherst.edu/~awagaman/math230/HighPeaks.csv")

part a:

SOLUTION:

xyplot(Time ~ Elevation, data = HighPeaks)

plot of chunk unnamed-chunk-3

There doesn't seem to be a clear linear relationship between elevation and time.

part b:

SOLUTION:

fm2b = lm(Time ~ Elevation + Length, data = HighPeaks)
summary(fm2b)

## 
## Call:
## lm(formula = Time ~ Elevation + Length, data = HighPeaks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.592 -0.805 -0.196  0.638  3.843 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.075379   2.532713    3.19   0.0027 ** 
## Elevation   -0.001448   0.000581   -2.49   0.0165 *  
## Length       0.712334   0.059333   12.01  2.5e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared:  0.77,   Adjusted R-squared:  0.76 
## F-statistic: 72.1 on 2 and 43 DF,  p-value: 1.84e-14

Both Elevation and Length are significant predictors in the multiple regression model. The low p-values (0.0165 and 2.5e-15) from the t-tests of the coefficients for the two predictors suggest that this model does significantly better than a model with either Elevation or Length alone.

part c:

SOLUTION:

plotAddedVar(fm2b, "Elevation")

plot of chunk unnamed-chunk-5

The pattern in the added variable plot shows us that there is information in the Elevation predictor that is not in the Length variable and that can help predict Length.

Exercise 4.3

Note: Allowed predictors are all variables in the data set after WinPct. Also, do not worry about finding four predictor models. Just report the final model R is choosing. If you want to isolate four predictor models, consult your R manuals for other code. stepAIC is a little harder to work with for that.

Standings = read.csv("http://www.amherst.edu/~awagaman/math230/MLB2007Standings.csv")

part a:

SOLUTION:

modsmall = lm(WinPct ~ 1, data = Standings)
summary(modsmall)

## 
## Call:
## lm(formula = WinPct ~ 1, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09293 -0.05418  0.00907  0.04532  0.09307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4999     0.0104    47.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0572 on 29 degrees of freedom

fmfull = lm(WinPct ~ . - Wins - Losses - Team, data = Standings)
summary(fmfull)

## 
## Call:
## lm(formula = WinPct ~ . - Wins - Losses - Team, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02463 -0.00833 -0.00038  0.00904  0.03425 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -2.55e-01   5.61e-01   -0.46    0.657  
## LeagueNL    -6.69e-03   2.23e-02   -0.30    0.769  
## BattingAvg  -7.03e+00   2.34e+01   -0.30    0.769  
## Runs         3.31e-04   8.60e-04    0.39    0.707  
## Hits         3.19e-05   1.49e-03    0.02    0.983  
## HR          -5.18e-03   8.66e-03   -0.60    0.561  
## Doubles     -1.79e-03   2.84e-03   -0.63    0.541  
## Triples     -4.29e-03   5.99e-03   -0.72    0.488  
## RBI          8.53e-05   9.39e-04    0.09    0.929  
## SB           4.54e-05   2.46e-04    0.18    0.856  
## OBP         -4.49e-01   1.02e+00   -0.44    0.669  
## SLG          9.70e+00   1.61e+01    0.60    0.558  
## ERA         -2.41e-02   4.75e-02   -0.51    0.621  
## HitsAllowed  1.04e-03   9.39e-04    1.11    0.288  
## Walks        1.08e-03   1.01e-03    1.08    0.303  
## StrikeOuts  -5.23e-06   9.06e-05   -0.06    0.955  
## Saves        3.45e-03   1.32e-03    2.61    0.023 *
## WHIP        -1.64e+00   1.39e+00   -1.18    0.260  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0204 on 12 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.873 
## F-statistic: 12.7 on 17 and 12 DF,  p-value: 3.43e-05

stepAIC(modsmall, scope = list(upper = fmfull, lower = ~1), direction = "both", 
    trace = FALSE)$anova

## Error: could not find function "stepAIC"

result = lm(WinPct ~ Saves + RBI + ERA + BattingAvg, data = Standings)
summary(result)

## 
## Call:
## lm(formula = WinPct ~ Saves + RBI + ERA + BattingAvg, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03035 -0.01442  0.00137  0.01547  0.04092 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51e-02   1.27e-01    0.20   0.8455    
## Saves        3.37e-03   8.24e-04    4.09   0.0004 ***
## RBI          3.15e-04   8.37e-05    3.77   0.0009 ***
## ERA         -6.83e-02   1.08e-02   -6.30  1.4e-06 ***
## BattingAvg   1.54e+00   4.97e-01    3.09   0.0049 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0195 on 25 degrees of freedom
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.883 
## F-statistic: 55.9 on 4 and 25 DF,  p-value: 4.18e-12

Four-predictor model: WinPct ~ Saves + RBI + ERA + BattingAvg R-squared = 0.899.

part b:

SOLUTION:

step(fmfull, k = 2)

## Start:  AIC=-225
## WinPct ~ (Team + League + Wins + Losses + BattingAvg + Runs + 
##     Hits + HR + Doubles + Triples + RBI + SB + OBP + SLG + ERA + 
##     HitsAllowed + Walks + StrikeOuts + Saves + WHIP) - Wins - 
##     Losses - Team
## 
##               Df Sum of Sq     RSS  AIC
## - Hits         1  0.000000 0.00499 -227
## - StrikeOuts   1  0.000001 0.00499 -227
## - RBI          1  0.000003 0.00500 -227
## - SB           1  0.000014 0.00501 -227
## - BattingAvg   1  0.000037 0.00503 -227
## - League       1  0.000038 0.00503 -227
## - Runs         1  0.000062 0.00505 -227
## - OBP          1  0.000080 0.00507 -227
## - ERA          1  0.000107 0.00510 -226
## - HR           1  0.000149 0.00514 -226
## - SLG          1  0.000151 0.00514 -226
## - Doubles      1  0.000165 0.00516 -226
## - Triples      1  0.000213 0.00521 -226
## <none>                     0.00499 -225
## - Walks        1  0.000481 0.00547 -224
## - HitsAllowed  1  0.000514 0.00551 -224
## - WHIP         1  0.000581 0.00557 -224
## - Saves        1  0.002830 0.00782 -214
## 
## Step:  AIC=-227
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples + 
##     RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + StrikeOuts + 
##     Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - StrikeOuts   1  0.000001 0.00499 -229
## - RBI          1  0.000004 0.00500 -229
## - SB           1  0.000015 0.00501 -229
## - League       1  0.000054 0.00505 -229
## - Runs         1  0.000068 0.00506 -229
## - OBP          1  0.000090 0.00508 -228
## - ERA          1  0.000148 0.00514 -228
## <none>                     0.00499 -227
## - Walks        1  0.000481 0.00547 -226
## - BattingAvg   1  0.000497 0.00549 -226
## - HitsAllowed  1  0.000514 0.00551 -226
## - SLG          1  0.000574 0.00557 -226
## - WHIP         1  0.000581 0.00557 -226
## - Doubles      1  0.000679 0.00567 -225
## - HR           1  0.000681 0.00567 -225
## - Triples      1  0.001168 0.00616 -223
## - Saves        1  0.002866 0.00786 -215
## 
## Step:  AIC=-229
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples + 
##     RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + Saves + 
##     WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - RBI          1   0.00000 0.00500 -231
## - SB           1   0.00001 0.00501 -231
## - League       1   0.00006 0.00506 -231
## - Runs         1   0.00007 0.00507 -231
## - OBP          1   0.00009 0.00508 -230
## - ERA          1   0.00015 0.00514 -230
## <none>                     0.00499 -229
## - BattingAvg   1   0.00051 0.00550 -228
## - Walks        1   0.00051 0.00551 -228
## - HitsAllowed  1   0.00055 0.00555 -228
## - SLG          1   0.00061 0.00561 -228
## - WHIP         1   0.00062 0.00562 -228
## - Doubles      1   0.00069 0.00569 -227
## - HR           1   0.00070 0.00570 -227
## - Triples      1   0.00120 0.00619 -225
## - Saves        1   0.00343 0.00843 -215
## 
## Step:  AIC=-231
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples + 
##     SB + OBP + SLG + ERA + HitsAllowed + Walks + Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - SB           1   0.00001 0.00501 -233
## - League       1   0.00009 0.00508 -232
## - OBP          1   0.00009 0.00508 -232
## - ERA          1   0.00016 0.00516 -232
## <none>                     0.00500 -231
## - BattingAvg   1   0.00061 0.00561 -230
## - Runs         1   0.00071 0.00571 -229
## - Walks        1   0.00075 0.00575 -229
## - SLG          1   0.00078 0.00578 -229
## - HitsAllowed  1   0.00079 0.00579 -229
## - HR           1   0.00086 0.00586 -228
## - Doubles      1   0.00089 0.00589 -228
## - WHIP         1   0.00096 0.00596 -228
## - Triples      1   0.00138 0.00638 -226
## - Saves        1   0.00356 0.00856 -217
## 
## Step:  AIC=-233
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples + 
##     OBP + SLG + ERA + HitsAllowed + Walks + Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - League       1   0.00009 0.00510 -234
## - OBP          1   0.00009 0.00510 -234
## - ERA          1   0.00015 0.00516 -234
## <none>                     0.00501 -233
## - Runs         1   0.00071 0.00572 -231
## - BattingAvg   1   0.00087 0.00588 -230
## - HitsAllowed  1   0.00091 0.00592 -230
## - Walks        1   0.00092 0.00593 -230
## - SLG          1   0.00112 0.00613 -229
## - WHIP         1   0.00116 0.00617 -229
## - HR           1   0.00122 0.00623 -228
## - Doubles      1   0.00144 0.00645 -227
## - Triples      1   0.00168 0.00669 -226
## - Saves        1   0.00368 0.00869 -218
## 
## Step:  AIC=-234
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + OBP + SLG + 
##     ERA + HitsAllowed + Walks + Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - OBP          1   0.00010 0.00520 -236
## - ERA          1   0.00011 0.00520 -236
## <none>                     0.00510 -234
## - BattingAvg   1   0.00078 0.00588 -232
## - SLG          1   0.00111 0.00621 -230
## - HitsAllowed  1   0.00116 0.00625 -230
## - Walks        1   0.00118 0.00628 -230
## - HR           1   0.00118 0.00628 -230
## - Doubles      1   0.00145 0.00655 -229
## - WHIP         1   0.00150 0.00659 -229
## - Triples      1   0.00165 0.00674 -228
## - Runs         1   0.00225 0.00735 -225
## - Saves        1   0.00451 0.00961 -217
## 
## Step:  AIC=-236
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + SLG + ERA + 
##     HitsAllowed + Walks + Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## - ERA          1   0.00016 0.00536 -237
## <none>                     0.00520 -236
## - BattingAvg   1   0.00069 0.00589 -234
## - SLG          1   0.00101 0.00621 -232
## - HitsAllowed  1   0.00106 0.00626 -232
## - Walks        1   0.00108 0.00628 -232
## - HR           1   0.00108 0.00628 -232
## - Doubles      1   0.00136 0.00656 -231
## - WHIP         1   0.00143 0.00663 -230
## - Triples      1   0.00157 0.00677 -230
## - Runs         1   0.00271 0.00791 -225
## - Saves        1   0.00500 0.01020 -218
## 
## Step:  AIC=-237
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + SLG + HitsAllowed + 
##     Walks + Saves + WHIP
## 
##               Df Sum of Sq     RSS  AIC
## <none>                     0.00536 -237
## - BattingAvg   1   0.00073 0.00609 -235
## - SLG          1   0.00102 0.00638 -234
## - HR           1   0.00113 0.00649 -233
## - HitsAllowed  1   0.00116 0.00651 -233
## - Walks        1   0.00130 0.00666 -232
## - Doubles      1   0.00139 0.00675 -232
## - Triples      1   0.00158 0.00694 -231
## - WHIP         1   0.00199 0.00735 -229
## - Runs         1   0.00333 0.00868 -224
## - Saves        1   0.00761 0.01297 -212

## 
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + 
##     SLG + HitsAllowed + Walks + Saves + WHIP, data = Standings)
## 
## Coefficients:
## (Intercept)   BattingAvg         Runs           HR      Doubles  
##   -0.196647    -5.715103     0.000473    -0.004330    -0.001518  
##     Triples          SLG  HitsAllowed        Walks        Saves  
##   -0.003601     7.822958     0.000796     0.000862     0.004059  
##        WHIP  
##   -1.399618

result2 = lm(WinPct ~ Saves + WHIP + Runs + Triples, data = Standings)
summary(result2)

## 
## Call:
## lm(formula = WinPct ~ Saves + WHIP + Runs + Triples, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03448 -0.00910  0.00133  0.01089  0.04562 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.12e-01   1.04e-01    3.97  0.00054 ***
## Saves        3.92e-03   7.81e-04    5.02  3.5e-05 ***
## WHIP        -3.21e-01   5.61e-02   -5.72  5.9e-06 ***
## Runs         5.25e-04   5.61e-05    9.36  1.2e-09 ***
## Triples     -8.25e-04   4.95e-04   -1.67  0.10811    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0203 on 25 degrees of freedom
## Multiple R-squared:  0.891,  Adjusted R-squared:  0.874 
## F-statistic: 51.3 on 4 and 25 DF,  p-value: 1.07e-11

Four-predictor model: WinPct ~ Saves + WHIP + Runs + Triples R2 = 0.891

part c:

SOLUTION:

library(leaps)
summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 1, nvmax = 4, 
    data = Standings))

## Subset selection object
## Call: regsubsets.formula(WinPct ~ . - Wins - Losses - Team, nbest = 1, 
##     nvmax = 4, data = Standings)
## 17 Variables  (and intercept)
##             Forced in Forced out
## LeagueNL        FALSE      FALSE
## BattingAvg      FALSE      FALSE
## Runs            FALSE      FALSE
## Hits            FALSE      FALSE
## HR              FALSE      FALSE
## Doubles         FALSE      FALSE
## Triples         FALSE      FALSE
## RBI             FALSE      FALSE
## SB              FALSE      FALSE
## OBP             FALSE      FALSE
## SLG             FALSE      FALSE
## ERA             FALSE      FALSE
## HitsAllowed     FALSE      FALSE
## Walks           FALSE      FALSE
## StrikeOuts      FALSE      FALSE
## Saves           FALSE      FALSE
## WHIP            FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          LeagueNL BattingAvg Runs Hits HR  Doubles Triples RBI SB  OBP SLG
## 1  ( 1 ) " "      " "        " "  " "  " " " "     " "     " " " " " " " "
## 2  ( 1 ) " "      " "        " "  " "  " " " "     " "     "*" " " " " " "
## 3  ( 1 ) " "      " "        "*"  " "  " " " "     " "     " " " " " " " "
## 4  ( 1 ) " "      "*"        "*"  " "  " " " "     " "     " " " " " " " "
##          ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 ) "*" " "         " "   " "        " "   " " 
## 2  ( 1 ) "*" " "         " "   " "        " "   " " 
## 3  ( 1 ) " " " "         " "   " "        "*"   "*" 
## 4  ( 1 ) " " " "         " "   " "        "*"   "*"

result3 = lm(WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
summary(result3)

## 
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02847 -0.01178  0.00051  0.01311  0.04762 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.72e-01   1.31e-01    1.31  0.20192    
## Saves        4.17e-03   7.06e-04    5.91  3.6e-06 ***
## WHIP        -3.36e-01   4.85e-02   -6.92  3.0e-07 ***
## BattingAvg   1.44e+00   4.78e-01    3.01  0.00587 ** 
## Runs         3.19e-04   7.89e-05    4.04  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.897 
## F-statistic: 64.4 on 4 and 25 DF,  p-value: 8.51e-13

Four-predictor model: WinPct ~ Saves + WHIP + BattingAvg + Runs R2 = 0.912

part d:

SOLUTION:

summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 1, nvmax = 4, 
    data = Standings))$summary(regsubsets(WinPct ~ . - Wins - Losses - Team, 
    nbest = 1, nvmax = 4, data = Standings))$cp

## Error: attempt to apply non-function

summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 15, nvmax = 4, 
    data = Standings))

## Subset selection object
## Call: regsubsets.formula(WinPct ~ . - Wins - Losses - Team, nbest = 15, 
##     nvmax = 4, data = Standings)
## 17 Variables  (and intercept)
##             Forced in Forced out
## LeagueNL        FALSE      FALSE
## BattingAvg      FALSE      FALSE
## Runs            FALSE      FALSE
## Hits            FALSE      FALSE
## HR              FALSE      FALSE
## Doubles         FALSE      FALSE
## Triples         FALSE      FALSE
## RBI             FALSE      FALSE
## SB              FALSE      FALSE
## OBP             FALSE      FALSE
## SLG             FALSE      FALSE
## ERA             FALSE      FALSE
## HitsAllowed     FALSE      FALSE
## Walks           FALSE      FALSE
## StrikeOuts      FALSE      FALSE
## Saves           FALSE      FALSE
## WHIP            FALSE      FALSE
## 15 subsets of each size up to 4
## Selection Algorithm: exhaustive
##           LeagueNL BattingAvg Runs Hits HR  Doubles Triples RBI SB  OBP
## 1  ( 1 )  " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 2 )  " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 3 )  " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 1  ( 4 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 1  ( 5 )  " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 1  ( 6 )  " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 7 )  " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 8 )  " "      "*"        " "  " "  " " " "     " "     " " " " " "
## 1  ( 9 )  " "      " "        " "  "*"  " " " "     " "     " " " " " "
## 1  ( 10 ) " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 11 ) " "      " "        " "  " "  " " "*"     " "     " " " " " "
## 1  ( 12 ) " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 13 ) " "      " "        " "  " "  "*" " "     " "     " " " " " "
## 1  ( 14 ) " "      " "        " "  " "  " " " "     " "     " " " " " "
## 1  ( 15 ) " "      " "        " "  " "  " " " "     " "     " " "*" " "
## 2  ( 1 )  " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 2  ( 2 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 2  ( 3 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 2  ( 4 )  " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 2  ( 5 )  " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 2  ( 6 )  " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 2  ( 7 )  " "      " "        " "  " "  " " " "     " "     " " " " " "
## 2  ( 8 )  " "      "*"        " "  " "  " " " "     " "     " " " " " "
## 2  ( 9 )  " "      " "        " "  "*"  " " " "     " "     " " " " " "
## 2  ( 10 ) " "      "*"        " "  " "  " " " "     " "     " " " " " "
## 2  ( 11 ) " "      " "        " "  " "  " " " "     " "     " " " " " "
## 2  ( 12 ) " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 2  ( 13 ) " "      " "        " "  "*"  " " " "     " "     " " " " " "
## 2  ( 14 ) " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 2  ( 15 ) " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 3  ( 1 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 3  ( 2 )  " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 3  ( 3 )  " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 3  ( 4 )  " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 3  ( 5 )  " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 3  ( 6 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 3  ( 7 )  " "      "*"        " "  " "  " " " "     " "     " " " " " "
## 3  ( 8 )  " "      " "        " "  "*"  " " " "     " "     " " " " " "
## 3  ( 9 )  " "      "*"        " "  " "  " " " "     " "     " " " " " "
## 3  ( 10 ) " "      " "        " "  " "  " " " "     " "     " " " " "*"
## 3  ( 11 ) " "      " "        " "  "*"  " " " "     " "     " " " " " "
## 3  ( 12 ) " "      "*"        " "  " "  " " " "     " "     "*" " " " "
## 3  ( 13 ) " "      " "        " "  " "  " " "*"     " "     "*" " " " "
## 3  ( 14 ) " "      " "        " "  " "  " " " "     "*"     "*" " " " "
## 3  ( 15 ) " "      " "        " "  " "  " " " "     " "     "*" " " " "
## 4  ( 1 )  " "      "*"        "*"  " "  " " " "     " "     " " " " " "
## 4  ( 2 )  " "      "*"        " "  " "  " " " "     " "     "*" " " " "
## 4  ( 3 )  " "      " "        "*"  "*"  " " " "     " "     " " " " " "
## 4  ( 4 )  " "      "*"        " "  " "  " " " "     " "     "*" " " " "
## 4  ( 5 )  " "      " "        " "  "*"  " " " "     " "     "*" " " " "
## 4  ( 6 )  " "      " "        "*"  " "  "*" " "     " "     " " " " " "
## 4  ( 7 )  " "      "*"        "*"  " "  " " " "     " "     " " " " " "
## 4  ( 8 )  " "      " "        " "  " "  "*" " "     " "     "*" " " " "
## 4  ( 9 )  " "      " "        "*"  " "  " " " "     " "     " " " " " "
## 4  ( 10 ) " "      "*"        " "  " "  " " " "     " "     " " " " "*"
## 4  ( 11 ) " "      " "        "*"  " "  " " " "     "*"     " " " " " "
## 4  ( 12 ) " "      " "        "*"  " "  " " " "     " "     " " " " "*"
## 4  ( 13 ) " "      "*"        " "  " "  " " " "     " "     " " " " "*"
## 4  ( 14 ) " "      " "        " "  "*"  " " " "     " "     " " " " "*"
## 4  ( 15 ) " "      " "        "*"  " "  " " "*"     " "     " " " " " "
##           SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 )  " " "*" " "         " "   " "        " "   " " 
## 1  ( 2 )  " " " " " "         " "   " "        " "   "*" 
## 1  ( 3 )  " " " " " "         " "   " "        " "   " " 
## 1  ( 4 )  " " " " " "         " "   " "        " "   " " 
## 1  ( 5 )  " " " " " "         " "   " "        " "   " " 
## 1  ( 6 )  " " " " " "         " "   " "        "*"   " " 
## 1  ( 7 )  " " " " "*"         " "   " "        " "   " " 
## 1  ( 8 )  " " " " " "         " "   " "        " "   " " 
## 1  ( 9 )  " " " " " "         " "   " "        " "   " " 
## 1  ( 10 ) "*" " " " "         " "   " "        " "   " " 
## 1  ( 11 ) " " " " " "         " "   " "        " "   " " 
## 1  ( 12 ) " " " " " "         "*"   " "        " "   " " 
## 1  ( 13 ) " " " " " "         " "   " "        " "   " " 
## 1  ( 14 ) " " " " " "         " "   "*"        " "   " " 
## 1  ( 15 ) " " " " " "         " "   " "        " "   " " 
## 2  ( 1 )  " " "*" " "         " "   " "        " "   " " 
## 2  ( 2 )  " " "*" " "         " "   " "        " "   " " 
## 2  ( 3 )  " " " " " "         " "   " "        " "   "*" 
## 2  ( 4 )  " " " " " "         " "   " "        " "   "*" 
## 2  ( 5 )  " " " " " "         " "   " "        "*"   " " 
## 2  ( 6 )  " " "*" " "         " "   " "        " "   " " 
## 2  ( 7 )  "*" "*" " "         " "   " "        " "   " " 
## 2  ( 8 )  " " "*" " "         " "   " "        " "   " " 
## 2  ( 9 )  " " "*" " "         " "   " "        " "   " " 
## 2  ( 10 ) " " " " " "         " "   " "        "*"   " " 
## 2  ( 11 ) "*" " " " "         " "   " "        " "   "*" 
## 2  ( 12 ) " " " " " "         " "   " "        "*"   " " 
## 2  ( 13 ) " " " " " "         " "   " "        "*"   " " 
## 2  ( 14 ) " " " " " "         " "   " "        "*"   " " 
## 2  ( 15 ) " " " " " "         " "   " "        " "   "*" 
## 3  ( 1 )  " " " " " "         " "   " "        "*"   "*" 
## 3  ( 2 )  " " " " " "         " "   " "        "*"   "*" 
## 3  ( 3 )  " " "*" " "         " "   " "        "*"   " " 
## 3  ( 4 )  " " " " " "         " "   " "        "*"   "*" 
## 3  ( 5 )  " " "*" " "         " "   " "        "*"   " " 
## 3  ( 6 )  " " "*" " "         " "   " "        "*"   " " 
## 3  ( 7 )  " " " " " "         " "   " "        "*"   "*" 
## 3  ( 8 )  " " " " " "         " "   " "        "*"   "*" 
## 3  ( 9 )  " " "*" " "         " "   " "        "*"   " " 
## 3  ( 10 ) " " " " "*"         " "   " "        "*"   " " 
## 3  ( 11 ) " " "*" " "         " "   " "        "*"   " " 
## 3  ( 12 ) " " "*" " "         " "   " "        " "   " " 
## 3  ( 13 ) " " "*" " "         " "   " "        " "   " " 
## 3  ( 14 ) " " "*" " "         " "   " "        " "   " " 
## 3  ( 15 ) " " "*" " "         "*"   " "        " "   " " 
## 4  ( 1 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 2 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 3 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 4 )  " " "*" " "         " "   " "        "*"   " " 
## 4  ( 5 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 6 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 7 )  " " "*" " "         " "   " "        "*"   " " 
## 4  ( 8 )  " " " " " "         " "   " "        "*"   "*" 
## 4  ( 9 )  "*" " " " "         " "   " "        "*"   "*" 
## 4  ( 10 ) " " " " " "         " "   " "        "*"   "*" 
## 4  ( 11 ) " " " " " "         " "   " "        "*"   "*" 
## 4  ( 12 ) " " " " " "         " "   " "        "*"   "*" 
## 4  ( 13 ) " " "*" " "         " "   " "        "*"   " " 
## 4  ( 14 ) " " " " " "         " "   " "        "*"   "*" 
## 4  ( 15 ) " " " " " "         " "   " "        "*"   "*"

summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 15, nvmax = 4, 
    data = Standings))$cp

##  [1] 104.880 112.369 114.211 115.661 115.919 124.011 137.033 145.678
##  [9] 149.604 152.856 165.930 170.518 194.320 196.113 197.087  17.723
## [17]  20.004  25.793  26.067  27.620  32.799  36.300  38.418  42.088
## [25]  44.419  45.213  45.471  45.917  47.823  47.981   5.509   7.723
## [33]   8.676   9.287   9.708   9.894  11.373  12.502  13.983  16.090
## [41]  16.224  16.271  16.683  17.441  17.607   0.186   0.865   2.249
## [49]   2.951   3.191   3.690   3.715   4.234   4.497   4.631   4.759
## [57]   4.881   5.027   5.520   5.547

CP-value of model fromm “best subsets” procedure = 0.186 CP-value of model from step backwards procedure = 4.759 CP-value from stepwise regression = 2.951

—-TESTING—-

explanatory = with(Standings, cbind(BattingAvg, Runs, Hits, HR, Doubles, Triples, 
    RBI, SB, OBP, SLG, ERA, HitsAllowed, Walks, StrikeOuts, Saves, WHIP))
head(explanatory, 1)

##      BattingAvg Runs Hits  HR Doubles Triples RBI  SB   OBP   SLG  ERA
## [1,]       0.25  712 1350 171     286      40 687 109 0.321 0.413 4.13
##      HitsAllowed Walks StrikeOuts Saves WHIP
## [1,]        1446   546       1088    51 1.38

results = with(Standings, leaps(explanatory, WinPct, method = "Cp"))  #obtain best subsets results with Cp method
xyplot(Cp ~ size, ylim = c(0, 20), data = results)
ladd(panel.abline(a = 0, b = 1))

plot of chunk unnamed-chunk-11

Cpmod = lm(cesd ~ i1 + mcs + pcs + pss_fr + sexrisk, data = HELPrct)
summary(Cpmod)

## 
## Call:
## lm(formula = cesd ~ i1 + mcs + pcs + pss_fr + sexrisk, data = HELPrct)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26.75  -5.94   0.04   5.35  25.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  66.0192     2.3357   28.27  < 2e-16 ***
## i1            0.0505     0.0209    2.42    0.016 *  
## mcs          -0.6324     0.0323  -19.56  < 2e-16 ***
## pcs          -0.2291     0.0387   -5.91  6.6e-09 ***
## pss_fr       -0.2526     0.1036   -2.44    0.015 *  
## sexrisk      -0.2887     0.1474   -1.96    0.051 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.64 on 447 degrees of freedom
## Multiple R-squared:  0.528,  Adjusted R-squared:  0.523 
## F-statistic:  100 on 5 and 447 DF,  p-value: <2e-16

## 
favstats(Cp ~ size, data = results)

##    .group    min     Q1 median     Q3    max   mean      sd  n missing
## 1       2 114.73 125.15 130.95 156.27 166.32 138.75 18.8152 10       0
## 2       3  20.86  29.61  34.29  42.55  49.57  35.27  9.8110 10       0
## 3       4   7.58  11.15  12.19  14.79  18.96  12.92  3.3346 10       0
## 4       5   1.71   4.11   5.20   5.92   6.48   4.75  1.6221 10       0
## 5       6   1.67   2.18   2.36   2.58   2.86   2.32  0.3962 10       0
## 6       7   1.62   1.76   2.23   2.40   2.43   2.10  0.3395 10       0
## 7       8   2.03   2.39   2.64   2.76   2.94   2.55  0.2963 10       0
## 8       9   3.32   3.47   3.71   3.79   3.86   3.64  0.2013 10       0
## 9      10   4.97   5.06   5.09   5.15   5.23   5.09  0.0864 10       0
## 10     11   5.85   6.23   6.40   6.64   6.75   6.40  0.2866 10       0
## 11     12   7.44   7.70   7.79   7.81   7.94   7.73  0.1646 10       0
## 12     13   9.17   9.38   9.39   9.44   9.51   9.38  0.1061 10       0
## 13     14  11.05  11.14  11.21  11.22  11.35  11.20  0.0952 10       0
## 14     15  13.04  13.05  13.10  13.20  13.22  13.12  0.0768 10       0
## 15     16  15.00  15.04  15.21  15.30  15.52  15.21  0.1861 10       0
## 16     17  17.00  17.00  17.00  17.00  17.00  17.00      NA  1       0

minimum = which.min(results$Cp)
minimum  #determine which model has lowest Cp value

## [1] 51

results$Cp[minimum]  #determine value of lowest Cp

## [1] 1.62

results$which[minimum, ]  #pull out model that has minimum Cp

##     1     2     3     4     5     6     7     8     9     A     B     C 
## FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE 
##     D     E     F     G 
## FALSE FALSE  TRUE FALSE

##

part e:

SOLUTION:

vif(result3)

## Error: could not find function "vif"

summary(result3)

## 
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02847 -0.01178  0.00051  0.01311  0.04762 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.72e-01   1.31e-01    1.31  0.20192    
## Saves        4.17e-03   7.06e-04    5.91  3.6e-06 ***
## WHIP        -3.36e-01   4.85e-02   -6.92  3.0e-07 ***
## BattingAvg   1.44e+00   4.78e-01    3.01  0.00587 ** 
## Runs         3.19e-04   7.89e-05    4.04  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.897 
## F-statistic: 64.4 on 4 and 25 DF,  p-value: 8.51e-13

I prefer the model from the best subsets procedure because it has the highest R2, all highly significant predictors, and low VIFs.

Exercise 4.11

part a:

SOLUTION:

summary(regsubsets(Time ~ Elevation + Difficulty + Ascent + Length, nbest = 1, 
    data = HighPeaks))

## Subset selection object
## Call: regsubsets.formula(Time ~ Elevation + Difficulty + Ascent + Length, 
##     nbest = 1, data = HighPeaks)
## 4 Variables  (and intercept)
##            Forced in Forced out
## Elevation      FALSE      FALSE
## Difficulty     FALSE      FALSE
## Ascent         FALSE      FALSE
## Length         FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          Elevation Difficulty Ascent Length
## 1  ( 1 ) " "       " "        " "    "*"   
## 2  ( 1 ) " "       "*"        " "    "*"   
## 3  ( 1 ) "*"       "*"        " "    "*"   
## 4  ( 1 ) "*"       "*"        "*"    "*"

result3 = lm(WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
summary(result3)

## 
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02847 -0.01178  0.00051  0.01311  0.04762 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.72e-01   1.31e-01    1.31  0.20192    
## Saves        4.17e-03   7.06e-04    5.91  3.6e-06 ***
## WHIP        -3.36e-01   4.85e-02   -6.92  3.0e-07 ***
## BattingAvg   1.44e+00   4.78e-01    3.01  0.00587 ** 
## Runs         3.19e-04   7.89e-05    4.04  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.897 
## F-statistic: 64.4 on 4 and 25 DF,  p-value: 8.51e-13


modsmall2 = lm(Time ~ 1, data = HighPeaks)
summary(modsmall)

## 
## Call:
## lm(formula = WinPct ~ 1, data = Standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09293 -0.05418  0.00907  0.04532  0.09307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4999     0.0104    47.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0572 on 29 degrees of freedom

fmfull2 = lm(Time ~ Elevation + Difficulty + Ascent + Length, data = HighPeaks)
summary(fmfull2)

## 
## Call:
## lm(formula = Time ~ Elevation + Difficulty + Ascent + Length, 
##     data = HighPeaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7794 -0.8122 -0.0865  0.6896  3.0674 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.956786   2.230763    2.67  0.01082 *  
## Elevation   -0.001670   0.000518   -3.22  0.00249 ** 
## Difficulty   0.865453   0.228528    3.79  0.00049 ***
## Ascent       0.000601   0.000331    1.82  0.07669 .  
## Length       0.444008   0.081252    5.46  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.17 on 41 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.824 
## F-statistic: 53.8 on 4 and 41 DF,  p-value: 8.74e-16

stepAIC(modsmall2, scope = list(upper = fmfull2, lower = ~1), direction = "both", 
    trace = FALSE)$anova

## Error: could not find function "stepAIC"

mypick = lm(Time ~ Elevation + Difficulty + Length, data = HighPeaks)
summary(mypick)

## 
## Call:
## lm(formula = Time ~ Elevation + Difficulty + Length, data = HighPeaks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.126 -0.767 -0.146  0.692  3.149 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.01308    2.29075    2.62  0.01204 *  
## Elevation   -0.00140    0.00051   -2.74  0.00888 ** 
## Difficulty   0.87281    0.23466    3.72  0.00059 ***
## Length       0.48923    0.07943    6.16  2.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 42 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.815 
## F-statistic:   67 on 3 and 42 DF,  p-value: 4.68e-16

vif(mypick)

## Error: could not find function "vif"

I would use the model Time ~ Elevation + Difficulty + Length because the VIFs are low and all the predictors are significant. The R2 value is also very close to that of the model with Ascent.

part b:

SOLUTION:

plot(mypick, which = 1)

plot of chunk unnamed-chunk-14

plot(mypick, which = 2)

plot of chunk unnamed-chunk-14

plot(mypick, which = 3)

plot of chunk unnamed-chunk-14

There are some obervations (24,33,40) that are causing problems. There seems to be increasing variance with higher predicted values. And the observations are causing problems with the upper tail in the QQ plot.

part c:

SOLUTION:

influencestat = ls.diag(mypick)
StuRes <- round(influencestat$stud.res, 2)
Hats <- round(influencestat$hat, 3)  #rounds to 3 decimals
Cooks <- round(influencestat$cooks, 3)
which(StuRes > 2)

## [1] 24 33 40

Obeservations 24,33,40 have St Res greater than 2 and seem to stand out.

part d:

SOLUTION:

plot(mypick, which = 4)

plot of chunk unnamed-chunk-16

Cooks[c(24, 40, 44)]

## [1] 0.137 0.193 0.125

plot(mypick, which = 5)

plot of chunk unnamed-chunk-16

Hats[c(24, 40)]

## [1] 0.069 0.117

24,40,44 have abnormal Cook's Ds (0.137,0.193,0.125) 24,40 have high leverage (0.069, 0.117)

Exercise 4.16

BaseballTimes = read.csv("http://www.amherst.edu/~awagaman/math230/BaseballTimes.csv")

SOLUTION:

t = do(1000) * (lm(WinPct ~ Saves + WHIP + BattingAvg + shuffle(Runs), data = Standings))
hist(t$Runs)

plot of chunk unnamed-chunk-18

coef(result3)

## (Intercept)       Saves        WHIP  BattingAvg        Runs 
##    0.171693    0.004173   -0.335661    1.440692    0.000319

pdata(0.000319, Runs, data = t, lower.tail = FALSE)

## [1] 0

qdata(c(0.025, 0.975), Runs, data = t)

##        quantile     p
## 2.5%  -0.000132 0.025
## 97.5%  0.000130 0.975

The earlier coefficient is not in the interval, so it does seem that correlation between Runs and Time is greater than zero.

Exercise 4.19

part a:

SOLUTION:

fm19a = lm(Length ~ Time, data = HighPeaks)
summary(fm19a)

## 
## Call:
## lm(formula = Length ~ Time, data = HighPeaks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.411 -1.164 -0.041  1.051  3.774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.100      1.067    1.03     0.31    
## Time           1.077      0.097   11.11  2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.82 on 44 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.731 
## F-statistic:  123 on 1 and 44 DF,  p-value: 2.39e-14

error19a = qt(0.95, df = 45) * 0.097
1.077 - error19a

## [1] 0.914

1.077 + error19a

## [1] 1.24

CI = [.914,1.24] We are 90% confident that all hikers' average speed is in the interval 1.05-1.1 mph.

part b:

SOLUTION:

bootstrap = do(5000) * lm(Length ~ Time, data = HighPeaks[resample(1:nrow(HighPeaks)), 
    ])$coefficients
histogram(~Time, data = bootstrap)

plot of chunk unnamed-chunk-20

The histrogram shows a normal distribution that is skewed right. Centered around 1.1

part c:

SOLUTION:

favstats(bootstrap$Time)

##    min   Q1 median   Q3  max mean    sd    n missing
##  0.776 1.01   1.08 1.17 1.65 1.09 0.123 5000       0

Mean = 1.09, SD = 0.123 Mean is very similar to coefficient in original model (1.077). The SD is slightly higher than the original value (0.097).

part d:

SOLUTION:

error19d = 0.123 * qt(0.95, df = 45)
1.09 - error19d

## [1] 0.883

1.09 + error19d

## [1] 1.3

New CI = [.883,1.3]

part e:

SOLUTION:

quantile(bootstrap$Time, c(0.05, 0.95))

##    5%   95% 
## 0.917 1.321

CI = [0.915,1.322]

part f:

SOLUTION:

1.077 - 0.915

## [1] 0.162

1.322 - 1.077

## [1] 0.245

1.077 - 0.245

## [1] 0.832

1.077 + 0.162

## [1] 1.24

Orig - LB = 0.162 UB - Orig = 0.245

New LB = 0.832 New UB = 1.24

New Interval = [.832,1.24]

part g:

SOLUTION:

Intervals: [.914,1.24] [.883,1.3] [0.915,1.322] [.832,1.24]

The intervals vary, but are very similar overall.