require(mosaic)
options(digits = 3)
trellis.par.set(theme = col.mosaic()) # get a better color scheme for lattice
plotAddedVar = function(fm, var.test) {
formula = as.formula(fm)
fm1 = update(fm, formula. = as.formula(paste("~. -", var.test)))
fm2 = update(fm, formula. = as.formula(paste(var.test, "~. -", var.test)))
print(plotPoints(fm1$residuals ~ fm2$residuals, type = c("p", "r"), alpha = 0.3,
pch = 19, cex = 2, lwd = 3, main = paste("Added Variable Plot:", var.test),
ylab = paste(fm1$call), xlab = paste(fm2$call)))
}
HighPeaks = read.csv("http://www.amherst.edu/~awagaman/math230/HighPeaks.csv")
part a:
SOLUTION:
xyplot(Time ~ Elevation, data = HighPeaks)
There doesn't seem to be a clear linear relationship between elevation and time.
part b:
SOLUTION:
fm2b = lm(Time ~ Elevation + Length, data = HighPeaks)
summary(fm2b)
##
## Call:
## lm(formula = Time ~ Elevation + Length, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.592 -0.805 -0.196 0.638 3.843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.075379 2.532713 3.19 0.0027 **
## Elevation -0.001448 0.000581 -2.49 0.0165 *
## Length 0.712334 0.059333 12.01 2.5e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared: 0.77, Adjusted R-squared: 0.76
## F-statistic: 72.1 on 2 and 43 DF, p-value: 1.84e-14
Both Elevation and Length are significant predictors in the multiple regression model. The low p-values (0.0165 and 2.5e-15) from the t-tests of the coefficients for the two predictors suggest that this model does significantly better than a model with either Elevation or Length alone.
part c:
SOLUTION:
plotAddedVar(fm2b, "Elevation")
The pattern in the added variable plot shows us that there is information in the Elevation predictor that is not in the Length variable and that can help predict Length.
Note: Allowed predictors are all variables in the data set after WinPct. Also, do not worry about finding four predictor models. Just report the final model R is choosing. If you want to isolate four predictor models, consult your R manuals for other code. stepAIC is a little harder to work with for that.
Standings = read.csv("http://www.amherst.edu/~awagaman/math230/MLB2007Standings.csv")
part a:
SOLUTION:
modsmall = lm(WinPct ~ 1, data = Standings)
summary(modsmall)
##
## Call:
## lm(formula = WinPct ~ 1, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09293 -0.05418 0.00907 0.04532 0.09307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4999 0.0104 47.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0572 on 29 degrees of freedom
fmfull = lm(WinPct ~ . - Wins - Losses - Team, data = Standings)
summary(fmfull)
##
## Call:
## lm(formula = WinPct ~ . - Wins - Losses - Team, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02463 -0.00833 -0.00038 0.00904 0.03425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.55e-01 5.61e-01 -0.46 0.657
## LeagueNL -6.69e-03 2.23e-02 -0.30 0.769
## BattingAvg -7.03e+00 2.34e+01 -0.30 0.769
## Runs 3.31e-04 8.60e-04 0.39 0.707
## Hits 3.19e-05 1.49e-03 0.02 0.983
## HR -5.18e-03 8.66e-03 -0.60 0.561
## Doubles -1.79e-03 2.84e-03 -0.63 0.541
## Triples -4.29e-03 5.99e-03 -0.72 0.488
## RBI 8.53e-05 9.39e-04 0.09 0.929
## SB 4.54e-05 2.46e-04 0.18 0.856
## OBP -4.49e-01 1.02e+00 -0.44 0.669
## SLG 9.70e+00 1.61e+01 0.60 0.558
## ERA -2.41e-02 4.75e-02 -0.51 0.621
## HitsAllowed 1.04e-03 9.39e-04 1.11 0.288
## Walks 1.08e-03 1.01e-03 1.08 0.303
## StrikeOuts -5.23e-06 9.06e-05 -0.06 0.955
## Saves 3.45e-03 1.32e-03 2.61 0.023 *
## WHIP -1.64e+00 1.39e+00 -1.18 0.260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0204 on 12 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.873
## F-statistic: 12.7 on 17 and 12 DF, p-value: 3.43e-05
stepAIC(modsmall, scope = list(upper = fmfull, lower = ~1), direction = "both",
trace = FALSE)$anova
## Error: could not find function "stepAIC"
result = lm(WinPct ~ Saves + RBI + ERA + BattingAvg, data = Standings)
summary(result)
##
## Call:
## lm(formula = WinPct ~ Saves + RBI + ERA + BattingAvg, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03035 -0.01442 0.00137 0.01547 0.04092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51e-02 1.27e-01 0.20 0.8455
## Saves 3.37e-03 8.24e-04 4.09 0.0004 ***
## RBI 3.15e-04 8.37e-05 3.77 0.0009 ***
## ERA -6.83e-02 1.08e-02 -6.30 1.4e-06 ***
## BattingAvg 1.54e+00 4.97e-01 3.09 0.0049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0195 on 25 degrees of freedom
## Multiple R-squared: 0.899, Adjusted R-squared: 0.883
## F-statistic: 55.9 on 4 and 25 DF, p-value: 4.18e-12
Four-predictor model: WinPct ~ Saves + RBI + ERA + BattingAvg R-squared = 0.899.
part b:
SOLUTION:
step(fmfull, k = 2)
## Start: AIC=-225
## WinPct ~ (Team + League + Wins + Losses + BattingAvg + Runs +
## Hits + HR + Doubles + Triples + RBI + SB + OBP + SLG + ERA +
## HitsAllowed + Walks + StrikeOuts + Saves + WHIP) - Wins -
## Losses - Team
##
## Df Sum of Sq RSS AIC
## - Hits 1 0.000000 0.00499 -227
## - StrikeOuts 1 0.000001 0.00499 -227
## - RBI 1 0.000003 0.00500 -227
## - SB 1 0.000014 0.00501 -227
## - BattingAvg 1 0.000037 0.00503 -227
## - League 1 0.000038 0.00503 -227
## - Runs 1 0.000062 0.00505 -227
## - OBP 1 0.000080 0.00507 -227
## - ERA 1 0.000107 0.00510 -226
## - HR 1 0.000149 0.00514 -226
## - SLG 1 0.000151 0.00514 -226
## - Doubles 1 0.000165 0.00516 -226
## - Triples 1 0.000213 0.00521 -226
## <none> 0.00499 -225
## - Walks 1 0.000481 0.00547 -224
## - HitsAllowed 1 0.000514 0.00551 -224
## - WHIP 1 0.000581 0.00557 -224
## - Saves 1 0.002830 0.00782 -214
##
## Step: AIC=-227
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples +
## RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + StrikeOuts +
## Saves + WHIP
##
## Df Sum of Sq RSS AIC
## - StrikeOuts 1 0.000001 0.00499 -229
## - RBI 1 0.000004 0.00500 -229
## - SB 1 0.000015 0.00501 -229
## - League 1 0.000054 0.00505 -229
## - Runs 1 0.000068 0.00506 -229
## - OBP 1 0.000090 0.00508 -228
## - ERA 1 0.000148 0.00514 -228
## <none> 0.00499 -227
## - Walks 1 0.000481 0.00547 -226
## - BattingAvg 1 0.000497 0.00549 -226
## - HitsAllowed 1 0.000514 0.00551 -226
## - SLG 1 0.000574 0.00557 -226
## - WHIP 1 0.000581 0.00557 -226
## - Doubles 1 0.000679 0.00567 -225
## - HR 1 0.000681 0.00567 -225
## - Triples 1 0.001168 0.00616 -223
## - Saves 1 0.002866 0.00786 -215
##
## Step: AIC=-229
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples +
## RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + Saves +
## WHIP
##
## Df Sum of Sq RSS AIC
## - RBI 1 0.00000 0.00500 -231
## - SB 1 0.00001 0.00501 -231
## - League 1 0.00006 0.00506 -231
## - Runs 1 0.00007 0.00507 -231
## - OBP 1 0.00009 0.00508 -230
## - ERA 1 0.00015 0.00514 -230
## <none> 0.00499 -229
## - BattingAvg 1 0.00051 0.00550 -228
## - Walks 1 0.00051 0.00551 -228
## - HitsAllowed 1 0.00055 0.00555 -228
## - SLG 1 0.00061 0.00561 -228
## - WHIP 1 0.00062 0.00562 -228
## - Doubles 1 0.00069 0.00569 -227
## - HR 1 0.00070 0.00570 -227
## - Triples 1 0.00120 0.00619 -225
## - Saves 1 0.00343 0.00843 -215
##
## Step: AIC=-231
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples +
## SB + OBP + SLG + ERA + HitsAllowed + Walks + Saves + WHIP
##
## Df Sum of Sq RSS AIC
## - SB 1 0.00001 0.00501 -233
## - League 1 0.00009 0.00508 -232
## - OBP 1 0.00009 0.00508 -232
## - ERA 1 0.00016 0.00516 -232
## <none> 0.00500 -231
## - BattingAvg 1 0.00061 0.00561 -230
## - Runs 1 0.00071 0.00571 -229
## - Walks 1 0.00075 0.00575 -229
## - SLG 1 0.00078 0.00578 -229
## - HitsAllowed 1 0.00079 0.00579 -229
## - HR 1 0.00086 0.00586 -228
## - Doubles 1 0.00089 0.00589 -228
## - WHIP 1 0.00096 0.00596 -228
## - Triples 1 0.00138 0.00638 -226
## - Saves 1 0.00356 0.00856 -217
##
## Step: AIC=-233
## WinPct ~ League + BattingAvg + Runs + HR + Doubles + Triples +
## OBP + SLG + ERA + HitsAllowed + Walks + Saves + WHIP
##
## Df Sum of Sq RSS AIC
## - League 1 0.00009 0.00510 -234
## - OBP 1 0.00009 0.00510 -234
## - ERA 1 0.00015 0.00516 -234
## <none> 0.00501 -233
## - Runs 1 0.00071 0.00572 -231
## - BattingAvg 1 0.00087 0.00588 -230
## - HitsAllowed 1 0.00091 0.00592 -230
## - Walks 1 0.00092 0.00593 -230
## - SLG 1 0.00112 0.00613 -229
## - WHIP 1 0.00116 0.00617 -229
## - HR 1 0.00122 0.00623 -228
## - Doubles 1 0.00144 0.00645 -227
## - Triples 1 0.00168 0.00669 -226
## - Saves 1 0.00368 0.00869 -218
##
## Step: AIC=-234
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + OBP + SLG +
## ERA + HitsAllowed + Walks + Saves + WHIP
##
## Df Sum of Sq RSS AIC
## - OBP 1 0.00010 0.00520 -236
## - ERA 1 0.00011 0.00520 -236
## <none> 0.00510 -234
## - BattingAvg 1 0.00078 0.00588 -232
## - SLG 1 0.00111 0.00621 -230
## - HitsAllowed 1 0.00116 0.00625 -230
## - Walks 1 0.00118 0.00628 -230
## - HR 1 0.00118 0.00628 -230
## - Doubles 1 0.00145 0.00655 -229
## - WHIP 1 0.00150 0.00659 -229
## - Triples 1 0.00165 0.00674 -228
## - Runs 1 0.00225 0.00735 -225
## - Saves 1 0.00451 0.00961 -217
##
## Step: AIC=-236
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + SLG + ERA +
## HitsAllowed + Walks + Saves + WHIP
##
## Df Sum of Sq RSS AIC
## - ERA 1 0.00016 0.00536 -237
## <none> 0.00520 -236
## - BattingAvg 1 0.00069 0.00589 -234
## - SLG 1 0.00101 0.00621 -232
## - HitsAllowed 1 0.00106 0.00626 -232
## - Walks 1 0.00108 0.00628 -232
## - HR 1 0.00108 0.00628 -232
## - Doubles 1 0.00136 0.00656 -231
## - WHIP 1 0.00143 0.00663 -230
## - Triples 1 0.00157 0.00677 -230
## - Runs 1 0.00271 0.00791 -225
## - Saves 1 0.00500 0.01020 -218
##
## Step: AIC=-237
## WinPct ~ BattingAvg + Runs + HR + Doubles + Triples + SLG + HitsAllowed +
## Walks + Saves + WHIP
##
## Df Sum of Sq RSS AIC
## <none> 0.00536 -237
## - BattingAvg 1 0.00073 0.00609 -235
## - SLG 1 0.00102 0.00638 -234
## - HR 1 0.00113 0.00649 -233
## - HitsAllowed 1 0.00116 0.00651 -233
## - Walks 1 0.00130 0.00666 -232
## - Doubles 1 0.00139 0.00675 -232
## - Triples 1 0.00158 0.00694 -231
## - WHIP 1 0.00199 0.00735 -229
## - Runs 1 0.00333 0.00868 -224
## - Saves 1 0.00761 0.01297 -212
##
## Call:
## lm(formula = WinPct ~ BattingAvg + Runs + HR + Doubles + Triples +
## SLG + HitsAllowed + Walks + Saves + WHIP, data = Standings)
##
## Coefficients:
## (Intercept) BattingAvg Runs HR Doubles
## -0.196647 -5.715103 0.000473 -0.004330 -0.001518
## Triples SLG HitsAllowed Walks Saves
## -0.003601 7.822958 0.000796 0.000862 0.004059
## WHIP
## -1.399618
result2 = lm(WinPct ~ Saves + WHIP + Runs + Triples, data = Standings)
summary(result2)
##
## Call:
## lm(formula = WinPct ~ Saves + WHIP + Runs + Triples, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03448 -0.00910 0.00133 0.01089 0.04562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.12e-01 1.04e-01 3.97 0.00054 ***
## Saves 3.92e-03 7.81e-04 5.02 3.5e-05 ***
## WHIP -3.21e-01 5.61e-02 -5.72 5.9e-06 ***
## Runs 5.25e-04 5.61e-05 9.36 1.2e-09 ***
## Triples -8.25e-04 4.95e-04 -1.67 0.10811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0203 on 25 degrees of freedom
## Multiple R-squared: 0.891, Adjusted R-squared: 0.874
## F-statistic: 51.3 on 4 and 25 DF, p-value: 1.07e-11
Four-predictor model: WinPct ~ Saves + WHIP + Runs + Triples R2 = 0.891
part c:
SOLUTION:
library(leaps)
summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 1, nvmax = 4,
data = Standings))
## Subset selection object
## Call: regsubsets.formula(WinPct ~ . - Wins - Losses - Team, nbest = 1,
## nvmax = 4, data = Standings)
## 17 Variables (and intercept)
## Forced in Forced out
## LeagueNL FALSE FALSE
## BattingAvg FALSE FALSE
## Runs FALSE FALSE
## Hits FALSE FALSE
## HR FALSE FALSE
## Doubles FALSE FALSE
## Triples FALSE FALSE
## RBI FALSE FALSE
## SB FALSE FALSE
## OBP FALSE FALSE
## SLG FALSE FALSE
## ERA FALSE FALSE
## HitsAllowed FALSE FALSE
## Walks FALSE FALSE
## StrikeOuts FALSE FALSE
## Saves FALSE FALSE
## WHIP FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI SB OBP SLG
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" "*" " " " " " " " " " " " " " " " "
## ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" "*"
## 4 ( 1 ) " " " " " " " " "*" "*"
result3 = lm(WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
summary(result3)
##
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02847 -0.01178 0.00051 0.01311 0.04762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72e-01 1.31e-01 1.31 0.20192
## Saves 4.17e-03 7.06e-04 5.91 3.6e-06 ***
## WHIP -3.36e-01 4.85e-02 -6.92 3.0e-07 ***
## BattingAvg 1.44e+00 4.78e-01 3.01 0.00587 **
## Runs 3.19e-04 7.89e-05 4.04 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.897
## F-statistic: 64.4 on 4 and 25 DF, p-value: 8.51e-13
Four-predictor model: WinPct ~ Saves + WHIP + BattingAvg + Runs R2 = 0.912
part d:
SOLUTION:
summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 1, nvmax = 4,
data = Standings))$summary(regsubsets(WinPct ~ . - Wins - Losses - Team,
nbest = 1, nvmax = 4, data = Standings))$cp
## Error: attempt to apply non-function
summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 15, nvmax = 4,
data = Standings))
## Subset selection object
## Call: regsubsets.formula(WinPct ~ . - Wins - Losses - Team, nbest = 15,
## nvmax = 4, data = Standings)
## 17 Variables (and intercept)
## Forced in Forced out
## LeagueNL FALSE FALSE
## BattingAvg FALSE FALSE
## Runs FALSE FALSE
## Hits FALSE FALSE
## HR FALSE FALSE
## Doubles FALSE FALSE
## Triples FALSE FALSE
## RBI FALSE FALSE
## SB FALSE FALSE
## OBP FALSE FALSE
## SLG FALSE FALSE
## ERA FALSE FALSE
## HitsAllowed FALSE FALSE
## Walks FALSE FALSE
## StrikeOuts FALSE FALSE
## Saves FALSE FALSE
## WHIP FALSE FALSE
## 15 subsets of each size up to 4
## Selection Algorithm: exhaustive
## LeagueNL BattingAvg Runs Hits HR Doubles Triples RBI SB OBP
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " " " " " " "*" " " " "
## 1 ( 4 ) " " " " "*" " " " " " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " " " " " " " " " " "*"
## 1 ( 6 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 7 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 8 ) " " "*" " " " " " " " " " " " " " " " "
## 1 ( 9 ) " " " " " " "*" " " " " " " " " " " " "
## 1 ( 10 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 11 ) " " " " " " " " " " "*" " " " " " " " "
## 1 ( 12 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 13 ) " " " " " " " " "*" " " " " " " " " " "
## 1 ( 14 ) " " " " " " " " " " " " " " " " " " " "
## 1 ( 15 ) " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " "
## 2 ( 2 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 3 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " " " " " " "*" " " " "
## 2 ( 5 ) " " " " " " " " " " " " " " " " " " "*"
## 2 ( 6 ) " " " " " " " " " " " " " " " " " " "*"
## 2 ( 7 ) " " " " " " " " " " " " " " " " " " " "
## 2 ( 8 ) " " "*" " " " " " " " " " " " " " " " "
## 2 ( 9 ) " " " " " " "*" " " " " " " " " " " " "
## 2 ( 10 ) " " "*" " " " " " " " " " " " " " " " "
## 2 ( 11 ) " " " " " " " " " " " " " " " " " " " "
## 2 ( 12 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 13 ) " " " " " " "*" " " " " " " " " " " " "
## 2 ( 14 ) " " " " " " " " " " " " " " "*" " " " "
## 2 ( 15 ) " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " " " " " " "*" " " " "
## 3 ( 3 ) " " " " " " " " " " " " " " " " " " "*"
## 3 ( 4 ) " " " " " " " " " " " " " " " " " " "*"
## 3 ( 5 ) " " " " " " " " " " " " " " "*" " " " "
## 3 ( 6 ) " " " " "*" " " " " " " " " " " " " " "
## 3 ( 7 ) " " "*" " " " " " " " " " " " " " " " "
## 3 ( 8 ) " " " " " " "*" " " " " " " " " " " " "
## 3 ( 9 ) " " "*" " " " " " " " " " " " " " " " "
## 3 ( 10 ) " " " " " " " " " " " " " " " " " " "*"
## 3 ( 11 ) " " " " " " "*" " " " " " " " " " " " "
## 3 ( 12 ) " " "*" " " " " " " " " " " "*" " " " "
## 3 ( 13 ) " " " " " " " " " " "*" " " "*" " " " "
## 3 ( 14 ) " " " " " " " " " " " " "*" "*" " " " "
## 3 ( 15 ) " " " " " " " " " " " " " " "*" " " " "
## 4 ( 1 ) " " "*" "*" " " " " " " " " " " " " " "
## 4 ( 2 ) " " "*" " " " " " " " " " " "*" " " " "
## 4 ( 3 ) " " " " "*" "*" " " " " " " " " " " " "
## 4 ( 4 ) " " "*" " " " " " " " " " " "*" " " " "
## 4 ( 5 ) " " " " " " "*" " " " " " " "*" " " " "
## 4 ( 6 ) " " " " "*" " " "*" " " " " " " " " " "
## 4 ( 7 ) " " "*" "*" " " " " " " " " " " " " " "
## 4 ( 8 ) " " " " " " " " "*" " " " " "*" " " " "
## 4 ( 9 ) " " " " "*" " " " " " " " " " " " " " "
## 4 ( 10 ) " " "*" " " " " " " " " " " " " " " "*"
## 4 ( 11 ) " " " " "*" " " " " " " "*" " " " " " "
## 4 ( 12 ) " " " " "*" " " " " " " " " " " " " "*"
## 4 ( 13 ) " " "*" " " " " " " " " " " " " " " "*"
## 4 ( 14 ) " " " " " " "*" " " " " " " " " " " "*"
## 4 ( 15 ) " " " " "*" " " " " "*" " " " " " " " "
## SLG ERA HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " " " " "*"
## 1 ( 3 ) " " " " " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " " " " " "
## 1 ( 6 ) " " " " " " " " " " "*" " "
## 1 ( 7 ) " " " " "*" " " " " " " " "
## 1 ( 8 ) " " " " " " " " " " " " " "
## 1 ( 9 ) " " " " " " " " " " " " " "
## 1 ( 10 ) "*" " " " " " " " " " " " "
## 1 ( 11 ) " " " " " " " " " " " " " "
## 1 ( 12 ) " " " " " " "*" " " " " " "
## 1 ( 13 ) " " " " " " " " " " " " " "
## 1 ( 14 ) " " " " " " " " "*" " " " "
## 1 ( 15 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 2 ) " " "*" " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " " " " "*"
## 2 ( 4 ) " " " " " " " " " " " " "*"
## 2 ( 5 ) " " " " " " " " " " "*" " "
## 2 ( 6 ) " " "*" " " " " " " " " " "
## 2 ( 7 ) "*" "*" " " " " " " " " " "
## 2 ( 8 ) " " "*" " " " " " " " " " "
## 2 ( 9 ) " " "*" " " " " " " " " " "
## 2 ( 10 ) " " " " " " " " " " "*" " "
## 2 ( 11 ) "*" " " " " " " " " " " "*"
## 2 ( 12 ) " " " " " " " " " " "*" " "
## 2 ( 13 ) " " " " " " " " " " "*" " "
## 2 ( 14 ) " " " " " " " " " " "*" " "
## 2 ( 15 ) " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" "*"
## 3 ( 2 ) " " " " " " " " " " "*" "*"
## 3 ( 3 ) " " "*" " " " " " " "*" " "
## 3 ( 4 ) " " " " " " " " " " "*" "*"
## 3 ( 5 ) " " "*" " " " " " " "*" " "
## 3 ( 6 ) " " "*" " " " " " " "*" " "
## 3 ( 7 ) " " " " " " " " " " "*" "*"
## 3 ( 8 ) " " " " " " " " " " "*" "*"
## 3 ( 9 ) " " "*" " " " " " " "*" " "
## 3 ( 10 ) " " " " "*" " " " " "*" " "
## 3 ( 11 ) " " "*" " " " " " " "*" " "
## 3 ( 12 ) " " "*" " " " " " " " " " "
## 3 ( 13 ) " " "*" " " " " " " " " " "
## 3 ( 14 ) " " "*" " " " " " " " " " "
## 3 ( 15 ) " " "*" " " "*" " " " " " "
## 4 ( 1 ) " " " " " " " " " " "*" "*"
## 4 ( 2 ) " " " " " " " " " " "*" "*"
## 4 ( 3 ) " " " " " " " " " " "*" "*"
## 4 ( 4 ) " " "*" " " " " " " "*" " "
## 4 ( 5 ) " " " " " " " " " " "*" "*"
## 4 ( 6 ) " " " " " " " " " " "*" "*"
## 4 ( 7 ) " " "*" " " " " " " "*" " "
## 4 ( 8 ) " " " " " " " " " " "*" "*"
## 4 ( 9 ) "*" " " " " " " " " "*" "*"
## 4 ( 10 ) " " " " " " " " " " "*" "*"
## 4 ( 11 ) " " " " " " " " " " "*" "*"
## 4 ( 12 ) " " " " " " " " " " "*" "*"
## 4 ( 13 ) " " "*" " " " " " " "*" " "
## 4 ( 14 ) " " " " " " " " " " "*" "*"
## 4 ( 15 ) " " " " " " " " " " "*" "*"
summary(regsubsets(WinPct ~ . - Wins - Losses - Team, nbest = 15, nvmax = 4,
data = Standings))$cp
## [1] 104.880 112.369 114.211 115.661 115.919 124.011 137.033 145.678
## [9] 149.604 152.856 165.930 170.518 194.320 196.113 197.087 17.723
## [17] 20.004 25.793 26.067 27.620 32.799 36.300 38.418 42.088
## [25] 44.419 45.213 45.471 45.917 47.823 47.981 5.509 7.723
## [33] 8.676 9.287 9.708 9.894 11.373 12.502 13.983 16.090
## [41] 16.224 16.271 16.683 17.441 17.607 0.186 0.865 2.249
## [49] 2.951 3.191 3.690 3.715 4.234 4.497 4.631 4.759
## [57] 4.881 5.027 5.520 5.547
CP-value of model fromm “best subsets” procedure = 0.186 CP-value of model from step backwards procedure = 4.759 CP-value from stepwise regression = 2.951
—-TESTING—-
explanatory = with(Standings, cbind(BattingAvg, Runs, Hits, HR, Doubles, Triples,
RBI, SB, OBP, SLG, ERA, HitsAllowed, Walks, StrikeOuts, Saves, WHIP))
head(explanatory, 1)
## BattingAvg Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## [1,] 0.25 712 1350 171 286 40 687 109 0.321 0.413 4.13
## HitsAllowed Walks StrikeOuts Saves WHIP
## [1,] 1446 546 1088 51 1.38
results = with(Standings, leaps(explanatory, WinPct, method = "Cp")) #obtain best subsets results with Cp method
xyplot(Cp ~ size, ylim = c(0, 20), data = results)
ladd(panel.abline(a = 0, b = 1))
Cpmod = lm(cesd ~ i1 + mcs + pcs + pss_fr + sexrisk, data = HELPrct)
summary(Cpmod)
##
## Call:
## lm(formula = cesd ~ i1 + mcs + pcs + pss_fr + sexrisk, data = HELPrct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.75 -5.94 0.04 5.35 25.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0192 2.3357 28.27 < 2e-16 ***
## i1 0.0505 0.0209 2.42 0.016 *
## mcs -0.6324 0.0323 -19.56 < 2e-16 ***
## pcs -0.2291 0.0387 -5.91 6.6e-09 ***
## pss_fr -0.2526 0.1036 -2.44 0.015 *
## sexrisk -0.2887 0.1474 -1.96 0.051 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.64 on 447 degrees of freedom
## Multiple R-squared: 0.528, Adjusted R-squared: 0.523
## F-statistic: 100 on 5 and 447 DF, p-value: <2e-16
##
favstats(Cp ~ size, data = results)
## .group min Q1 median Q3 max mean sd n missing
## 1 2 114.73 125.15 130.95 156.27 166.32 138.75 18.8152 10 0
## 2 3 20.86 29.61 34.29 42.55 49.57 35.27 9.8110 10 0
## 3 4 7.58 11.15 12.19 14.79 18.96 12.92 3.3346 10 0
## 4 5 1.71 4.11 5.20 5.92 6.48 4.75 1.6221 10 0
## 5 6 1.67 2.18 2.36 2.58 2.86 2.32 0.3962 10 0
## 6 7 1.62 1.76 2.23 2.40 2.43 2.10 0.3395 10 0
## 7 8 2.03 2.39 2.64 2.76 2.94 2.55 0.2963 10 0
## 8 9 3.32 3.47 3.71 3.79 3.86 3.64 0.2013 10 0
## 9 10 4.97 5.06 5.09 5.15 5.23 5.09 0.0864 10 0
## 10 11 5.85 6.23 6.40 6.64 6.75 6.40 0.2866 10 0
## 11 12 7.44 7.70 7.79 7.81 7.94 7.73 0.1646 10 0
## 12 13 9.17 9.38 9.39 9.44 9.51 9.38 0.1061 10 0
## 13 14 11.05 11.14 11.21 11.22 11.35 11.20 0.0952 10 0
## 14 15 13.04 13.05 13.10 13.20 13.22 13.12 0.0768 10 0
## 15 16 15.00 15.04 15.21 15.30 15.52 15.21 0.1861 10 0
## 16 17 17.00 17.00 17.00 17.00 17.00 17.00 NA 1 0
minimum = which.min(results$Cp)
minimum #determine which model has lowest Cp value
## [1] 51
results$Cp[minimum] #determine value of lowest Cp
## [1] 1.62
results$which[minimum, ] #pull out model that has minimum Cp
## 1 2 3 4 5 6 7 8 9 A B C
## FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## D E F G
## FALSE FALSE TRUE FALSE
##
part e:
SOLUTION:
vif(result3)
## Error: could not find function "vif"
summary(result3)
##
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02847 -0.01178 0.00051 0.01311 0.04762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72e-01 1.31e-01 1.31 0.20192
## Saves 4.17e-03 7.06e-04 5.91 3.6e-06 ***
## WHIP -3.36e-01 4.85e-02 -6.92 3.0e-07 ***
## BattingAvg 1.44e+00 4.78e-01 3.01 0.00587 **
## Runs 3.19e-04 7.89e-05 4.04 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.897
## F-statistic: 64.4 on 4 and 25 DF, p-value: 8.51e-13
I prefer the model from the best subsets procedure because it has the highest R2, all highly significant predictors, and low VIFs.
part a:
SOLUTION:
summary(regsubsets(Time ~ Elevation + Difficulty + Ascent + Length, nbest = 1,
data = HighPeaks))
## Subset selection object
## Call: regsubsets.formula(Time ~ Elevation + Difficulty + Ascent + Length,
## nbest = 1, data = HighPeaks)
## 4 Variables (and intercept)
## Forced in Forced out
## Elevation FALSE FALSE
## Difficulty FALSE FALSE
## Ascent FALSE FALSE
## Length FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## Elevation Difficulty Ascent Length
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " "*" " " "*"
## 3 ( 1 ) "*" "*" " " "*"
## 4 ( 1 ) "*" "*" "*" "*"
result3 = lm(WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
summary(result3)
##
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAvg + Runs, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02847 -0.01178 0.00051 0.01311 0.04762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72e-01 1.31e-01 1.31 0.20192
## Saves 4.17e-03 7.06e-04 5.91 3.6e-06 ***
## WHIP -3.36e-01 4.85e-02 -6.92 3.0e-07 ***
## BattingAvg 1.44e+00 4.78e-01 3.01 0.00587 **
## Runs 3.19e-04 7.89e-05 4.04 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0183 on 25 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.897
## F-statistic: 64.4 on 4 and 25 DF, p-value: 8.51e-13
modsmall2 = lm(Time ~ 1, data = HighPeaks)
summary(modsmall)
##
## Call:
## lm(formula = WinPct ~ 1, data = Standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09293 -0.05418 0.00907 0.04532 0.09307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4999 0.0104 47.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0572 on 29 degrees of freedom
fmfull2 = lm(Time ~ Elevation + Difficulty + Ascent + Length, data = HighPeaks)
summary(fmfull2)
##
## Call:
## lm(formula = Time ~ Elevation + Difficulty + Ascent + Length,
## data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7794 -0.8122 -0.0865 0.6896 3.0674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.956786 2.230763 2.67 0.01082 *
## Elevation -0.001670 0.000518 -3.22 0.00249 **
## Difficulty 0.865453 0.228528 3.79 0.00049 ***
## Ascent 0.000601 0.000331 1.82 0.07669 .
## Length 0.444008 0.081252 5.46 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.17 on 41 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.824
## F-statistic: 53.8 on 4 and 41 DF, p-value: 8.74e-16
stepAIC(modsmall2, scope = list(upper = fmfull2, lower = ~1), direction = "both",
trace = FALSE)$anova
## Error: could not find function "stepAIC"
mypick = lm(Time ~ Elevation + Difficulty + Length, data = HighPeaks)
summary(mypick)
##
## Call:
## lm(formula = Time ~ Elevation + Difficulty + Length, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.126 -0.767 -0.146 0.692 3.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.01308 2.29075 2.62 0.01204 *
## Elevation -0.00140 0.00051 -2.74 0.00888 **
## Difficulty 0.87281 0.23466 3.72 0.00059 ***
## Length 0.48923 0.07943 6.16 2.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 42 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.815
## F-statistic: 67 on 3 and 42 DF, p-value: 4.68e-16
vif(mypick)
## Error: could not find function "vif"
I would use the model Time ~ Elevation + Difficulty + Length because the VIFs are low and all the predictors are significant. The R2 value is also very close to that of the model with Ascent.
part b:
SOLUTION:
plot(mypick, which = 1)
plot(mypick, which = 2)
plot(mypick, which = 3)
There are some obervations (24,33,40) that are causing problems. There seems to be increasing variance with higher predicted values. And the observations are causing problems with the upper tail in the QQ plot.
part c:
SOLUTION:
influencestat = ls.diag(mypick)
StuRes <- round(influencestat$stud.res, 2)
Hats <- round(influencestat$hat, 3) #rounds to 3 decimals
Cooks <- round(influencestat$cooks, 3)
which(StuRes > 2)
## [1] 24 33 40
Obeservations 24,33,40 have St Res greater than 2 and seem to stand out.
part d:
SOLUTION:
plot(mypick, which = 4)
Cooks[c(24, 40, 44)]
## [1] 0.137 0.193 0.125
plot(mypick, which = 5)
Hats[c(24, 40)]
## [1] 0.069 0.117
24,40,44 have abnormal Cook's Ds (0.137,0.193,0.125) 24,40 have high leverage (0.069, 0.117)
BaseballTimes = read.csv("http://www.amherst.edu/~awagaman/math230/BaseballTimes.csv")
SOLUTION:
t = do(1000) * (lm(WinPct ~ Saves + WHIP + BattingAvg + shuffle(Runs), data = Standings))
hist(t$Runs)
coef(result3)
## (Intercept) Saves WHIP BattingAvg Runs
## 0.171693 0.004173 -0.335661 1.440692 0.000319
pdata(0.000319, Runs, data = t, lower.tail = FALSE)
## [1] 0
qdata(c(0.025, 0.975), Runs, data = t)
## quantile p
## 2.5% -0.000132 0.025
## 97.5% 0.000130 0.975
The earlier coefficient is not in the interval, so it does seem that correlation between Runs and Time is greater than zero.
part a:
SOLUTION:
fm19a = lm(Length ~ Time, data = HighPeaks)
summary(fm19a)
##
## Call:
## lm(formula = Length ~ Time, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.411 -1.164 -0.041 1.051 3.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.100 1.067 1.03 0.31
## Time 1.077 0.097 11.11 2.4e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.82 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.731
## F-statistic: 123 on 1 and 44 DF, p-value: 2.39e-14
error19a = qt(0.95, df = 45) * 0.097
1.077 - error19a
## [1] 0.914
1.077 + error19a
## [1] 1.24
CI = [.914,1.24] We are 90% confident that all hikers' average speed is in the interval 1.05-1.1 mph.
part b:
SOLUTION:
bootstrap = do(5000) * lm(Length ~ Time, data = HighPeaks[resample(1:nrow(HighPeaks)),
])$coefficients
histogram(~Time, data = bootstrap)
The histrogram shows a normal distribution that is skewed right. Centered around 1.1
part c:
SOLUTION:
favstats(bootstrap$Time)
## min Q1 median Q3 max mean sd n missing
## 0.776 1.01 1.08 1.17 1.65 1.09 0.123 5000 0
Mean = 1.09, SD = 0.123 Mean is very similar to coefficient in original model (1.077). The SD is slightly higher than the original value (0.097).
part d:
SOLUTION:
error19d = 0.123 * qt(0.95, df = 45)
1.09 - error19d
## [1] 0.883
1.09 + error19d
## [1] 1.3
New CI = [.883,1.3]
part e:
SOLUTION:
quantile(bootstrap$Time, c(0.05, 0.95))
## 5% 95%
## 0.917 1.321
CI = [0.915,1.322]
part f:
SOLUTION:
1.077 - 0.915
## [1] 0.162
1.322 - 1.077
## [1] 0.245
1.077 - 0.245
## [1] 0.832
1.077 + 0.162
## [1] 1.24
Orig - LB = 0.162 UB - Orig = 0.245
New LB = 0.832 New UB = 1.24
New Interval = [.832,1.24]
part g:
SOLUTION:
Intervals: [.914,1.24] [.883,1.3] [0.915,1.322] [.832,1.24]
The intervals vary, but are very similar overall.