This lecture was all about creating the best model possible with either all, or certain variables provided using F tests, t tests and ANOVA. First, install the alr3 package.
Use the wblake data. Create a model for the age of the fish dependent on the other two variables in the data set. Does age depend on length and/or scale radius?
#install.packages("alr3")
library(alr3)
## Loading required package: car
data(wblake)
head(wblake)
## Age Length Scale
## 1 1 71 1.90606
## 2 1 64 1.87707
## 3 1 57 1.09736
## 4 1 68 1.33108
## 5 1 72 1.59283
## 6 1 80 1.91602
mod <- lm(Age ~ Length + Scale, data = wblake )
summary(mod)
##
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68036 -0.52766 0.03982 0.54636 2.81994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.008884 0.139800 -7.217 2.38e-12 ***
## Length 0.027344 0.001773 15.427 < 2e-16 ***
## Scale -0.011078 0.044012 -0.252 0.801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared: 0.8165, Adjusted R-squared: 0.8157
## F-statistic: 970 on 2 and 436 DF, p-value: < 2.2e-16
Since the F stat is 2.2 * 10^-16 we know that some predictors are a good predictors, so then we look at the t tests. We can also say that at least one of the betas in not equal to zero.
Length t stat: 15.427 p value: < 2e-16 Scale t stat: -0.252 p value: 0.801
Because of the small p value that corresponds to length, we can say that length is a significant predictor of age. Scale’s p value is too high to be a signifigant predictor.
data(water)
#head(water)
mymod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
F value: 2.2 * 10^-16 this means that at least one of our variables is a significant predictor for response.
Can we get by with just a subset of the predictors? Let’s look to see if we can get rid of the APSAB term. (Partial F test) Now we look to see if we can drop APSAB from our model.
mymod1 <- lm(BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod1)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE,
## data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12854 -4980 -1442 4265 18538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15709.07 4019.36 3.908 0.000382 ***
## APMAM -114.28 662.25 -0.173 0.863934
## APSLAKE 1839.61 897.11 2.051 0.047440 *
## OPBPC 55.75 455.51 0.122 0.903248
## OPRC 1818.85 594.46 3.060 0.004108 **
## OPSLAKE 2312.43 708.44 3.264 0.002368 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7474 on 37 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9142
## F-statistic: 90.53 on 5 and 37 DF, p-value: < 2.2e-16
anova(mymod, mymod1)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 36 2055830733
## 2 37 2066700504 -1 -10869771 0.1903 0.6652
Based on this, we can see that APSAB is not significant and can be dropped from our model because the F stat (0.1903) is too low and the p value (0.6652) is too high signifying that this is not a good predictor.
mymod2 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
summary(mymod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
anova(mymod, mymod2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 36 2055830733
## 2 39 2068947585 -3 -13116852 0.0766 0.9722
In my opinion this is the best version of the model. The p value is very high and has removed the three least important variables.
mymod3 <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod3)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE,
## data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
anova(mymod, mymod3)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 36 2055830733
## 2 37 2055849271 -1 -18538 3e-04 0.9857