This lecture was all about creating the best model possible with either all, or certain variables provided using F tests, t tests and ANOVA. First, install the alr3 package.

Fish

Use the wblake data. Create a model for the age of the fish dependent on the other two variables in the data set. Does age depend on length and/or scale radius?

#install.packages("alr3")
library(alr3)
## Loading required package: car
data(wblake)
head(wblake)
##   Age Length   Scale
## 1   1     71 1.90606
## 2   1     64 1.87707
## 3   1     57 1.09736
## 4   1     68 1.33108
## 5   1     72 1.59283
## 6   1     80 1.91602
mod <-  lm(Age ~ Length + Scale, data = wblake )
summary(mod)
## 
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68036 -0.52766  0.03982  0.54636  2.81994 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.008884   0.139800  -7.217 2.38e-12 ***
## Length       0.027344   0.001773  15.427  < 2e-16 ***
## Scale       -0.011078   0.044012  -0.252    0.801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.8157 
## F-statistic:   970 on 2 and 436 DF,  p-value: < 2.2e-16

Since the F stat is 2.2 * 10^-16 we know that some predictors are a good predictors, so then we look at the t tests. We can also say that at least one of the betas in not equal to zero.

Length t stat: 15.427 p value: < 2e-16 Scale t stat: -0.252 p value: 0.801

Because of the small p value that corresponds to length, we can say that length is a significant predictor of age. Scale’s p value is too high to be a signifigant predictor.

Runoff

data(water)
#head(water)
mymod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

F value: 2.2 * 10^-16 this means that at least one of our variables is a significant predictor for response.

Can we get by with just a subset of the predictors? Let’s look to see if we can get rid of the APSAB term. (Partial F test) Now we look to see if we can drop APSAB from our model.

mymod1 <- lm(BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod1)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE, 
##     data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12854  -4980  -1442   4265  18538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15709.07    4019.36   3.908 0.000382 ***
## APMAM        -114.28     662.25  -0.173 0.863934    
## APSLAKE      1839.61     897.11   2.051 0.047440 *  
## OPBPC          55.75     455.51   0.122 0.903248    
## OPRC         1818.85     594.46   3.060 0.004108 ** 
## OPSLAKE      2312.43     708.44   3.264 0.002368 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7474 on 37 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9142 
## F-statistic: 90.53 on 5 and 37 DF,  p-value: < 2.2e-16
anova(mymod, mymod1)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APMAM + APSLAKE + OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     36 2055830733                           
## 2     37 2066700504 -1 -10869771 0.1903 0.6652

Based on this, we can see that APSAB is not significant and can be dropped from our model because the F stat (0.1903) is too low and the p value (0.6652) is too high signifying that this is not a good predictor.

mymod2 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
summary(mymod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
anova(mymod, mymod2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     36 2055830733                           
## 2     39 2068947585 -3 -13116852 0.0766 0.9722

In my opinion this is the best version of the model. The p value is very high and has removed the three least important variables.

mymod3 <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(mymod3)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, 
##     data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPBPC          68.94     453.50   0.152 0.879996    
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16
anova(mymod, mymod3)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq     F Pr(>F)
## 1     36 2055830733                          
## 2     37 2055849271 -1    -18538 3e-04 0.9857