First, install the alr3 package.

Fish

Use the wblake data. Create a model for the age of the fish dependent on the other two variables in the data set. Does age depend on length and/or scale radius?

library(alr3)
## Loading required package: car
data(wblake)
head(wblake)
##   Age Length   Scale
## 1   1     71 1.90606
## 2   1     64 1.87707
## 3   1     57 1.09736
## 4   1     68 1.33108
## 5   1     72 1.59283
## 6   1     80 1.91602
attach(wblake)
fmod<-lm(Age~Length+Scale,data=wblake)
fmod
## 
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
## 
## Coefficients:
## (Intercept)       Length        Scale  
##    -1.00888      0.02734     -0.01108
summary(fmod)
## 
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68036 -0.52766  0.03982  0.54636  2.81994 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.008884   0.139800  -7.217 2.38e-12 ***
## Length       0.027344   0.001773  15.427  < 2e-16 ***
## Scale       -0.011078   0.044012  -0.252    0.801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.8157 
## F-statistic:   970 on 2 and 436 DF,  p-value: < 2.2e-16

Our null hypothesis is that age doesn’t depend on scale or length. Based on our test statistics, age have a significant relationship with the length and or scale of the fish. F-statistic: can we drop all of the predictors while the individual values tells us if we can drop each individual predictor.

Runoff

Use the water data. Create a model with runoff as the response and all the snowfall measurements as the predictor. Can we drop APSAB from the model if we keep all the other predictors? More generally, does snowfall at any of the sites help predict runoff?

data(water)
head(water)
##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594
attach(water)
wmod<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPSLAKE)
omod<-lm(BSAAM~APMAM+APSAB+APSLAKE)
wmod
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE)
## 
## Coefficients:
## (Intercept)        APMAM        APSAB      APSLAKE        OPBPC  
##    18117.85       116.12       922.84       878.94       -17.12  
##     OPSLAKE  
##     3725.81
summary(wmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13698.2  -5551.1     71.3   4636.3  19580.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18117.85    4446.12   4.075 0.000234 ***
## APMAM         116.12     779.71   0.149 0.882425    
## APSAB         922.84    1572.76   0.587 0.560927    
## APSLAKE       878.94    1386.06   0.634 0.529900    
## OPBPC         -17.12     507.75  -0.034 0.973279    
## OPSLAKE      3725.81     613.29   6.075 4.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8327 on 37 degrees of freedom
## Multiple R-squared:  0.9062, Adjusted R-squared:  0.8935 
## F-statistic: 71.49 on 5 and 37 DF,  p-value: < 2.2e-16
anova(wmod,omod)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE
## Model 2: BSAAM ~ APMAM + APSAB + APSLAKE
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1     37 2.5657e+09                                   
## 2     39 2.5116e+10 -2 -2.255e+10 162.59 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After taking account of other measurements of snowfall, APSAB doesn’t matter based on the p-value If F-test is less than .05, we can determine that at least one of the variables have a significant relationship with the runoff. As long as we keep all the other predictors, we can drop APSAB because it doesn’t add any value to the runoff.

wmod is the complete model and the omod is the reduced model. We want to see what happens when we drop some of the predictors. anova will look at both models. The order in which we enter doesn’t matter. We want to keep the variables because our null hypothesis states that we can drop the variables which we reject based on our p-value. We would want to include at least one of those variables.

babymod<-lm(BSAAM~APMAM,data = water)
wmod<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPSLAKE)
wmod1<-lm(BSAAM~APSAB+APSLAKE+OPBPC+OPSLAKE)
anova(wmod,wmod1)#same as an F-test
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     37 2565725132                           
## 2     38 2567262994 -1  -1537862 0.0222 0.8824

we can get the same p-value doing this and taking the variable that we’re interested in looking at either through partial model or the complet model

additional practice

heights<-read.csv("http://cknudson.com/data/Galton.csv")
attach(heights)
head(heights)
##   FamilyID FatherHeight MotherHeight Gender Height NumKids
## 1        1         78.5         67.0      M   73.2       4
## 2        1         78.5         67.0      F   69.2       4
## 3        1         78.5         67.0      F   69.0       4
## 4        1         78.5         67.0      F   69.0       4
## 5        2         75.5         66.5      M   73.5       4
## 6        2         75.5         66.5      M   72.5       4

Use partial F-test. Should we have different slpes for men and women

dmod<-lm(Height~FatherHeight+Gender)#reduced
summary(dmod)
## 
## Call:
## lm(formula = Height ~ FatherHeight + Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3708 -1.4808  0.0192  1.5616  9.4153 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.46113    2.13628   16.13   <2e-16 ***
## FatherHeight  0.42782    0.03079   13.90   <2e-16 ***
## GenderM       5.17604    0.15211   34.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.277 on 895 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5962 
## F-statistic: 663.2 on 2 and 895 DF,  p-value: < 2.2e-16
dmod1<-lm(Height~FatherHeight*Gender)#complete
summary(dmod1)
## 
## Call:
## lm(formula = Height ~ FatherHeight * Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3774 -1.4857  0.0088  1.6017  9.3987 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          35.58523    2.87662  12.370   <2e-16 ***
## FatherHeight          0.41160    0.04148   9.923   <2e-16 ***
## GenderM               2.67368    4.28926   0.623    0.533    
## FatherHeight:GenderM  0.03615    0.06192   0.584    0.560    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.278 on 894 degrees of freedom
## Multiple R-squared:  0.5973, Adjusted R-squared:  0.5959 
## F-statistic: 441.9 on 3 and 894 DF,  p-value: < 2.2e-16
anova(dmod,dmod1)
## Analysis of Variance Table
## 
## Model 1: Height ~ FatherHeight + Gender
## Model 2: Height ~ FatherHeight * Gender
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    895 4639.4                           
## 2    894 4637.6  1    1.7678 0.3408 0.5595

p-value greater than .05 we accept the null so we drop the extra term so the lines are parallel. We don’t have evidence that we need different slopes so we don’t need the interaction model.