First, install the alr3 package.
Use the wblake data. Create a model for the age of the fish dependent on the other two variables in the data set. Does age depend on length and/or scale radius?
library(alr3)
## Loading required package: car
data(wblake)
head(wblake)
## Age Length Scale
## 1 1 71 1.90606
## 2 1 64 1.87707
## 3 1 57 1.09736
## 4 1 68 1.33108
## 5 1 72 1.59283
## 6 1 80 1.91602
attach(wblake)
fmod<-lm(Age~Length+Scale,data=wblake)
fmod
##
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
##
## Coefficients:
## (Intercept) Length Scale
## -1.00888 0.02734 -0.01108
summary(fmod)
##
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68036 -0.52766 0.03982 0.54636 2.81994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.008884 0.139800 -7.217 2.38e-12 ***
## Length 0.027344 0.001773 15.427 < 2e-16 ***
## Scale -0.011078 0.044012 -0.252 0.801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared: 0.8165, Adjusted R-squared: 0.8157
## F-statistic: 970 on 2 and 436 DF, p-value: < 2.2e-16
Our null hypothesis is that age doesn’t depend on scale or length. Based on our test statistics, age have a significant relationship with the length and or scale of the fish. F-statistic: can we drop all of the predictors while the individual values tells us if we can drop each individual predictor.
Use the water data. Create a model with runoff as the response and all the snowfall measurements as the predictor. Can we drop APSAB from the model if we keep all the other predictors? More generally, does snowfall at any of the sites help predict runoff?
data(water)
head(water)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
attach(water)
wmod<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPSLAKE)
omod<-lm(BSAAM~APMAM+APSAB+APSLAKE)
wmod
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE)
##
## Coefficients:
## (Intercept) APMAM APSAB APSLAKE OPBPC
## 18117.85 116.12 922.84 878.94 -17.12
## OPSLAKE
## 3725.81
summary(wmod)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13698.2 -5551.1 71.3 4636.3 19580.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18117.85 4446.12 4.075 0.000234 ***
## APMAM 116.12 779.71 0.149 0.882425
## APSAB 922.84 1572.76 0.587 0.560927
## APSLAKE 878.94 1386.06 0.634 0.529900
## OPBPC -17.12 507.75 -0.034 0.973279
## OPSLAKE 3725.81 613.29 6.075 4.97e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8327 on 37 degrees of freedom
## Multiple R-squared: 0.9062, Adjusted R-squared: 0.8935
## F-statistic: 71.49 on 5 and 37 DF, p-value: < 2.2e-16
anova(wmod,omod)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE
## Model 2: BSAAM ~ APMAM + APSAB + APSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 2.5657e+09
## 2 39 2.5116e+10 -2 -2.255e+10 162.59 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After taking account of other measurements of snowfall, APSAB doesn’t matter based on the p-value If F-test is less than .05, we can determine that at least one of the variables have a significant relationship with the runoff. As long as we keep all the other predictors, we can drop APSAB because it doesn’t add any value to the runoff.
wmod is the complete model and the omod is the reduced model. We want to see what happens when we drop some of the predictors. anova will look at both models. The order in which we enter doesn’t matter. We want to keep the variables because our null hypothesis states that we can drop the variables which we reject based on our p-value. We would want to include at least one of those variables.
babymod<-lm(BSAAM~APMAM,data = water)
wmod<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPSLAKE)
wmod1<-lm(BSAAM~APSAB+APSLAKE+OPBPC+OPSLAKE)
anova(wmod,wmod1)#same as an F-test
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPSLAKE
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 2565725132
## 2 38 2567262994 -1 -1537862 0.0222 0.8824
we can get the same p-value doing this and taking the variable that we’re interested in looking at either through partial model or the complet model
heights<-read.csv("http://cknudson.com/data/Galton.csv")
attach(heights)
head(heights)
## FamilyID FatherHeight MotherHeight Gender Height NumKids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
Use partial F-test. Should we have different slpes for men and women
dmod<-lm(Height~FatherHeight+Gender)#reduced
summary(dmod)
##
## Call:
## lm(formula = Height ~ FatherHeight + Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3708 -1.4808 0.0192 1.5616 9.4153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.46113 2.13628 16.13 <2e-16 ***
## FatherHeight 0.42782 0.03079 13.90 <2e-16 ***
## GenderM 5.17604 0.15211 34.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.277 on 895 degrees of freedom
## Multiple R-squared: 0.5971, Adjusted R-squared: 0.5962
## F-statistic: 663.2 on 2 and 895 DF, p-value: < 2.2e-16
dmod1<-lm(Height~FatherHeight*Gender)#complete
summary(dmod1)
##
## Call:
## lm(formula = Height ~ FatherHeight * Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3774 -1.4857 0.0088 1.6017 9.3987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.58523 2.87662 12.370 <2e-16 ***
## FatherHeight 0.41160 0.04148 9.923 <2e-16 ***
## GenderM 2.67368 4.28926 0.623 0.533
## FatherHeight:GenderM 0.03615 0.06192 0.584 0.560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.278 on 894 degrees of freedom
## Multiple R-squared: 0.5973, Adjusted R-squared: 0.5959
## F-statistic: 441.9 on 3 and 894 DF, p-value: < 2.2e-16
anova(dmod,dmod1)
## Analysis of Variance Table
##
## Model 1: Height ~ FatherHeight + Gender
## Model 2: Height ~ FatherHeight * Gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 895 4639.4
## 2 894 4637.6 1 1.7678 0.3408 0.5595
p-value greater than .05 we accept the null so we drop the extra term so the lines are parallel. We don’t have evidence that we need different slopes so we don’t need the interaction model.