Creating a model for the age of the fish dependent on the other two variables in the data set, length and scale radius:
data("wblake")
fishMod <- lm(Age~Length+Scale, wblake)
summary(fishMod)
##
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68036 -0.52766 0.03982 0.54636 2.81994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.008884 0.139800 -7.217 2.38e-12 ***
## Length 0.027344 0.001773 15.427 < 2e-16 ***
## Scale -0.011078 0.044012 -0.252 0.801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared: 0.8165, Adjusted R-squared: 0.8157
## F-statistic: 970 on 2 and 436 DF, p-value: < 2.2e-16
It looks like age depends on length, but not scale radius. The F-test has a p-value of less than 2.2x10^-16, so at least one variable has a significant relationship with Age, and the t-test results for each beta show that Length is a significant predictor (with a p-value of less than 2.2x10^-16) but Scale is not (with a p-value of 0.801, which is not less than 0.05)
Creating a model with runoff as the response and all the snowfall measurements as the predictor:
data(water)
runoffMod <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM, water)
summary(runoffMod)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB +
## APMAM, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPBPC 69.70 461.69 0.151 0.880839
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## APSAB -664.41 1522.89 -0.436 0.665237
## APMAM -12.77 708.89 -0.018 0.985725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
Generally speaking, at least one of the site snowfall measures helps predict runoff, since the F-test as a significant p-value (less than 2.2*10^-16, which is also less than 0.05). If we keep all the other predictors we can (and should) drop APSAB, because the t-test of its slope has a p-value of 0.67, which is not less than 0.05 and thus not significant.
Now let’s check if we could remove OPSLAKE, OPRC, and OPBPC from the model using a partial F-test:
fullMod <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM, water)
reducedMod <- lm(BSAAM~APSLAKE+APSAB+APMAM, water)
anova(fullMod, reducedMod) #order doesn't matter, fyi
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM
## Model 2: BSAAM ~ APSLAKE + APSAB + APMAM
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 36 2.0558e+09
## 2 39 2.5116e+10 -3 -2.306e+10 134.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value’s super small so we reject the null, we shoudl keep at least one of the predictors we just tested (the one’s that start with O).
Let’s use a partial F-test to check and see if the best model to predict the children’s height based on their father’s height and gender is a multiple linear regression model or an interaction model:
heights <- read.csv("https://cknudson.com/data/Galton.csv")
reducedMod <- lm(Height~FatherHeight+Gender, heights)
completeMod <- lm(Height~FatherHeight*Gender, heights)
anova(completeMod, reducedMod)
## Analysis of Variance Table
##
## Model 1: Height ~ FatherHeight * Gender
## Model 2: Height ~ FatherHeight + Gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 894 4637.6
## 2 895 4639.4 -1 -1.7678 0.3408 0.5595
Since the p-value of this partial F-test is 0.55, we fail to reject the null hypothesis, which means we should stick to the MLR model and we don’t need the interaction term.