Multiple linear regression

Creating a model for the age of the fish dependent on the other two variables in the data set, length and scale radius:

data("wblake")

fishMod <- lm(Age~Length+Scale, wblake)
summary(fishMod)
## 
## Call:
## lm(formula = Age ~ Length + Scale, data = wblake)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68036 -0.52766  0.03982  0.54636  2.81994 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.008884   0.139800  -7.217 2.38e-12 ***
## Length       0.027344   0.001773  15.427  < 2e-16 ***
## Scale       -0.011078   0.044012  -0.252    0.801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8545 on 436 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.8157 
## F-statistic:   970 on 2 and 436 DF,  p-value: < 2.2e-16

It looks like age depends on length, but not scale radius. The F-test has a p-value of less than 2.2x10^-16, so at least one variable has a significant relationship with Age, and the t-test results for each beta show that Length is a significant predictor (with a p-value of less than 2.2x10^-16) but Scale is not (with a p-value of 0.801, which is not less than 0.05)

Partial F-test

Creating a model with runoff as the response and all the snowfall measurements as the predictor:

data(water)

runoffMod <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM, water)
summary(runoffMod)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + 
##     APMAM, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPBPC          69.70     461.69   0.151 0.880839    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APMAM         -12.77     708.89  -0.018 0.985725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Generally speaking, at least one of the site snowfall measures helps predict runoff, since the F-test as a significant p-value (less than 2.2*10^-16, which is also less than 0.05). If we keep all the other predictors we can (and should) drop APSAB, because the t-test of its slope has a p-value of 0.67, which is not less than 0.05 and thus not significant.

Now let’s check if we could remove OPSLAKE, OPRC, and OPBPC from the model using a partial F-test:

fullMod <- lm(BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM, water)
reducedMod <- lm(BSAAM~APSLAKE+APSAB+APMAM, water)

anova(fullMod, reducedMod) #order doesn't matter, fyi
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM
## Model 2: BSAAM ~ APSLAKE + APSAB + APMAM
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1     36 2.0558e+09                                  
## 2     39 2.5116e+10 -3 -2.306e+10 134.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value’s super small so we reject the null, we shoudl keep at least one of the predictors we just tested (the one’s that start with O).

Another example, with heights

Let’s use a partial F-test to check and see if the best model to predict the children’s height based on their father’s height and gender is a multiple linear regression model or an interaction model:

heights <- read.csv("https://cknudson.com/data/Galton.csv")

reducedMod <- lm(Height~FatherHeight+Gender, heights)
completeMod <- lm(Height~FatherHeight*Gender, heights)

anova(completeMod, reducedMod)
## Analysis of Variance Table
## 
## Model 1: Height ~ FatherHeight * Gender
## Model 2: Height ~ FatherHeight + Gender
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    894 4637.6                           
## 2    895 4639.4 -1   -1.7678 0.3408 0.5595

Since the p-value of this partial F-test is 0.55, we fail to reject the null hypothesis, which means we should stick to the MLR model and we don’t need the interaction term.