12/1/2021

Why use more than one explanatory variable?

  • Often more than one thing affects an outcome
  • Naively, you might think ignore it as it complicates things
  • However, you might miss the relationship between the first two variables or even get them the wrong way around
  • Two examples

Leaping to the wrong conclusion

Are taller children better at maths?

AMA=HGT

anova(lm(AMA~HGT, data = maths_kids))
## Analysis of Variance Table
## 
## Response: AMA
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## HGT        1 412.77  412.77  726.87 < 2.2e-16 ***
## Residuals 30  17.04    0.57                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Weirdly yes. This seems odd. Hang on, the children were from all different classes.

Leaping to the wrong conclusion

Lets include ages.

AMA = YEARS + HGT

anova(lm(AMA~YEARS+HGT, data = maths_kids))
## Analysis of Variance Table
## 
## Response: AMA
##           Df Sum Sq Mean Sq   F value Pr(>F)    
## YEARS      1 422.60  422.60 1702.4271 <2e-16 ***
## HGT        1   0.01    0.01    0.0317 0.8599    
## Residuals 29   7.20    0.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The YEAR line asks “Does age influence mathematical ability once any difference due to height has been taken into account?”

The HGT line asks “Does height influence mathematical ability once any difference due to age has been taken into account?”

Missing a significant relationship

ggplot(saplings, aes(x= WATER, y= FINALHT)) +
  geom_point()

Missing a significant relationship

saplings$WATER<-as.factor(saplings$WATER)
anova(lm(FINALHT~WATER, data = saplings))
## Analysis of Variance Table
## 
## Response: FINALHT
##           Df Sum Sq Mean Sq F value Pr(>F)
## WATER      3 12.895  4.2982  1.9721 0.1356
## Residuals 36 78.461  2.1795

Missing a significant relationship

But maybe FINALHT (final height) is affected by initial height

FINALHT= WATER + INITHT

saplings$WATER<-as.factor(saplings$WATER)
anova(lm(FINALHT~WATER+INITHT, data = saplings))
## Analysis of Variance Table
## 
## Response: FINALHT
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## WATER      3 12.895   4.298   793.55 < 2.2e-16 ***
## INITHT     1 78.272  78.272 14450.93 < 2.2e-16 ***
## Residuals 35  0.190   0.005                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sweet, final height depends on initial height when the effect of watering regimes are taken into effect AND final height depends on watering regimes when initial height is taken into consideration.

Statistical elimination

  • So those two examples are fairly obvious (when you think about it)
  • Not always so clear
  • The inclusion of a third variable allows its influence to be eliminated, so we call the process statistical elimination
  • More on this in the next lecture