Often when making linear models, there might be some big changes in our models. When we make those changes, its not usually a question whether or not the models performance will be significantly different from the others. If we make a small change, how do know if we’ve made any difference at all? Lets load in the boys dataset in the Mice package, make some short models on their physical attributes, run ANOVA on them to find out.
This dataset contains information on boys, as they grow. You have age, weight, height, head circumference, and BMI.
I’ll remove some NAs to make this easier for ANOVA.
data(boys)
daBoys = boys[,c('age','hgt','wgt','bmi','hc')]
daBoys= daBoys[complete.cases(daBoys), ]
describe(daBoys)
## vars n mean sd median trimmed mad min max range skew
## age 1 684 9.11 6.83 10.41 8.98 9.91 0.04 20.81 20.78 -0.03
## hgt 2 684 130.90 46.34 144.85 132.81 55.75 50.00 198.00 148.00 -0.30
## wgt 3 684 36.91 25.90 34.15 35.19 34.46 3.14 117.40 114.26 0.40
## bmi 4 684 17.99 3.05 17.37 17.64 2.57 11.77 31.74 19.97 1.22
## hc 5 684 51.60 5.94 53.20 52.29 5.04 33.70 65.00 31.30 -0.91
## kurtosis se
## age -1.55 0.26
## hgt -1.45 1.77
## wgt -0.96 0.99
## bmi 2.02 0.12
## hc 0.08 0.23
m1 = lm(data=daBoys,formula=age~hgt+wgt+bmi)
m2 = lm(data=daBoys,formula=age~wgt+bmi)
m3 = lm(data=daBoys,formula=age~hgt+wgt+bmi+hc)
Three models, all pretty similar including or excluding one variable.
anova(m1,m2,m3)
## Analysis of Variance Table
##
## Model 1: age ~ hgt + wgt + bmi
## Model 2: age ~ wgt + bmi
## Model 3: age ~ hgt + wgt + bmi + hc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 680 1235.5
## 2 681 1759.8 -1 -524.26 305.74 < 2.2e-16 ***
## 3 679 1164.3 2 595.51 173.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So which model you put in first, is our base model that the other two are comparing themselves too. DF meaning of course ‘degrees of freedom’ at -1 means 1 less than the base model. Sum of squares, again is relative to the base model. Then we have our two important figures. The F-statistic and the p-value for the F-statistic. This tells you how the models vary from each other (F), and whether or not we should care (is it significant.) Low p-values (Pr(>F)) mean yes, we should care, the F-statistic is significant. Together, they tell us whether or not a model is a significant change from the base model.
In our case all three models are significantly different from each other. (You can just use the signif. codes if you don’t want to think about it)