Checking out what intercepts actually represent model1 :
data("warpbreaks")
mod1<-lm(breaks ~ wool+tension, data=warpbreaks)
summary(mod1)
Call:
lm(formula = breaks ~ wool + tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-19.500 -8.083 -2.139 6.472 30.722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.278 3.162 12.423 < 2e-16 ***
woolB -5.778 3.162 -1.827 0.073614 .
tensionM -10.000 3.872 -2.582 0.012787 *
tensionH -14.722 3.872 -3.802 0.000391 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.62 on 50 degrees of freedom
Multiple R-squared: 0.2691, Adjusted R-squared: 0.2253
F-statistic: 6.138 on 3 and 50 DF, p-value: 0.00123
Taking a look at some group means:
aggregate(breaks ~ wool, warpbreaks,mean)
aggregate(breaks ~ tension, warpbreaks,mean)
aggregate(breaks ~ wool*tension, warpbreaks,mean)
If what link 2 said is correct, these two quantities should be equal, and they are
all.equal(aggregate(breaks ~ wool, warpbreaks,mean)[1,2]+aggregate(breaks ~ tension, warpbreaks,mean)[1,2]-mean(warpbreaks$breaks),as.numeric(mod1$coefficients[1]
)) #This value is not equal to the intercept
[1] TRUE
Again, following what link two said (and my intuition), the following quantities between each level and the reference level should also match makes sense
identical((aggregate(breaks ~ wool, warpbreaks,mean)[2,2] - aggregate(breaks ~ wool, warpbreaks,mean)[1,2]),as.numeric(mod1$coefficients[2]
)) #woolB - woolA
[1] FALSE
identical((aggregate(breaks ~ tension, warpbreaks,mean)[3,2] - aggregate(breaks ~ tension, warpbreaks,mean)[1,2]),as.numeric(mod1$coefficients[4]
)) #tensionM - tensionL
[1] FALSE
identical((aggregate(breaks ~ tension, warpbreaks,mean)[4,2] - aggregate(breaks ~ tension, warpbreaks,mean)[1,2]),as.numeric(mod1$coefficients[5]
)) #tensionM -tensionH
[1] TRUE
Following a recommendation from post 1, we can see if an interaction is significant. In their words indicating additivity may not hold
anova(lm(breaks ~ wool*tension, data=warpbreaks))
Analysis of Variance Table
Response: breaks
Df Sum Sq Mean Sq F value Pr(>F)
wool 1 450.7 450.67 3.7653 0.0582130 .
tension 2 2034.3 1017.13 8.4980 0.0006926 ***
wool:tension 2 1002.8 501.39 4.1891 0.0210442 *
Residuals 48 5745.1 119.69
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
At this point, everything matches up with what I have been reading online, but now let’s try another example:
data("mtcars")
mod2<-lm(mpg ~ as.factor(cyl)+as.factor(vs), data=mtcars)
summary(mod2)
Call:
lm(formula = mpg ~ as.factor(cyl) + as.factor(vs), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.201 -1.686 0.000 1.463 7.299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.2901 2.0856 13.085 1.88e-13 ***
as.factor(cyl)6 -7.1535 1.7235 -4.151 0.00028 ***
as.factor(cyl)8 -12.1901 2.2616 -5.390 9.56e-06 ***
as.factor(vs)1 -0.6891 2.0210 -0.341 0.73567
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.273 on 28 degrees of freedom
Multiple R-squared: 0.7336, Adjusted R-squared: 0.705
F-statistic: 25.7 on 3 and 28 DF, p-value: 3.412e-08
Group means:
aggregate(mpg ~ cyl, mtcars,mean)
aggregate(mpg ~ vs, mtcars,mean)
aggregate(mpg ~ vs*cyl, mtcars,mean)
Let’s try and calculate the intercept
#aggregate(mpg ~ cyl, mtcars,mean)[1,2]+aggregate(mpg ~ vs, mtcars,mean)[1,2]-mean(mtcars$mpg)
identical(aggregate(mpg ~ cyl, mtcars,mean)[1,2]+aggregate(mpg ~ vs, mtcars,mean)[1,2]-mean(mtcars$mpg),as.numeric(mod2$coefficients[1]
)) #This value is indeed equal to the intercept
[1] FALSE
For some reason the formula we used last doesn’t seem to be working
#mean cyl4 + mean vs0 - mean(y)
26.66364+16.61667-mean(mtcars$mpg)
[1] 23.18969
Even if the intercept is off, do the rest of the beta values line up with what we would expect?
identical((aggregate(mpg ~ cyl, mtcars,mean)[2,2] - aggregate(mpg ~ cyl, mtcars,mean)[1,2]),as.numeric(mod2$coefficients[2]
)) #cyl 6 - cyl 4
[1] FALSE
identical((aggregate(mpg ~ cyl, mtcars,mean)[3,2] - aggregate(mpg ~ cyl, mtcars,mean)[1,2]),as.numeric(mod2$coefficients[3]
)) #cyl 8 - cyl 4
[1] FALSE
identical((aggregate(mpg ~ vs, mtcars,mean)[2,2] - aggregate(mpg ~ vs, mtcars,mean)[1,2]),as.numeric(mod2$coefficients[4]
)) #vs1 - vs0
[1] FALSE
anova(lm(mpg ~ as.factor(vs)*as.factor(cyl), data=mtcars))
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(vs) 1 496.53 496.53 45.1062 3.283e-07 ***
as.factor(cyl) 2 329.50 164.75 14.9665 4.227e-05 ***
as.factor(vs):as.factor(cyl) 1 2.80 2.80 0.2545 0.618
Residuals 27 297.22 11.01
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In sum, it’s easy for me to interpret the regression coefficients when I include an interaction term with the two dummy coded predictors. However, what totally confused me is how these coefficients are calculated absent an interaction term. This isn’t something I’ve every really thought about because I’m usually working with continuous measurses and interested in interactions. I’m not entirely certain what is going on here.