R is sort of a sneaky bastard when it comes to specifying contrasts and makes it easy to confuse contrast parameterization with contrast transformations. I’ll start with how it would seem one would specify contrasts.
data("warpbreaks")
head(warpbreaks[1:5,])
apply(warpbreaks[,-1],2, function (x) length(unique(x))) # we can see wool has 2 levels and tension 3
wool tension
2 3
aggregate(breaks ~ tension, data= warpbreaks, mean)
lets specify some contrasts to look at for the tension variable:
levels(warpbreaks$tension)
[1] "L" "M" "H"
lVSmh<-c(1,-.5,-.5)
mVSh<-c(0,1,-1)
contrasts(warpbreaks$tension)<-cbind(lVSmh,mVSh)
print(contrasts(warpbreaks$tension)) # looks good
lVSmh mVSh
L 1.0 0
M -0.5 1
H -0.5 -1
sum(apply(contrasts(warpbreaks$tension),1,function (x) prod(x) )) #check to see if contrasts are orthogonal
[1] 0
m1<-lm(breaks ~ tension, data = warpbreaks)
summary(m1)
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-22.389 -8.139 -2.667 6.333 33.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.148 1.617 17.410 < 2e-16 ***
tensionlVSmh 8.241 2.286 3.604 0.000711 ***
tensionmVSh 2.361 1.980 1.192 0.238614
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203, Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753
#the intercept should equal the grand mean
cat("mean breaks:",mean(warpbreaks$breaks),"calculated intercept:" ,as.numeric(m1$coefficients[1]))
mean breaks: 28.14815 calculated intercept: 28.14815
#the first beta coefficient should equal the difference between mean(l) and (m+h)/2 which is equal to 36.38889- 24.02778 = 12.36111
cat("mean(l)-(m+h)/2:",aggregate(breaks ~ tension, data = warpbreaks, mean)[1,2]-((aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2]+aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2])/2),"calculated beta:",as.numeric(m1$coefficients[2]) ) # off by about four
mean(l)-(m+h)/2: 12.36111 calculated beta: 8.240741
#the second beta coef should equal m - h , 26.38889- 21.66667 = 4.7222
cat("mean(m)-mean(h):",aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2] - aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2],"calculated beta:",as.numeric(m1$coefficients[3]) ) ##off by two or so
mean(m)-mean(h): 4.722222 calculated beta: 2.361111
Well something is clearly off from the output above, turns out with missed a critical step which is intverting the contrast matrix:
library(MASS)
mat<-rbind(lVSmh,mVSh)
cmat<-ginv(mat) #take the inverse
contrasts(warpbreaks$tension)<-cmat
m2<-lm(breaks ~ tension, data = warpbreaks)
summary(m2)
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-22.389 -8.139 -2.667 6.333 33.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.148 1.617 17.410 < 2e-16 ***
tension1 12.361 3.430 3.604 0.000711 ***
tension2 4.722 3.960 1.192 0.238614
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203, Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753
cat("mean breaks:",mean(warpbreaks$breaks),"calculated intercept:" , as.numeric(m2$coefficients[1])) #TRUE
mean breaks: 28.14815 calculated intercept: 28.14815
#the first beta coefficient should equal the difference between mean(l) and (m+h)/2
cat("mean(l)-(m+h)/2:",aggregate(breaks ~ tension, data = warpbreaks, mean)[1,2]-((aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2]+aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2])/2),"calculated beta:",as.numeric(m2$coefficients[2]) ) # TRUE
mean(l)-(m+h)/2: 12.36111 calculated beta: 12.36111
#the second beta coef should equal m - h
cat("mean(m)-mean(h):",aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2] - aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2],"calculated beta:",as.numeric(m2$coefficients[3]) )
mean(m)-mean(h): 4.722222 calculated beta: 4.722222
A similar approach that also does the trick with base R assuming you specify a square matrix (i.e., k-1 columns plus a column for the intercept…in this case, 3x3)
mat<-rbind(intercept=c(1,1,1),lVSmh,mVSh)
inv.mat<-solve(mat) #take the inverse, need a square matrix
contrasts(warpbreaks$tension)<-inv.mat[,-1] #drop the intercept
m3<-lm(breaks ~ tension, data = warpbreaks)
summary(m3)
Call:
lm(formula = breaks ~ tension, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-22.389 -8.139 -2.667 6.333 33.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.148 1.617 17.410 < 2e-16 ***
tensionlVSmh 12.361 3.430 3.604 0.000711 ***
tensionmVSh 4.722 3.960 1.192 0.238614
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 11.88 on 51 degrees of freedom
Multiple R-squared: 0.2203, Adjusted R-squared: 0.1898
F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753
all.equal(as.numeric(m3$coefficients[1]),as.numeric(m2$coefficients[1]))
[1] TRUE
all.equal(as.numeric(m3$coefficients[2]),as.numeric(m2$coefficients[2]))
[1] TRUE
all.equal(as.numeric(m3$coefficients[3]),as.numeric(m2$coefficients[3]))
[1] TRUE
However, it is important to note that while the betas could be off by some multiplicative factor the resulting se, t and p-values will be correct. As long are you’re specifying orthogonal contrasts your results sans the betas will be valid. However, I think it is good practice to get in the habit of inverting the contrast matrix because, as we will see shortly, it becomes much more important when specifying non-orthogonal contrasts.
easy and correct way to do deviation or sum coding within r. Here, we will grand the overall mean to a level of the ind var
hsb <- read_csv("C:/Users/dasil/Desktop/hsb2_cat.csv")
Parsed with column specification:
cols(
id = col_integer(),
gender = col_character(),
race = col_character(),
ses = col_character(),
schtyp = col_character(),
prog = col_character(),
read = col_integer(),
write = col_integer(),
math = col_integer(),
science = col_integer(),
socst = col_integer()
)
library(MASS)
hsb$race<-as.factor(hsb$race)
hsb$race<-relevel(hsb$race,ref="hispanic")
contrasts(hsb$race)<-contr.sum(4)
m4<-lm(write ~ race, data=hsb)
summary(m4)
Call:
lm(formula = write ~ race, data = hsb)
Residuals:
Min 1Q Median 3Q Max
-23.0552 -5.4583 0.9724 7.0000 18.8000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.619 < 2e-16 ***
race1 -5.2200 1.6314 -3.200 0.00160 **
race2 -3.4784 1.7323 -2.008 0.04602 *
race3 6.3216 2.1603 2.926 0.00384 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05
cat("grandmean:",mean(aggregate(write ~ race, hsb, mean)$write),"calculated intercept",as.numeric(m4$coefficients[1]))
grandmean: 51.67838 calculated intercept 51.67838
##all checks out!
cat(aggregate(write ~ race, hsb,mean)[1,2]-as.numeric(m4$coefficients[1]) ,as.numeric(m4$coefficients[2]))
-5.220043 -5.220043
cat(aggregate(write ~ race, hsb,mean)[2,2]-as.numeric(m4$coefficients[1]) ,as.numeric(m4$coefficients[3]))
-3.478376 -3.478376
cat(aggregate(write ~ race, hsb,mean)[3,2]-as.numeric(m4$coefficients[1]) ,as.numeric(m4$coefficients[4]))
6.321624 6.321624
Now, we can try and specify the contrast how they would appear and fit a model
intercept<-c( 0.25, 0.25, 0.25, 0.25)
con1<-c( 0.75, -0.25, -0.25, -0.25)
con2<-c(-0.25, 0.75, -0.25, -0.25)
con3<-c(-0.25, -0.25, 0.75, -0.25)
contrasts(hsb$race)<-cbind(con1,con2,con3)
m5<-lm(write ~ race, data=hsb)
summary(m5)
Call:
lm(formula = write ~ race, data = hsb)
Residuals:
Min 1Q Median 3Q Max
-23.0552 -5.4583 0.9724 7.0000 18.8000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.619 < 2e-16 ***
racecon1 -7.5968 1.9889 -3.820 0.000179 ***
racecon2 -5.8552 2.1528 -2.720 0.007118 **
racecon3 3.9448 2.8225 1.398 0.163803
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05
cat("Calculated:", m4$coefficients[2],"R coded:", m5$coefficients[2])
Calculated: -5.220043 R coded: -7.596839
cat("Calculated:", m4$coefficients[3],"R coded:", m5$coefficients[3])
Calculated: -3.478376 R coded: -5.855172
cat("Calculated:", m4$coefficients[4],"R coded:", m5$coefficients[4])
Calculated: 6.321624 R coded: 3.944828
we can see the betas are off, but what about the SE, t-values and p’s?
summary(m4)[4]
$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.678376 0.982122 52.619098 1.431984e-117
race1 -5.220043 1.631408 -3.199716 1.604633e-03
race2 -3.478376 1.732305 -2.007947 4.602191e-02
race3 6.321624 2.160314 2.926252 3.835918e-03
summary(m5)[4]
$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.678376 0.982122 52.619098 1.431984e-117
racecon1 -7.596839 1.988870 -3.819677 1.792682e-04
racecon2 -5.855172 2.152760 -2.719845 7.118082e-03
racecon3 3.944828 2.822504 1.397634 1.638029e-01
Ok, so we saw the entire output is off. Now, we should be able to invert the contrast matrix and see if our output matches that from base R for deviation coding:
mat<-ginv(rbind(con1,con2,con3))
contrasts(hsb$race)<-mat
m6<-lm(write ~ race, data=hsb)
summary(m6)
Call:
lm(formula = write ~ race, data = hsb)
Residuals:
Min 1Q Median 3Q Max
-23.0552 -5.4583 0.9724 7.0000 18.8000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.619 < 2e-16 ***
race1 -5.2200 1.6314 -3.200 0.00160 **
race2 -3.4784 1.7323 -2.008 0.04602 *
race3 6.3216 2.1603 2.926 0.00384 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05
cat("coded by hand:", m6$coefficients[2],"R coded:", m4$coefficients[2])
coded by hand: -5.220043 R coded: -5.220043
cat("coded by hand:", m6$coefficients[3],"R coded:", m4$coefficients[3])
coded by hand: -3.478376 R coded: -3.478376
cat("coded by hand:", m6$coefficients[4],"R coded:", m4$coefficients[4])
coded by hand: 6.321624 R coded: 6.321624
So, we can see that our own coded contrasts match the output of r. R is doing the same thing behind the scences when we ask for deviation coded contrasts. Finally, we can also perform the same sort of operation in base R. Here, we will need a vector corresponding to intercepts to complete the square contrasts matrix.
mat<-solve(rbind(intercept,con1,con2,con3))
contrasts(hsb$race)<-mat[,-1] #leave off intercept
m7<-lm(write ~ race, data=hsb)
summary(m7)
Call:
lm(formula = write ~ race, data = hsb)
Residuals:
Min 1Q Median 3Q Max
-23.0552 -5.4583 0.9724 7.0000 18.8000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.619 < 2e-16 ***
racecon1 -5.2200 1.6314 -3.200 0.00160 **
racecon2 -3.4784 1.7323 -2.008 0.04602 *
racecon3 6.3216 2.1603 2.926 0.00384 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05
cat("coded by hand:", m7$coefficients[2],"R coded:", m4$coefficients[2])
coded by hand: -5.220043 R coded: -5.220043
cat("coded by hand:", m7$coefficients[3],"R coded:", m4$coefficients[3])
coded by hand: -3.478376 R coded: -3.478376
cat("coded by hand:", m7$coefficients[4],"R coded:", m4$coefficients[4])
coded by hand: 6.321624 R coded: 6.321624