1 Introduction

Here I will explore interpretation details about two particular contrast coding systems for categorical variables (CV’s) in generalized linear models (glm’s): Dummy coding and Deviation coding. I will focus on interpretation of parameters derived from the glm for both coding systems as well as how intereaction terms works as well. A good resource can be seen here. Coding systems are necessary because in order to include categorical variables in a glm we need to convert (or encode) them to numerical variables since a glm is formulated using numerical variables.

We will be working those systems in a simulated data set inspired in a biological problem.

1.1 Beetle weight related to growth status and sex

In this simulated case study we are interested in assess the effect of growth status and sex on the weight of a beetle species. Next we present the code to generate the data set.

###load packages
library(pacman)
p_load(car)
p_load(mvtnorm)
p_load(tidyverse)
p_load(kableExtra)
p_load(knitr)
p_load(ggrepel)
p_load(reshape2)
p_load(cowplot)
p_load(gridGraphics)
p_load(GA)
p_load(ggbeeswarm)
# for reproducibility
set.seed(150)

#number of beetles
n<-500
#generate data frame with sex and status
z0<-data.frame(sex=sample(c("M","F"),n,replace = T,prob=c(.5,.5)) %>% factor,
               status=sample(LETTERS[1:3],n,replace = T,prob=c(.3,.3,.3)) %>% factor%>% sort)
#build data frame with parameters for simulation
param_df<-expand.grid(z0$sex %>% unique,z0$status %>% unique)
colnames(param_df)<-c("sex","status")
param_df$mean<-c(45,40,65,65,80,107)
param_df$sd<-c(7,7,10,10,12,12)
n_df<-z0 %>% group_by(sex,status) %>% summarize(n=n())

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

param_df<-left_join(param_df,n_df,by=c("sex","status"))
#simulate weight
param_df$weight<-NA
for(i in 1:nrow(param_df)){
  z0[z0$sex %in% param_df[i,"sex"] & z0$status %in% param_df[i,"status"],"weight"]<-
    rnorm(param_df[i,"n"],param_df[i,"mean"],param_df[i,"sd"])
}
#output summary
summary(z0)

##  sex     status      weight      
##  F:253   A:167   Min.   : 24.25  
##  M:247   B:173   1st Qu.: 46.02  
##          C:160   Median : 62.31  
##                  Mean   : 65.92  
##                  3rd Qu.: 80.87  
##                  Max.   :134.22

Let us now explore the data set making some descriptive statistics calculations:

by(z0$weight,z0$sex,summary)

## z0$sex: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.63   47.82   62.20   62.75   74.88  103.79 
## ------------------------------------------------------------ 
## z0$sex: M
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.25   43.82   62.83   69.16   96.73  134.22

by(z0$weight,z0$status,summary)

## z0$status: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.25   37.83   42.58   42.24   46.30   64.00 
## ------------------------------------------------------------ 
## z0$status: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38.34   57.60   63.72   64.49   71.54   88.71 
## ------------------------------------------------------------ 
## z0$status: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.62   78.65   91.74   92.16  106.11  134.22

p_w_sex<-z0 %>% ggplot(aes(x=sex,y=weight))+geom_boxplot()+theme_bw()
p_w_sts<-z0 %>% ggplot(aes(x=status,y=weight))+geom_boxplot()+theme_bw()
gridExtra::grid.arrange(grobs =list(p_w_sex,p_w_sts),nrow=1,ncol=2)

Figure 1.1: Boxplots of sex and status on weight

Now we will create a new variable (group) which is just the combination of labels for status and sex in order to visualize more easily the joint effect of sex and status on weight:

z0$group<-z0[,c("sex","status")] %>% apply(1,paste,collapse="-") %>% factor
summary(z0)

##  sex     status      weight       group   
##  F:253   A:167   Min.   : 24.25   F-A:83  
##  M:247   B:173   1st Qu.: 46.02   F-B:89  
##          C:160   Median : 62.31   F-C:81  
##                  Mean   : 65.92   M-A:84  
##                  3rd Qu.: 80.87   M-B:84  
##                  Max.   :134.22   M-C:79

z0 %>% ggplot(aes(x=group,y=weight))+geom_boxplot()+theme_bw()

Figure 1.2: Boxplot of sex and status on weight

As can be seen, marginal effects of sex and gender are more or less clear (see figure 1.1) males weights more than females (also have more variance) and as growth status advances,weight also increases, again variance also increases as well. By marginal effect we mean, the effect of one variable ignoring information from the other, for example, the marginal effect of sex seen in figure 1.1 is across all growth status.

For the case of the joint or conditional effect of sex and status on weight, figure 1.2 we can choose between two possible interpretations: focusing on the effect of sex based on the different growth status or focusing on the effect of growth status on each of the two possible sexes. For the first one, we could say that weight differences between sexes increases as the status growth also increases, for example in growth status A there is barely any difference in weights while in growth status C the diference in wieghts is highly noticeable, males weighting more than females. Can you describe how the effect of growth status changes depending on the sex of the beetle?

Our objective will be to use linear models to explain the relations observed in this descriptive analysis.

2 Dummy coding

As said earlier, in order to include a CV in a glm we need to encoded it first using numbers. The Dummy coding system is prety simple. Assume we have a CV \(\textbf{C}\) with k distinct values \(C_1,C_2,\dots,C_k\), then the Dummy coding system to encode this variable is as follows:

Chose a reference category, in our case we will be choosing \(C_1\) as the reference category (could be any other value).
Build \(k-1\) indicator variables \(I_{C_i}(x)\) of the form

\[I_{C_i}(x)= \begin{cases} 1 & x = C_i \\ 0 & x \neq C_i \end{cases} \] for all \(C_i\) where \(C_i \neq C_1\)

Thus a to encode a CV with \(k\) distinct categories we need \(k-1\) indicator (Dummy) variables. Let us show how this encodings works for the beetles’s data set. For the case of sex we have

Table 2.1: Table 2.2: Dummy coding for sex
status	\(X_1\)
F	0
M	1

And for the case of growth status we have

Table 2.3: Table 2.4: Dummy coding for status
status	\(X_1\)	\(X_2\)
A	0	0
B	1	0
C	0	1

Now let us show how these encoded CV can be included in a glm.

2.1 Including dummy variables in a glm

In our first example we are going to build a linear model for weight on sex. This glm will specify the expected value of weight conditional to a given sex, that is \(E(W|Sex=s)\) where \(s \in \{F,M\}\). Since we encoded sex using the dummy coding we can rewrite this expected values as follows: \(E(W|X_1=x_1)\) where

\[x_1= \begin{cases} 0 & s = F \\ 1 & s = M \end{cases} \]

since now \(X1\) is a numerical variable we can specify a glm as follows:

\[\begin{equation} E(W|Sex=s) = E(W|X1=x_1)=\beta_0+\beta_1x_1\tag{2.1} \end{equation}\]

Let us explore how this model behaves for each possible sex. First for the reference category females:

\[\begin{equation} E(W|Sex=F) = E(W|X_1=0)=\beta_0+\beta_1\times 0=\beta_0\tag{2.2} \end{equation}\]

that means that the usual intercept is now the conditional expected value of weight for female beetles. Let us now explore the model (2.1) for males:

\[\begin{equation} E(W|Sex=M) = E(W|X_1=1)=\beta_0+\beta_1\times 1=\beta_0+\beta_1\tag{2.3} \end{equation}\]

equation (2.3) implies that

\[\begin{equation} \beta_1= E(W|Sex=M)-\beta_0=E(W|Sex=M)-E(W|Sex=F)\tag{2.4} \end{equation}\]

Therefore \(\beta_1\) (the usual “slope” term) is the mean difference of weight between males and females.

Let us do the same now for the status growth variable: Now we will work a model for \(E(W|SG=sg)\) where \(SG\) is just the status growth and \(sg \in \{A,B,C\}\). Following table 2.3, we now need to dummy variables, \(x_1\) and \(x_2\) where

\[x_1= \begin{cases}1 & sg = B \\ 0 & sg \neq B \end{cases} \]

and \[x_2= \begin{cases}1 & sg = C \\ 0 & sg \neq C \end{cases} \]

Therefore

\[\begin{equation} E(W|SG=sg) = E(W|X_1=x_1,X_2=x_2)=\beta_0+\beta_1x_1+\beta_2x_2\tag{2.5} \end{equation}\]

As before, let us explore the model behavior for each of the possible growth status. First for the reference category A:

\[\begin{equation} E(W|SG=A) = E(W|X_1=0,X_2=0)=\beta_0+\beta_1\times 0+\beta_2\times 0=\beta_0\tag{2.6} \end{equation}\]

As before, the intercept is just the expected value of weight for the reference category A. Now for B category:

\[\begin{equation} E(W|SG=B) = E(W|X_1=1,X_2=0)=\beta_0+\beta_1\times 1+\beta_2\times 0=\beta_0+\beta_1\tag{2.7} \end{equation}\]

Which implies \[\begin{equation} \beta_1= E(W|SG=B)-\beta_0 = E(W|SG=B)-E(W|SG=A)\tag{2.8} \end{equation}\]

That is the mean weight difference between group B and A and similarly for group C:

\[\begin{equation} E(W|SG=C) = E(W|X_1=0,X_2=1)=\beta_0+\beta_1\times 0+\beta_2\times 1=\beta_0+\beta_2\tag{2.9} \end{equation}\]

Which implies \[\begin{equation} \beta_2= E(W|SG=C)-\beta_0 = E(W|SG=C)-E(W|SG=A)\tag{2.10} \end{equation}\]

That is the mean weight difference between group C and A.

Let us now fit these models in R. Our first model will be weight on sex:

mws<-lm(weight~sex,data=z0)
summary(mws)

## 
## Call:
## lm(formula = weight ~ sex, data = z0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.913 -18.577  -3.502  16.134  65.057 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62.747      1.486  42.216  < 2e-16 ***
## sexM           6.413      2.115   3.033  0.00255 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.64 on 498 degrees of freedom
## Multiple R-squared:  0.01813,    Adjusted R-squared:  0.01616 
## F-statistic: 9.198 on 1 and 498 DF,  p-value: 0.00255

You can check that (Intercept) is just the mean weight for the reference sex F as stated in equation (2.2) while the estimate for sexM is just the mean difference of weight between males and females as stated in equation (2.4), you should check that by computing the means and making the apropiate sustractions from the data itself.

The same applies for status growth:

mwsg<-lm(weight~status,data=z0)
summary(mwsg)

## 
## Call:
## lm(formula = weight ~ status, data = z0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.536  -7.455  -0.360   7.316  42.057 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.2446     0.9779   43.20   <2e-16 ***
## statusB      22.2481     1.3709   16.23   <2e-16 ***
## statusC      49.9156     1.3980   35.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.64 on 497 degrees of freedom
## Multiple R-squared:   0.72,  Adjusted R-squared:  0.7189 
## F-statistic: 639.1 on 2 and 497 DF,  p-value: < 2.2e-16

Compare these results with equations (2.6), (2.8) and (2.10) respectively.

2.2 General case CV with k categories

For a given dependent variable (DV) \(Y\) and a CV \(C\) with k categories \(C_1,C_2,\dots,C_k\) we will have the following linear model:

\[\begin{equation} E(Y|C=c) = \\ E(W|X_1=x_1,X_2=x_2,\dots,X_{k-1}=x_{k-1})=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_{k-1}x_{k-1}\tag{2.11} \end{equation}\]

where \(C_k\) is the reference category and \[x_i= \begin{cases}1 & c = C_i \\ 0 & c \neq C_i \end{cases} \] For all \(i \neq k\). for this model it can be shown that

\[\begin{equation} E(Y|C=C_k) = \beta_0 \\ E(Y|C=C_i) = \beta_0+\beta_i \; for \; i\neq k \\ \beta_i =E(Y|C=C_i)-E(Y|C=C_k) \; for \; i\neq k \\ \tag{2.12} \end{equation}\]

2.2.1 Matrix formulation and derivation of the general solution

The general linear model formulation using linear algebra goes as follows:

\[\begin{equation} E(\textbf{Y}|\textbf{X}=X) = X\beta \tag{2.13} \end{equation}\]

where \(\textbf{Y}_{n \times 1}\) is the vector of responses in the dependent variable, \(X_{n \times p}\) is known as the design matrix with \(p-1\) covariates and \(\beta^T_{p \times 1}=(\beta_0,\beta_1,\dots,\beta_{p-1})\) is the parameter vector. We are working with \(p-1\) covariates since the first column of \(X_{n \times p}\) is a vector of constant ones (associated to the intercept, \(\beta_0\) parameter).

We are going to show how equation (2.12) is derived. Assume we have \(E(\textbf{Y}|\textbf{C}=C)=(E(Y|C=C_k),E(Y|C=C_1),\dots,E(Y|C=C_{k-1}))^T\). Therefore \(X\) design have the following form:

\[\begin{equation} X_{k \times k}=\begin{pmatrix} 1 & 0 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & 0 & \cdots & 0 \\ 1 & 0 & 1 & 0 & \cdots & 0 \\ \vdots &\cdots & \cdots &\ddots & \vdots & \vdots\\ \vdots &\cdots & \cdots &\cdots & \ddots & \vdots\\ 1 & 0 & 0 & \cdots & 0 & 1 \\ \end{pmatrix} \tag{2.14} \end{equation}\] It can be shown that

\[\begin{equation} X_{k \times k}^{-1}=\begin{pmatrix} 1 & 0 & 0 & 0 & \cdots & 0 \\ -1 & 1 & 0 & 0 & \cdots & 0 \\ -1 & 0 & 1 & 0 & \cdots & 0 \\ \vdots &\cdots & \cdots &\ddots & \vdots & \vdots\\ \vdots &\cdots & \cdots &\cdots & \ddots & \vdots\\ -1 & 0 & 0 & \cdots & 0 & 1 \\ \end{pmatrix} \tag{2.15} \end{equation}\]

\[\begin{equation} \beta = X_{k \times k}^{-1}E(\textbf{Y}|\textbf{C}=C)= \\ = (E(Y|C=C_k),E(Y|C=C_1)-E(Y|C=C_k),\dots,E(Y|C=C_{k-1})-E(Y|C=C_k))^T \tag{2.16} \end{equation}\]

2.3 Additive vs interaction terms in a glm using dummy coding

In this section we will explore the behavior of a model with two CV as covariates and how parameters can be interpreted in the case of an additive model as well as in te case of interactions. First we will explore an additive model.

2.3.1 Additive model of sex and status growth on weight - dummy coding

We will keep working with the coding system presented in tables 2.1 and 2.3. We now want to adress the following expected values: \(E(W|Sex=s,SG=sg)\). Using the dummy encodings referenced above we will have the following linear model

\[\begin{equation} E(W|Sex=s,SG=sg)= \\ E(W|I_{sex-M}=x_1,I_{sg-B}=x_2,I_{sg-C}=x_3)=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3 \tag{2.17} \end{equation}\]

As before, we will evaluate the model (2.17) in all the possible combinations of \(Sex\) and \(SG\).

\(Sex= F\) and \(SG = A\)

\[\begin{equation} E(W|Sex=F,SG=A)= \\ E(W|I_{sex-M}=0,I_{sg-B}=0,I_{sg-C}=0)=\beta_0 \tag{2.18} \end{equation}\]

Therefore \(\beta_0\) is just the expected value of weight for females and growth status \(A\).

\(Sex= M\) and \(SG = A\)

\[\begin{equation} E(W|Sex=M,SG=A)= \\ E(W|I_{sex-M}=1,I_{sg-B}=0,I_{sg-C}=0)=\beta_0+\beta_1 \tag{2.19} \end{equation}\]

therefore

\[\begin{equation} \beta_1 = E(W|Sex=M,SG=A) - E(W|Sex=F,SG=A) \\\tag{2.20} \end{equation}\]

\(Sex= F\) and \(SG = B\)

\[\begin{equation} E(W|Sex=F,SG=B)= \\ E(W|I_{sex-M}=0,I_{sg-B}=1,I_{sg-C}=0)=\beta_0+\beta_2 \tag{2.21} \end{equation}\]

therefore

\[\begin{equation} \beta_2 = E(W|Sex=F,SG=B) - E(W|Sex=F,SG=A) \\\tag{2.22} \end{equation}\]

\(Sex= M\) and \(SG = B\)

\[\begin{equation} E(W|Sex=M,SG=B)= \\ E(W|I_{sex-M}=1,I_{sg-B}=1,I_{sg-C}=0)=\beta_0+\beta_1+\beta_2 \tag{2.23} \end{equation}\]

\(Sex= F\) and \(SG = C\)

\[\begin{equation} E(W|Sex=F,SG=C)= \\ E(W|I_{sex-M}=0,I_{sg-B}=0,I_{sg-C}=1)=\beta_0+\beta_3 \tag{2.24} \end{equation}\]

therefore

\[\begin{equation} \beta_3 = E(W|Sex=F,SG=C) - E(W|Sex=F,SG=A) \\\tag{2.25} \end{equation}\]

\(Sex= M\) and \(SG = C\)

\[\begin{equation} E(W|Sex=M,SG=C)= \\ E(W|I_{sex-M}=1,I_{sg-B}=0,I_{sg-C}=1)=\beta_0+\beta_1+\beta_3 \tag{2.21} \end{equation}\]

note how equations (2.20), (2.22) and (2.25) gives meaning to the interpretation of the corresponding parameters: they are mean differences of sex or status growth with respect to the corresponding reference categories (sex female and growth status \(A\)). Let us take a closer look to for example result (2.20): \(\beta_1\) is the mean diference of males vs females in the beetles with \(A\) as status growth (the reference category for \(SG\)). something similar is hapeneing for \(\beta_2\) and \(\beta_3\), they are comparing \(SG\) \(B\) and \(C\) against \(A\) for the reference sex female.

However, what happens if we ask for the following contrast (not encoded as a parameter in this model):

\[E(W|Sex=M,SG=B) - E(W|Sex=F,SG=B)\] Using model equations (2.23) and (2.21) it can be shown that \[\begin{equation} E(W|Sex=M,SG=B) - E(W|Sex=F,SG=B) = \beta_0+\beta_1+\beta_2-(\beta_0+\beta_2)=\beta_1 \tag{2.26} \end{equation}\]

Studying similar contrasts it can be shown that :

\[\begin{equation} \beta_1=E(W|Sex=M,SG=A) - E(W|Sex=F,SG=A) \\ =E(W|Sex=M,SG=B) - E(W|Sex=F,SG=B) \\ =E(W|Sex=M,SG=C) - E(W|Sex=F,SG=C) \tag{2.27} \end{equation}\]

and

\[\begin{equation} \beta_2=E(W|Sex=F,SG=B) - E(W|Sex=F,SG=A) \\ =E(W|Sex=M,SG=B) - E(W|Sex=M,SG=B) \\ \tag{2.28} \end{equation}\]

and

\[\begin{equation} \beta_3=E(W|Sex=F,SG=C) - E(W|Sex=F,SG=A)\\ =E(W|Sex=M,SG=C) - E(W|Sex=M,SG=A)\\ \tag{2.29} \end{equation}\]

In short, equations (2.27), (2.28) and (2.29) are showing that effects of \(Sex\) and \(SG\) are independent of each other, allowing us to rewrite them as

\[\begin{equation} \beta_1=E(W|Sex=M) - E(W|Sex=F)\\ \beta_2=E(W|SG=B) - E(W|SG=A)\\ \beta_2=E(W|SG=C) - E(W|SG=A)\\ \tag{2.30} \end{equation}\]

Let us fit this model to our data set:

mwsexsg<-lm(weight~sex+status,data=z0)
summary(mwsexsg)

## 
## Call:
## lm(formula = weight ~ sex + status, data = z0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.233  -7.908  -0.237   7.943  38.670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.879      1.092  35.614  < 2e-16 ***
## sexM           6.691      1.091   6.133 1.76e-09 ***
## statusB       22.365      1.323  16.903  < 2e-16 ***
## statusC       49.977      1.349  37.043  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 496 degrees of freedom
## Multiple R-squared:  0.7398, Adjusted R-squared:  0.7382 
## F-statistic:   470 on 3 and 496 DF,  p-value: < 2.2e-16

let us compare this result with previous ones obtained in models mws and mwsg:

model_list<-list(mwsexsg=mwsexsg,mws=mws,mwsg=mwsg)
model_list %>% map(coef)

## $mwsexsg
## (Intercept)        sexM     statusB     statusC 
##   38.879016    6.691093   22.364843   49.977436 
## 
## $mws
## (Intercept)        sexM 
##   62.747171    6.413481 
## 
## $mwsg
## (Intercept)     statusB     statusC 
##    42.24460    22.24812    49.91558

Some says that model mwsexsg had effects of sex adjusted by status growth or vice versa, effect of status growth adjusted by sex. Those same people point out that differences in parameter estimates as shown comparing the 3 models built until now, are the result for this correction or adjustment. Truth is, model mwsexsg is assuming that effect of \(Sex\) and \(SG\) are independent of each other. Next we will see how this assumption can be ignored, allowing for dependence of effects.

2.3.1.1 Matrix formulation and derivation of the general solution

Let \[\begin{equation} E(\textbf{Y}|\textbf{X}=X)= \\ \left(E(Y|Sex=F,SG=A),E(Y|Sex=M,SG=A),E(Y|Sex=F,SG=B),\\ E(Y|Sex=M,SG=B),E(Y|Sex=F,SG=C),E(Y|Sex=M,SG=C)\right)^T \tag{2.31} \end{equation}\]

As before \(E(\textbf{Y}|\textbf{X}=X)=X\beta\) where

\[\begin{equation} X=\begin{pmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 1 & 0 & 1 \\ \end{pmatrix} \tag{2.32} \end{equation}\]

To solve for \(\beta\) we get: \[\begin{equation} \beta=\left[ \left( X^TX\right)^{-1}X^T\right]E(\textbf{Y}|\textbf{X}=X) \tag{2.33} \end{equation}\]

in particular \[\begin{equation} \left( X^TX \right)^{-1}X^T=\begin{pmatrix} 2/3 & 1/3 & 1/6 & -1/6 & 1/6 & - 1/6 \\ -1/3 & 1/3 & -1/3 & 1/3 & -1/3 & 1/3 \\ -1/2 & -1/2 & 1/2 & 1/2 & 0 & 0 \\ -1/2 & -1/2 & 0 & 0 & 1/2 & 1/2 \\ \end{pmatrix} \tag{2.34} \end{equation}\]

therefore, it can be seen for example that to solve for \(\beta_1\) we get: \[\begin{equation} \beta_1= \frac{E(Y|Sex=M,SG=A)-E(Y|Sex=F,SG=A)}{3}+ \\ \frac{E(Y|Sex=M,SG=B)-E(Y|Sex=F,SG=B)}{3}+ \\ \frac{E(Y|Sex=M,SG=C)-E(Y|Sex=F,SG=C)}{3} \tag{2.35} \end{equation}\]

That is \(\beta_1\) is the average of mean differences of males vs females across all status growth groups. A similar finding can be shown for \(\beta_2\) and \(\beta_3\). This is however, assuming \(E(\textbf{Y}|\textbf{X}=X)\) have arbitrary values which is not the case following equation (2.17).

It is also worth nothing that if we exclude rows with more than 2 ones in (2.32), we get results (2.20), (2.22) and (2.25) in a similar way as in (2.16) since removing those rows will get us to an square invertible matrix similar to (2.15).

2.3.2 Interaction model of sex and status growth on weight - dummy coding

We are now going to explore a model with interaction terms between \(Sex\) ans \(SG\):

\[\begin{equation} E(W|Sex=s,SG=sg)= E(W|I_{sex-M}=x_1,I_{sg-B}=x_2,I_{sg-C}=x_3)=\\ \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4(x_1x_2)+\beta_5(x_1x_3) \tag{2.36} \end{equation}\]

Parameters \(\beta_4\) and \(\beta_5\) associated with products \(x_1x_2\) and \(x_1x_3\) respectively are the new interaction terms. let us evaluate the model in each of the 6 possible combinations as before:

\(Sex= F\) and \(SG = A\)

\[\begin{equation} E(W|Sex=F,SG=A)= \\ E(W|I_{sex-M}=0,I_{sg-B}=0,I_{sg-C}=0)=\beta_0 \tag{2.37} \end{equation}\]

Therefore \(\beta_0\) is just the expected value of weight for females and growth status \(A\) (the same result as in the additive model).

\(Sex= M\) and \(SG = A\)

\[\begin{equation} E(W|Sex=M,SG=A)= \\ E(W|I_{sex-M}=1,I_{sg-B}=0,I_{sg-C}=0)=\beta_0+\beta_1 \tag{2.38} \end{equation}\]

therefore

\[\begin{equation} \beta_1 = E(W|Sex=M,SG=A) - E(W|Sex=F,SG=A) \\\tag{2.39} \end{equation}\]

\(Sex= F\) and \(SG = B\)

\[\begin{equation} E(W|Sex=F,SG=B)= \\ E(W|I_{sex-M}=0,I_{sg-B}=1,I_{sg-C}=0)=\beta_0+\beta_2 \tag{2.40} \end{equation}\]

therefore

\[\begin{equation} \beta_2 = E(W|Sex=F,SG=B) - E(W|Sex=F,SG=A) \\\tag{2.41} \end{equation}\]

\(Sex= M\) and \(SG = B\)

\[\begin{equation} E(W|Sex=M,SG=B)= \\ E(W|I_{sex-M}=1,I_{sg-B}=1,I_{sg-C}=0)=\beta_0+\beta_1+\beta_2+\beta_4 \tag{2.42} \end{equation}\]

\(Sex= F\) and \(SG = C\)

\[\begin{equation} E(W|Sex=F,SG=C)= \\ E(W|I_{sex-M}=0,I_{sg-B}=0,I_{sg-C}=1)=\beta_0+\beta_3 \tag{2.43} \end{equation}\]

therefore

\[\begin{equation} \beta_3 = E(W|Sex=F,SG=C) - E(W|Sex=F,SG=A) \\\tag{2.44} \end{equation}\]

\(Sex= M\) and \(SG = C\)

\[\begin{equation} E(W|Sex=M,SG=C)= \\ E(W|I_{sex-M}=1,I_{sg-B}=0,I_{sg-C}=1)=\beta_0+\beta_1+\beta_3+\beta_5 \tag{2.45} \end{equation}\]

We need to work a litle more to get a meaning for the interaction terms parameters \(\beta_4\) and \(\beta_5\):

\[\begin{equation} \beta_4=E(W|Sex=M,SG=B)-E(W|Sex=M,SG=A)\\ -(E(W|Sex=F,SG=B)-E(W|Sex=F,SG=A)) \\ OR \\ \beta_4= E(W|Sex=M,SG=B)-E(W|Sex=F,SG=B) \\ -(E(W|Sex=M,SG=A)-E(W|Sex=F,SG=A)) \tag{2.46} \end{equation}\]

We can say that \(\beta_4\) is a measure of difference of effect of sex when comparing \(SG\) groups B vs A or vice versa, a measure of difference of effect of SG (B vs A) when comparing sex Males vs Females.

A similar result can be obtain for \(\beta_5\), can you obtain it?

\[\begin{equation} \beta_5=E(W|Sex=M,SG=C)-E(W|Sex=M,SG=A)\\ -(E(W|Sex=F,SG=C)-E(W|Sex=F,SG=A)) \\ OR \\ \beta_5= E(W|Sex=M,SG=C)-E(W|Sex=F,SG=C) \\ -(E(W|Sex=M,SG=A)-E(W|Sex=F,SG=A)) \tag{2.47} \end{equation}\]

Let us fit this model in R:

mwsexsgint<-lm(weight~sex*status, data=z0)
summary(mwsexsgint)

## 
## Call:
## lm(formula = weight ~ sex * status, data = z0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.1624  -6.1178   0.2209   6.3394  28.3450 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.583      1.094  40.760   <2e-16 ***
## sexM           -4.650      1.542  -3.015   0.0027 ** 
## statusB        20.506      1.521  13.485   <2e-16 ***
## statusC        34.203      1.556  21.976   <2e-16 ***
## sexM:statusB    3.421      2.163   1.582   0.1143    
## sexM:statusC   31.736      2.205  14.393   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.965 on 494 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8252 
## F-statistic: 472.2 on 5 and 494 DF,  p-value: < 2.2e-16

On CONTRAST CODING SYSTEMS FOR CATEGORICAL VARIABLES

r/nmolanog

06 April, 2024

1 Introduction

2 Dummy coding

2.1 Including dummy variables in a glm

2.2 General case CV with k categories

2.2.1 Matrix formulation and derivation of the general solution

2.3 Additive vs interaction terms in a glm using dummy coding

2.3.1 Additive model of sex and status growth on weight - dummy coding

2.3.1.1 Matrix formulation and derivation of the general solution

2.3.2 Interaction model of sex and status growth on weight - dummy coding