the categorical variables are recoded into a set of separate binary variables. This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix.

load packages

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

We’ll use the Salaries data set [car package], which contains 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.
# load the data

Salaries1 <- read.csv("Salaries.csv")
sample_n(Salaries1,3)
##   rank discipline yrs.since.phd yrs.service  sex salary
## 1 Prof          B            46          45 Male  67559
## 2 Prof          A            14           9 Male 108100
## 3 Prof          B            27          14 Male 147349

Categorical Variable with two levels

Based on the gender variable, we can create a new dummy variable that takes the value: 1 if a person is male 0 if a person is female and use this variable as a predictor in the regression equation, leading to the following the model:

b0 + b1 if person is male bo if person is female

The coefficients can be interpreted as follow: b0 is the average salary among females, b0 + b1 is the average salary among males, and b1 is the average difference in salary between males and females.

R creates dummy variables automatically:

# Compute the model
model <- lm(salary~sex,data=Salaries1)
summary(model)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 101002.41   4809.386 21.001103 2.683482e-66
## sexMale      14088.01   5064.579  2.781674 5.667107e-03

From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 =115090. The p-value for the dummy variable sexMale is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.

The contrasts() function returns the coding that R have used to create the dummy variables:

contrasts(Salaries1$sex)
##        Male
## Female    0
## Male      1

You can use the function relevel() to set the baseline category to males as follow:

Salaries1 <- Salaries1 %>% mutate( sex = relevel( sex, ref = "Male"))

The output of the regression fit becomes

model <- lm( salary ~ sex, data = Salaries1)
summary(model)$coef
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 115090.42   1587.378 72.503463 2.459122e-230
## sexFemale   -14088.01   5064.579 -2.781674  5.667107e-03

Alternatively, instead of a 0/ 1 coding scheme, we could create a dummy variable -1 (male) / 1 (female) . This results in the model: b0 - b1 if person is male b0 + b1 if person is female

So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed.

Categorical Variables with more than two levels

Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called contrast matrix.

For example rank in the Salaries data has three levels: “AsstProf”, “AssocProf” and “Prof”. This variable could be dummy coded into two variables, one called AssocProf and one Prof:

If rank = AssocProf, then the column AssocProf would be coded with a 1 and Prof with a 0.

If rank = Prof, then the column AssocProf would be coded with a 0 and Prof would be coded with a 1.

If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded with a 0.

This dummy coding is automatically performed by R. For demonstration purpose, you can use the function model.matrix() to create a contrast matrix for a factor variable:

res <- model.matrix(~rank,data=Salaries1)
head(res[,-1])
##   rankAsstProf rankProf
## 1            0        1
## 2            0        1
## 3            1        0
## 4            0        1
## 5            0        1
## 6            0        0

When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.

Note that, ANOVA (analyse of variance) is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base anova() function or the Anova() function [in car package]. We generally recommend the Anova() function because it automatically takes care of unbalanced designs.

The results of predicting salary from using a multiple regression procedure are presented below.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
model2 <- lm(salary~yrs.service+rank+discipline+sex,data=Salaries1)
Anova(model2)
## Anova Table (Type II tests)
## 
## Response: salary
##                 Sum Sq  Df  F value    Pr(>F)    
## yrs.service 3.2448e+08   1   0.6324    0.4270    
## rank        1.0288e+11   2 100.2572 < 2.2e-16 ***
## discipline  1.7373e+10   1  33.8582 1.235e-08 ***
## sex         7.7669e+08   1   1.5137    0.2193    
## Residuals   2.0062e+11 391                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Taking other variables (yrs.service, rank and discipline) into account, it can be seen that the categorical variable sex is no longer significantly associated with the variation in salary between individuals. Significant variables are rank and discipline.

If you want to interpret the contrasts of the categorical variable, type this:

summary(model2)
## 
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex, 
##     data = Salaries1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64202 -14255  -1533  10571  99163 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   87683.32    3566.80  24.583  < 2e-16 ***
## yrs.service     -88.78     111.64  -0.795 0.426958    
## rankAsstProf -14560.40    4098.32  -3.553 0.000428 ***
## rankProf      34599.24    3382.52  10.229  < 2e-16 ***
## disciplineB   13473.38    2315.50   5.819 1.24e-08 ***
## sexFemale     -4771.25    3878.00  -1.230 0.219311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22650 on 391 degrees of freedom
## Multiple R-squared:  0.4478, Adjusted R-squared:  0.4407 
## F-statistic: 63.41 on 5 and 391 DF,  p-value: < 2.2e-16

For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).

Discussion

Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is. For example, if the professor grades (" AsstProf“,”AssocProf" and “Prof”) have a special meaning, you can convert them into numerical values, ordered from low to high, corresponding to higher-grade professors.