Lesson 3.5 - Categorical Predictors

Robbie Beane

Categorical Predictors

A variable is categorical, or qualitative if it takes on values from a finite set of categories or classes. The values might be encoded using numerical labels, but they do not typically have any numerical or quantitative significance. In R terminology, a categorical variable is represented by a variable with the factor data type. The possible values that a factor can assume are referred to as its levels.

Simple linear regression assumes that the response variable is a quantitative variable that lies on a continuous scale. However, it makes no such assumptions about the predictors. We can use categorical variables in our predictors, as long as we are careful to encode them properly.

Encoding Binary Categorical Variables

Assume that we have are constructing a regression model that includes a categorical predictor \(Z\). Suppose that \(Z\) has two possible values, \(A\) and \(B\). To use this predictor in a model, we must numerically encode the information it contains. We do this by first selecting one of the two values to be the base value. This choice is arbitrary, and has no substantive effect on the final model (but will effect how the model is represented).

Assume that we have selected \(A\) to be our base value. Then we will create new dummy variable \(Z_B\) defined as follows:

\[Z_B = 0 \text{ if } Z = A\]

\[Z_B = 1 \text{ if } Z = B\]

In a sense, the dummy variable \(X_B\) is indicates whether or not an observation falls in a level other than the base level.

We can use \(Z_B\) in a regression model. Suppose that we have two other predictors, \(X^{(1)}\) and \(X^{(2)}\), and that our fitted model has the form:

\[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)} + \hat{\beta_3} Z_B\]

In a sense, this creates two models. Observations with \(Z = A\) would use the model:

\[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)}\]

while observations with \(Z=B\) would use the model: \[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)} + \hat{\beta_3}\]

The parameter estimate \(\hat{\beta_3}\) is an estimate of the average difference between \(Y\) values for two observations that have different values of \(Z\), but the same values for the other predictors.

Encoding Non-Binary Categorical Variables

It is quite common to have categorical variables with more than two categories. Encoding a non-binary categorical variable requires a bit more work than a binary categorical variable. We will still select a base level for the variable, and we then introduce dummy variables for each categorical other than the base level. A particular dummy variable is equal to one for observations that fall in the related category, and is zero otherwise.

For example, assume that we have a categorical variable \(Z\) with four levels: \(A\), \(B\), \(C\), and \(D\). We will select \(A\) to be the base level, and will introduce dummy variables \(Z_B\), \(Z_C\), and \(Z_D\). The following table shows the value of these three dummy variables for each possible value of \(Z\).

\(Z\) \(Z_B\) \(Z_C\) \(Z_D\)
A 0 0 0
B 1 0 0
C 0 1 0
D 0 0 1

When using these dummary variables in a regression model, their fitted coefficients will be estimates of the average difference between observations within the relevant category and the base category, assume that all other predictors are equal.

Interaction Terms

An interaction term in a regression model is a predictor that is created as a product of two other predictors. Interaction terms are often important to consider when dealing with categorical predictors, as the value of a categorical variable \(Z\) might have an impact on the effect of a different predictor variable \(X\) within the model.

When considering interactions between a quantitative predictor \(X\) and a categorical predictor \(Z\), we will have to include one interaction term for every dummy variable created for \(Z\).

We will see how to implement interaction terms in R later in this notebook.

Example: Salary Survey

In this example, we will study the effects that experience, education, and being in a management position have on the salary of individuals at a specific company. The data used in this study has been simulated.

Our dataset consists of 198 observations with 4 features. The features in the dataset are:

Load Packages

We will begin by loading two packages that we will use for the purposes of performing graphical analysis of the data.

library(ggplot2)
library(gridExtra)

Load and Analyze the Data

The training data for this example is stored in the file data/salaries_tr.csv. We will load this dataset into a data frame and will then look at a summary of the data.

df <- read.table("data/salaries_tr.csv", header=TRUE, sep=",")
summary(df)
##       Exp         Educ         Mgmt            Salary     
##  Min.   : 2.000   B :97   Min.   :0.0000   Min.   :28116  
##  1st Qu.: 6.250   HS:66   1st Qu.:0.0000   1st Qu.:39651  
##  Median : 9.000   M :35   Median :0.0000   Median :43685  
##  Mean   : 9.783           Mean   :0.2121   Mean   :45590  
##  3rd Qu.:13.000           3rd Qu.:0.0000   3rd Qu.:50969  
##  Max.   :22.000           Max.   :1.0000   Max.   :74553

We will use the str() function to determine the data types for each of the columns in the data frame.

str(df)
## 'data.frame':    198 obs. of  4 variables:
##  $ Exp   : int  4 7 14 8 9 5 4 7 8 15 ...
##  $ Educ  : Factor w/ 3 levels "B","HS","M": 1 1 2 3 1 3 3 2 2 1 ...
##  $ Mgmt  : int  1 0 0 1 0 0 1 0 0 0 ...
##  $ Salary: int  46994 46972 42503 56029 39452 38689 63040 40300 46443 45893 ...

Load and Analyze the Data

The variable Educ is stored as a factor, which is correct. However, the base level for Educ is listed as B. We will relevel this factor variable so that HS is the base level. Additionally, we notice that Mgmt is currently stored as an integer. We will convert this to a factor.

df$Educ <- factor(df$Educ, levels=c("HS", "B", "M"))
df$Mgmt <- factor(df$Mgmt, levels=c(0, 1))
str(df)
## 'data.frame':    198 obs. of  4 variables:
##  $ Exp   : int  4 7 14 8 9 5 4 7 8 15 ...
##  $ Educ  : Factor w/ 3 levels "HS","B","M": 2 2 1 3 2 3 3 1 1 2 ...
##  $ Mgmt  : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
##  $ Salary: int  46994 46972 42503 56029 39452 38689 63040 40300 46443 45893 ...

Let’s take another look at the summary of the data.

summary(df)
##       Exp         Educ    Mgmt        Salary     
##  Min.   : 2.000   HS:66   0:156   Min.   :28116  
##  1st Qu.: 6.250   B :97   1: 42   1st Qu.:39651  
##  Median : 9.000   M :35           Median :43685  
##  Mean   : 9.783                   Mean   :45590  
##  3rd Qu.:13.000                   3rd Qu.:50969  
##  Max.   :22.000                   Max.   :74553

Exploratory Plots

We will generate plots to explore the distribution of the Educ and Mgmt variables.

p1 <- ggplot(df, aes(x=Educ, fill=Educ)) + geom_bar()  
p2 <- ggplot(df, aes(x=Mgmt, fill=Mgmt)) + geom_bar()  
p3 <- ggplot(df, aes(x=Educ, fill=Mgmt)) + geom_bar(position="fill") +
    ylab("Proportion")

grid.arrange(p1, p2, p3, ncol=3)

Exploratory Plots

We will now create boxplots to explore the replationship between the each of the qualitative variables and Salary.

p1 <- ggplot(df, aes(x=Educ, y=Salary, fill=Educ)) + geom_boxplot()
p2 <- ggplot(df, aes(x=Mgmt, y=Salary, fill=Mgmt)) + geom_boxplot()
grid.arrange(p1, p2, ncol=2)

Exploratory Plots

We can use a scatterplot to visualize the relationship between Salary and Exp.

ggplot(df, aes(x=Exp, y=Salary, color=Educ, shape=Mgmt)) + 
  geom_point(size=2, alpha=0.8) 

Exploratory Plots

We can use faceting to create seperate scatter plots for each of the six groups defined by the values of Educ and Mgmt.

ggplot(df, aes(x=Exp, y=Salary, color=Educ, shape=Mgmt)) + 
  geom_point(size=2, alpha=0.8) + 
  geom_smooth(method='lm') + 
  facet_grid(Educ ~ Mgmt)

Create Full Model, without Interactions

We can create a regession model using the same syntax as we would for any other multiple regression problem. R will take care of creating dummy variables for any predictors that are stored as factors.

mod1 <- lm(Salary ~ Exp + Educ + Mgmt, df)
summary(mod1)
## 
## Call:
## lm(formula = Salary ~ Exp + Educ + Mgmt, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13792.9  -3844.1     44.7   3334.3  15197.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30327.10    1041.65  29.114  < 2e-16 ***
## Exp           929.69      84.67  10.980  < 2e-16 ***
## EducB        4657.64     836.79   5.566 8.63e-08 ***
## EducM        8618.16    1220.85   7.059 2.94e-11 ***
## Mgmt1       11138.77    1015.17  10.972  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5101 on 193 degrees of freedom
## Multiple R-squared:  0.6997, Adjusted R-squared:  0.6935 
## F-statistic: 112.4 on 4 and 193 DF,  p-value: < 2.2e-16

Consider Models with Interaction Terms

We will now explore the possibility of there being relevant interaction terms in the model. We begin by creating a model that includes interactions between Exp and Mgmt. Our fitted model will have the following form:

\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1 + \hat{\beta_4}\cdot Exp + \hat{\beta_5}\cdot Exp \cdot Mgmt_1\]

mod2 <- lm(Salary ~ Educ + Exp*Mgmt, df)
summary(mod2)
## 
## Call:
## lm(formula = Salary ~ Educ + Exp * Mgmt, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12609.4  -3607.2    157.8   3256.0  14920.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31395.46    1101.37  28.506  < 2e-16 ***
## EducB        4723.22     824.27   5.730 3.83e-08 ***
## EducM        8812.70    1204.28   7.318 6.70e-12 ***
## Exp           815.00      93.84   8.685 1.64e-15 ***
## Mgmt1        5681.27    2281.21   2.490  0.01361 *  
## Exp:Mgmt1     544.85     204.72   2.661  0.00844 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5023 on 192 degrees of freedom
## Multiple R-squared:  0.7104, Adjusted R-squared:  0.7029 
## F-statistic:  94.2 on 5 and 192 DF,  p-value: < 2.2e-16

Consider Models with Interaction Terms

Let’s now include interactions between Exp and Educ, as well as Exp and Mgmt. Our model will have the form:

\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1+ \hat{\beta_4}\cdot Exp\] \[ + \hat{\beta_5}\cdot Exp \cdot Mgmt_1 + \hat{\beta_6}\cdot Exp \cdot Educ_B + \hat{\beta_7}\cdot Exp \cdot Educ_M\]

mod3 <- lm(Salary ~ Exp*Educ + Exp*Mgmt, df)
summary(mod3)
## 
## Call:
## lm(formula = Salary ~ Exp * Educ + Exp * Mgmt, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12314.6  -3619.0    -38.2   3459.8  14000.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31367.00    1553.61  20.190  < 2e-16 ***
## Exp           818.62     145.28   5.635 6.24e-08 ***
## EducB        5122.24    2057.56   2.489   0.0137 *  
## EducM        7007.72    2998.63   2.337   0.0205 *  
## Mgmt1        6432.87    2527.36   2.545   0.0117 *  
## Exp:EducB     -42.41     191.16  -0.222   0.8247    
## Exp:EducM     180.58     274.09   0.659   0.5108    
## Exp:Mgmt1     473.08     229.25   2.064   0.0404 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5038 on 190 degrees of freedom
## Multiple R-squared:  0.7116, Adjusted R-squared:  0.701 
## F-statistic: 66.99 on 7 and 190 DF,  p-value: < 2.2e-16

Consider Models with Interaction Terms

Finally, let consider a model including interactions between only Exp and Educ. This model will have the form:

\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1 +\] \[\hat{\beta_4}\cdot Exp + \hat{\beta_5}\cdot Exp \cdot Educ_B + \hat{\beta_6}\cdot Exp \cdot Educ_M\]

mod4 <- lm(Salary ~ Mgmt + Exp*Educ, df)
summary(mod4)
## 
## Call:
## lm(formula = Salary ~ Mgmt + Exp * Educ, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12894  -3724    178   3390  14737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31226.57    1565.30  19.949  < 2e-16 ***
## Mgmt1       11219.41    1012.28  11.083  < 2e-16 ***
## Exp           837.95     146.21   5.731 3.83e-08 ***
## EducB        4342.14    2039.72   2.129   0.0346 *  
## EducM        4362.29    2733.83   1.596   0.1122    
## Exp:EducB      30.28     189.49   0.160   0.8732    
## Exp:EducM     431.27     247.79   1.740   0.0834 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5081 on 191 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.6959 
## F-statistic: 76.14 on 6 and 191 DF,  p-value: < 2.2e-16

Residual Analysis

Of the models considered, Model 2 appears to be the best. We will now perform a residual analysis for this model.

res <- mod2$residuals

plot(res ~ mod2$fitted.values, pch=21, col="black", bg="salmon",
     xlab="Fitted Value", ylab="Residuals", main="Residual Plot")

abline(h=0, col="darkred", lwd=2)

Residual Analysis

We will apply a Shapiro-Wilks test for normality of the residuals.

shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.99551, p-value = 0.8274
par(mfrow=c(1,2))
hist(res, col='orchid', breaks=10)
qqnorm(res)
qqline(res)

par(mfrow=c(1,1))

Comparing Models with a Validation Set

Assume that we have access to a validation set to use when selecting between our models. We can compare the different models by calculating \(r^2\) scores for each of the models. Note that selecting the model with the highest \(r^2\) score is equivalent to selecting the model with the lowest \(SSE\) score.

We will begin by loading and processing the validation set.

vdf <- read.table("data/salaries_va.csv", header=TRUE, sep=",")
vdf$Educ <- factor(vdf$Educ, levels=c("HS", "B", "M"))
vdf$Mgmt <- factor(vdf$Mgmt, levels=c(0, 1))
summary(vdf)
##       Exp        Educ    Mgmt       Salary     
##  Min.   : 4.00   HS:21   0:48   Min.   :29960  
##  1st Qu.: 6.75   B :30   1:12   1st Qu.:39187  
##  Median :10.00   M : 9          Median :42967  
##  Mean   :10.57                  Mean   :44365  
##  3rd Qu.:13.25                  3rd Qu.:48877  
##  Max.   :22.00                  Max.   :71263

Comparing Models with a Validation Set

In the next cell, we will use each of the models to generate predictions for the validation set. We will then calculate \(SSE\) and \(r^2\) for each of the models.

val_pred_1 <- predict(mod1, vdf)
val_pred_2 <- predict(mod2, vdf)
val_pred_3 <- predict(mod3, vdf)
val_pred_4 <- predict(mod4, vdf)

SSE_1 <- sum((vdf$Salary - val_pred_1)^2)
SSE_2 <- sum((vdf$Salary - val_pred_2)^2)
SSE_3 <- sum((vdf$Salary - val_pred_3)^2)
SSE_4 <- sum((vdf$Salary - val_pred_4)^2)

SST <- var(vdf$Salary) * (nrow(vdf) - 1)

r2_1 <- 1 - SSE_1 / SST
r2_2 <- 1 - SSE_2 / SST
r2_3 <- 1 - SSE_3 / SST
r2_4 <- 1 - SSE_4 / SST

round(c(r2_1, r2_2, r2_3, r2_4),3)
## [1] 0.635 0.660 0.651 0.636

Model 2 has the best performance on the validation set.

Generating Predictions

We will finish this lesson by using Model 2 to generate predictions for individuals with the following characteristics:

  1. Exp = 10, Educ = 'HS', Mgmt = '0'
  2. Exp = 5, Educ = 'B', Mgmt = '0'
  3. Exp = 12, Educ = 'B', Mgmt = '1'
  4. Exp = 8, Educ = 'M', Mgmt = '1'
nd <- data.frame(
    Exp = c(10, 5, 12, 8), 
    Educ = c('HS', 'B', 'B', 'M'),
    Mgmt = c('0', '0', '1', '1')
)
 
predict(mod2, newdata = nd, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 39545.47 29563.68 49527.27
## 2 40193.69 30188.86 50198.52
## 3 58118.14 48013.60 68222.69
## 4 56768.22 46670.06 66866.38