Robbie Beane
A variable is categorical, or qualitative if it takes on values from a finite set of categories or classes. The values might be encoded using numerical labels, but they do not typically have any numerical or quantitative significance. In R terminology, a categorical variable is represented by a variable with the factor data type. The possible values that a factor can assume are referred to as its levels.
Simple linear regression assumes that the response variable is a quantitative variable that lies on a continuous scale. However, it makes no such assumptions about the predictors. We can use categorical variables in our predictors, as long as we are careful to encode them properly.
Assume that we have are constructing a regression model that includes a categorical predictor \(Z\). Suppose that \(Z\) has two possible values, \(A\) and \(B\). To use this predictor in a model, we must numerically encode the information it contains. We do this by first selecting one of the two values to be the base value. This choice is arbitrary, and has no substantive effect on the final model (but will effect how the model is represented).
Assume that we have selected \(A\) to be our base value. Then we will create new dummy variable \(Z_B\) defined as follows:
\[Z_B = 0 \text{ if } Z = A\]
\[Z_B = 1 \text{ if } Z = B\]
In a sense, the dummy variable \(X_B\) is indicates whether or not an observation falls in a level other than the base level.
We can use \(Z_B\) in a regression model. Suppose that we have two other predictors, \(X^{(1)}\) and \(X^{(2)}\), and that our fitted model has the form:
\[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)} + \hat{\beta_3} Z_B\]
In a sense, this creates two models. Observations with \(Z = A\) would use the model:
\[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)}\]
while observations with \(Z=B\) would use the model: \[\hat Y = \hat{\beta_0} + \hat{\beta_1} X^{(1)} + \hat{\beta_2} X^{(2)} + \hat{\beta_3}\]
The parameter estimate \(\hat{\beta_3}\) is an estimate of the average difference between \(Y\) values for two observations that have different values of \(Z\), but the same values for the other predictors.
It is quite common to have categorical variables with more than two categories. Encoding a non-binary categorical variable requires a bit more work than a binary categorical variable. We will still select a base level for the variable, and we then introduce dummy variables for each categorical other than the base level. A particular dummy variable is equal to one for observations that fall in the related category, and is zero otherwise.
For example, assume that we have a categorical variable \(Z\) with four levels: \(A\), \(B\), \(C\), and \(D\). We will select \(A\) to be the base level, and will introduce dummy variables \(Z_B\), \(Z_C\), and \(Z_D\). The following table shows the value of these three dummy variables for each possible value of \(Z\).
| \(Z\) | \(Z_B\) | \(Z_C\) | \(Z_D\) |
| A | 0 | 0 | 0 |
| B | 1 | 0 | 0 |
| C | 0 | 1 | 0 |
| D | 0 | 0 | 1 |
When using these dummary variables in a regression model, their fitted coefficients will be estimates of the average difference between observations within the relevant category and the base category, assume that all other predictors are equal.
An interaction term in a regression model is a predictor that is created as a product of two other predictors. Interaction terms are often important to consider when dealing with categorical predictors, as the value of a categorical variable \(Z\) might have an impact on the effect of a different predictor variable \(X\) within the model.
When considering interactions between a quantitative predictor \(X\) and a categorical predictor \(Z\), we will have to include one interaction term for every dummy variable created for \(Z\).
We will see how to implement interaction terms in R later in this notebook.
In this example, we will study the effects that experience, education, and being in a management position have on the salary of individuals at a specific company. The data used in this study has been simulated.
Our dataset consists of 198 observations with 4 features. The features in the dataset are:
We will begin by loading two packages that we will use for the purposes of performing graphical analysis of the data.
The training data for this example is stored in the file data/salaries_tr.csv. We will load this dataset into a data frame and will then look at a summary of the data.
## Exp Educ Mgmt Salary
## Min. : 2.000 B :97 Min. :0.0000 Min. :28116
## 1st Qu.: 6.250 HS:66 1st Qu.:0.0000 1st Qu.:39651
## Median : 9.000 M :35 Median :0.0000 Median :43685
## Mean : 9.783 Mean :0.2121 Mean :45590
## 3rd Qu.:13.000 3rd Qu.:0.0000 3rd Qu.:50969
## Max. :22.000 Max. :1.0000 Max. :74553
We will use the str() function to determine the data types for each of the columns in the data frame.
## 'data.frame': 198 obs. of 4 variables:
## $ Exp : int 4 7 14 8 9 5 4 7 8 15 ...
## $ Educ : Factor w/ 3 levels "B","HS","M": 1 1 2 3 1 3 3 2 2 1 ...
## $ Mgmt : int 1 0 0 1 0 0 1 0 0 0 ...
## $ Salary: int 46994 46972 42503 56029 39452 38689 63040 40300 46443 45893 ...
The variable Educ is stored as a factor, which is correct. However, the base level for Educ is listed as B. We will relevel this factor variable so that HS is the base level. Additionally, we notice that Mgmt is currently stored as an integer. We will convert this to a factor.
df$Educ <- factor(df$Educ, levels=c("HS", "B", "M"))
df$Mgmt <- factor(df$Mgmt, levels=c(0, 1))
str(df)## 'data.frame': 198 obs. of 4 variables:
## $ Exp : int 4 7 14 8 9 5 4 7 8 15 ...
## $ Educ : Factor w/ 3 levels "HS","B","M": 2 2 1 3 2 3 3 1 1 2 ...
## $ Mgmt : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
## $ Salary: int 46994 46972 42503 56029 39452 38689 63040 40300 46443 45893 ...
Let’s take another look at the summary of the data.
## Exp Educ Mgmt Salary
## Min. : 2.000 HS:66 0:156 Min. :28116
## 1st Qu.: 6.250 B :97 1: 42 1st Qu.:39651
## Median : 9.000 M :35 Median :43685
## Mean : 9.783 Mean :45590
## 3rd Qu.:13.000 3rd Qu.:50969
## Max. :22.000 Max. :74553
We will generate plots to explore the distribution of the Educ and Mgmt variables.
p1 <- ggplot(df, aes(x=Educ, fill=Educ)) + geom_bar()
p2 <- ggplot(df, aes(x=Mgmt, fill=Mgmt)) + geom_bar()
p3 <- ggplot(df, aes(x=Educ, fill=Mgmt)) + geom_bar(position="fill") +
ylab("Proportion")
grid.arrange(p1, p2, p3, ncol=3)We will now create boxplots to explore the replationship between the each of the qualitative variables and Salary.
p1 <- ggplot(df, aes(x=Educ, y=Salary, fill=Educ)) + geom_boxplot()
p2 <- ggplot(df, aes(x=Mgmt, y=Salary, fill=Mgmt)) + geom_boxplot()
grid.arrange(p1, p2, ncol=2)We can use a scatterplot to visualize the relationship between Salary and Exp.
We can use faceting to create seperate scatter plots for each of the six groups defined by the values of Educ and Mgmt.
ggplot(df, aes(x=Exp, y=Salary, color=Educ, shape=Mgmt)) +
geom_point(size=2, alpha=0.8) +
geom_smooth(method='lm') +
facet_grid(Educ ~ Mgmt)We can create a regession model using the same syntax as we would for any other multiple regression problem. R will take care of creating dummy variables for any predictors that are stored as factors.
##
## Call:
## lm(formula = Salary ~ Exp + Educ + Mgmt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13792.9 -3844.1 44.7 3334.3 15197.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30327.10 1041.65 29.114 < 2e-16 ***
## Exp 929.69 84.67 10.980 < 2e-16 ***
## EducB 4657.64 836.79 5.566 8.63e-08 ***
## EducM 8618.16 1220.85 7.059 2.94e-11 ***
## Mgmt1 11138.77 1015.17 10.972 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5101 on 193 degrees of freedom
## Multiple R-squared: 0.6997, Adjusted R-squared: 0.6935
## F-statistic: 112.4 on 4 and 193 DF, p-value: < 2.2e-16
We will now explore the possibility of there being relevant interaction terms in the model. We begin by creating a model that includes interactions between Exp and Mgmt. Our fitted model will have the following form:
\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1 + \hat{\beta_4}\cdot Exp + \hat{\beta_5}\cdot Exp \cdot Mgmt_1\]
##
## Call:
## lm(formula = Salary ~ Educ + Exp * Mgmt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12609.4 -3607.2 157.8 3256.0 14920.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31395.46 1101.37 28.506 < 2e-16 ***
## EducB 4723.22 824.27 5.730 3.83e-08 ***
## EducM 8812.70 1204.28 7.318 6.70e-12 ***
## Exp 815.00 93.84 8.685 1.64e-15 ***
## Mgmt1 5681.27 2281.21 2.490 0.01361 *
## Exp:Mgmt1 544.85 204.72 2.661 0.00844 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5023 on 192 degrees of freedom
## Multiple R-squared: 0.7104, Adjusted R-squared: 0.7029
## F-statistic: 94.2 on 5 and 192 DF, p-value: < 2.2e-16
Let’s now include interactions between Exp and Educ, as well as Exp and Mgmt. Our model will have the form:
\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1+ \hat{\beta_4}\cdot Exp\] \[ + \hat{\beta_5}\cdot Exp \cdot Mgmt_1 + \hat{\beta_6}\cdot Exp \cdot Educ_B + \hat{\beta_7}\cdot Exp \cdot Educ_M\]
##
## Call:
## lm(formula = Salary ~ Exp * Educ + Exp * Mgmt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12314.6 -3619.0 -38.2 3459.8 14000.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31367.00 1553.61 20.190 < 2e-16 ***
## Exp 818.62 145.28 5.635 6.24e-08 ***
## EducB 5122.24 2057.56 2.489 0.0137 *
## EducM 7007.72 2998.63 2.337 0.0205 *
## Mgmt1 6432.87 2527.36 2.545 0.0117 *
## Exp:EducB -42.41 191.16 -0.222 0.8247
## Exp:EducM 180.58 274.09 0.659 0.5108
## Exp:Mgmt1 473.08 229.25 2.064 0.0404 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5038 on 190 degrees of freedom
## Multiple R-squared: 0.7116, Adjusted R-squared: 0.701
## F-statistic: 66.99 on 7 and 190 DF, p-value: < 2.2e-16
Finally, let consider a model including interactions between only Exp and Educ. This model will have the form:
\[\hat{Salary} = \hat{\beta_0} + \hat{\beta_1}\cdot Educ_B + \hat{\beta_2}\cdot Educ_M + \hat{\beta_3}\cdot Mgmt_1 +\] \[\hat{\beta_4}\cdot Exp + \hat{\beta_5}\cdot Exp \cdot Educ_B + \hat{\beta_6}\cdot Exp \cdot Educ_M\]
##
## Call:
## lm(formula = Salary ~ Mgmt + Exp * Educ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12894 -3724 178 3390 14737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31226.57 1565.30 19.949 < 2e-16 ***
## Mgmt1 11219.41 1012.28 11.083 < 2e-16 ***
## Exp 837.95 146.21 5.731 3.83e-08 ***
## EducB 4342.14 2039.72 2.129 0.0346 *
## EducM 4362.29 2733.83 1.596 0.1122
## Exp:EducB 30.28 189.49 0.160 0.8732
## Exp:EducM 431.27 247.79 1.740 0.0834 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5081 on 191 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.6959
## F-statistic: 76.14 on 6 and 191 DF, p-value: < 2.2e-16
Of the models considered, Model 2 appears to be the best. We will now perform a residual analysis for this model.
res <- mod2$residuals
plot(res ~ mod2$fitted.values, pch=21, col="black", bg="salmon",
xlab="Fitted Value", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)We will apply a Shapiro-Wilks test for normality of the residuals.
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.99551, p-value = 0.8274
Assume that we have access to a validation set to use when selecting between our models. We can compare the different models by calculating \(r^2\) scores for each of the models. Note that selecting the model with the highest \(r^2\) score is equivalent to selecting the model with the lowest \(SSE\) score.
We will begin by loading and processing the validation set.
vdf <- read.table("data/salaries_va.csv", header=TRUE, sep=",")
vdf$Educ <- factor(vdf$Educ, levels=c("HS", "B", "M"))
vdf$Mgmt <- factor(vdf$Mgmt, levels=c(0, 1))
summary(vdf)## Exp Educ Mgmt Salary
## Min. : 4.00 HS:21 0:48 Min. :29960
## 1st Qu.: 6.75 B :30 1:12 1st Qu.:39187
## Median :10.00 M : 9 Median :42967
## Mean :10.57 Mean :44365
## 3rd Qu.:13.25 3rd Qu.:48877
## Max. :22.00 Max. :71263
In the next cell, we will use each of the models to generate predictions for the validation set. We will then calculate \(SSE\) and \(r^2\) for each of the models.
val_pred_1 <- predict(mod1, vdf)
val_pred_2 <- predict(mod2, vdf)
val_pred_3 <- predict(mod3, vdf)
val_pred_4 <- predict(mod4, vdf)
SSE_1 <- sum((vdf$Salary - val_pred_1)^2)
SSE_2 <- sum((vdf$Salary - val_pred_2)^2)
SSE_3 <- sum((vdf$Salary - val_pred_3)^2)
SSE_4 <- sum((vdf$Salary - val_pred_4)^2)
SST <- var(vdf$Salary) * (nrow(vdf) - 1)
r2_1 <- 1 - SSE_1 / SST
r2_2 <- 1 - SSE_2 / SST
r2_3 <- 1 - SSE_3 / SST
r2_4 <- 1 - SSE_4 / SST
round(c(r2_1, r2_2, r2_3, r2_4),3)## [1] 0.635 0.660 0.651 0.636
Model 2 has the best performance on the validation set.
We will finish this lesson by using Model 2 to generate predictions for individuals with the following characteristics:
Exp = 10, Educ = 'HS', Mgmt = '0'Exp = 5, Educ = 'B', Mgmt = '0'Exp = 12, Educ = 'B', Mgmt = '1'Exp = 8, Educ = 'M', Mgmt = '1'nd <- data.frame(
Exp = c(10, 5, 12, 8),
Educ = c('HS', 'B', 'B', 'M'),
Mgmt = c('0', '0', '1', '1')
)
predict(mod2, newdata = nd, interval = "prediction", level = 0.95)## fit lwr upr
## 1 39545.47 29563.68 49527.27
## 2 40193.69 30188.86 50198.52
## 3 58118.14 48013.60 68222.69
## 4 56768.22 46670.06 66866.38