Categorical Variables

The ideas behind linear regression do no change when we use categorical variables \(x_j\) to predict a quantitative response \(y\). What changes is how we set the data up.

We are still trying to create a model to capture a relationship of the form \[ y = \beta_0 + \beta_1x + \varepsilon,\] where \(y\) represents the response variable, \(\beta_0\) represents the \(y\)-intercept, \(\beta_1\) is the coefficient representing the slope of the regression line and \(\varepsilon\) represents a random error term with mean 0. We still do so by creating a model \[\hat{y} = b_0 + b_1x,\] where \(b_0\) predicts \(\beta_0\) and \(b_1\) predicts \(\beta_1\) and \(\hat{y} = E(y|x)\). Finally, we still do so by minimizing the sum of the square residuals. The difference is that our variables can now only be 1 or 0. We also note that this analysis can be extended to a multiple regression model by making the appropriate changes.

Transforming Categories

In some cases, your categorical variables may have more than 1 option. For example:

Married
Single
Widowed
Separated/ Divorced
Other

Here, we might survey subjects and assign a numerical value from 1 to 5 based on their answers; however, that is not allowed in linear regression. If Married is 1 and Widowed is 3, does that mean that Widowed is 3 times as much as Married? That is the mathematical interpretation after a regression is run. Whatever the value of \(b_1\) is that comes from that variable, being widowed adds 3 times as much as being married to the value of \(\hat{y}\).

To address this issue, we create a number of categories. In this example, we take:

Married = \(x_1\)
Single = \(x_2\)
Widowed = \(x_3\)
Separated/ Divorced = \(x_4\)
Other (No variable assigned)

Our model becomes \[y = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 +\beta_4x_4 +\varepsilon,\] and we create a predictive model \[\hat{y} = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4.\] A value of 1 is assigned if a person selects that answer. Therefore, the chart below shows the possible options:

Classification	\(x_1\)	\(x_2\)	\(x_3\)	\(x_4\)
Married	1	0	0	0
Single	0	1	0	0
Widowed	0	0	1	0
Separated/ Divorced	0	0	0	1
Other	0	0	0	0

From this model, we can see that we have actually created a series of linear equations that predict \(y\) based on the answer to the question.

Classification	Equation
Married	\(\hat{y} = b_0 + b_1\)
Single	\(\hat{y} = b_0 + b_2\)
Widowed	\(\hat{y} = b_0 + b_3\)
Separated/ Divorced	\(\hat{y} = b_0 + b_4\)
Other	\(\hat{y} = b_0\)

Thus, we can see exactly the extent to which each answer impacts \(y\) in our model.

The Example

We would like to predict the birthweight of a newborn baby based on a number of characteristics of the mother including age, mother’s weight, whether or not the mother smoked and the mother’s race. To do this, we will use the birthwt data set in the MASS library for this example. The data contains 10 columns and 189 rows and the data were collected at Bayside Medical Center in Springfield Massachusetts in 1986.

This data frame contains the following columns:

Variable	Description
low	indicator of birth weight less than 2.5 kg.
age	mother’s age in years.
lwt	mother’s weight in pounds at last menstrual period.
race	mother’s race (1 = white, 2 = black, 3 = other).
smoke	smoking status during pregnancy.
ptl	number of previous premature labours.
ht	history of hypertension.
ui	presence of uterine irritability.
ftv	number of physician visits during the first trimester.
bwt	birth weight in grams.

library(MASS)
dat <- birthwt
head(dat)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Data Preparation

We begin by taking the race variable and dividing it up into 2 variables based on the 3 possible answers.

Mother’s Race	wht	blk
White	1	0
Black	0	1
Other	0	0

Using the mutate function, we create 2 additional columns in the data based on the race of the baby’s mother.

dat <- dat %>% mutate(wht = ifelse(race==1,1,0))
dat <- dat %>% mutate(blk = ifelse(race==2,1,0))
head(dat)

##   low age lwt race smoke ptl ht ui ftv  bwt wht blk
## 1   0  19 182    2     0   0  0  1   0 2523   0   1
## 2   0  33 155    3     0   0  0  0   3 2551   0   0
## 3   0  20 105    1     1   0  0  0   1 2557   1   0
## 4   0  21 108    1     1   0  0  1   2 2594   1   0
## 5   0  18 107    1     1   0  0  1   0 2600   1   0
## 6   0  21 124    3     0   0  0  0   0 2622   0   0

Eventually, we will regress against the smoke variable which is also categorical. However, it only has 2 categories, so it is not necessary to create additional columns for that variable.

Next, we create the training and testing set. Again, we use a 60% / 40% split where we train our model on 60% of the data and then test the model performance on 40% of the data that is withheld.

set.seed(123456)
sample <- sample(c(TRUE, FALSE), nrow(dat), replace = T, prob = c(0.6,0.4))
train <- dat[sample, ]
test <- dat[!sample, ]
train

##     low age lwt race smoke ptl ht ui ftv  bwt wht blk
## 3     0  20 105    1     1   0  0  0   1 2557   1   0
## 4     0  21 108    1     1   0  0  1   2 2594   1   0
## 5     0  18 107    1     1   0  0  1   0 2600   1   0
## 6     0  21 124    3     0   0  0  0   0 2622   0   0
## 7     0  22 118    1     0   0  0  0   1 2637   1   0
## 8     0  17 103    3     0   0  0  0   1 2637   0   0
## 10    0  26 113    1     1   0  0  0   0 2665   1   0
## 12    0  19 150    3     0   0  0  0   1 2733   0   0
## 18    0  25 118    1     1   0  0  0   3 2782   1   0
## 19    0  20 120    3     0   0  0  1   0 2807   0   0
## 21    0  32 121    3     0   0  0  0   2 2835   0   0
## 22    0  31 100    1     0   0  0  1   3 2835   1   0
## 23    0  36 202    1     0   0  0  0   1 2836   1   0
## 24    0  28 120    3     0   0  0  0   0 2863   0   0
## 25    0  25 120    3     0   0  0  1   2 2877   0   0
## 30    0  17 113    2     0   0  0  0   1 2920   0   1
## 31    0  17 113    2     0   0  0  0   1 2920   0   1
## 33    0  35 121    2     1   1  0  0   1 2948   0   1
## 34    0  25 155    1     0   0  0  0   1 2977   1   0
## 36    0  29 140    1     1   0  0  0   2 2977   1   0
## 41    0  21 185    2     1   0  0  0   2 3042   0   1
## 44    0  21 160    1     0   0  0  0   0 3062   1   0
## 45    0  18  90    1     1   0  0  1   0 3062   1   0
## 46    0  18  90    1     1   0  0  1   0 3062   1   0
## 47    0  32 132    1     0   0  0  0   4 3080   1   0
## 50    0  22  85    3     1   0  0  0   0 3090   0   0
## 51    0  22 120    1     0   0  1  0   1 3100   1   0
## 53    0  22 130    1     1   0  0  0   0 3132   1   0
## 59    0  20 103    3     0   0  0  0   0 3203   0   0
## 60    0  17 119    3     0   0  0  0   0 3225   0   0
## 61    0  17 119    3     0   0  0  0   0 3225   0   0
## 62    0  23 119    3     0   0  0  0   2 3232   0   0
## 63    0  24 110    3     0   0  0  0   0 3232   0   0
## 64    0  28 140    1     0   0  0  0   0 3234   1   0
## 65    0  26 133    3     1   2  0  0   0 3260   0   0
## 66    0  20 169    3     0   1  0  1   1 3274   0   0
## 67    0  24 115    3     0   0  0  0   2 3274   0   0
## 68    0  28 250    3     1   0  0  0   6 3303   0   0
## 69    0  20 141    1     0   2  0  1   1 3317   1   0
## 70    0  22 158    2     0   1  0  0   2 3317   0   1
## 71    0  22 112    1     1   2  0  0   0 3317   1   0
## 74    0  16 112    2     0   0  0  0   0 3374   0   1
## 76    0  18 229    2     0   0  0  0   0 3402   0   1
## 78    0  32 134    1     1   1  0  0   4 3430   1   0
## 79    0  20 121    2     1   0  0  0   0 3444   0   1
## 81    0  22 131    1     0   0  0  0   1 3460   1   0
## 82    0  32 170    1     0   0  0  0   0 3473   1   0
## 85    0  23 123    3     0   0  0  0   0 3544   0   0
## 86    0  17 120    3     1   0  0  0   0 3572   0   0
## 87    0  19 105    3     0   0  0  0   0 3572   0   0
## 88    0  23 130    1     0   0  0  0   0 3586   1   0
## 89    0  36 175    1     0   0  0  0   0 3600   1   0
## 90    0  22 125    1     0   0  0  0   1 3614   1   0
## 91    0  24 133    1     0   0  0  0   0 3614   1   0
## 92    0  21 134    3     0   0  0  0   2 3629   0   0
## 93    0  19 235    1     1   0  1  0   0 3629   1   0
## 94    0  25  95    1     1   3  0  1   0 3637   1   0
## 95    0  16 135    1     1   0  0  0   0 3643   1   0
## 96    0  29 135    1     0   0  0  0   1 3651   1   0
## 99    0  19 147    1     1   0  0  0   0 3651   1   0
## 100   0  30 137    1     0   0  0  0   1 3699   1   0
## 101   0  24 110    1     0   0  0  0   1 3728   1   0
## 103   0  24 110    3     0   1  0  0   0 3770   0   0
## 104   0  23 110    1     0   0  0  0   1 3770   1   0
## 106   0  25 241    2     0   0  1  0   0 3790   0   1
## 108   0  22 169    1     0   0  0  0   0 3827   1   0
## 111   0  32 186    1     0   0  0  0   2 3860   1   0
## 112   0  18 120    3     0   0  0  0   1 3884   0   0
## 113   0  29 130    1     1   0  0  0   2 3884   1   0
## 114   0  33 117    1     0   0  0  1   1 3912   1   0
## 115   0  20 170    1     1   0  0  0   0 3940   1   0
## 116   0  28 134    3     0   0  0  0   1 3941   0   0
## 117   0  14 135    1     0   0  0  0   0 3941   1   0
## 118   0  28 130    3     0   0  0  0   0 3969   0   0
## 119   0  25 120    1     0   0  0  0   2 3983   1   0
## 121   0  20 158    1     0   0  0  0   1 3997   1   0
## 122   0  26 160    3     0   0  0  0   0 4054   0   0
## 125   0  25 130    1     0   0  0  0   2 4153   1   0
## 126   0  31 120    1     0   0  0  0   2 4167   1   0
## 127   0  35 170    1     0   1  0  0   1 4174   1   0
## 128   0  19 120    1     1   0  0  0   0 4238   1   0
## 130   0  45 123    1     0   0  0  0   1 4990   1   0
## 131   1  28 120    3     1   1  0  1   0  709   0   0
## 132   1  29 130    1     0   0  0  1   2 1021   1   0
## 133   1  34 187    2     1   0  1  0   0 1135   0   1
## 135   1  25  85    3     0   0  0  1   0 1474   0   0
## 136   1  27 150    3     0   0  0  0   0 1588   0   0
## 137   1  23  97    3     0   0  0  1   1 1588   0   0
## 139   1  24 132    3     0   0  1  0   0 1729   0   0
## 140   1  21 165    1     1   0  1  0   1 1790   1   0
## 145   1  25  92    1     1   0  0  0   0 1928   1   0
## 146   1  20 150    1     1   0  0  0   2 1928   1   0
## 149   1  21 103    3     0   0  0  0   0 1970   0   0
## 150   1  20 125    3     0   0  0  1   0 2055   0   0
## 153   1  19 112    1     1   0  0  1   0 2084   1   0
## 155   1  24 138    1     0   0  0  0   0 2100   1   0
## 156   1  17 130    3     1   1  0  1   0 2125   0   0
## 157   1  20 120    2     1   0  0  0   3 2126   0   1
## 159   1  27 130    2     0   0  0  1   0 2187   0   1
## 160   1  20  80    3     1   0  0  1   0 2211   0   0
## 162   1  25 105    3     0   1  0  0   1 2240   0   0
## 163   1  20 109    3     0   0  0  0   0 2240   0   0
## 165   1  18 110    2     1   1  0  0   0 2296   0   1
## 166   1  20 121    1     1   1  0  1   0 2296   1   0
## 169   1  31 102    1     1   1  0  0   1 2353   1   0
## 170   1  15 110    1     0   0  0  0   0 2353   1   0
## 171   1  23 187    2     1   0  0  0   1 2367   0   1
## 172   1  20 122    2     1   0  0  0   0 2381   0   1
## 178   1  17 120    1     1   0  0  0   3 2414   1   0
## 179   1  23 110    1     1   1  0  0   0 2424   1   0
## 180   1  17 120    2     0   0  0  0   2 2438   0   1
## 181   1  26 154    3     0   1  1  0   1 2442   0   0
## 187   1  23  94    3     1   0  0  0   0 2495   0   0
## 188   1  17 142    2     0   0  1  0   0 2495   0   1

Our Model

Using the lm command, we construct a linear model attempting to predict the weight of the baby based on mother’s age, mother’s weight, whether or not the mother smoked and the mother’s race. Note that the smoke variable and the race variable are both categorical.

LinearModel <- lm(bwt ~ age + lwt + smoke + wht + blk, data = train)
summary(LinearModel)

## 
## Call:
## lm(formula = bwt ~ age + lwt + smoke + wht + blk, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2363.4  -408.2   114.7   472.1  1618.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2341.517    388.700   6.024 2.39e-08 ***
## age            1.163     13.092   0.089  0.92939    
## lwt            4.476      2.275   1.968  0.05164 .  
## smoke       -417.779    146.279  -2.856  0.00515 ** 
## wht          427.281    156.721   2.726  0.00747 ** 
## blk          -91.064    221.095  -0.412  0.68125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.4 on 108 degrees of freedom
## Multiple R-squared:  0.1762, Adjusted R-squared:  0.1381 
## F-statistic: 4.622 on 5 and 108 DF,  p-value: 0.0007346

Using \(\alpha = 0.9\), and based on the results of the regression, it appears that there is insignificant evidence to conclude that the age variable is different than 0. Thus, we drop that variable and rerun the regression. While the same can be said for the blk variable, we leave that in since the wht variable is statistically significant.

LinearModel <- lm(bwt ~ lwt + smoke + wht + blk, data = train)
summary(LinearModel)

## 
## Call:
## lm(formula = bwt ~ lwt + smoke + wht + blk, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2359.1  -408.6   112.0   473.1  1641.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2363.261    300.530   7.864 2.89e-12 ***
## lwt            4.517      2.217   2.038  0.04400 *  
## smoke       -419.683    144.039  -2.914  0.00433 ** 
## wht          429.532    153.953   2.790  0.00622 ** 
## blk          -92.885    219.138  -0.424  0.67250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 714.2 on 109 degrees of freedom
## Multiple R-squared:  0.1762, Adjusted R-squared:  0.146 
## F-statistic: 5.828 on 4 and 109 DF,  p-value: 0.0002742

Again, we see that very little has changed with the blk variable. If we leave it in the model, we can conclude the following.

Race	Smoke	Model
White	Yes	\(\hat{y} = 2373.1 + 4.52x\)
White	No	\(\hat{y} = 2792.8 + 4.52x\)
Black	Yes	\(\hat{y} = 1850.7 + 4.52x\)
Black	No	\(\hat{y} = 2270.4 + 4.52x\)
Other	Yes	\(\hat{y} = 1943.6 + 4.52x\)
Other	No	\(\hat{y} = 2363.3 + 4.52x\)

We can see that based on the combination of race and smoking habits, the baby’s weight is going to fluctuate. However, the rate at which the baby’s weight will change based on the mother’s weight will remain constant.

If we remove the blk variable, we have essentially mixed people who identified as black and those who identified as other. Thus, removing that variable from the regression will result in a white/ non-white binary variable.

LinearModel <- lm(bwt ~ lwt + smoke + wht , data = train)
summary(LinearModel)

## 
## Call:
## lm(formula = bwt ~ lwt + smoke + wht, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2365.25  -439.46    66.46   466.47  1633.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2371.746    298.742   7.939 1.87e-12 ***
## lwt            4.264      2.127   2.005 0.047421 *  
## smoke       -432.152    140.476  -3.076 0.002644 ** 
## wht          460.232    135.344   3.400 0.000938 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 711.5 on 110 degrees of freedom
## Multiple R-squared:  0.1748, Adjusted R-squared:  0.1523 
## F-statistic: 7.769 on 3 and 110 DF,  p-value: 9.419e-05

Once we remove the blk variable, we end up with the following equations.

Race	Smoke	Model
White	Yes	\(\hat{y} = 2399.6 + 4.23x\)
White	No	\(\hat{y} = 2831.7 + 4.23x\)
non-white	Yes	\(\hat{y} = 1939.6 + 4.23x\)
non-white	No	\(\hat{y} = 2371.7 + 4.23x\)

Analysis of the Model

In our analysis, we end up with the following equations.

Race	Smoke	Model
White	Yes	\(\hat{y} = 2399.6 + 4.23x\)
White	No	\(\hat{y} = 2831.7 + 4.23x\)
non-white	Yes	\(\hat{y} = 1939.6 + 4.23x\)
non-white	No	\(\hat{y} = 2371.7 + 4.23x\)

summary(LinearModel)

## 
## Call:
## lm(formula = bwt ~ lwt + smoke + wht, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2365.25  -439.46    66.46   466.47  1633.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2371.746    298.742   7.939 1.87e-12 ***
## lwt            4.264      2.127   2.005 0.047421 *  
## smoke       -432.152    140.476  -3.076 0.002644 ** 
## wht          460.232    135.344   3.400 0.000938 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 711.5 on 110 degrees of freedom
## Multiple R-squared:  0.1748, Adjusted R-squared:  0.1523 
## F-statistic: 7.769 on 3 and 110 DF,  p-value: 9.419e-05

We note that our \(p\)-value is quite low indicating that there is at least one variable that is non-zero. That makes sense as if \(\alpha = 0.05\), all of our coefficients are significant. The \(R^2\) value of 0.1748 indicates that our model only accounts for about 17.5% of the variability in the baby’s weight.

Standard Errors & Hypothesis Testing

Based on the summary statistics and taking \(\alpha = 0.05\), we can reject the null hypothesis that the coefficients are 0 and conclude that we are 95% certain that they are all non-zero. Indeed, none of the 95% confidence intervals for the coefficients include 0.

confint(LinearModel, level=0.95)

##                     2.5 %     97.5 %
## (Intercept) 1779.70983607 2963.78236
## lwt            0.04936889    8.47796
## smoke       -710.54197366 -153.76185
## wht          192.01268448  728.45184

Visual Assessment of the Model

ggplot(data = dat, aes(x = lwt, y = bwt)) +
  facet_wrap( ~ race) +
  geom_point(aes(color=as.factor(smoke))) +
  geom_smooth(method = "lm") +
  labs(x="Mother's Weight", y="Baby's Weight", color="Mother Smokes")

From the diagram above, we can see that there appears to be a positive correlation between mother’s weight (lwt) and the baby’s weight (bwt). We can also quickly see that there appears to be a larger proportion of green dots below the regression line than above indicating that babies with mothers who smoked tend to weigh less.

ggplot(data = dat, aes(x = lwt, y = bwt)) +
  facet_wrap( ~ smoke) +
  geom_point(aes(color=as.factor(race))) +
  geom_smooth(method = "lm") +
  labs(x="Mother's Weight", y="Baby's Weight", color="Race")

The graph above illustrates the relationship between mother’s weight and baby’s weight divided into whether or not the mother smoked. One thing to notice about both of these graphs is that the slope of the regression line changes in all five cases indicating that there may be some interaction between the mother’s weight, race and smoking variables. However, upon inspection, it does not appear that an interaction between mother’s weight and either of the race or smoking variables is statistically significant.

To verify our model satisfies our regression assumptions, we create our residual plots.

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(LinearModel)

In the residuals vs fitted plot, there does not seem to be any correlation between the fitted values and residuals. The QQ plot lies very close to the 45 degree line. There are no discernible trends with the scale-location plot. Finally, there does not appear to be any high residual high leverage data points. Thus, our model appears to satisfy all of the underlying assumptions.

Testing the Model

Now that we are reasonably happy with our model, we can see how it fits the testing data. Since the goal of linear regression is often to make predictions on new data, using our testing data will help us make an assessment of this.

library(modelr)
#library(broom)
test <- test %>%
  add_predictions(LinearModel)
head(test)

##    low age lwt race smoke ptl ht ui ftv  bwt wht blk     pred
## 1    0  19 182    2     0   0  0  1   0 2523   0   1 3147.733
## 2    0  33 155    3     0   0  0  0   3 2551   0   0 3032.614
## 9    0  29 123    1     1   0  0  0   1 2663   1   0 2924.257
## 11   0  19  95    3     0   0  0  0   0 2722   0   0 2776.794
## 13   0  22  95    3     0   0  1  0   0 2751   0   0 2776.794
## 14   0  30 107    3     0   1  0  1   2 2750   0   0 2827.958

We compare the test sample MSE to the training sample MSE below and note that the difference is small. This exercise becomes more powerful when you are comparing multiple models as it helps avoid overfitting issues.

test %>% 
  add_predictions(LinearModel) %>%
  summarise(MSE = mean((bwt - pred)^2))

##        MSE
## 1 402560.1

train %>% 
  add_predictions(LinearModel) %>%
  summarise(MSE = mean((bwt - pred)^2))

##        MSE
## 1 488471.9

These MSE values are similar (in fact, the MSE for the test data is smaller) so we can be reasonably sure we are not over fitting the model.

We also see that there is a nearly 29% correlation between the actual birth weight and the predicted sales value.

cor(cbind(test$bwt,test$pred))

##           [,1]      [,2]
## [1,] 1.0000000 0.2882381
## [2,] 0.2882381 1.0000000

The min max accuracy metric is 0.840611 which is quite good for our test data.

mean(apply(cbind(test$bwt,test$pred), 1, min) / apply(cbind(test$bwt,test$pred), 1, max))

## [1] 0.840611

Finally, we calculate MAPE.

mean(abs((test$bwt - test$pred))/test$bwt)

## [1] 0.203628

Here, we have a small MAPE of 20.4%. Thus, on average the error between the predicted and actual values is around 20.4% of the sales value.

Citations

Dalpiaz, David. (2016 – 2020) Applied Statistics with R. https://daviddalpiaz.github.io/appliedstats/

Prabhakaran, Selva. (2016 - 2017) r-statistics.co . http://r-statistics.co/Linear-Regression.html

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

Linear Regression With Categorical Variables

OC Data Science