Linear Regression Using Dummy Variables

Introduction

Linear regression is not strictly limited to quantitative data. A linear regression model can have qualitative data as well. In this tutorial, we will show how to use indicator, or “dummy”, variables that represent qualitative data and how to incorporate it into a linear regression model.

1. Dummy Variables

An indicator variable, or dummy variable, is an input variable that represents qualitative data, such as gender, race, etc. Typically, dummy variables are sometimes referred to as binary variables because they usually take just two values, 1 or 0, with 1 generally representing the presence of a characteristic and 0 representing the absence. For example, if we had a dummy variable called male, 1 would indicate that the individual is male and 0 would indicate that the individual is female (or non-male to be more precise). Using a 1 and 0 to represent the presence or absence of a particular characteristic is arbitrary, but as we will see, it is quite convenient. Generally, we can define a dummy variable $D$ as

\[ \begin{aligned} D = \begin{cases} 1 & \text{if characteristic is present} \\ 0 & \text{if characteristic is not present} \end{cases} \end{aligned} \] Dummy variables can be used to capture changes in the model’s intercept, slope, or both, as we shall see later on.

1.1 Intercept Dummy Variables

Perhaps the most common use of dummy variables is to modify the regression model’s intercept parameter. Adding an indicator variable $D$ to the model, along with a new parameter $\delta$ gives us

\[ \begin{aligned} y = \beta_{0} + \delta D + \beta_{1}x + e \end{aligned} \]

Suppose we had a regression model where the variable $DEGREE$ represented whether or not an invididual obtained a college degree, with 1 indicating that he or she did, and 0 otherwise. In this model, our independent variable is income, along with another independent variable $\text{EXPERIENCE}$, indicating how much work experience in years that individual has. Such a model can be represented as follows

\[ \begin{aligned} \mathbf{E}(\text{INCOME}) = \begin{cases} (\beta_{0} + \delta) + \beta_{1}\text{EXPERIENCE}, & \text{DEGREE}=1 \\ \beta_{0} + \beta_{1}\text{EXPERIENCE}, & \text{DEGREE} = 0 \end{cases} \end{aligned} \] For those with a college degree, the interecept of the model is $(\beta_{0} + \delta)$, while the intercept of the model for those without a college degree is simply $\beta_{0}$. The difference between the two groups, or models, can be depicted by the graph below.

Figure 1: An intercept dummy variable

The slopes of the two lines are the same, but their intercepts, as noted earlier, are different. We can interpret figure 1 as, “For two individuals with the same level of work experience, the individual who holds a college degree has an expected income that is $\delta$ more than the individual without.”

1.2 Choosing the Reference Group

When the dummy variable $D=0$, this is what is referred to as the reference group, or base group, of individuals who do not hold a college degree. In a sense, we are comparing the expected income of individuals with a college degree to the reference group of those without a college degree.

1.3 Slope Dummy Variables

Dummy variables can also be introduced to affect the slope of a regression model. Rather than assuming that the effect of having a college degree causes a change in the intercept in our previous model (sometimes referred to as hedonic regression), let us assume that the change is in the slope of the relationship. In order to accomplish this, we can allow for change in a slope by including an additional explanatory variable in our model that is euqal to the product of a dummy variable and a continuous variable. In such a model, slope of the relationship is the value of an one-unit increase in work experience. We can specify a generic form of such a model as

\[ \begin{equation} y = \beta_{0} + \beta_{1}x + \gamma(x \times D) + e \end{equation} \] The new variable $(x \times D)$ is the product of $x$ and the dummy variable $D$. This is what is called an interaction variable, as it captures the interaction effect of the continuous variable $x$ and dummy variable $D$. Alternatively, it is also referred to as a slope-indicator variable, or a slope dummy variable, because it allows for a change in the slope of the rrelationship. Using our current example, our regression model that describes the interaction effects of work experience and education can be written as

\[ \begin{aligned} \text{INCOME} = \beta_{0} + \beta_{1}\text{EXPERIENCE} + \gamma(\text{EXPERIENCE} \times \text{DEGREE}) + e \end{aligned} \] Despite its unusual nature, the slope dummy variable is treated like any other explanatory variable in a regression model. Analyzing the regression model for the two different individual groups best illustrates the effect of the inclusion of the slope dummy variable into the economic model.

For individuals who whold a bachelor’s degree, their income per additional year of work experience is $(\beta_{1} + \gamma)$, while for those who do not hold a bachelor’s degree, the income per additional year of work experience is simply $\beta_{1}$. \[ \begin{aligned} \mathbf{E}(\text{INCOME}) = \beta_{0} + \beta_{1}\text{EXPERIENCE} + \gamma(\text{EXPERIENCE} \times \text{DEGREE}) = \begin{cases} \beta_{0} + (\beta_{1} + \gamma)\text{EXPERIENCE}, & \text{DEGREE} = 1\\ \beta_{0} + \beta_{1}\text{EXPERIENCE}, & \text{DEGREE} = 0 \end{cases} \end{aligned} \]

Another way of to see the effect of including a slope dummy variable is to use a little bit of calculus. The partial derivative of expected income with respect to work experience gives us the slope of relation. That is,

\[ \begin{aligned} \frac{\partial \mathbf{E}(\text{INCOME})}{\partial \text{EXPERIENCE}} = \begin{cases} \beta_{1} + \gamma & \text{when } DEGREE = 1 \\ \beta_{1} & \text{when } \text{DEGREE} = 0 \end{cases} \end{aligned} \] We’ve seen what an intercept dummy variable looks like graphically; below is what a slope-dummy variable looks like as a curve.

Figure 2: Slope-Dummy Variable

1.4 Dummy Variables Example: Effects on House Prices

Now that we have a basic understanding of how dummy variables work, let’s take a look at an example on how one might use dummy variables in a regression model. In this example, we will be examining various qualitative and quantitative factors that might affect house prices. Here’s a brief look at our data set, which consists of 1,000 observations and 6 variables.

price	sqft	age	pool	fplace
205.452	23.46	6	0	1
185.328	20.03	5	0	1
248.422	27.77	6	0	0
154.690	20.17	1	0	0
221.801	26.45	0	0	1
199.119	21.56	6	0	1
272.134	29.91	9	0	1
250.631	27.98	0	0	1
197.240	24.80	0	1	0
235.755	27.50	0	0	0

Let us suppose that a potential home buyer wants to know the expected price of a house based on several features, such as square foot, age, whether it is near a university, whether it has a pool, and whether it has a fireplace. We can represent an economic model as follows

\[ \begin{aligned} \text{PRICE} = \beta_{0} + \delta_{1}\text{UTOWN} + \beta_{1}\text{SQFT} + \gamma (\text{SQFT} \times \text{UTOWN}) + \beta_{2}\text{AGE} + \delta_{2}\text{POOL} + \delta_{3}\text{FPLACE} + e \end{aligned} \] $\text{UTOWN} = 1$ if the house is within close proximity to a university, $\text{POOL} = 1$ if the house has a pool, and $\text{FPLACE} = 1$ if the house has a fireplace. If the house lacks any of these qualitative features, their respective values are $0$. The variables $\text{SQFT}$ and $\text{AGE}$ are both quantitative variables.

Let’s estimate the parameters of our model.

housing.fit <- lm(price ~ utown + sqft + I((sqft * utown)) + age + pool + fplace, data = housing)
summary(housing.fit)

## 
## Call:
## lm(formula = price ~ utown + sqft + I((sqft * utown)) + age + 
##     pool + fplace, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.289 -10.141   0.148  10.565  44.783 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        24.5000     6.1917   3.957 8.13e-05 ***
## utown              27.4530     8.4226   3.259 0.001154 ** 
## sqft                7.6122     0.2452  31.048  < 2e-16 ***
## I((sqft * utown))   1.2994     0.3321   3.913 9.72e-05 ***
## age                -0.1901     0.0512  -3.712 0.000217 ***
## pool                4.3772     1.1967   3.658 0.000268 ***
## fplace              1.6492     0.9720   1.697 0.090056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 993 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8698 
## F-statistic:  1113 on 6 and 993 DF,  p-value: < 2.2e-16

Our estimated model is

\[ \begin{aligned} \widehat{PRICE} = (24.5 + 27.453) + (7.6122 + 1.2994)\text{SQFT} - 0.1901\text{AGE} + 4.3772\text{POOL} + 1.6492\text{FPLACE} \end{aligned} \]

So, how can we interpret our fitted model? There are a couple of ways we can interpret the model above.

The premium for a house located near a university is $27,453 (the coefficient of $\text{UTOWN}$)
The change in $\mathbf{E}(PRICE)$ per additional square foot is $89.12 for houses near the university and $76.12 for houses elsewhere.
As a house ages each year, its value decreases by $190.10.
The presence of a pool increases the value of a house by $4,377.20
The presence of a fireplace icnreases the value of a house by $1,6492.