Dummy Variables

Recall that our broad business question is, “How are quarterly sales affected by quarter of the year, region, and by product category (parent name)?” Up to this point we have analyzed the ability of quantitative predictor variables to improve predictions. In this video we will discuss how qualitative predictor variables can be included in regression models by creating what are known as dummy variables. This will allow us to incorporate quarter of the year in a multiple regression model to predict and explain quarterly revenue.

Preliminaries

If you haven’t already done so, then install the tidyverse collection of packages.

You only need to install these packages once on the machine that you’re using. If you have not already done so, then you can do so by uncommenting the code chunk below and running it. If you have already done so, then you should not run the next code chunk.

# install.packages('tidyverse')

Load the tidyverse collection of packages.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Make sure that you have also downloaded the tecaRegressionData.rds file into the same folder in which this file is saved. Use the next code chunk to read in the data and load it as a dataframe object.

trd <- readRDS('tecaRegressionData.rds')

Qualitative Variables

Qualitative variables are those that do not have a numeric value associated with them such as gender, or country of origin. These types of variables can provide an important source of predictive and explanatory power; however, machine learning algorithms, including regression, rely on numeric values. So we have to somehow convert qualitative variables to numeric variables.

Some qualitative variables lend themselves well to a numeric conversion because they have a natural order. For example, quarter of the year can be converted to values of 1, 2, 3, and 4. Similarly, gold, silver and bronze medals in the olympics could be converted to numeric values of 1, 2, and 3, respectively. These are known as ordinal variables. This type of ordinal encoding is not possible for nominal variables, like country or gender.

Sometimes, even if it is possible to convert qualitative variables to numeric values it does not make sense to do so because in a linear model because it would assume that the change between each value is constant. This is especially true for quarter of the year. For example, if quarters 1 and 3 are the busiest seasons of the year for some industries, and lower in quarters 2 and 4, then it wouldn’t make sense to force a constant positive or a constant negative coefficient on the variable that represents quarter of the year.

Dummy Variables

What is often done is a series of binary variables is used to capture the different levels of the qualitative variable. Specifically, we would replace the quarter of the year variable, quarterNoYear, with three variables: Second, Third, and Fourth. The values in these columns take on a value of 1 if the observation fits into that category, and a value of zero otherwise. We only need three columns because if they all have a value of 0, then that means the observation fits into the first quarter.

Here’s a dataframe to illustrate that idea with a bit more detail:

data.frame('quarterNoYear' = c('First', 'Second', 'Third', 'Fourth')
           , 'quarterNoYearSecond' = c(0,1,0,0)
           , 'quarterNoYearThird' = c(0,0,1,0)
           , 'quarterNoYearFourth' = c(0,0,0,1))

Factor Class

Because R was made for analytics, it has a factor class that can be very helpful. This class displays data like a character string so that it makes sense to humans. However, it is coded as a numeric value when used in analytics, like visualizaitons, column summaries, and some machine learning algorithms like regression. The lm() function in R knows that factor variables should be converted to dummy variables and it does that automatically.

Let’s see what happens when we run a simple regression of totalRevenue on quarterNoYear column, which as a data type of factor.

lm6 <- lm(totalRevenue ~ quarterNoYear, data = trd)
summary(lm6)

## 
## Call:
## lm(formula = totalRevenue ~ quarterNoYear, data = trd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9395.2 -3523.6  -544.7  2519.9 30376.4 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          11538.1      467.6  24.673   <2e-16 ***
## quarterNoYearSecond  -1043.7      661.3  -1.578    0.115    
## quarterNoYearThird    1070.0      661.3   1.618    0.106    
## quarterNoYearFourth    823.4      661.3   1.245    0.214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5553 on 560 degrees of freedom
## Multiple R-squared:  0.02182,    Adjusted R-squared:  0.01658 
## F-statistic: 4.164 on 3 and 560 DF,  p-value: 0.006232

Notice that there is a coefficient estimate for the second through fourth quarters, but not for the first quarter. In this case, the intercept represents the estimate of totalRevenue for the first quarter, and the coefficient estimates for the other variables represent the difference between that quarter from the first quarter.

Let’s create a manual comparison by calculating the mean value of totalRevenue for each quarter.

trd %>%
  group_by(quarterNoYear) %>%
  summarize(meanRevenue = mean(totalRevenue)) %>%
  ungroup()

Notice that the value of meanRevenue for the first quarter, 11,538.13 is the same as the intercept in the multiple regression model.

The value of meanRevenue for the second quarter, 10,494.42 is less than the value of meanRevenue for the first quarter by 1,043.71, which is represented by the coefficient estimate on quarterNoYearSecond in the multiple regression model.

There’s a similar relationship for the other two quarters and the difference from the first quarter and their coefficient estimates.

However, notice that none of these coefficient estimates are statistically significant at the .05 level meaning that these difference could really just be a result of random fluctuations.

The Unique Effect of Quarter of the Year

Quarter of the year may have a signficant effect on quarterly revenue after controlling for the percentage of sales that come from other products. Let’s test this out by including it with the other variables that we have already investigated.

lm7 <- lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 + quarterNoYear, data = trd)
summary(lm7)

## 
## Call:
## lm(formula = totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 + 
##     quarterNoYear, data = trd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11118.3  -2647.3   -292.1   1821.7  25342.0 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -10917.2     2391.7  -4.565 6.16e-06 ***
## Fuel_py1                   34054.0     2610.0  13.047  < 2e-16 ***
## Juicetonics_py1            82618.1    40553.8   2.037  0.04210 *  
## ColdDispensedBeverage_py1 -83594.0    29323.1  -2.851  0.00452 ** 
## quarterNoYearSecond        -1318.2      504.5  -2.613  0.00922 ** 
## quarterNoYearThird           461.9      563.8   0.819  0.41301    
## quarterNoYearFourth          213.7      558.3   0.383  0.70206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4223 on 557 degrees of freedom
## Multiple R-squared:  0.4373, Adjusted R-squared:  0.4312 
## F-statistic: 72.14 on 6 and 557 DF,  p-value: < 2.2e-16

It appears that after considering the impact of those other parent categories, total revenue during the second quarter of the year is significantly lower than total revenue during the first quarter of the year.

Concluding Comments

This process of converting a single column of values into multiple columns of binary values, or dummy variables, is also known as one-hot-encoding. Not all machine learning algorithms natively make that conversion when factor variables are encountered, so you may need to learn how to one-hot-encode qualitative variables using other methods.

Creating dummy variables, or one-hot-encoding, is a powerful way of capturing the effect of qualitative variables in machine learning models. Just remember that the interpretation is different than coefficient estimates for quantitative variables.