Recall that our broad business question is, “How are quarterly sales affected by quarter of the year, region, and by product category (parent name)?” Up to this point we have analyzed the ability of predictor variables (aka explanatory variables) to create forecasts of quarterly revenue independently of each other. In other words, we have used one predictor variable in a regression model. This is known as simple regression. In this lesson we will investigate the benefits of using multiple variables in the same linear model to create those forecasts. When we do that it’s known as multiple regression.
If you haven’t already done so, then install the tidyverse collection of packages, and the jtools package. Functions from the jtools package depend on at least two other packages, ggstance, and huxtable, so you will also need to install those packages. You should also install the corrplot package if you have not already done so.
You only need to install these packages once on the machine that you’re using. If you have not already done so, then you can do so by uncommenting the code chunk below and running it. If you have already done so, then you should not run the next code chunk.
#install.packages('tidyverse')
#install.packages('jtools')
#install.packages('ggstance')
#install.packages('huxtable')
#install.packages('corrplot')
Load the tidyverse collection of packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jtools) # For tabulating and visualizing results from multiple regression models
library(corrplot)
## corrplot 0.95 loaded
Make sure that you have also downloaded the tecaRegressionData.rds file into the same folder in which this file is saved. Use the next code chunk to read in the data and load it as a dataframe object.
trd <- readRDS('tecaRegressionData.rds')
Creating a linear model using more than one explanatory variable is easy to do in R, but we also want to illustrate that it’s not simply a combination of coefficients from many simple regressions that include only one predictor variable. (By the way, a simple regression is a regression in which there is only one predictor variable.)
As a benchmark, let’s first calculate simple regression models of totalRevenue on Fuel_py1 and Juicetonics_py1 and store the coefficients and then aggregate R-squared values in dataframes for comparison purposes.
lm1 <- lm(totalRevenue ~ Fuel_py1, data = trd)
lm2 <- lm(totalRevenue ~ Juicetonics_py1, data = trd)
export_summs(lm1, lm2) # Create a nice looking table for comparing the regression results
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | -11509.68 *** | 14295.79 *** |
| (1216.79) | (683.78) | |
| Fuel_py1 | 35096.84 *** | |
| (1815.14) | ||
| Juicetonics_py1 | -150275.47 *** | |
| (37960.10) | ||
| N | 564 | 564 |
| R2 | 0.40 | 0.03 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
The table indicates the key takeaways from both linear models:
* Coefficient estimates
* For model 1, the coefficient estimate is -11,509.68 for the intercept
and is 35,096.84 for Fuel_py1.
* For model 2, the coefficient estimate is 14,295.79 for the intercept
and is -150,275.47 for Juicetonics_py1. * The standard errors are in
parentheses below these coefficient estimates.
* N stands for number of observations, and they are both based on 564
observations.
* R2 is the R-squared, which is much larger for model 1, 40%, than for
model 2, 3%. This means that model 1 explains more variation in
totalRevenue than model 2, and is better for making predictions.
Let’s use the jtools package to plot the coefficients and standard errors to help visualize the results.
plot_summs(lm1, lm2)
I think this visualization is excellent. It clearly shows that the coefficient on Fuel_py1 from model 1 is positive, and has a much smaller standard error relative to the negative coefficient on Juicetonics_py1 from model 2. As you may have guessed, the long orange whiskers and the short, barely visible blue whiskers represent the standard error, or the range in which the actual value could be.
Now, let’s run a multiple regression that contains both predictor variables in the same model, and evaluate the models. It’s called a multiple regression because it has multiple predictor variables.
lm3 <- lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1, data = trd)
export_summs(lm1, lm2, lm3)
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | -11509.68 *** | 14295.79 *** | -16855.87 *** |
| (1216.79) | (683.78) | (1680.95) | |
| Fuel_py1 | 35096.84 *** | 39333.35 *** | |
| (1815.14) | (2014.95) | ||
| Juicetonics_py1 | -150275.47 *** | 149875.80 *** | |
| (37960.10) | (33106.72) | ||
| N | 564 | 564 | 564 |
| R2 | 0.40 | 0.03 | 0.42 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
First, notice how the R-squared for model 3 is .42, which is higher than the R-squared of .40 in model 1 and .03 in model 2. This means that we can have a little more confidence in the accuracy of these predictions than in the predictions from model 1, and a lot more confidence in the predictions compared to model 2.
Also, the R-squared is not simply a sum of the R-squared from models 1 and 2. This is because Fuel_py1 and Juicetonics_py1 are correlated to each other. Let’s review the corrleation matrix. It looks like they have a fairly strong correlation of -.46 with each other. This also has implications for helping us understand the relationships among these variables.
ctrd <- cor(trd %>% select(where(is.numeric)))
corrplot(ctrd
, method = 'color' # I also like pie and ellipse
, order = 'hclust' # Orders the variables so that ones that behave similarly are placed next to each other
, addCoef.col = 'black'
, number.cex = .6 # Lower values decrease the size of the numbers in the cells
)
Notice that the coefficient estimates in model 3 are different than in
the other two models, and are all statistically significant.
The coefficient estimates for the intercept and Fuel_py1 are similar, but more extreme than those in model 1. However, the coefficient estimate for Juicetonics_py1 is drastically different. It changed from about a value of -150,000 to a positive value of 150,000. Once again, this is because Fuel_py1 and Juicetonics_py1 are correlated with each other, and the coefficient estimates represent the unique effect of each predictor variable after considering the effect(s) of the other predictor variables.
This change is effectively communicated by visualizing the coefficients from all three models.
plot_summs(lm1, lm2, lm3)
Conceptually, this change in the Juicetonics_py1 coefficient makes sense, as well. Specifically, the negative correlation between Fuel_py1 and Juicetonics_py1 means that stores that have higher percentage of fuel sales have a smaller percentage of juice and tonic sales. People spend a lot more on fuel than on juice and tonics, so stores that depend more on sales of juice an tonics are likely to not generate as much total revenue.
So, the really great thing for explanatory purposes is that we can evaluate the effect of an increase of percentage sales from juice and tonics after controlling for the amount of sales from fuel. In other words, stores that have the same amount of fuel sales can expect to have higher total revenue if they increase their sales from juice and tonics.
In the absence of being able to run an experiment, this is the next best thing. In some ways it’s better because running an experiment would be really hard.
Making predictions using a multiple regression model is a fairly
straightforward extension of making predictions from a simple regression
model. In our case, if we want to estimate the total revenue for a store
in which during the previous year 70% of sales came from fuel and 3%
came from juice and tonics, then we would take the intercept of -16,856,
then add .7 times 39,333, and finally add .03 times 149,876 which would
give us \(15,173.86=-16,856 +
(.7*39,333)+(.03*149,876)\). Of course, we can simply use the
predict() function to calculate those predictions.
You may have noticed the adjusted R-squared in the output, and that it’s always lower than the R-squared. The purpose of the adjusted R-squared is to penalize regression models for adding in predictor variables that do not add much explanatory power, or that are insignificant.
Let’s create a larger multiple regression model by adding in ColdDispensedBeverage_py1, Lottery_py1, and quarterNoYear and look at the summary.
lm4 = lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 + Lottery_py1, data = trd)
summary(lm4)
##
## Call:
## lm(formula = totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 +
## Lottery_py1, data = trd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11128.2 -2566.3 -312.4 1855.2 25791.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14330 2434 -5.886 6.82e-09 ***
## Fuel_py1 36903 2610 14.140 < 2e-16 ***
## Juicetonics_py1 140171 37908 3.698 0.000239 ***
## ColdDispensedBeverage_py1 -60632 29829 -2.033 0.042558 *
## Lottery_py1 1582 5963 0.265 0.790877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4261 on 559 degrees of freedom
## Multiple R-squared: 0.4251, Adjusted R-squared: 0.421
## F-statistic: 103.4 on 4 and 559 DF, p-value: < 2.2e-16
Notice that the Lottery_py1 variable in this model has a statistically insignificant coefficient, which means that we’re not very confident that this coefficient could be zero or negative. This is contributing to the adjusted R-squared value being less than the R-squared value.
We don’t want to make things more difficult than they need to be, so to keep things simple, the final model that is often used typically includes only the variables that are statistically significant, and the adjusted R-squared should remain about the same as when it was left in the model. Let’s try it out.
lm5 = lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1, data = trd)
summary(lm5)
##
## Call:
## lm(formula = totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1,
## data = trd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11085.6 -2541.9 -333.1 1844.3 25796.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14033 2159 -6.499 1.79e-10 ***
## Fuel_py1 36629 2395 15.293 < 2e-16 ***
## Juicetonics_py1 135591 33721 4.021 6.59e-05 ***
## ColdDispensedBeverage_py1 -57848 27898 -2.074 0.0386 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4257 on 560 degrees of freedom
## Multiple R-squared: 0.4251, Adjusted R-squared: 0.422
## F-statistic: 138 on 3 and 560 DF, p-value: < 2.2e-16
In this case, the adjusted R-squared even increased slightly.
plot_summs(lm1, lm2, lm3, lm4, lm5)
In conclusion, multiple regression is a very powerful tool for both predictive and explanatory purposes. It improves the predictive power by using the effect of multiple variables. It improves the explanatory power by reporting the unique effect of each predictor variable.
The adjusted R-squared should be the main version of R-squared that you focus on when using multiple regressions because it is a more conservative estimate of the expected explanatory power of a model.