Simple and Multiple Regression Analysis
Lab Overview
- Simple and multiple linear regression
- Categorical variables in linear regression
- Interaction terms in linear regression
- Predicted values and residuals
Our company’s quest to understand customer spending continues. Building upon the foundation laid in our previous lab, we’re ready to take our data analysis journey to the next level. In this lab, we will delve into the realm of regression analysis, leveraging the insights we’ve gained from our exploratory analysis in Lab 3. Through this process, we’ll gain a comprehensive understanding of the factors at play and their respective impacts.
As a data analyst, you are going through the following steps:
Data Pre-processing: Clean and prepare the dataset for analysis, handle missing values if any, and ensure data is in the appropriate format.
Exploratory Data Analysis: Visualize and explore the data to gain insights into customer demographics and behavior.
Regression Analysis: Build a regression model to explain the amount spent by customers based on the provided attributes.
Interpretation and Insights: Interpret the regression results, identify significant factors driving customer spending, and draw conclusions.
1. Getting Started
Download Lab 4’s materials from Moodle:
- Save provided R script in your
codefolder in BRM-Labs project folder.
- Save provided R script in your
Open the provided lab 4’s R script.
Setup your R environment.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
- Load the data.
# Load data
load("data/ecommerce.RData")
- Pre-process the data by running the code from Lab 3.
# Use the function source() to run Lab 3's code in the background
source("code/Lab3_LearningNotebook-code-Fall2025.R")
2. Simple Regression Analysis
The next step in our analysis is to build a good regression model to predict the amount spent by customers.
2.1 Continuous Dependent Variable and Numerical Independent Variable
Recall from our correlation analysis that the variables “timespent” and “visits” had the highest correlations with the outcome “amountspent”. Both variables are numeric, with \(timespent\) being continuous and \(visits\) discrete. The interpretation of their estimated coefficients will be similar: on average, one unit increase in \(x\) is associated with a \(\hat{\beta_1}\) units increase in \(y\).
# Impact of timespent on amount spent
lm1 <- lm(amountspent ~ timespent, data = tb.ecommerce)
# Directly output results using summary()
summary(lm1)
Call:
lm(formula = amountspent ~ timespent, data = tb.ecommerce)
Residuals:
Min 1Q Median 3Q Max
-189.3 -31.0 -21.8 21.5 514.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.980 3.533 3.39 0.00072 ***
timespent 3.449 0.199 17.30 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 80 on 998 degrees of freedom
Multiple R-squared: 0.231, Adjusted R-squared: 0.23
F-statistic: 299 on 1 and 998 DF, p-value: <2e-16
# Output results using stargazer()
stargazer(lm1, type = "text", no.space = TRUE
, title = "Impact of Time Spent on Amount Spent")
Impact of Time Spent on Amount Spent
===============================================
Dependent variable:
---------------------------
amountspent
-----------------------------------------------
timespent 3.449***
(0.199)
Constant 11.980***
(3.533)
-----------------------------------------------
Observations 1,000
R2 0.231
Adjusted R2 0.230
Residual Std. Error 80.003 (df = 998)
F Statistic 299.330*** (df = 1; 998)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
\(\hat{\beta}_0=11.980\) is the intercept of the model. It gives us the expected amount spent for a customer with an average time spent per visit of 0 minutes. The value is nonsensical since it is not possible to make a purchase while not visiting the website. Note that, in regression, we should limit our interpretation of coefficients to the range of data that we observe. In this case, the lowest time spent observed is \(1.7\) minutes, thus \(timespent = 0\) is outside the observed range of values. The coefficient is significantly different from zero (p<0.01).
\(\hat{\beta}_1=3.449\) is the slope of the model. It tells us that, on average, for each minute increase in the customer’s timespent, we can expect an increase of 3.449 dollars in the amount spent by the customer. It is significant at the 1% level, indicated by the three stars next to this coefficient.
The \(R^2\) of the model is 0.231. We can thus say that the variable
timespentexplains about 23% of the variation of the variableamountspentaround its mean.
# Impact of visits on amount spent
lm2 <- lm(amountspent ~ visits, data = tb.ecommerce)
stargazer(lm2, type = "text", no.space = TRUE
, title = "Impact of Visits on Amount Spent")
Impact of Visits on Amount Spent
===============================================
Dependent variable:
---------------------------
amountspent
-----------------------------------------------
visits 21.630***
(1.513)
Constant -2.878
(4.805)
-----------------------------------------------
Observations 1,000
R2 0.170
Adjusted R2 0.169
Residual Std. Error 83.102 (df = 998)
F Statistic 204.380*** (df = 1; 998)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
\(\hat{\beta}_0=-2.878\) is the intercept of the model. It gives us the expected amount spent for customers with \(visits = 0\). The value is nonsensical and is not meaningful as the amount spent can never be negative. The coefficient is not significantly different from zero, hence the absence of stars next to this coefficient.
\(\hat{\beta}_1=21.630\) is the slope of the model. It tells us that, on average, for additional visit to the website, we can expect an increase of 21.63 dollars in the amount spent by the customer. It is significant at the 1% level, indicated by the three stars next to this coefficient.
The \(R^2\) of the model is 0.170. We can thus say that the variable
visitsexplains about 17% of the variation of the variableamountspentaround its mean.
2.2 Continuous Dependent Variable and Categorical (2 categories) Independent Variable
We will now build a simple regression model with a single binary explanatory variable - the indicator for “Location = Far”. This model allows us to statistically compare the equality of two group means, similarly to the t-test procedure covered in Lab 3. \(\hat{\beta_1}\) is the estimated difference between the two group means.
# Impact of location on amount spent
lm3 <- lm(amountspent ~ location, data = tb.ecommerce)
stargazer(lm3, type = "text", no.space = TRUE
, title = "Impact of Location on Amount Spent")
Impact of Location on Amount Spent
===============================================
Dependent variable:
---------------------------
amountspent
-----------------------------------------------
locationFar 32.640***
(6.272)
Constant 45.171***
(3.378)
-----------------------------------------------
Observations 1,000
R2 0.026
Adjusted R2 0.025
Residual Std. Error 90.002 (df = 998)
F Statistic 27.080*** (df = 1; 998)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
\(\hat{\beta}_0=45.171\) is the average amount spent by customers who are “close” (where “close”” is the omitted category of the variable
location).\(\hat{\beta}_1=32.640\) indicates that customers who are “far”, spend on average $32.64 dollars more on the website than customers who are close. The difference is statistically significant at the 1% level.
\(\hat{\beta}_0+\hat{\beta}_1 = 45.171 + 32.640 = 77.811\) is the average amount spent by customers who are “far”.
You can confirm these interpretations by calculating the mean of each group directly from the sample data:
# Mean amount spent by location
tb.ecommerce %>%
group_by(location) %>%
summarise(mean_amountspent = mean(amountspent))
3. Multiple Regression Analysis
We will now venturing into multiple regression but how do we know which model we should build?
In model building and selection, we can draw insights from multiple sources.
The initial descriptive analysis offers valuable hints on which variables to include.
Economic theory allows us to deduce plausible relationships between predictors and the outcome.
We can learn from the successes and lessons of existing models, particularly those that align with our purpose.
To discern the strength of our model, we turn to measures of goodness of fit.
The F-statistic gives us the overall significance of predictors.
The standard error indicates the precision of our coefficient estimates, a critical factor in determining the model’s reliability.
The \(R^2\) reveals the proportion of variance in the outcome variable captured by our predictors, signifying the model’s explanatory power.
The adjusted-\(R^2\) allows us to compare models built on the same dataset, with same outcome variable but a different number of predictors.
To streamline our model selection process, you can start with a model encompassing all variables, creating a comprehensive foundation. You can then methodically prune the model by excluding non-relevant variables (but note that this decision is not based on the p-value alone!)
3.1 Continuous Dependent Variable and Categorical (3 categories) Independent Variable
We start by building a model with two indicator variables. This is an extension of the model we previously built with a single dummy variable. Interpretations are also similar - coefficients represent the estimated difference in means between each respective category or level and the omitted category.
Note that in the regression model, we must include the factor variable.
# Impact of customer purchase history on amount spent
lm4 <- lm(amountspent ~ history, data = tb.ecommerce)
stargazer(lm4 , type = "text", no.space = TRUE
, title = "Impact of Purchase History on Amount Spent")
Impact of Purchase History on Amount Spent
===============================================
Dependent variable:
---------------------------
amountspent
-----------------------------------------------
historyLow -84.129***
(7.891)
historyMedium -64.491***
(8.065)
Constant 101.700***
(5.434)
-----------------------------------------------
Observations 697
R2 0.153
Adjusted R2 0.151
Residual Std. Error 86.777 (df = 694)
F Statistic 62.755*** (df = 2; 694)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Note that because history has three levels, R
automatically included in the model two temporary variables
(historyLow and historyMedium).
Interpretation:
\(\hat{\beta}_0 = 101.700\) is the average amount spent by customers who have a “high” purchase history (where “high” is the omitted category of the variable history).
\(\hat{\beta}_1 = - 84.129\) indicates that customers with a “low” purchase history spent, on average, $84.129 less than customers with a “high” purchase history. The difference is statistically significant at the 1% level.
\(\hat{\beta}_0+\hat{\beta}_1 = 101.696 - 84.129 = 17.567\) is the average amount spent by customers who have a “low” purchase history.
\(\hat{\beta}_2 = - 64.491\) indicates that customers with a “medium” purchase history spent, on average, $64.491 less than customers with a “high” purchase history. The difference is statistically significant at the 1% level.
\(\hat{\beta}_0+\hat{\beta}_2 = 101.696 - 64.491 = 37.205\) is the average amount spent by customers who have a “medium” purchase history.
You can confirm these interpretations by calculating the mean of each group directly from the sample data:
# Mean amount spent by purchase history
tb.ecommerce %>%
group_by(history) %>%
summarise(mean_amountspent = mean(amountspent))
3.2 Continuous Dependent Variable and Numeric and Categorical (2 categories) Independent Variables
We will now build a multiple regression model with a numeric independent variable and a categorical independent variable.
# Impact of timespent and location on amount spent
lm5 <- lm(amountspent ~ timespent + location, data = tb.ecommerce)
stargazer(lm5, type = "text", no.space = TRUE
, title = "Impact of timespent and Location on Amount Spent")
Impact of timespent and Location on Amount Spent
===============================================
Dependent variable:
---------------------------
amountspent
-----------------------------------------------
timespent 3.378***
(0.198)
locationFar 25.239***
(5.538)
Constant 5.535
(3.773)
-----------------------------------------------
Observations 1,000
R2 0.246
Adjusted R2 0.245
Residual Std. Error 79.222 (df = 997)
F Statistic 163.010*** (df = 2; 997)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
\(\hat{\beta}_0=5.535\) is the intercept of the model. It gives us the expected amount spent for customer who are “close” (omitted category) with \(timespent = 0\). As before, the value is not meaningful. It is not significantly different from 0.
\(\hat{\beta}_1=3.378\) tells us that, on average, holding location fixed, for each minute increase in the customer’s timespent, we can expect an increase of 3.378 dollars in the amount spent by the customer. It is significant at the 1% level, indicated by the three stars next to this coefficient.
\(\hat{\beta}_2=25.239\) tells us that, on average, holding timespent fixed, customers who are “far” spend an average of $25.239 more than customers who are “close”. It is significant at the 1% level.
The \(R^2\) of the model is 0.246 We can thus say that the variables
timespentandlocationexplain about 24.6% of the variation of the variableamountspentaround its mean.
3.3 Continuous Dependent Variable and Continuous and Categorical (2 categories) Independent Variables with Interaction
Finally, we will introduce an interaction term between a numeric and a categorical variable.
# Impact of timespent and location on amount spent
lm6 <- lm(amountspent ~ timespent * location, data = tb.ecommerce)
stargazer(lm6, type = "text", no.space = TRUE
, title = "Impact of timespent and Location on Amount Spent with Interaction")
Impact of timespent and Location on Amount Spent with Interaction
=================================================
Dependent variable:
---------------------------
amountspent
-------------------------------------------------
timespent 2.487***
(0.242)
locationFar -7.884
(7.625)
timespent:locationFar 2.519***
(0.407)
Constant 15.992***
(4.071)
-------------------------------------------------
Observations 1,000
R2 0.274
Adjusted R2 0.272
Residual Std. Error 77.777 (df = 996)
F Statistic 125.550*** (df = 3; 996)
=================================================
Note: *p<0.1; **p<0.05; ***p<0.01
If customer is close:
\(\hat{amountspent} \ \ = 15.992 + 2.487 timespent\)
If customer is far:
\(\hat{amountspent} \ \ = 8.108 + 5.006 timespent\)
Interpretation:
\(\hat{\beta}_0=15.992\) is the intercept of the model. It gives us the expected amount spent for “close” customers with \(timespent = 0\). The value is not meaningful. The coefficient is significantly different from 0 at the 1% level.
\(\hat{\beta}_1=2.487\) on average, for customers located close to a physical store selling similar products, a one minute increase in the average time spent on the website per visit is associated with a $2.487 increase in the amount spent on the website. The variable is significant at the 1% level.
\(\hat{\beta}_2=-7.884\) on average, a customer with timespent of 0 who is located far from a physical store is expected to spend $7.884 less than a customer with timespent of 0 located close to a physical store. In this case, since customers who do not visit the website (timespent = 0) cannot make purchases, this coefficient is meaningless. This variable is not statistically significant.
\(\hat{\beta}_3=2.519\) on average, for customers located far from a physical store selling similar products, a one minute increase in the average time spent on the website per visit is associated with a $(2.487 + 2.519) = 5.006 increase in the amount spent on the website. In other words, the impact of an additional minute spent on the website is significantly greater for customers located far from a physical store than for customers located close to a physical store. The difference is significant at the 1% level.
The \(R^2\) of the model is 0.274. The model explains 27.4% of the variation of the variable
amountspentaround its mean.
4. Predicted Values and Residuals
4.1 Predicted Values
Once we have settled for a model, we can use it to make predictions. Suppose we are keeping model 5 as our final model for predicting the amount spent by customers.
Making predictions for our sample
We can now use our model to predict the amount spent by each customer
in our data. To this purpose, we use the function
predict().
# Create new column with predicted values
tb.ecommerce <- tb.ecommerce %>% mutate(
preds = predict(lm5))
# See predictions
tb.ecommerce %>%
select(preds) %>%
slice_tail(n=5)
Making predictions for new observations
We can also use our model to predict the amount spent by a new customer:
# Create tibble with new client information
new.client <- tibble( timespent = 53
, location = "Close")
# Predict amount spent for the new client
my.pred <- predict(lm5, newdata = new.client
, level = .95, interval = "confidence")
my.pred
fit lwr upr
1 184.58 167.51 201.64
Model 5 tells us that we can be 95% confident that a new customer who spent an average of 53 minutes per visit on the website and lives “close” to a physical store selling similar products will spend between $167.5143 and $201.6401 on the website. The predicted value is $184.5772.
4.2 Regression Residuals
We can use the function residuals() for obtaining the
regression residuals. Residuals will be useful for running regression
diagnostics later on.
# Create new column with regression residuals
tb.ecommerce <- tb.ecommerce %>% mutate(
res = residuals(lm5))
# See residuals
tb.ecommerce %>%
select(res) %>%
slice_tail(n=5)
4.3 Residual Plots
Before trusting our regression model, it’s important to check whether the assumptions underlying linear regression are reasonably satisfied. A key tool for this is the residual plot, which helps us assess whether the model is appropriate for the data.
We can use the built-in plot() function on a fitted
model object to quickly generate a set of diagnostic plots.
# Base R diagnostic plots for model 5
plot(lm5)
This command will produce four plots:
Residuals vs Fitted
- Checks for non-linearity or heteroscedasticity (unequal spread of residuals).
- The residuals should be randomly scattered around the horizontal line at 0. A clear curve or fan shape suggests problems with linearity or constant variance.
Normal Q-Q Plot
- Checks whether the residuals are normally distributed, an assumption required for inference (e.g., p-values, confidence intervals).
- If the residuals fall approximately along the reference line, the normality assumption is reasonable.
Scale-Location Plot (also called “Spread-Location”)
- Another check for homoscedasticity (constant variance).
- The red line should be flat, and the points should be spread evenly.
- An increasing or decreasing trend indicates changing variance — again suggesting heteroscedasticity.
Residuals vs Leverage
- Identifies influential observations that may unduly affect the regression results.
- Points with high leverage (far to the right) and large residuals (far from 0) can be problematic.
- Look out for points marked with Cook’s distance circles — these may warrant closer inspection.
⚠️ Reminder: These plots are not pass/fail tests, but visual tools to help us detect potential violations of the model assumptions. If issues are detected, consider transforming variables, adding interaction terms, or exploring alternative models.