E-commerce Regression Analysis
1 Consider model lm1 on the relationship between amountspent and timespent.
1.1 Write down the relevant population model.
1.2 Write down the estimated model.
1.3 How would \(\hat{\beta_1}\) change if the variable timespent was measured in hours instead of minutes?
1.4 Estimate the 95% confidence interval for the slope of this model.
1.5 Compute the p-value associated with timespent.
2 Consider model lm5. Improve upon this model by adding whichever additional predictors you deem relevant. Interpret all the estimated coefficients, their statistical significance, and the R-square of the model.
3 Consider model lm6 on the relationship between amountspent, timespent and location with the interaction between timespent and location.
3.1 In one single graph, illustrate the two regression lines generated by this model.
3.2 Use the appropriate function to find the predicted outcome for a customer who is located “far” and spends an average of 15 minutes per visit on the website.
4 Please consider the following:
4.1 Create an indicator variable “d_married” which takes on the value 1 if the individual is married and 0 otherwise.
4.2 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:
\(amountspent = \beta_0 + \beta_1 timespent + \beta_2 d_{married} + \beta_3 timespent*d_{married} + \epsilon_i\)
5 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:
\(amountspent = \beta_0 + \beta_1 d_{married} + \beta_2 location + \beta_3 d_{married}*location + \epsilon_i\)
Get started by loading libraries, reading and preprocessing data.
library(tidyverse)
library(stargazer)
library(broom)
load("data/ecommerce.RData")
tb.ecommerce <- rename_with(tb.ecommerce, tolower)
# Factor variable age
tb.ecommerce <- tb.ecommerce %>%
mutate(fact_age = factor(x = age,
levels = c("Young", "Middle", "Old")))
# Factor variable history
tb.ecommerce <- tb.ecommerce %>%
mutate(fact_history = factor(x = history,
levels = c("Low", "Medium", "High")))
# Factor variable new_hist
tb.ecommerce <- tb.ecommerce %>%
mutate(new_hist = if_else(is.na(history), "New Customer", as.character(history)))
tb.ecommerce <- tb.ecommerce %>%
mutate(fact_new_hist = factor(x = new_hist,
levels = c("New Customer", "Low", "Medium", "High")))
1 Consider model lm1 on the relationship between amountspent and timespent.
lm1 <- lm(amountspent ~ timespent, data = tb.ecommerce)
1.1 Write down the relevant population model.
\(amountspent = \beta_0 + \beta_1 timespent + \epsilon_i\)
1.2 Write down the estimated model.
stargazer(lm1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## amountspent
## -----------------------------------------------
## timespent 3.449***
## (0.199)
##
## Constant 11.980***
## (3.533)
##
## -----------------------------------------------
## Observations 1,000
## R2 0.231
## Adjusted R2 0.230
## Residual Std. Error 80.003 (df = 998)
## F Statistic 299.331*** (df = 1; 998)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
\(\hat{amountspent} = 11.980 + 3.449 timespent\)
1.3 How would \(\hat{\beta_1}\) change if the variable timespent was measured in hours instead of minutes?
As 1 hour corresponds to 60 minutes, the estimated coefficient would be multiplied by 60.
1.4 Estimate the 95% confidence interval for the slope of this model.
tidy(lm1, conf.int = TRUE) %>%
filter(term == "timespent") %>%
select(conf.low, conf.high)
## # A tibble: 1 × 2
## conf.low conf.high
## <dbl> <dbl>
## 1 3.06 3.84
# Alternatively, manual calculation of confidence interval:
3.449-1.96*0.199
## [1] 3.05896
3.449+1.96*0.199
## [1] 3.83904
1.5 Compute the p-value associated with timespent.
tidy(lm1, conf.int = TRUE) %>%
filter(term == "timespent") %>%
select(p.value)
## # A tibble: 1 × 1
## p.value
## <dbl>
## 1 7.47e-59
# Alternatively, manual calculation of test statistic:
3.449/0.199
## [1] 17.33166
# Manual p-value calculation:
2 * pt(-abs(3.449/0.199), df = nrow(tb.ecommerce) - 2)
## [1] 4.968958e-59
# or:
2 * pt(abs(3.449/0.199), df = nrow(tb.ecommerce) - 2, lower.tail = FALSE)
## [1] 4.968958e-59
2 Consider model lm5. Improve upon this model by adding whichever additional predictors you deem relevant. Interpret all the estimated coefficients, their statistical significance, and the R-square of the model.
lm7 <- lm(amountspent ~ age + gender + ownhome + married + location +
children + visits + timespent + new_hist,
data = tb.ecommerce)
stargazer(lm7, type = "text", title = "Extended Model with Additional Predictors")
##
## Extended Model with Additional Predictors
## ================================================
## Dependent variable:
## ---------------------------
## amountspent
## ------------------------------------------------
## ageOld -2.684
## (6.682)
##
## ageYoung -17.072***
## (6.298)
##
## genderMale 0.418
## (4.729)
##
## ownhomeRent -5.433
## (5.242)
##
## marriedSingle -15.006***
## (5.121)
##
## locationFar 17.101***
## (5.176)
##
## children -1.537
## (2.580)
##
## visits 15.292***
## (1.340)
##
## timespent 2.652***
## (0.184)
##
## new_histLow -35.257***
## (8.624)
##
## new_histMedium -27.908***
## (7.183)
##
## new_histNew Customer -6.729
## (7.017)
##
## Constant 9.055
## (8.359)
##
## ------------------------------------------------
## Observations 1,000
## R2 0.403
## Adjusted R2 0.395
## Residual Std. Error 70.890 (df = 987)
## F Statistic 55.443*** (df = 12; 987)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
3 Consider model lm6 on the relationship between amountspent, timespent and location with the interaction between timespent and location.
lm6 <- lm(amountspent ~ timespent * location, data = tb.ecommerce)
3.1 In one single graph, illustrate the two regression lines generated by this model.
ggplot(tb.ecommerce, aes(x = timespent, y = amountspent, colour = location)) +
geom_point(alpha = 0.6) +
stat_smooth(method = "lm", se = FALSE) +
labs(
title = "Interaction of Time Spent and Location",
x = "Time Spent (minutes)",
y = "Amount Spent ($)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
3.2 Use the appropriate function to find the predicted outcome for a customer who is located “far” and spends an average of 15 minutes per visit on the website.
new.client <- tibble(timespent = 15, location = "Far")
predict(lm6, newdata = new.client, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 83.19726 74.20824 92.18628
4 Please consider the following:
4.1 Create an indicator variable “d_married” which takes on the value 1 if the individual is married and 0 otherwise.
tb.ecommerce <- tb.ecommerce %>%
mutate(d_married = as.double(married == "Married"))
4.2 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:
\(amountspent = \beta_0 + \beta_1 timespent + \beta_2 d_{married} + \beta_3 timespent*d_{married} + \epsilon_i\)
lm8 <- lm(amountspent ~ timespent * d_married, data = tb.ecommerce)
stargazer(lm8, type = "text", title = "Interaction: Time Spent × Married")
##
## Interaction: Time Spent × Married
## ===============================================
## Dependent variable:
## ---------------------------
## amountspent
## -----------------------------------------------
## timespent 1.529***
## (0.308)
##
## d_married -2.836
## (6.791)
##
## timespent:d_married 2.893***
## (0.394)
##
## Constant 17.025***
## (4.789)
##
## -----------------------------------------------
## Observations 1,000
## R2 0.299
## Adjusted R2 0.297
## Residual Std. Error 76.445 (df = 996)
## F Statistic 141.633*** (df = 3; 996)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation of coefficients:
β0 = 17.025 (Single with no time spent on the website; no economic meaning)
β1 = 1.529 (difference for one minute spent by Single)
β2 = -2.836 (Married vs Single with no time spent)
β3 = 2.893 (Interaction effect: extra difference for one minute spent by Married)
Significance:
All coefficients are statistically significant (p < 0.01) except d_married which is not significant at the 5% level or 10% level.
Interpretation suggests strong interaction: time on the website affects spending differently for married and single customers.
Business insight: time on the website favours spending, especially for married customers.
5 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:
\(amountspent = \beta_0 + \beta_1 d_{married} + \beta_2 location + \beta_3 d_{married}*location + \epsilon_i\)
lm9 <- lm(amountspent ~ d_married * location, data = tb.ecommerce)
stargazer(lm9, type = "text", title = "Interaction: Married × Location")
##
## Interaction: Married × Location
## =================================================
## Dependent variable:
## ---------------------------
## amountspent
## -------------------------------------------------
## d_married 31.085***
## (6.543)
##
## locationFar 14.095
## (8.581)
##
## d_married:locationFar 37.827***
## (12.150)
##
## Constant 29.497***
## (4.646)
##
## -------------------------------------------------
## Observations 1,000
## R2 0.089
## Adjusted R2 0.086
## Residual Std. Error 87.171 (df = 996)
## F Statistic 32.248*** (df = 3; 996)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# Summarize group means to support interpretation
tb.ecommerce %>%
group_by(married, location) %>%
summarise(mean_amountspent = mean(amountspent), .groups = "drop")
## # A tibble: 4 × 3
## married location mean_amountspent
## <fct> <fct> <dbl>
## 1 Married Close 60.6
## 2 Married Far 113.
## 3 Single Close 29.5
## 4 Single Far 43.6
Interpretation of coefficients:
β0 = 29.497 (Single & Close)
β1 = 31.085 (Married vs Single when Close)
β2 = 14.095 (Far vs Close when Single)
β3 = 37.827 (Interaction effect: extra difference for Married & Far)
Group interpretations:
Single & Close (baseline group): Mean amount spent = 29.497 → single customers close to a store spend on average ~$29.
Married & Close: Mean amount spent = 29.497 + 31.085 = 60.582 → Married costumers living close spend on average $31 more than single customers nearby.
Single & Far: Mean amount spent = 29.497 + 14.095 = 43.592 → Single customers far from a store spend on average $14 more than those close.
Married & Far: Mean amount spent = 29.497 + 31.085 + 14.095 + 37.827 = 112.504 → Married far from a store spend on average $52 more than married close, but still far more than single far customers.
Significance:
All coefficients are statistically significant (p < 0.01) except locationFar which is not significant at the 5% level or 10% level.
Interpretation suggests strong interaction: location affects spending differently for married and single customers.
Business insight: Married customers far from physical stores may rely more on e-commerce, and spend substantially more.