E-commerce Regression Analysis

1 Consider model lm1 on the relationship between amountspent and timespent.

1.1 Write down the relevant population model.

1.2 Write down the estimated model.

1.3 How would \(\hat{\beta_1}\) change if the variable timespent was measured in hours instead of minutes?

1.4 Estimate the 95% confidence interval for the slope of this model.

1.5 Compute the p-value associated with timespent.

2 Consider model lm5. Improve upon this model by adding whichever additional predictors you deem relevant. Interpret all the estimated coefficients, their statistical significance, and the R-square of the model.

3 Consider model lm6 on the relationship between amountspent, timespent and location with the interaction between timespent and location.

3.1 In one single graph, illustrate the two regression lines generated by this model.

3.2 Use the appropriate function to find the predicted outcome for a customer who is located “far” and spends an average of 15 minutes per visit on the website.

4 Please consider the following:

4.1 Create an indicator variable “d_married” which takes on the value 1 if the individual is married and 0 otherwise.

4.2 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:

\(amountspent = \beta_0 + \beta_1 timespent + \beta_2 d_{married} + \beta_3 timespent*d_{married} + \epsilon_i\)

5 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:

\(amountspent = \beta_0 + \beta_1 d_{married} + \beta_2 location + \beta_3 d_{married}*location + \epsilon_i\)


Get started by loading libraries, reading and preprocessing data.

library(tidyverse)
library(stargazer)
library(broom)

load("data/ecommerce.RData")

tb.ecommerce <- rename_with(tb.ecommerce, tolower)

# Factor variable age
tb.ecommerce <- tb.ecommerce %>% 
  mutate(fact_age  = factor(x = age, 
                            levels = c("Young", "Middle", "Old")))

# Factor variable history
tb.ecommerce <- tb.ecommerce %>%
  mutate(fact_history = factor(x = history,
                               levels = c("Low", "Medium", "High")))

# Factor variable new_hist
tb.ecommerce <- tb.ecommerce %>%
  mutate(new_hist = if_else(is.na(history), "New Customer", as.character(history)))
tb.ecommerce <- tb.ecommerce  %>%
  mutate(fact_new_hist = factor(x = new_hist,
                                levels = c("New Customer", "Low", "Medium", "High")))


1 Consider model lm1 on the relationship between amountspent and timespent.

lm1 <- lm(amountspent ~ timespent, data = tb.ecommerce)


1.1 Write down the relevant population model.

\(amountspent = \beta_0 + \beta_1 timespent + \epsilon_i\)


1.2 Write down the estimated model.

stargazer(lm1, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             amountspent        
## -----------------------------------------------
## timespent                    3.449***          
##                               (0.199)          
##                                                
## Constant                     11.980***         
##                               (3.533)          
##                                                
## -----------------------------------------------
## Observations                   1,000           
## R2                             0.231           
## Adjusted R2                    0.230           
## Residual Std. Error      80.003 (df = 998)     
## F Statistic          299.331*** (df = 1; 998)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

\(\hat{amountspent} = 11.980 + 3.449 timespent\)


1.3 How would \(\hat{\beta_1}\) change if the variable timespent was measured in hours instead of minutes?

As 1 hour corresponds to 60 minutes, the estimated coefficient would be multiplied by 60.


1.4 Estimate the 95% confidence interval for the slope of this model.

tidy(lm1, conf.int = TRUE)  %>% 
  filter(term == "timespent") %>%
  select(conf.low, conf.high)
## # A tibble: 1 × 2
##   conf.low conf.high
##      <dbl>     <dbl>
## 1     3.06      3.84
# Alternatively, manual calculation of confidence interval:
3.449-1.96*0.199
## [1] 3.05896
3.449+1.96*0.199
## [1] 3.83904


1.5 Compute the p-value associated with timespent.

tidy(lm1, conf.int = TRUE)  %>% 
  filter(term == "timespent") %>%
  select(p.value)
## # A tibble: 1 × 1
##    p.value
##      <dbl>
## 1 7.47e-59
# Alternatively, manual calculation of test statistic:
3.449/0.199
## [1] 17.33166
# Manual p-value calculation:
2 * pt(-abs(3.449/0.199), df = nrow(tb.ecommerce) - 2)
## [1] 4.968958e-59
# or:
2 * pt(abs(3.449/0.199), df = nrow(tb.ecommerce) - 2, lower.tail = FALSE)
## [1] 4.968958e-59


2 Consider model lm5. Improve upon this model by adding whichever additional predictors you deem relevant. Interpret all the estimated coefficients, their statistical significance, and the R-square of the model.

lm7 <- lm(amountspent ~ age + gender + ownhome + married + location +
            children + visits + timespent + new_hist,
          data = tb.ecommerce)

stargazer(lm7, type = "text", title = "Extended Model with Additional Predictors")
## 
## Extended Model with Additional Predictors
## ================================================
##                          Dependent variable:    
##                      ---------------------------
##                              amountspent        
## ------------------------------------------------
## ageOld                         -2.684           
##                                (6.682)          
##                                                 
## ageYoung                     -17.072***         
##                                (6.298)          
##                                                 
## genderMale                      0.418           
##                                (4.729)          
##                                                 
## ownhomeRent                    -5.433           
##                                (5.242)          
##                                                 
## marriedSingle                -15.006***         
##                                (5.121)          
##                                                 
## locationFar                   17.101***         
##                                (5.176)          
##                                                 
## children                       -1.537           
##                                (2.580)          
##                                                 
## visits                        15.292***         
##                                (1.340)          
##                                                 
## timespent                     2.652***          
##                                (0.184)          
##                                                 
## new_histLow                  -35.257***         
##                                (8.624)          
##                                                 
## new_histMedium               -27.908***         
##                                (7.183)          
##                                                 
## new_histNew Customer           -6.729           
##                                (7.017)          
##                                                 
## Constant                        9.055           
##                                (8.359)          
##                                                 
## ------------------------------------------------
## Observations                    1,000           
## R2                              0.403           
## Adjusted R2                     0.395           
## Residual Std. Error       70.890 (df = 987)     
## F Statistic           55.443*** (df = 12; 987)  
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01


3 Consider model lm6 on the relationship between amountspent, timespent and location with the interaction between timespent and location.

lm6 <- lm(amountspent ~ timespent * location, data = tb.ecommerce)


3.1 In one single graph, illustrate the two regression lines generated by this model.

ggplot(tb.ecommerce, aes(x = timespent, y = amountspent, colour = location)) +
  geom_point(alpha = 0.6) +
  stat_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Interaction of Time Spent and Location",
    x = "Time Spent (minutes)",
    y = "Amount Spent ($)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'


3.2 Use the appropriate function to find the predicted outcome for a customer who is located “far” and spends an average of 15 minutes per visit on the website.

new.client <- tibble(timespent = 15, location = "Far")
predict(lm6, newdata = new.client, interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 83.19726 74.20824 92.18628


4 Please consider the following:

4.1 Create an indicator variable “d_married” which takes on the value 1 if the individual is married and 0 otherwise.

tb.ecommerce <- tb.ecommerce %>%
  mutate(d_married = as.double(married == "Married"))


4.2 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:

\(amountspent = \beta_0 + \beta_1 timespent + \beta_2 d_{married} + \beta_3 timespent*d_{married} + \epsilon_i\)

lm8 <- lm(amountspent ~ timespent * d_married, data = tb.ecommerce)
stargazer(lm8, type = "text", title = "Interaction: Time Spent × Married")
## 
## Interaction: Time Spent × Married
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             amountspent        
## -----------------------------------------------
## timespent                    1.529***          
##                               (0.308)          
##                                                
## d_married                     -2.836           
##                               (6.791)          
##                                                
## timespent:d_married          2.893***          
##                               (0.394)          
##                                                
## Constant                     17.025***         
##                               (4.789)          
##                                                
## -----------------------------------------------
## Observations                   1,000           
## R2                             0.299           
## Adjusted R2                    0.297           
## Residual Std. Error      76.445 (df = 996)     
## F Statistic          141.633*** (df = 3; 996)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01


Interpretation of coefficients:

β0 = 17.025 (Single with no time spent on the website; no economic meaning)

β1 = 1.529 (difference for one minute spent by Single)

β2 = -2.836 (Married vs Single with no time spent)

β3 = 2.893 (Interaction effect: extra difference for one minute spent by Married)


Significance:


5 Estimate the following multiple regression model and interpret all estimated coefficients and their statistical significance:

\(amountspent = \beta_0 + \beta_1 d_{married} + \beta_2 location + \beta_3 d_{married}*location + \epsilon_i\)

lm9 <- lm(amountspent ~ d_married * location, data = tb.ecommerce)
stargazer(lm9, type = "text", title = "Interaction: Married × Location")
## 
## Interaction: Married × Location
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                               amountspent        
## -------------------------------------------------
## d_married                      31.085***         
##                                 (6.543)          
##                                                  
## locationFar                     14.095           
##                                 (8.581)          
##                                                  
## d_married:locationFar          37.827***         
##                                (12.150)          
##                                                  
## Constant                       29.497***         
##                                 (4.646)          
##                                                  
## -------------------------------------------------
## Observations                     1,000           
## R2                               0.089           
## Adjusted R2                      0.086           
## Residual Std. Error        87.171 (df = 996)     
## F Statistic             32.248*** (df = 3; 996)  
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01
# Summarize group means to support interpretation
tb.ecommerce %>%
  group_by(married, location) %>%
  summarise(mean_amountspent = mean(amountspent), .groups = "drop")
## # A tibble: 4 × 3
##   married location mean_amountspent
##   <fct>   <fct>               <dbl>
## 1 Married Close                60.6
## 2 Married Far                 113. 
## 3 Single  Close                29.5
## 4 Single  Far                  43.6


Interpretation of coefficients:

β0 = 29.497 (Single & Close)

β1 = 31.085 (Married vs Single when Close)

β2 = 14.095 (Far vs Close when Single)

β3 = 37.827 (Interaction effect: extra difference for Married & Far)


Group interpretations:

  1. Single & Close (baseline group): Mean amount spent = 29.497 → single customers close to a store spend on average ~$29.

  2. Married & Close: Mean amount spent = 29.497 + 31.085 = 60.582 → Married costumers living close spend on average $31 more than single customers nearby.

  3. Single & Far: Mean amount spent = 29.497 + 14.095 = 43.592 → Single customers far from a store spend on average $14 more than those close.

  4. Married & Far: Mean amount spent = 29.497 + 31.085 + 14.095 + 37.827 = 112.504 → Married far from a store spend on average $52 more than married close, but still far more than single far customers.


Significance: