msin0041 coursework 2

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

^Q1:

Marginal costs (mc) = 0, p = price, Demand function D(p)

Log-linear demand model

\(log_q=\alpha+\beta log_p+\epsilon\), which implied elasticity \(\eta=\beta\).

Profit function: \(\pi(p)=D(p)(p-c) = D(p)p\)

Price elasticity of demand \(\eta\):

\[\eta = \frac{\partial D(p)/D(p)}{\partial p/p} = \frac{\partial \log D(p)}{\partial \log p} =D'(p) (\frac{p}{D(p)}) (1)\]

First order condition of profit function:

\(\frac{d(\pi)}{d(p)}=D(p)+pD'(p)=0\)

which means: \(p^*=-\frac{D(p)}{D'(p)}\) and \(D(p) = -p D'(p)\) (2)

Dividing (2) by \(D(p)\): \(1= -p\frac{D'(p)}{D(p)}\). From (1) and (2) \(1=-\eta\)

Therefore at \(p^*\), \(\eta\) is -1, which means when price increases by 1%, demand decreases by 1%.

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

q1data<-read.csv("store_weekly.csv")
head(q1data)

  week month price quantity
1    1     1 51.49      925
2    2     1 49.59     1009
3    3     1 51.94     1128
4    4     1 54.57      914
5    5     1 49.30      999
6    6     2 49.30      967

colnames(q1data)

[1] "week"     "month"    "price"    "quantity"

model_1b <- lm(log(quantity)~log(price), data = q1data)
summary(model_1b)


Call:
lm(formula = log(quantity) ~ log(price), data = q1data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.34626 -0.09236  0.00555  0.09578  0.34484 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.9148     0.3976   42.54   <2e-16 ***
log(price)   -2.5642     0.1033  -24.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1578 on 102 degrees of freedom
Multiple R-squared:  0.858, Adjusted R-squared:  0.8566 
F-statistic: 616.2 on 1 and 102 DF,  p-value: < 2.2e-16

elasticity_1b<- coef(model_1b)["log(price)"]

profit_ratio <- 0.7^(1+elasticity_1b)
profit_ratio

log(price) 
  1.747006

The estimated price elasticity \(\eta\)=-2.5642. A 1% price increase leads to a 2.56% product’s demand reduction. Demand is elastic since \(|\eta|\)>1. Model intercept is \(\alpha\)=16.91.

Log-linear demand model:

\(log(q)=16.91-2.56log(p) + \epsilon\)

\(q=e^{16.91}p^{-2.56}\)

\(\pi(p)=e^{16.91}p^{-2.56} (p)=e^{16.91}p^{-1.56}\). When p = 0.7p: \(\pi(0.7p)=e^{16.91}0.7p^{-1.56}\)

Profit ratio:

\(\frac{\pi_{new}}{\pi_{old}}=\frac{(e^{16.91}(0.7p)^{-1.56})}{(e^{16.91}p^{-1.56})}=0.7^{-1.56}=1.747\)

A 30% discount is predicted to increase profit by 74.7%. With elastic demand, the quantity increase from the discount outweighs the percentage loss in price per unit.

#q1c identify potential bias of model 1b 

library(readr)
library(dplyr)

# Load the dataset
q1data <- read_csv("store_weekly.csv")

Rows: 104 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): week, month, price, quantity

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# weekly profit with mc = 0 

q1data <- q1data %>%
  mutate(profit = price * quantity)

# Identify discount period
discount_p <- filter(q1data, month %in% c(11, 12))
ndiscount_p <- filter(q1data, !(month %in% c(11, 12)))

# Calculate average weekly profit in each period
discount_pavg <- mean(discount_p$profit) #profit during discount period 
ndiscount_pavg <- mean(ndiscount_p$profit) #profit during non discount period 

print(discount_pavg)

[1] 100828.3

print(ndiscount_pavg)

[1] 48578.58

q1data |>
  dplyr::group_by(month) |>
  dplyr::summarise(mean_quantity = mean(quantity))

# A tibble: 12 × 2
   month mean_quantity
   <dbl>         <dbl>
 1     1          986.
 2     2          960.
 3     3         1048.
 4     4          974.
 5     5         1007.
 6     6          996.
 7     7          959.
 8     8          939.
 9     9          977 
10    10          944.
11    11         3058.
12    12         2901.

#dif_dis_ndis<-(discount_pavg-ndiscount_pavg)/ndiscount_pavg

Prices and quantities are endogenous variables determined in equilibrium through supply and demand curves’ intersection. The log-linear model suffers from omitted variable bias. The discount occurs during November-December, when both supply side (deliberate price cuts) and demand side (baseline demand rises during holiday) changes occur, therefore the observed low prices and high quantities may either reflect elastic demand, shift in demand curve, or both. The model omits demand shifters such as seasonality (holiday season baseline demand changes), attributing all quantity changes to discount. Holiday baseline demand rises means positive demand shocks. Low prices coincide with positive demand shocks violate exogeneity.

Proposed model: fixed effects panel regression which controls for unobserved time-invariant and unit-variant factors (store-specific events, fixed seasonal trends) through dummy variables. lt is suitable since the data-set spans two years of weekly sales data and isolates discount effects from seasonality.

library(tidyverse)

# q1d
q1data <- read.csv("store_weekly.csv")
head(q1data)

  week month price quantity
1    1     1 51.49      925
2    2     1 49.59     1009
3    3     1 51.94     1128
4    4     1 54.57      914
5    5     1 49.30      999
6    6     2 49.30      967

# regression: month fixed effects to control for seasonal demand shifts
model_1d <- lm(log(quantity) ~ log(price) + factor(month), data = q1data)
summary(model_1d)


Call:
lm(formula = log(quantity) ~ log(price) + factor(month), data = q1data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.207108 -0.070076 -0.003057  0.053265  0.254379 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.32511    0.65930  14.144  < 2e-16 ***
log(price)      -0.61981    0.16755  -3.699  0.00037 ***
factor(month)2  -0.03184    0.04446  -0.716  0.47574    
factor(month)3   0.03980    0.04481   0.888  0.37678    
factor(month)4  -0.03929    0.04498  -0.874  0.38467    
factor(month)5   0.01540    0.04194   0.367  0.71430    
factor(month)6  -0.02155    0.04524  -0.476  0.63488    
factor(month)7  -0.04660    0.04343  -1.073  0.28614    
factor(month)8  -0.05862    0.04319  -1.357  0.17810    
factor(month)9  -0.04119    0.04524  -0.910  0.36502    
factor(month)10 -0.05685    0.04216  -1.348  0.18085    
factor(month)11  0.89280    0.07922  11.270  < 2e-16 ***
factor(month)12  0.81628    0.08419   9.695 1.11e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09372 on 91 degrees of freedom
Multiple R-squared:  0.9553,    Adjusted R-squared:  0.9494 
F-statistic:   162 on 12 and 91 DF,  p-value: < 2.2e-16

elasticity_1d <- coef(model_1d)["log(price)"]
cat("model 1d price elasticity of demand is ",round(elasticity_1d,4), sep ="", "\n")

model 1d price elasticity of demand is -0.6198

#ped difference 
ped_dif_q1<- elasticity_1d- elasticity_1b
cat("difference between 1d_elasticity to 1b_elasticity is ",round(ped_dif_q1,4), sep ="", "\n")

difference between 1d_elasticity to 1b_elasticity is 1.9444

# profit ratios
profit_ratio_1b <- 0.7^(1 + elasticity_1b)
profit_ratio_1d <- 0.7^(1 + elasticity_1d) 
cat("model 1d profit ratio is ",round(profit_ratio_1d,4), sep ="", "\n")

model 1d profit ratio is 0.8732

#F-statistics test 
model_restricted <- lm(log(quantity) ~ log(price), data = q1data)
anova_test <- anova(model_restricted, model_1d)
cat("\n=== F-TEST FOR MONTH EFFECTS ===\n")


=== F-TEST FOR MONTH EFFECTS ===

print(anova_test)

Analysis of Variance Table

Model 1: log(quantity) ~ log(price)
Model 2: log(quantity) ~ log(price) + factor(month)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    102 2.53887                                  
2     91 0.79937 11    1.7395 18.002 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

if (anova_test$`Pr(>F)`[2] < 0.05) {
  cat("\n✓ Month effects are statistically significant (p < 0.05)\n")
  cat("  This confirms seasonal demand shifts are present and should be controlled for.\n")
} else {
  cat("\n✗ Month effects are not statistically significant (p > 0.05)\n")
}


✓ Month effects are statistically significant (p < 0.05)
  This confirms seasonal demand shifts are present and should be controlled for.

Fixed effects regression equation:

\(log_{q}=\alpha+\beta_1log(p)+\sum_{n=1}^{12}\gamma_{m}1[month_{t}=m]+\epsilon\)

From model_1d, \(\eta_{1d}\)=-0.6198 (p=0.0037) and \(\frac{\pi_{new}}{\pi_{old}}\)=0.8732, which means relative to old profit, at 0.7p (seasonality controlled), new profit decreases by 12.68%. The discount is therefore not justified.

Comparison of fixed-effects model to log-linear model:

Log-linear model \(R^2\)=0.857, fixed-effects model \(R^2\)=0.955, this means month effects substantially accounts for previously unexplained variance. This confirms that linear-log model exhibits optimistic bias, and shows that true demand is inelastic.

F-statistics measures if significant differences exist between the mean quantity sold each month.

\(F=\frac{between-group \ variability}{within-group\ variability}=18\ (p<0.001)\). There is enough evidence to reject the null hypothesis that all month effects are zero.

Q2:

Equation: \(retention_{i}=\beta_{0}+\beta_1activation_{i}+\epsilon_i\), in which \(retention_i=1, activation_i=1\), subsequently represents retained customer and activated customer, and 0 otherwise.

#q2a

library(tidyverse)
q2data<-read.csv("customer_data.csv")
head(q2data)

  id treated activation retention
1  1       1          0         1
2  2       1          0         0
3  3       1          1         1
4  4       1          1         1
5  5       1          1         0
6  6       1          1         1

#number of customers in treatment group 
q2a_treated <-sum(q2data$treated==1)

#overall activation rate of whole dataset. 
q2a_activation<-mean(q2data$activation)

#overall retention rate of whole dataset. 
q2a_retention <-mean(q2data$retention) 

#retention by activation 
retention_by_activation <- q2data %>%
  group_by(activation) %>%
  summarise(
    n = n(),
    retention_rate = mean(retention)
  )

#regression model 
model_2a <- lm(retention ~ activation, data = q2data)
summary(model_2a)


Call:
lm(formula = retention ~ activation, data = q2data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7031 -0.4398  0.2969  0.5602  0.5602 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.43983    0.01386   31.72   <2e-16 ***
activation   0.26331    0.02199   11.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4813 on 1998 degrees of freedom
Multiple R-squared:  0.06695,   Adjusted R-squared:  0.06648 
F-statistic: 143.4 on 1 and 1998 DF,  p-value: < 2.2e-16

#coefficient of activation 
beta_activation <- coef(model_2a)["activation"]
cat("activation coefficient:", round(beta_activation, 4), "\n")

activation coefficient: 0.2633

intercept_b0<- coef(model_2a)["(Intercept)"]

#given info
cost_per_act <- 5 
ltv_per_retained <- 50 

#profit calculation

expected_value <- beta_activation*ltv_per_retained

cat("expected value per activation = £",round(expected_value,2), sep ="", "\n")

expected value per activation = £13.17

profit_per_act <-expected_value - cost_per_act

cat("profit per activation = £",round(profit_per_act,2), sep ="", "\n")

profit per activation = £8.17

#total expected profit for 2000 customers 
n_customers <-nrow(q2data)
t_expected_value <- profit_per_act*n_customers
cat("total expected profit when offered to 2000 customers = £",round(t_expected_value,2), sep ="", "\n")

total expected profit when offered to 2000 customers = £16331.06

From model_2a:

\(retention_i=0.4398+0.2633activation_i\)

The baseline retention rate (\(\beta_0\) activation = 0) is 43.98% (p<0.001). Activation coefficient (\(\beta_1)\) is 0.2633 (p<0.001), meaning that activation increases retention by 26.33%.

Expected profit per activation is: \(\pi = \beta_1 *(LTV - CPA)\).

Extra LTV per activation is \(0.2633*50=13.17\). Expected profit per activation is therefore £8.17. It is profitable for the firm to provide the new feature.

The initial treatment assignment is a plausible but not perfect instrument for measuring actual activation status.

A good instrument variable (IV) \(Z\) for an endogenous regressor \(X_1\) has to satisfy the following conditions:

Exogeneity: No correlation between IV and the error term \(Cov(Z,\epsilon)=0\)
Instrument relevance: IV must be correlated with endogenous variable \(Cov(Z, X_1) \neq0\).
Strength: Weak instruments (those that weakly correlate to endogenous variable) can cause severe biases in IV estimates.
Exclusion restriction: IV \(Z\) affects the dependent variable (\(Y\)) only through its effects on \(X_1\).

#q2b
library(tidyverse)
q2data<-read.csv("customer_data.csv")

#activation rate among treatment group 
activation_rate_treated<-mean(q2data$activation[q2data$treated == 1]) 

#regression activation on treated 
model_2b<- lm(activation ~ treated, data = q2data)
summary(model_2b)


Call:
lm(formula = activation ~ treated, data = q2data)

Residuals:
   Min     1Q Median     3Q    Max 
-0.795  0.000  0.000  0.205  0.205 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.747e-14  9.032e-03    0.00        1    
treated     7.950e-01  1.277e-02   62.24   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2856 on 1998 degrees of freedom
Multiple R-squared:  0.6598,    Adjusted R-squared:  0.6596 
F-statistic:  3874 on 1 and 1998 DF,  p-value: < 2.2e-16

f_stat <-summary(model_2b)$fstatistic[1]
cat("f-statistic = ",round(f_stat, 2), sep ="", "\n")

f-statistic = 3874.17

coef_treated <- coef(model_2b)["treated"]
cat("coefficient of treated = ",coef_treated, sep ="", "\n")

coefficient of treated = 0.795

From the analysis:

Relevance and strength assumption holds: first stage regression \(activation_i=\gamma+\delta treated_i+u_i\). The treated coefficient \(\gamma\)=0.795 and F-statistics of 3874 exceeds the conventional threshold (10), indicating strong empirical evidence for relevance and instrument strength.
Exclusion restriction and exogeneity: Given the study design (50% treated, 50% control), it can be assumed that the firm intended to randomise assignment, making exogeneity and exclusion plausible design assumptions. Realistically, this may not hold if staff selectively chose more engaged clients. If true, this would mean a direct correlation of assignment to retention, violating the exclusion restriction. The 50/50 design supports but doesn’t guarantee exogeneity since its validity rests on unobservable procedure details.

c) d)

#q2c
#q2c iv regression with initial treatment assignment 

library(tidyverse)
#install.packages("AER")
library(AER)

Warning: package 'AER' was built under R version 4.3.3

Loading required package: car

Warning: package 'car' was built under R version 4.3.3

Loading required package: carData

Warning: package 'carData' was built under R version 4.3.3


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: lmtest

Warning: package 'lmtest' was built under R version 4.3.3

Loading required package: zoo

Warning: package 'zoo' was built under R version 4.3.3


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Warning: package 'sandwich' was built under R version 4.3.3

Loading required package: survival

q2data<-read.csv("customer_data.csv")

#first stage: activation treated 
fs_2c<-lm(activation ~ treated, data = q2data)
summary(fs_2c)


Call:
lm(formula = activation ~ treated, data = q2data)

Residuals:
   Min     1Q Median     3Q    Max 
-0.795  0.000  0.000  0.205  0.205 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.747e-14  9.032e-03    0.00        1    
treated     7.950e-01  1.277e-02   62.24   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2856 on 1998 degrees of freedom
Multiple R-squared:  0.6598,    Adjusted R-squared:  0.6596 
F-statistic:  3874 on 1 and 1998 DF,  p-value: < 2.2e-16

q2data$activation_hat <- fitted(fs_2c)

#second stage: retention to instrumented activation 
model_2c<-lm(retention~activation_hat, data=q2data)
summary(model_2c)


Call:
lm(formula = retention ~ activation_hat, data = q2data)

Residuals:
   Min     1Q Median     3Q    Max 
-0.574 -0.515  0.426  0.485  0.485 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.51500    0.01573  32.742  < 2e-16 ***
activation_hat  0.07421    0.02798   2.652  0.00806 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4974 on 1998 degrees of freedom
Multiple R-squared:  0.003509,  Adjusted R-squared:  0.00301 
F-statistic: 7.035 on 1 and 1998 DF,  p-value: 0.008055

#alternative methods 
alternative_2c<-ivreg(retention~activation|treated, data = q2data)
summary(alternative_2c)


Call:
ivreg(formula = retention ~ activation | treated, data = q2data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5892 -0.5150  0.4108  0.4850  0.4850 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.51500    0.01550  33.228  < 2e-16 ***
activation   0.07421    0.02757   2.692  0.00717 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4901 on 1998 degrees of freedom
Multiple R-Squared: 0.03242,    Adjusted R-squared: 0.03194 
Wald test: 7.245 on 1 and 1998 DF,  p-value: 0.007167

#iv coef
iv_coef <- coef(model_2c)["activation_hat"]
cat("iv activation coefficient = ",round(iv_coef, 4), sep ="", "\n")

iv activation coefficient = 0.0742

#profitability 
value_per_cus<- 50 
cpa <-5 
profit_per_iv<-iv_coef*value_per_cus - cpa
cat("profit per instrumented activation = £",round(profit_per_iv, 2), sep ="", "\n")

profit per instrumented activation = £-1.29

Using initial treatment assignment as IV for actual activation, the coefficient of instrumented activation on retention is 0.0742 (p<0.001), which means among treatment group, feature activation increases retention probability by 7.42%. Each activation yields an LTV of £3.71 and expected profit per activation is therefore (-£1.29). Providing the feature is therefore not profitable.

\(ATT= E(Y_i(1)-Y_i(0)|W_i=1)=\frac{E(Y_i|Z_i=1)-E(Y_i|Z_i=0)}{Pr(W_i=1|Z_i=1)}\) represents the average effect of the treatment among those treated, \(Pr(W_i=1|Z_i=1)\) is activation rate among treatment group. \(Z \in \{0,1\}\) - assignment status, \(Y \in \{0,1\}\) - retention status, \(W \in \{0,1\}\) - activation status.

Local average treatment effect (LATE) estimates the treatment effect among those who complied with the assignment, if and only if activation among control \(E(W_i|Z_i=0)\) then ATT = LATE.

\(LATE = \frac{E(Y_i|Z_i=1)-E(Y_i|Z_i=0)}{E(W_i|Z_i=1)-E(W_i|Z_i=0)}=\frac{0.059}{0.795}=0.0742\)

ITT measures the retention effect based on initial assignment, regardless of actual activation status. Being assigned the treatment increases retention rate by 5.9%. Activation rate among treatment \(Pr(W_i=1|Z_i=1)\) is 79.5%. ATT is around 7.42%, which means in treat=1, activators have a 7.42% higher retention rate. Activation rate among control \(Pr(W_i=1|Z_i=0)=E(W_i|Z_i=0) = 0\) , ATT and LATE both yields 7.42%.

\(ITT = E(Y_i|Z_i=1)-E(Y_i|Z_i=0)=0.574-0.515=0.059\)

\(ATT=\frac{0.059}{0.795}=0.0742\)

#q2d
library(tidyverse)

q2data<- read.csv("customer_data.csv")

#retention rate among treatment and control (y|z)

retention_among_treated<- mean(q2data$retention[q2data$treated == 1])
cat("retention among treated is ", retention_among_treated, sep = "", "\n")

retention among treated is 0.574

retention_among_control <-mean(q2data$retention[q2data$treated == 0])  
cat("retention_among_control is ", retention_among_control, sep = "", "\n")

retention_among_control is 0.515

itt <- retention_among_treated- retention_among_control
cat("itt is ",itt, sep ="", "\n")

itt is 0.059

itt_reg<- lm(retention~treated, data=q2data)
summary(itt_reg)


Call:
lm(formula = retention ~ treated, data = q2data)

Residuals:
   Min     1Q Median     3Q    Max 
-0.574 -0.515  0.426  0.485  0.485 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.51500    0.01573  32.742  < 2e-16 ***
treated      0.05900    0.02224   2.652  0.00806 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4974 on 1998 degrees of freedom
Multiple R-squared:  0.003509,  Adjusted R-squared:  0.00301 
F-statistic: 7.035 on 1 and 1998 DF,  p-value: 0.008055

beta_ITT <- coef(itt_reg)["treated"]
cat("itt (from regression model) is ", beta_ITT, sep = "", "\n")

itt (from regression model) is 0.059

#ATT 

activation_among_treated<- mean(q2data$activation[q2data$treated == 1])
att<- itt/activation_among_treated
cat("att is ", round(att, 4), sep = "", "\n")

att is 0.0742

#LATE 
activation_among_control<-mean(q2data$activation[q2data$treated == 0])
late_denominator<- activation_among_treated - activation_among_control
late <- beta_ITT/ late_denominator
cat("late is ", round(late, 4), sep = "", "\n")

late is 0.0742

#profitability 
cpa<-5 
ltv_per_retention<- 50 
net_benefit <- ltv_per_retention*att - cpa 

cat("net profit per customers is £", round(net_benefit,2), sep = "", "\n")

net profit per customers is £-1.29

n_customers<-nrow(q2data)
expected_act<- n_customers * mean(q2data$activation)
total_expected_benefit <-expected_act * net_benefit
cat("total expected profit per 2000 customers is £", total_expected_benefit, sep = "", "\n")

total expected profit per 2000 customers is £-1025

Model 2a and 2c yields coefficient of activation of 0.2633 and 0.0742, respectively. Profit per activation is £8.17 and (-£1.29), subsequently for 2a and 2c. 2c and 2d’s results are identical.

Results from 2c and 2d model is more reliable since 2a suffers positive selection bias (more engaged customers are more likely to activate the feature and/or more likely to retain subscription regardless of activation) leading to overestimation.

The firm is therefore recommended to not implement the feature. Once activation endogeneity is addressed, causal effect is modest (7.24%). Additionally, the firm is expected to make a net loss of -£1.29 per activation. The firm should either redesign the feature to reduce costs below £3.71 per activation or target customers with a higher LTV.

Q4:

#q4a
library(tidyverse)
q4data <- read.csv("fitness.csv")
head(q4data)

  id treat past_workouts app_opens renewed
1  1     1             9        12       1
2  2     1             7        11       1
3  3     1             5         9       1
4  4     1             4         5       0
5  5     0             5         6       0
6  6     1             4         4       0

#logistic regression renewal on treatment status

model_4a<- glm(renewed~treat, data = q4data, family= binomial )
summary(model_4a)


Call:
glm(formula = renewed ~ treat, family = binomial, data = q4data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.44288    0.05168  -47.27   <2e-16 ***
treat        1.69360    0.06004   28.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9960.7  on 9999  degrees of freedom
Residual deviance: 9000.0  on 9998  degrees of freedom
AIC: 9004

Number of Fisher Scoring iterations: 5

coef_treat <- coef(model_4a)["treat"]
coef_treat_odds<- exp(coef_treat)


intercept_4a<- coef(model_4a)["(Intercept)"]
control_renewal_rate <- (exp(intercept_4a)/(1+exp(intercept_4a)))
cat("renewal probability of control group (scenario 2) is ", round(control_renewal_rate,3), sep = "", "\n")

renewal probability of control group (scenario 2) is 0.08

intercept_4a_odds<-exp(intercept_4a)
cat("renewal odds for control group is ", round(intercept_4a_odds,3), sep = "", "\n")

renewal odds for control group is 0.087

#renewal probability of treatment group  
treated_renewal_rate<-(exp(coef_treat+intercept_4a)/(1+exp(coef_treat+intercept_4a)))  
cat("renewal probability of treatment group (scenario 1) is ", round(treated_renewal_rate,3), sep = "", "\n")

renewal probability of treatment group (scenario 1) is 0.321

#treatment effect on renewal rate 
treatment_effect_on_renewal<- treated_renewal_rate - control_renewal_rate
cat("treatment effect on renewal rate is ", round(treatment_effect_on_renewal,3), sep = "", "\n")

treatment effect on renewal rate is 0.241

#predict renewal treated = everyone  
alltreated_4a<- q4data%>% mutate(treat = 1)
pred_alltreated <- predict.glm(model_4a, newdata=alltreated_4a, type = "response")
summary(pred_alltreated)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.321   0.321   0.321   0.321   0.321   0.321

alltreated_avg<- mean(pred_alltreated)
cat("renewal rate when everyone is treated is ",round( alltreated_avg,3), sep = "", "\n")

renewal rate when everyone is treated is 0.321

#predict renewal treated = no one. 

nonetreated_4a<- q4data %>% mutate(treat=0)
pred_nonetreated <- predict.glm(model_4a, newdata=nonetreated_4a, type = "response")
nonetreated_avg<- mean(pred_nonetreated) 
cat("renewal rate when no one is treated is ", round(nonetreated_avg,3), sep = "", "\n")

renewal rate when no one is treated is 0.08

#comparison 

effect_4a<- alltreated_avg - nonetreated_avg
cat("treatment effect on renewal rate is ", round(effect_4a,3), sep = "", "\n")

treatment effect on renewal rate is 0.241

\(P(renew_i=1)=\lambda(\beta_0+\beta_1treat_i)\) in which \(\lambda(z)=\frac{e^z}{1+e^z}\)

The treatment coefficient (log-odds) is \(\beta_1=1.694\) (p<0.001), \(e^{\beta_0}\approx5.44\), treatment increases renewal odds by 444%. The intercept (log-odds) \(\beta_0=-2.443\) ( p<0.001), \(e^{\beta_0}=0.087\) is the odds of renewal for treat = 0.

Scenario 1: Everyone treated \(\hat{P}(renewed=1|treated=1)=\lambda(\beta_0+\beta_1)=(-2.443+1.694)\lambda=-0.749\lambda\)

Expected renewal rate (scenario 1) \(p_1=\frac{e^{\beta_0+\beta_1}}{1+e^{\beta_0+\beta_1}}=0.321\)

Scenario 2: No one treated \(\hat{P}(renewed=1|treated=0)=\lambda(\beta_0)=(-2.443)\lambda\)

\(p_0=\frac{e^{\beta_0}}{1+e^{\beta_0}}\approx0.080\), expected renewal rate for scenario 2 is 8%.

Treatment effect on overall renewal rate is \(p_1-p_0=0.241\). Treatment increases overall retention rate by 24.1%, exceeding the threshold (10%).

#q4b 
#install.packages("pscl")
#library(pscl)
library(tidyverse)
q4data<- read.csv("fitness.csv") 

#logistic regression with treatment & past workouts 
model_4b <- glm(renewed ~ treat + past_workouts, data = q4data, family = binomial)
summary(model_4b)


Call:
glm(formula = renewed ~ treat + past_workouts, family = binomial, 
    data = q4data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -4.47606    0.09438  -47.42   <2e-16 ***
treat          0.97308    0.06574   14.80   <2e-16 ***
past_workouts  0.43771    0.01497   29.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9960.7  on 9999  degrees of freedom
Residual deviance: 7964.9  on 9997  degrees of freedom
AIC: 7970.9

Number of Fisher Scoring iterations: 5

treat4b_coef<-coef(model_4b)["treat"]
pworkout<-coef(model_4b)["past_workouts"]
intercept_4b<-coef(model_4b)["(Intercept)"]

#renewal prob for user of treat = 0 and past workouts = 0
baseline_renew_prob <- (exp(intercept_4b)/(1+exp(intercept_4b)))
cat("renewal rate for user with treat = 0 and past_workouts = 0 is ",round(baseline_renew_prob,3), sep = "", "\n")

renewal rate for user with treat = 0 and past_workouts = 0 is 0.011

# scenario 1 - everyone treated 
all_treated <- q4data %>% mutate(treat = 1)
pred_treatment <- predict.glm(model_4b, newdata = all_treated, type = "response")
avg_ptreated <- mean(pred_treatment)
cat("renewal rate when everyone is treated (pastworkout constant) is ",round(avg_ptreated,4), sep = "", "\n")

renewal rate when everyone is treated (pastworkout constant) is 0.2471

# scenario 2 - no one treate
all_control <- q4data %>% mutate(treat = 0)
pred_control <- predict.glm(model_4b, newdata = all_control, type = "response")
avg_pcontrol <- mean(pred_control)
cat("renewal rate when no one is treated (pastworkout constant) is ",round(avg_pcontrol,4), sep = "", "\n")

renewal rate when no one is treated (pastworkout constant) is 0.1245

effect_4b<- avg_ptreated - avg_pcontrol
cat("treatment effect on renewal rate (pastworkouts constant) is ", round(effect_4b,3), sep = "", "\n")

treatment effect on renewal rate (pastworkouts constant) is 0.123

#past workouts control vs treatment 

avg_control_pwrk <- mean(q4data$past_workouts[q4data$treat == 0])
avg_treat_pwrk   <- mean(q4data$past_workouts[q4data$treat == 1]) 

#pearson correlation coefficient treatment and pastworkouts model4a validity 
cor(q4data$treat, q4data$past_workouts)

[1] 0.4401311

The treatment coefficient \(\beta_1=0.973\) (p<0.001), odds ratio \(e^{\beta_1}=2.65\), treatment increases renewal odds by 165% (past_workout controlled).

The past_workout coefficient \(\beta_2=0.438\) (p<0.001), \(e^{\beta_2}=1.55\), which means each additional past workout raises the renewal odds by 55% (treatment controlled).

The intercept (\(\beta_0\)) (p<0.001) is -4.48 represents the log-odds of renewal for a user with treat = 0 and past_workouts = 0.

Scenario 1: everyone treated (holding past_workouts constant)

\(P(\text{renewed}_i = 1)\approx 0.2471 \ (24.71\%)\)

Scenario 2: no one treated (holding past_workouts constant)

\(P(\text{renewed}_i = 1)\approx 0.1245 \ (12.45\%)\)

Baseline activity controlled, the model predicts that treatment increases renewal by 12.26%, exceeding the threshold (10%). The company should therefore proceed with new feature.

The \(r\) for treatment and past_workout is 0.44 demonstrating moderate positive correlation, this indicates selection bias, more active users (those with higher past workouts) were more likely to receive treatment. This creates positive confounding, violating the random assignment assumption. Model 4a hence suffers from omitted variable bias.

#q4c 
library(tidyverse)
q4data <- read.csv("fitness.csv")

# Logistic regression renewal on treatment, pworkout, and app freq 

model_4c <- glm(renewed ~ treat + past_workouts + app_opens, data = q4data, family = binomial)
summary(model_4c)


Call:
glm(formula = renewed ~ treat + past_workouts + app_opens, family = binomial, 
    data = q4data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -4.50806    0.09635 -46.789  < 2e-16 ***
treat          0.08070    0.07772   1.038    0.299    
past_workouts  0.08360    0.02112   3.959 7.54e-05 ***
app_opens      0.35601    0.01578  22.556  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9960.7  on 9999  degrees of freedom
Residual deviance: 7395.1  on 9996  degrees of freedom
AIC: 7403.1

Number of Fisher Scoring iterations: 5

treat_coef4c<- coef(model_4c)["treat"]
pwrk_coef4c<- coef(model_4c)["past_workouts"]
app_coef4c<- coef(model_4c)["app_opens"]
intercept_4c<- coef(model_4c)["(Intercept)"] 
p0<- (exp(intercept_4c)/(1+exp(intercept_4c))) 

#app open is an in-trial parameter (not pre or post)

# scenario 1: everyone treated
all_treated4c <- q4data %>% mutate(treat = 1)
pred_treated4c <- predict.glm(model_4c, newdata = all_treated4c, type = "response")
ptreated_4c <- mean(pred_treated4c) 
cat("renewal rate when everyone is treated (pastworkout and app usage constant) is ",round(ptreated_4c,4), sep = "", "\n")

renewal rate when everyone is treated (pastworkout and app usage constant) is 0.2012

# scenario 2: no one treated 
all_control4c <- q4data %>% mutate(treat = 0)
pred_control4c <- predict.glm(model_4c, newdata = all_control4c, type = "response")
pcontrol_4c <- mean(pred_control4c)
cat("renewal rate when no one is treated (pastworkout and app usage constant) is ",round(pcontrol_4c,4), sep = "", "\n")

renewal rate when no one is treated (pastworkout and app usage constant) is 0.1918

# Treatment effect on overall retention rate
effect_4c <- ptreated_4c- pcontrol_4c
cat("treatment effect on renewal rate (pastworkouts & app usage constant) is ", round(effect_4c,4), sep = "", "\n")

treatment effect on renewal rate (pastworkouts & app usage constant) is 0.0093

The intercept \(\beta_0=-4.51\) (\(p<0.001\)), \(p_0=0.0109\), which means renewal rate for baseline users (treat, post_workouts and app = 0) is 1.09%.

The treatment coefficient \(\beta_1=0.081\) (statistically insignificant \(p=0.299\)), which means the treatment has negligible effect (controlled past_workout app_opens). The past_workouts coefficient \(\beta_2=0.084\) (\(p<0.001\)), (\(e^{\beta_2}\)) of 1.09 which means each additional past workout increases renewal odds by 9%. The app usage coefficient \(\beta_3=0.356\) (\(p<0.001\)), \(e^{\beta_3}=1.43\). Each open increases renewal odds by 43%.

Scenario 1: everyone treated (controlled past_workout app_opens).

\(P(\text{renewed}_i = 1)= 0.2012\ (20.12\%)\)

Scenario 2: no one treated (controlled past_workout app_opens).

\(P(\text{renewed}_i = 1)\approx 0.1918\ (19.18\%)\)

Controlled for past workouts and app usage, treatment affect on renewal rate is 0.93%, significantly smaller than threshold. Model_4c recommends the company to not implement the feature.

Treatment affect renewal through two mechanisms:

Indirect pathway: Treatment -> App Opens -> Renewal

Direct pathway: Treatment -> Renewal

Post_workout is a pre-treatment confounders, affecting both the probability of being assigned treatment, and renewal. App open is a mediator in the indirect pathway.

Model comparison and conclusion:

Treatment affect on renewal rates across 4a, 4b and 4c are 24.1%, 12.26% and 0.93%, subsequently. Model 4a suffers from confounding due to positive selection bias because it omits the variable past_workouts, causing overestimation of treatment effect. Model 4c incorporates app usage (besides past workouts), which is a mediator (indirect pathway of how treatment affects outcome), and controlling for app opens would therefore block indirect treatment effect, resulting in underestimation. Model 4b controls for pre-treatment confounder and allows for complete estimation of treatment effect on outcome. Therefore based on model 4b, the new feature should be implemented because it raises renewal by 12.26%, exceeding the threshold (10%).

Q3:

\(V_V=0.4, V_W=0.2, V_T=0.2, V_0 (outside \ option)=0\)

The choice probability for each option:

\(P_j=\frac{exp(V_j)}{exp(V_0)+exp(V_V)+exp(V_W)+exp(V_T)}\), \(j \in \{V,W,T\}\)

\(P_0=\frac{exp(V_0)}{exp(V_0)+exp(V_V)+exp(V_W)+exp(V_T)}\)

#q3a
v0<- 0 
vv<- 0.4 
vw<-0.2 
vt<- 0.2 

denominator<- exp(v0)+exp(vv)+exp(vw)+exp(vt)

#choice probability for brand v 
prob_v<- exp(vv)/denominator 
cat("choice probability for brand v is ",round(prob_v,4), sep = "", "\n")

choice probability for brand v is 0.3023

#choice probability for brand w 
prob_w<- exp(vw)/denominator 
cat("choice probability for brand w is ",round(prob_w,4), sep = "", "\n")

choice probability for brand w is 0.2475

#choice probability for brand t  
prob_t<- exp(vt)/denominator 
cat("choice probability for brand t is ",round(prob_t,4), sep = "", "\n")

choice probability for brand t is 0.2475

#choice probability for outside option 
prob_0<- exp(v0)/denominator 
cat("choice probability for outside option (0) is ",round(prob_0,4), sep = "", "\n")

choice probability for outside option (0) is 0.2026

#q3b 

library(tidyverse)

V_V <- 0.4
V_W <- 0.2
V_T <- 0.2
V_0 <- 0

n_consumers <- 50000

set.seed(123)

# Generate Gumbel errors
# If X ~ Exp(1), then -ln(X) ~ Gumbel(0,1)
generate_gumbel <- function(n) {
  -log(rexp(n, rate = 1))
}

#error term
epsilon_0 <- generate_gumbel(n_consumers)
epsilon_V <- generate_gumbel(n_consumers)
epsilon_W <- generate_gumbel(n_consumers)
epsilon_T <- generate_gumbel(n_consumers)

# Calculate utilities
U_0 <- V_0 + epsilon_0
U_V <- V_V + epsilon_V
U_W <- V_W + epsilon_W
U_T <- V_T + epsilon_T

# maximum utility for each consumer
utilities_matrix <- cbind(U_0, U_V, U_W, U_T)
max_utilities <- apply(utilities_matrix, 1, max)

# Expected consumer surplus

expected_CS <- mean(max_utilities) 
cat("expected consumer surplus is ",round(expected_CS,4), sep = "", "\n")

expected consumer surplus is 2.1737

Consumer surplus represents the expected utility consumers receive from having access to these brands. Each consumer’s expected surplus is 2.17 from choosing their preferred among 4 options (V,W,T,0), benefiting from product variety (2.17>0).

#q3c merger of W and T 

library(tidyverse)
V_V <- 0.4
V_W <- 0.2
V_T <- 0.2 
V_0 <- 0

n_consumers <- 50000

set.seed(123)

#Gumbel errors
generate_gumbel <- function(n) {
  -log(rexp(n, rate = 1))
}

# error term for each 
eps_0 <- generate_gumbel(n_consumers)
eps_V <- generate_gumbel(n_consumers)
eps_W <- generate_gumbel(n_consumers)
eps_T<- generate_gumbel(n_consumers)

# utilities
U_0 <- V_0 + eps_0
U_V <- V_V + eps_V
U_W <- V_W + eps_W
U_T <- V_T + eps_T

#pre-merger utilities 
util_pre<- cbind(U_0, U_V, U_W, U_T)
max_pre <- apply(util_pre, 1, max) 
cs_pre <- mean(max_pre) 
cat("pre-merger consumer surplus is ",round(cs_pre,4), sep = "", "\n")

pre-merger consumer surplus is 2.1737

#post merger 
util_post <- cbind(U_0, U_V, U_W)
max_post<- apply (util_post, 1, max) 
cs_post<- mean (max_post) 
cat("post-merger consumer surplus is ",round(cs_post,4), sep = "", "\n")

post-merger consumer surplus is 1.8942

#consumer surplus change 
cs_dif<- cs_post - cs_pre
cat("difference between post-merger and pre-merger consumer surplus is ",round(cs_dif,4), sep = "", "\n")

difference between post-merger and pre-merger consumer surplus is -0.2795

cs_change <- (cs_post-cs_pre)/cs_pre
cat("post-merger consumer surplus changes by (rate) ",round(cs_change,4), sep = "", "\n")

post-merger consumer surplus changes by (rate) -0.1286

#market share pre and post 

#util_post <- cbind(U_0, U_V, U_W)
post_choice<- max.col(util_post)
post_shares<- prop.table(table(post_choice))

#util_pre<- cbind(U_0, U_V, U_W, U_T)
pre_choice<- max.col(util_pre)
pre_shares <- prop.table(table(pre_choice))

The merger reduces aggregate consumer surplus by 12.86%. Reduced product variety (brand T eliminated) accounts for this. Additionally, despite V and W remains, losing T means those prefer T, must now switch to a less-preferred option. In addition, the IIA (independence of irrelevant alternatives) of \(mlogit\) means the T’s probability mass is reallocated equally across all remaining options, instead of W only. The model assumes W&T as perfect substitutes (\(V_W=V_T=0.2)\).

The analysis reveals that merger would lead to a 12.86% reduction in consumer surplus, meaning a harm to consumer welfare.

Validity of policy implications based on model:

The model imposes IIA which might mean unrealistic substitution pattern. Perfect substitutes assumptions may not hold in practice.
The model is static since it assumes fixed product attributes, price, market dynamics, innovation post merger.

The model therefore provides a useful starting point but is insufficient. Real policy choices should be supplemented with potential price effects, competitive dynamics, empirical demand and efficiency claim analysis.

Market dynamics, innovation and barrier-to-entry - static analysis misses dynamic considerations. Considering competitive dynamics, the merger would likely result in reduced competition, which likely means complacency, reduced innovation and price rises. Authorities should also consider barrier to entry changes post-merger, if barriers to entry increase, consumer harm may increase in the long term.
Efficiencies and synergies - The merger might benefit from cost synergies given their shared supply chain (Eden Valley), economies of scale in bottling and distribution, or duplicative SG&A costs’ reduction. Loss in consumer surplus may decrease if efficiency gains are passed on.
Product and portfolio mix - Post-merger, the combined firm might modify their product portfolio mix, which can alter substitution patterns and welfare outcomes beyond the scenario assumption.
Distribution and fairness - authorities also consider distributional impacts and fairness perceptions. Specifically, the merger might disproportionately harming certain groups (e.g: small cafes reliant on brand T) more.