1

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

Question 1:

1a)

For $I=X_1X_2X_3X_4=X_1$, this means:

$X_1=X_1*I=X1^2=X_1X_2X_3X_4=X1^2*X_2X_3X_4$

So: $X_1=X_2X_3X_4=I$

1b)

For $I=X_1=X_2X_3X_4$, we have:

$X_1X_2X_3X_4=X_2X_3X_4*I=X_1X_2X_3$

Since: $X_2X_3X_4=I$

Therefore $X_1X_2X_3X_4=I$

1c)

We have $I=X_1X_2X_3X_4=X_3X_4$

$X_3*I=X_3*X_3X_4$

$X_3=I*X_4=X_4$

Therefore $X_3=X_4$, which means:

$X_1X_2X_3X_3=I$

$X_1X_2I=I$

$X_1X_2=I$

This shows that $X_1$ and $X_2$ are perfectly confounded.

Collinearity: $X_1X_2=I$ so $X_1=X_2^{-1}$, this means $X_1$ and $X_2$ are perfectly collinear.

Therefore when $X_1$ and $X_2$ are perfectly confounded and collinear, this creates a multicollinearity problem and thus we cannot separate the effects of $X_1$ and $X_2$. Therefore, we cannot:

Uniquely identify $\beta_1$ and $\beta_2$
We can only estimate the confounded effect ($\beta_1+\beta_2$).

Conclusion: It is not possible to estimate this regression because $X_1$ and $X_2$ are completely confounded in this fractional factorial design.

Question 2:

2a)

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr) 
library(pwrss)

Warning: package 'pwrss' was built under R version 4.3.3


Attaching package: 'pwrss'

The following object is masked from 'package:stats':

    power.t.test

library(readr) 
cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")

#q2a
head(cd)

  id treated activation retention
1  1       1          1         1
2  2       1          1         1
3  3       1          0         1
4  4       1          1         1
5  5       1          1         1
6  6       1          1         1

str(cd)

'data.frame':   2000 obs. of  4 variables:
 $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ treated   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ activation: int  1 1 0 1 1 1 1 0 1 0 ...
 $ retention : int  1 1 1 1 1 1 1 0 1 0 ...

#equation: retention = b0 + b1activation + error  
reg2a = lm(retention ~ activation, data = cd)
summary(reg2a)


Call:
lm(formula = retention ~ activation, data = cd)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7197 -0.4296  0.2803  0.5704  0.5704 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.42964    0.01378   31.18   <2e-16 ***
activation   0.29001    0.02180   13.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4775 on 1998 degrees of freedom
Multiple R-squared:  0.08138,   Adjusted R-squared:  0.08092 
F-statistic:   177 on 1 and 1998 DF,  p-value: < 2.2e-16

coef(reg2a)

(Intercept)  activation 
  0.4296420   0.2900076

Interpretation:

$\beta_0(intercept)=0.43, p<0.001$ this means that among subscribers who did not activate the new feature, the retention rate is 43%. This is the baseline retention rate for subscribers who did not use the one-to-one fitness assessment feature. This statistic is significant (p<0.001)
$\beta_1(activation) = 0.29, p<0.001$ this means subscribers who activated the feature had a retention rate 29% higher than those who did not activate it. The statistics is significant (p<0.001) which means strong evidence that activation is associated with retention rate.
$R^2=0.081$ means the model accounts for 81% of variation in the retention rate.

2b)

#q2b

library(tidyverse)
library(knitr) 
library(pwrss) 
library(readr) 
cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")

#equation: retention = b0 + b1treated + error
reg2b = lm(retention ~ treated, data = cd)
summary(reg2b)


Call:
lm(formula = retention ~ treated, data = cd)

Residuals:
   Min     1Q Median     3Q    Max 
-0.600 -0.491  0.400  0.400  0.509 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.49100    0.01566  31.356  < 2e-16 ***
treated      0.10900    0.02215   4.922 9.27e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4952 on 1998 degrees of freedom
Multiple R-squared:  0.01198,   Adjusted R-squared:  0.01149 
F-statistic: 24.23 on 1 and 1998 DF,  p-value: 9.266e-07

coef(reg2b)

(Intercept)     treated 
      0.491       0.109

Interpretation:

$\beta_0=0.491$ means among subscribers in the control group (those who were not assigned to the treatment group), the retention rate is 49.1%. p-value<0.001 means this is statistically significant.
$\beta_1=0.109$ means subscribers who were assigned to treatment group had a retention rate 10.9% higher than control group, p<0.001 means this is statistically significant. Intention to treat effect means the effect of offering new feature, regardless of whether subscribers actually activated it. This interprets the intention to treat effect. Intention to treat effects ensure randomness, and reflects realistic implementation including non-compliance. This is more realistically representative than 2a.
$R^2=0.012$ means treatment assignment accounts for 1.2% of variation in retention rate.

2c)

# 2c

cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")

# retention rate analysis 
# Control group
control_retention <- mean(cd$retention[cd$treated == 0])
control_n <- sum(cd$treated == 0)


# Treated group who activated
treated_activated_retention <- mean(cd$retention[cd$treated == 1 & cd$activation == 1])
treated_activated_n <- sum(cd$treated == 1 & cd$activation == 1)


# Treated group who did NOT activate (key evidence!)
treated_not_activated_retention <- mean(cd$retention[cd$treated == 1 & cd$activation == 0])
treated_not_activated_n <- sum(cd$treated == 1 & cd$activation == 0)

# Calculate the critical difference
diff <- treated_not_activated_retention - control_retention

# Statistical test 
test_data <- cd[(cd$treated == 1 & cd$activation == 0) | cd$treated == 0, ]
t_test <- t.test(retention ~ treated, data = test_data)
cat("t-test comparing treated non-activators vs control:\n")

t-test comparing treated non-activators vs control:

# Compliance rate
compliance_rate <- mean(cd$activation[cd$treated == 1])


# Create visualization data
viz_data <- data.frame(
  Group = c("Control", "Treated - No Activation", "Treated - Activated"),
  Retention = c(control_retention, treated_not_activated_retention, treated_activated_retention),
  N = c(control_n, treated_not_activated_n, treated_activated_n)
)
print(viz_data)

                    Group Retention    N
1                 Control 0.4910000 1000
2 Treated - No Activation 0.1243781  201
3     Treated - Activated 0.7196496  799

The two regressions imply different business recommendations

Model 2a $\beta_1$ showed the effect of activation, among activators the retention rate was 29% higher than baseline. Model 2b $\beta_1$ showed the intention-to-treat effect, among people in the treatment group (regardless of whether they activated or not), the retention rate was 109% higher than baseline. The effect in model 2a is around 2.7 larger than model 2b.
Retention rate analysis: control group retention = 49.1%, treated and activated retention = 72%, treated non-activated = 12.4%. T-test for treated non-activators vs control was statistically significant, suggest there is a difference between two group (p<0.001).
Intention to treat analysis in 2b is better than 2a since model a suffers from severe self-selection bias. Comparing 2a and 2b, it is deduced that selection is not random and the decision to activate is strongly correlated with subscriber characteristics that predict retention (this means randomness of model is not maintained). Model 2a better reflects the pre-existing differences between activators and non-activators, and illustrate a realistic treatment effect of the feature.

Question 3:

3a)

library(tidyverse)
library(knitr) 
library(pwrss) 
library(readr) 
library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

library(tidyverse)
sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")

#3a 

# Convert to long format

sd <- as.data.table(sd)
sd_long <- melt(
  data = sd,
  id.vars = c("store_id", "city"),       # Columns to keep
  measure.vars = patterns("^day_"),       # Columns to pivot
  variable.name = "time",                 # Name for time column
  value.name = "sales"                    # Name for sales column
)

sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))

head(sd_long, 10)

    id       city time sales
 1:  1 Manchester    1  2678
 2:  2 Manchester    1  2605
 3:  3 Manchester    1  2774
 4:  4 Manchester    1  2550
 5:  5 Manchester    1  2617
 6:  6 Manchester    1  2896
 7:  7 Manchester    1  2650
 8:  8 Manchester    1  2668
 9:  9 Manchester    1  2115
10: 10 Manchester    1  2691

3b)

manchester_avg <- sd_long[city == "Manchester",.(avg_sales = mean(sales)), by = time]

# Create the line plot
library(ggplot2)

ggplot(manchester_avg, aes(x = time, y = avg_sales)) +
  geom_line(color = "blue", linewidth = 1) +
  geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", linewidth = 1) +
  labs(
    title = "Average Daily Sales in Manchester",
    subtitle = "Vertical line marks the start of free shipping trial (Day 31)",
    x = "Day",
    y = "Average Daily Sales (£)"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(face = "bold")
  )

Based on the graph, there is a clear preliminary evidence of a treatment effect.

Visible jump at intervention point: There is a dramatic and immediate increase in average daily sales immediately on day 31 (the day free shipping begins)
Treatment effect: The sharp jump starts on day 31 suggests the invention had a substantial positive impact on sales. The effect appears to be both immediate and persistent and the magnitude of change is larger relative to normal day.
Pre-trial average was 2658 and post trial sales average was 3461, which indicates clear positive preliminary evidence of treatment effect.

man_data <- sd_long[city == "Manchester"]

pre_trial_avg <- man_data[time <= 30, mean(sales)]
post_trial_avg <- man_data[time >= 31, mean(sales)]
q3b_dif <- post_trial_avg - pre_trial_avg
pct_incr <- (q3b_dif / pre_trial_avg) * 100

print(pre_trial_avg)

[1] 2657.648

print(post_trial_avg)

[1] 3461.465

print(q3b_dif)

[1] 803.8167

print(pct_incr)

[1] 30.24541

3c)

#q3c: 

man_data3c <- sd_long[city =="Manchester"]
print(man_data3c[1:10])

    id       city time sales
 1:  1 Manchester    1  2678
 2:  2 Manchester    1  2605
 3:  3 Manchester    1  2774
 4:  4 Manchester    1  2550
 5:  5 Manchester    1  2617
 6:  6 Manchester    1  2896
 7:  7 Manchester    1  2650
 8:  8 Manchester    1  2668
 9:  9 Manchester    1  2115
10: 10 Manchester    1  2691

man_data3c[, treatment := ifelse(time >= 31, 1, 0)]


print(table(man_data3c$treatment))


  0   1 
600 600

#regression: sales = b0 + b1treatment + e

q3c_reg <- lm(sales ~ treatment, data = man_data3c)
summary(q3c_reg)


Call:
lm(formula = sales ~ treatment, data = man_data3c)

Residuals:
    Min      1Q  Median      3Q     Max 
-903.46 -166.90   -0.65  177.79  899.35 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2657.65      10.61  250.55   <2e-16 ***
treatment     803.82      15.00   53.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 259.8 on 1198 degrees of freedom
Multiple R-squared:  0.7056,    Adjusted R-squared:  0.7054 
F-statistic:  2871 on 1 and 1198 DF,  p-value: < 2.2e-16

coef(q3c_reg)

(Intercept)   treatment 
  2657.6483    803.8167

Name of research design: Pre-post treatment design

Interpretation:

$\beta_1(treatment) = 803.82$ indicates the free shipping offer is associated with an average increase of £803.82 in daily sales. The coefficient is statistically significant (p<0.001), which means there is strong evidence that free shipping offer had a positive effect on sales.
Intercept ($\beta_0)$ represents the average daily sales revenue across all Manchester sales districts during the pre-trail period (days 1-30). 2657.65 means that pre-treatment, daily average sales in Manchester was 2657.65.
$R^2=0.706$ which means the model explain for 70.6% of sales variation.
This suggests that introducing free shipping represent a 30.2% increase for daily sales during the trial period.

3d)

#3d

library(data.table)
library(ggplot2)

# prep data
sd_long <- melt(
  data = sd,
  id.vars = c("store_id", "city"),
  measure.vars = patterns("^day_"),
  variable.name = "time",
  value.name = "sales"
)
sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))

# Create treatment indicator
sd_long[, treatment := ifelse(time >= 31, 1, 0)]

# Issue 1: Pre-existing time trend in Manchester 
man_pre <- sd_long[city == "Manchester" & treatment == 0]
#formula: sales = b0+b1*time+ e 
trend_model <- lm(sales ~ time, data = man_pre)
summary(trend_model)


Call:
lm(formula = sales ~ time, data = man_pre)

Residuals:
    Min      1Q  Median      3Q     Max 
-729.93 -183.12    9.13  174.98  766.77 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2509.862     21.124 118.816  < 2e-16 ***
time           9.535      1.190   8.013 5.88e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 252.3 on 598 degrees of freedom
Multiple R-squared:  0.09696,   Adjusted R-squared:  0.09545 
F-statistic: 64.21 on 1 and 598 DF,  p-value: 5.883e-15

# Issue 2: Both cities show increase
avg_by_city_time <- sd_long[, .(avg_sales = mean(sales)), by = .(city, time)]
print(avg_by_city_time[1:10])

          city time avg_sales
 1: Manchester    1   2482.45
 2: Birmingham    1   1932.45
 3: Manchester    2   2470.95
 4: Birmingham    2   1927.20
 5: Manchester    3   2512.60
 6: Birmingham    3   2031.30
 7: Manchester    4   2561.25
 8: Birmingham    4   1940.60
 9: Manchester    5   2594.35
10: Birmingham    5   2018.35

# Plot both cities
ggplot(avg_by_city_time, aes(x = time, y = avg_sales, color = city)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", linewidth = 1) +
  labs(
    title = "Average Daily Sales by City Over Time",
    subtitle = "Vertical line marks start of free shipping trial in Manchester (Day 31)",
    x = "Day",
    y = "Average Daily Sales (£)",
    color = "City"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = seq(0, 60, by = 10))

ggsave("sales_trends_both_cities.png", width = 10, height = 6)

# Issue 3: Birmingham 
birm_data <- sd_long[city == "Birmingham"]
birm_model <- lm(sales ~ treatment, data = birm_data)
summary(birm_model)


Call:
lm(formula = sales ~ treatment, data = birm_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-936.85 -199.49    4.89  191.40  745.15 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2152.37      10.92  197.19   <2e-16 ***
treatment     310.47      15.44   20.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 267.4 on 1198 degrees of freedom
Multiple R-squared:  0.2524,    Adjusted R-squared:  0.2518 
F-statistic: 404.5 on 1 and 1198 DF,  p-value: < 2.2e-16

# Summary statistics 
summary_stats <- sd_long[, .(
  mean_sales = mean(sales),
  sd_sales = sd(sales)
), by = .(city, treatment)]
print(summary_stats)

         city treatment mean_sales sd_sales
1: Manchester         0   2657.648 265.2481
2: Birmingham         0   2152.373 266.4162
3: Manchester         1   3461.465 254.2783
4: Birmingham         1   2462.847 268.3195

# Calculate changes in both cities
changes <- sd_long[, .(avg_sales = mean(sales)), by = .(city, treatment)]
changes_wide <- dcast(changes, city ~ treatment, value.var = "avg_sales")
setnames(changes_wide, c("0", "1"), c("pre_avg", "post_avg"))
changes_wide[, change := post_avg - pre_avg]
print(changes_wide)

         city  pre_avg post_avg   change
1: Birmingham 2152.373 2462.847 310.4733
2: Manchester 2657.648 3461.465 803.8167

There are three issues identified with pre-post treatment analysis.

Issue 1: Pre-existing time trend - this regression tests whether Manchester already had a time trend before the free shipping offer intitated. The coefficient for time $\beta_1(time)$ is 9.54, which means daily sales in Manchester increases by average by £9.54 per day during pre-treatment period. This is statistically significantly (p<0.001), which means there was a pre-existing upward trend pre-treatment.
Issue 2: Unable to determine counterfactual for Manchester data without a control group (ie what Manchester daily sales would have been during days 31-60 with no free shipping) - without comparing to Birmingham (control group), inability to separate treatment effect from background trends (see graph both cities saw background rises pre-treatment and Birmingham witnessed rise during trial period (control group).
Issue 3: The fundamental issue with Pre-Post treatment design is that it cannot separate the effect of the treatment (free shipping) from other potential concurrent factors (seasonality, promotions, general market trends, …) that may also influence sales.

3e and f)

library(data.table)
library(ggplot2)

sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")

setDT(sd)
sd_long <- melt(
  sd,
  id.vars = c("store_id", "city"),
  measure.vars = patterns("^day_"),
  variable.name = "time",
  value.name = "sales"
)
sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))

sd_long[, treatment := ifelse(time >= 31, 1, 0)]
sd_long[, manchester := ifelse(city == "Manchester", 1, 0)]

pre_data <- sd_long[treatment == 0]

#formula: sales = b0 + b1*time + b2*manchester + b3(time*manchester) + error 

parallel_test <- lm(sales ~ time + manchester + time:manchester, data = pre_data)
summary(parallel_test)


Call:
lm(formula = sales ~ time + manchester + time:manchester, data = pre_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-729.93 -175.23    2.85  172.30  871.30 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1955.262     20.730  94.320   <2e-16 ***
time              12.717      1.168  10.891   <2e-16 ***
manchester       554.600     29.317  18.917   <2e-16 ***
time:manchester   -3.182      1.651  -1.927   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 247.6 on 1196 degrees of freedom
Multiple R-squared:  0.5454,    Adjusted R-squared:  0.5443 
F-statistic: 478.3 on 3 and 1196 DF,  p-value: < 2.2e-16

pre_avg <- sd_long[treatment == 0, .(avg_sales = mean(sales)), by = .(city, time)]

ggplot(pre_avg, aes(x = time, y = avg_sales, color = city)) +
  geom_line(size = 1) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
  labs(
    title = "Pre-Treatment Trends (Days 1-30)",
    x = "Day",
    y = "Average Daily Sales (£)",
    color = "City"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = seq(0, 30, by = 5))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

ggsave("parallel_trends_check.png", width = 10, height = 6)

`geom_smooth()` using formula = 'y ~ x'

avg_full <- sd_long[, .(avg_sales = mean(sales)), by = .(city, time)]

ggplot(avg_full, aes(x = time, y = avg_sales, color = city)) +
  geom_line(size = 1) +
  geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", size = 1) +
  labs(
    title = "Average Daily Sales: Manchester vs Birmingham",
    x = "Day",
    y = "Average Daily Sales (£)",
    color = "City"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = seq(0, 60, by = 10))

ggsave("did_design_visualization.png", width = 10, height = 6)

did_summary <- sd_long[, .(avg_sales = mean(sales)), by = .(city, treatment)]
did_wide <- dcast(did_summary, city ~ treatment, value.var = "avg_sales")
setnames(did_wide, c("0", "1"), c("pre_avg", "post_avg"))
did_wide[, within_diff := post_avg - pre_avg]
print(did_wide)

         city  pre_avg post_avg within_diff
1: Birmingham 2152.373 2462.847    310.4733
2: Manchester 2657.648 3461.465    803.8167

manchester_change <- did_wide[city == "Manchester", within_diff]
birmingham_change <- did_wide[city == "Birmingham", within_diff]
did_estimate <- manchester_change - birmingham_change
print(did_estimate)

[1] 493.3433

Interpretation:

Parallel trend test: $\beta_3=-3.182, p-value = 0.0542$ indicates how much that Manchester and Birmingham had statistically similar trends during the pre-treatment period (1-30).
Pre-treatment trends plot - The plot visually confirms that Manchester and Birmingham had parallel trends pre-treatment.
Full Period Plot - In the post treatment period, Manchester indicated sharp jump at day 31, then stabilises at higher levels. While Birmingham (31-60) showed its gradual upward trend continuation. From this we have counterfactual, Birmingham shows what would have happened to Manchester without treatment.

Proposed design to overcome issues in 3d - Difference in Differences (DiD) with fixed time effects.

DiD functions to estimate causal effects of free shipping on sales by comparing changes in outcome between Manchester (treatment) and Birmingham (control), this addresses confounding factors issues.

Formula: $Sales_{idt}=\beta_0+\beta_1*manchester+\delta_t+\beta_{DiD}(Manchester_i*Trial_t)+error$

library(data.table)
library(ggplot2)
library(tidyverse)

# Load the data 

sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")

# Convert to long

sd_long <- sd %>%
  pivot_longer(
    cols = starts_with("day_"), 
    names_to = "time_str",
    values_to = "sales"
  ) %>%
  # Convert to data.table structure for subsequent data.table operations
  setDT() 

# Convert time column to integer and rename columns
sd_long[, time := as.integer(gsub("day_", "", time_str))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))

sd_long[, time_str := NULL]

# Create treatment indicator
sd_long[, treatment := ifelse(time >= 31, 1, 0)]

# Create DiD variables 
# 1. City dummy (Group Fixed Effect)
sd_long[, manchester := ifelse(city == "Manchester", 1, 0)]

# 2. Interaction term (DiD estimator)
sd_long[, man_trial_interact := manchester * treatment]


# Execute the proposed DiD model with Day Fixed Effects (as.factor(time))

did_tfe_model <- lm(sales ~ manchester + as.factor(time) + man_trial_interact,
                    data = sd_long)

# Print the summary of the robust DiD regression
print(summary(did_tfe_model))


Call:
lm(formula = sales ~ manchester + as.factor(time) + man_trial_interact, 
    data = sd_long)

Residuals:
    Min      1Q  Median      3Q     Max 
-830.48 -165.76   -1.76  163.02  905.56 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1954.812     39.701  49.239  < 2e-16 ***
manchester          505.275     14.261  35.431  < 2e-16 ***
as.factor(time)2     -8.375     55.232  -0.152 0.879489    
as.factor(time)3     64.500     55.232   1.168 0.243006    
as.factor(time)4     43.475     55.232   0.787 0.431283    
as.factor(time)5     98.900     55.232   1.791 0.073482 .  
as.factor(time)6    105.625     55.232   1.912 0.055949 .  
as.factor(time)7     93.725     55.232   1.697 0.089843 .  
as.factor(time)8     92.300     55.232   1.671 0.094829 .  
as.factor(time)9    175.250     55.232   3.173 0.001528 ** 
as.factor(time)10   184.900     55.232   3.348 0.000828 ***
as.factor(time)11   163.775     55.232   2.965 0.003055 ** 
as.factor(time)12    38.950     55.232   0.705 0.480752    
as.factor(time)13   143.025     55.232   2.590 0.009670 ** 
as.factor(time)14   232.500     55.232   4.210 2.66e-05 ***
as.factor(time)15   228.350     55.232   4.134 3.68e-05 ***
as.factor(time)16   259.500     55.232   4.698 2.77e-06 ***
as.factor(time)17   218.475     55.232   3.956 7.86e-05 ***
as.factor(time)18   258.475     55.232   4.680 3.04e-06 ***
as.factor(time)19   260.375     55.232   4.714 2.57e-06 ***
as.factor(time)20   248.250     55.232   4.495 7.31e-06 ***
as.factor(time)21   253.875     55.232   4.597 4.52e-06 ***
as.factor(time)22   311.775     55.232   5.645 1.85e-08 ***
as.factor(time)23   304.275     55.232   5.509 4.00e-08 ***
as.factor(time)24   252.500     55.232   4.572 5.09e-06 ***
as.factor(time)25   346.325     55.232   6.270 4.28e-10 ***
as.factor(time)26   351.175     55.232   6.358 2.45e-10 ***
as.factor(time)27   319.250     55.232   5.780 8.46e-09 ***
as.factor(time)28   290.775     55.232   5.265 1.53e-07 ***
as.factor(time)29   313.825     55.232   5.682 1.50e-08 ***
as.factor(time)30   281.075     55.232   5.089 3.89e-07 ***
as.factor(time)31   332.728     56.145   5.926 3.56e-09 ***
as.factor(time)32   386.153     56.145   6.878 7.77e-12 ***
as.factor(time)33   382.278     56.145   6.809 1.25e-11 ***
as.factor(time)34   348.878     56.145   6.214 6.10e-10 ***
as.factor(time)35   342.103     56.145   6.093 1.29e-09 ***
as.factor(time)36   393.378     56.145   7.006 3.18e-12 ***
as.factor(time)37   464.928     56.145   8.281  < 2e-16 ***
as.factor(time)38   441.253     56.145   7.859 5.85e-15 ***
as.factor(time)39   408.853     56.145   7.282 4.47e-13 ***
as.factor(time)40   560.078     56.145   9.976  < 2e-16 ***
as.factor(time)41   456.653     56.145   8.133 6.70e-16 ***
as.factor(time)42   457.153     56.145   8.142 6.23e-16 ***
as.factor(time)43   521.253     56.145   9.284  < 2e-16 ***
as.factor(time)44   586.053     56.145  10.438  < 2e-16 ***
as.factor(time)45   536.878     56.145   9.562  < 2e-16 ***
as.factor(time)46   534.078     56.145   9.512  < 2e-16 ***
as.factor(time)47   559.128     56.145   9.959  < 2e-16 ***
as.factor(time)48   548.728     56.145   9.773  < 2e-16 ***
as.factor(time)49   435.053     56.145   7.749 1.37e-14 ***
as.factor(time)50   600.278     56.145  10.692  < 2e-16 ***
as.factor(time)51   579.378     56.145  10.319  < 2e-16 ***
as.factor(time)52   553.378     56.145   9.856  < 2e-16 ***
as.factor(time)53   568.753     56.145  10.130  < 2e-16 ***
as.factor(time)54   595.278     56.145  10.603  < 2e-16 ***
as.factor(time)55   610.128     56.145  10.867  < 2e-16 ***
as.factor(time)56   610.278     56.145  10.870  < 2e-16 ***
as.factor(time)57   661.453     56.145  11.781  < 2e-16 ***
as.factor(time)58   621.803     56.145  11.075  < 2e-16 ***
as.factor(time)59   520.353     56.145   9.268  < 2e-16 ***
as.factor(time)60   624.328     56.145  11.120  < 2e-16 ***
man_trial_interact  493.343     20.168  24.462  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 247 on 2338 degrees of freedom
Multiple R-squared:  0.8042,    Adjusted R-squared:  0.7991 
F-statistic: 157.4 on 61 and 2338 DF,  p-value: < 2.2e-16

Coefficient Interpretations:

$\beta_0(interecept)=1954.81$ is the average sales on day 1 in Birminham (control group - baseline reference)
$\beta_1(Manchester)=£505.28$ which means Manchester had 505.28 higher average daily sales than Birmingham before the trial. This statistics was significant (p<0.001), meaning Manchester was systematically different from Birmingham.
Time fixed effects capture day by day variation in sales for both cities.
$\beta_{did}=493.34$ this is the treatment effect, this means free shipping caused an additional 493.34 increase in average daily sales in Manchester, controlling for the common time trends between both cities and any external factors. This is statistically significant (p<0.001). This is significantly lower compared to coefficient in pre-post design (803.82), since DiD removes confounding, isolates causal effect. Time fixed effect controls for pre-existing trends. Counterfactual Birmingham controls for external factors.
$R^2=0.804$ which means the model explains 80.4% of sales variation.

Question 4:

4a)

#q4a

library(tidyverse)
library(knitr) 
library(pwrss) 
library(readr) 
library(mlogit)

Warning: package 'mlogit' was built under R version 4.3.3

Loading required package: dfidx

Warning: package 'dfidx' was built under R version 4.3.3

library(tidyverse)

camera_df <-  read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/camera.csv")
head(camera_df)

  chid consumer alt canon sony nikon panasonic pixels zoom video swivel wifi
1    1        1   1     0    0     1         0      0    1     0      1    0
2    1        1   2     1    0     0         0      1    1     0      1    1
3    1        1   3     0    0     0         1      0    0     0      0    1
4    1        1   4     0    1     0         0      0    1     0      1    0
5    1        1   5     0    0     0         0      0    0     0      0    0
6    2        1   1     0    0     0         1      1    0     0      1    0
  price choice
1  0.79      1
2  2.29      0
3  1.29      0
4  2.79      0
5  0.00      0
6  2.79      0

str(camera_df)

'data.frame':   26560 obs. of  14 variables:
 $ chid     : int  1 1 1 1 1 2 2 2 2 2 ...
 $ consumer : int  1 1 1 1 1 1 1 1 1 1 ...
 $ alt      : int  1 2 3 4 5 1 2 3 4 5 ...
 $ canon    : int  0 1 0 0 0 0 1 0 0 0 ...
 $ sony     : int  0 0 0 1 0 0 0 1 0 0 ...
 $ nikon    : int  1 0 0 0 0 0 0 0 1 0 ...
 $ panasonic: int  0 0 1 0 0 1 0 0 0 0 ...
 $ pixels   : int  0 1 0 0 0 1 1 1 1 0 ...
 $ zoom     : int  1 1 0 1 0 0 0 0 0 0 ...
 $ video    : int  0 0 0 0 0 0 1 1 1 0 ...
 $ swivel   : int  1 1 0 1 0 1 0 0 1 0 ...
 $ wifi     : int  0 1 1 0 0 0 1 1 1 0 ...
 $ price    : num  0.79 2.29 1.29 2.79 0 2.79 0.79 1.79 1.29 0 ...
 $ choice   : int  1 0 0 0 0 0 1 0 0 0 ...

# Convert to mlogit format
camera_mlogit <- dfidx(camera_df, 
                       idx = list(c("chid", "consumer"), "alt"),
                       choice = "choice")

# Define attribute names
attribute_names <- c("canon", "sony", "nikon", "panasonic", 
                     "pixels", "zoom", "video", "swivel", "wifi", "price")

# the multinomial logit model
logit_formula <- as.formula(paste("choice ~", 
                                  paste(attribute_names, collapse = " + "),
                                  "+ 0"))

logit_model <- mlogit(logit_formula, 
                      data = camera_mlogit,
                      method = "nr")

summary(logit_model)


Call:
mlogit(formula = choice ~ canon + sony + nikon + panasonic + 
    pixels + zoom + video + swivel + wifi + price + 0, data = camera_mlogit, 
    method = "nr")

Frequencies of alternatives:choice
      1       2       3       4       5 
0.20708 0.17620 0.19484 0.16905 0.25282 

nr method
5 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.0099 
successive function values within tolerance limits 

Coefficients :
           Estimate Std. Error  z-value  Pr(>|z|)    
canon      0.465026   0.075967   6.1214 9.273e-10 ***
sony       0.238372   0.076693   3.1081  0.001883 ** 
nikon      0.311653   0.076591   4.0691 4.720e-05 ***
panasonic  0.022662   0.077852   0.2911  0.770986    
pixels     0.758258   0.042194  17.9706 < 2.2e-16 ***
zoom       0.819351   0.041940  19.5363 < 2.2e-16 ***
video      0.627884   0.040647  15.4472 < 2.2e-16 ***
swivel     0.367104   0.040211   9.1294 < 2.2e-16 ***
wifi       0.577804   0.041658  13.8703 < 2.2e-16 ***
price     -1.485549   0.032467 -45.7551 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -6503.7

4b)

# Question 4b: Price coefficient interpretation

price_coef <- coef(logit_model)["price"]
print(price_coef)

    price 
-1.485549

The price coefficient is -1.4855 ($\beta$_price) (unit = $100)

Log Odds Intepretation:

In the multinomial logit model, the log odds ratio between any 2 alternative j and k in the same choice set is given as:

$log[Pr(Y_i=j|\beta)/Pr(Y_i=k|\beta)]=\beta^T(x_j-x_i)$

When one alternative price is changed and one is constant, we have:

$\Delta log[Pr(Y_i=j)/Pr(Y_i=k)]=\beta price*\Delta p$

Therefore when $\Delta p = 1$, the log odds of choosing alternative j versus k decreases by 1.4855 or:

$\Delta log[Pr(Y_i=j)/Pr(Y_i=k)]=-1.4855*1=-1.4855 p$

Odds interpretation:

When alternative j’s price increases by $\Delta p$ = 1

$exp(\beta price * \Delta p) = exp(-1.4855) = 0.2264$

This mean for every $100 price increase, the probability of j being chosen decreases by 77.36%.

The negative price coefficient ($\beta price = -1.4855$) demonstrates that consumers are less likely to choose alternatives with higher prices, all other factors being constant.

This large magnitude of the coefficients show substantial price sensitivity in camera purchases. Specifically, the probability of a camera being chosen compared to alternatives decreases by 77.36% for every $100 increase. This means:

Pricing strategy is crucial in the camera market
Small price changes can affect market share significantly
Consumers in this market are highly responsive to price differences when choosing between cameras with similar features.

4c)

# Question 4c:

# Extract coefficients
sony_coef <- coef(logit_model)["sony"]
print(sony_coef)

     sony 
0.2383719

nikon_coef <- coef(logit_model)["nikon"]
print(nikon_coef)

    nikon 
0.3116529

price_coef <- coef(logit_model)["price"]

# Calculate brand premium 
brand_premium_sony_vs_nikon <- -(sony_coef - nikon_coef) / price_coef * 100
print(brand_premium_sony_vs_nikon)

     sony 
-4.932926

The brand premium of Sony relative to Nikon is -$4.93. This indicates that:

Sony has a lower brand premium than Nikon - if all factors are constant, consumers prefer Nikon to Sony.
Monetary equivelance: A Sony camera would need to be priced about $4.93 lower than an otherwise identical Nikon camera for consumers to be indifferent between the two brands, or for they to have the same utility to consumers.

4d)

# Question 4d

# Extract coefficients
wifi_coef <- coef(logit_model)["wifi"]
price_coef <- coef(logit_model)["price"]

# Calculate valuation
wifi_valuation <- -wifi_coef / price_coef * 100
wifi_valuation

    wifi 
38.89496

The estimated consumer valuation for Wifi connectivity is $38.89, which means:

Consumers are willing to pay around $38.89 more for a camera with Wifi compared to an otherwise identical without Wifi. Adding Wifi to a camera has the same effects to utility as reducing the camera’s price by $38.89.

4e)

# Question 4e:


# beta
get_probs <- function(X, beta) {
  # Calculate utility for each alternative: X %*% beta
  utilities <- X %*% beta
  
  # Apply softmax trans
  exp_utilities <- exp(utilities)
  probs <- exp_utilities / sum(exp_utilities)
  
  # Return as vector
  return(as.vector(probs))
}

# Extract coefficients
beta <- coef(logit_model)

# Check the names
names(beta)

 [1] "canon"     "sony"      "nikon"     "panasonic" "pixels"    "zoom"     
 [7] "video"     "swivel"    "wifi"      "price"

# choice task with chid 7 
choice_task_7 <- camera_df %>% 
  filter(chid == 7) %>%
  select(all_of(names(beta))) 

# Convert to matrix
X_task_7 <- as.matrix(choice_task_7)
dim(X_task_7)

[1]  5 10

length(beta)

[1] 10

# Calculate choice probabilities
probs_task_7 <- get_probs(X_task_7, beta)
probs_task_7

[1] 0.50542669 0.07523878 0.04238305 0.08599906 0.29095242

# alternative labels for clarity
data.frame(
  alternative = 1:5,
  probability = round(probs_task_7, 4)
)

  alternative probability
1           1      0.5054
2           2      0.0752
3           3      0.0424
4           4      0.0860
5           5      0.2910

4f)

# Question 4f:

# Function to get Sony's profit
# price: Sony camera's price 
# X: matrix characterising the choice task
# beta: vector of estimated coefficients
# mc: marginal cost of Sony's camera
get_profit <- function(price, X, beta, mc) {
  # Update Sony's price in the choice task matrix
  # Identify which row is Sony's camera
  sony_col_index <- which(colnames(X) == "sony")
  sony_row_index <- which(X[, sony_col_index] == 1)
  
  # Update the price for Sony's camera
  X_updated <- X
  price_col_index <- which(colnames(X) == "price")
  X_updated[sony_row_index, price_col_index] <- price
  
  # Get choice probabilities with updated price
  probs <- get_probs(X_updated, beta)
  
  # Sony's market share= probability of choosing Sony
  sony_prob <- probs[sony_row_index]
  
  # Calculate profit: market_share * (price - marginal_cost)
  # Convert price
  profit <- sony_prob * (price * 100 - mc)
  
  return(profit)
}

# Get the choice task with chid == 7
choice_task_7 <- camera_df %>% 
  filter(chid == 7) %>%
  select(all_of(names(beta)))

X_task_7 <- as.matrix(choice_task_7)

# Set marginal cost
mc <- 70  # £70

# Get Sony's current price in choice task 7
sony_row <- which(X_task_7[, "sony"] == 1)
current_price <- X_task_7[sony_row, "price"]

# Calculate Sony's predicted profit at current price
sony_profit <- get_profit(current_price, X_task_7, beta, mc)

#results
cat("Sony's predicted profit in choice task 7:\n")

Sony's predicted profit in choice task 7:

cat("Current price: $", current_price * 100, "\n")

Current price: $ 229

cat("Marginal cost: $", mc, "\n")

Marginal cost: $ 70

cat("Predicted profit: $", round(sony_profit, 2), "\n")

Predicted profit: $ 13.67

sony_profit

   price 
13.67385

4g)

# Question 4g:

# choice task with chid == 7
choice_task_7 <- camera_df %>% 
  filter(chid == 7) %>%
  select(all_of(names(beta)))

X_task_7 <- as.matrix(choice_task_7)

# Set marginal cost
mc <- 70  # £70
price_range <- seq(0.50, 3.50, by = 0.01)

# Calculate profit
profits <- sapply(price_range, function(p) {
  get_profit(p, X_task_7, beta, mc)
})

# Create a data frame for plotting
profit_data <- data.frame(
  price = price_range * 100,  # Convert to dollars for plotting
  profit = profits
)

# optimal price
optimal_index <- which.max(profits)
optimal_price <- price_range[optimal_index]
optimal_profit <- profits[optimal_index]

# Plotting
library(ggplot2)

ggplot(profit_data, aes(x = price, y = profit)) +
  geom_line(color = "blue", linewidth = 1) +
  geom_vline(xintercept = optimal_price * 100, 
             linetype = "dashed", color = "red", linewidth = 0.8) +
  geom_point(aes(x = optimal_price * 100, y = optimal_profit), 
             color = "red", size = 3) +
  annotate("text", 
           x = optimal_price * 100 + 20, 
           y = optimal_profit,
           label = paste0("Optimal: $", round(optimal_price * 100, 2),
                          "\nProfit: $", round(optimal_profit, 4)),
           hjust = 0, color = "red", size = 4) +
  labs(title = "Sony's Predicted Profit vs Price (Choice Task 7)",
       subtitle = paste0("Marginal Cost = $", mc),
       x = "Price ($)",
       y = "Predicted Profit ($)") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        axis.title = element_text(face = "bold"))

# Print the optimal price and profit
cat("\n=== OPTIMAL PRICING RESULTS ===\n")


=== OPTIMAL PRICING RESULTS ===

cat("Optimal price for Sony: $", round(optimal_price * 100, 2), "\n")

Optimal price for Sony: $ 156

cat("Expected profit at optimal price: $", round(optimal_profit, 4), "\n")

Expected profit at optimal price: $ 18.7233

cat("\nCurrent price in dataset: $", X_task_7[which(X_task_7[, "sony"] == 1), "price"] * 100, "\n")


Current price in dataset: $ 229

cat("Profit at current price: $", 
    round(get_profit(X_task_7[which(X_task_7[, "sony"] == 1), "price"], X_task_7, beta, mc), 4), "\n")

Profit at current price: $ 13.6739

optimal_price * 100  # Return in dollars

[1] 156