1 + 1[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
You can add options to executable code like this
[1] 4
The echo: false option disables the printing of code (only output is displayed).
Question 1:
1a)
For \(I=X_1X_2X_3X_4=X_1\), this means:
\(X_1=X_1*I=X1^2=X_1X_2X_3X_4=X1^2*X_2X_3X_4\)
So: \(X_1=X_2X_3X_4=I\)
1b)
For \(I=X_1=X_2X_3X_4\), we have:
\(X_1X_2X_3X_4=X_2X_3X_4*I=X_1X_2X_3\)
Since: \(X_2X_3X_4=I\)
Therefore \(X_1X_2X_3X_4=I\)
1c)
We have \(I=X_1X_2X_3X_4=X_3X_4\)
\(X_3*I=X_3*X_3X_4\)
\(X_3=I*X_4=X_4\)
Therefore \(X_3=X_4\), which means:
\(X_1X_2X_3X_3=I\)
\(X_1X_2I=I\)
\(X_1X_2=I\)
This shows that \(X_1\) and \(X_2\) are perfectly confounded.
Collinearity: \(X_1X_2=I\) so \(X_1=X_2^{-1}\), this means \(X_1\) and \(X_2\) are perfectly collinear.
Therefore when \(X_1\) and \(X_2\) are perfectly confounded and collinear, this creates a multicollinearity problem and thus we cannot separate the effects of \(X_1\) and \(X_2\). Therefore, we cannot:
Uniquely identify \(\beta_1\) and \(\beta_2\)
We can only estimate the confounded effect (\(\beta_1+\beta_2\)).
Conclusion: It is not possible to estimate this regression because \(X_1\) and \(X_2\) are completely confounded in this fractional factorial design.
Question 2:
2a)
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(pwrss) Warning: package 'pwrss' was built under R version 4.3.3
Attaching package: 'pwrss'
The following object is masked from 'package:stats':
power.t.test
library(readr)
cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")
#q2a
head(cd) id treated activation retention
1 1 1 1 1
2 2 1 1 1
3 3 1 0 1
4 4 1 1 1
5 5 1 1 1
6 6 1 1 1
str(cd)'data.frame': 2000 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ treated : int 1 1 1 1 1 1 1 1 1 1 ...
$ activation: int 1 1 0 1 1 1 1 0 1 0 ...
$ retention : int 1 1 1 1 1 1 1 0 1 0 ...
#equation: retention = b0 + b1activation + error
reg2a = lm(retention ~ activation, data = cd)
summary(reg2a)
Call:
lm(formula = retention ~ activation, data = cd)
Residuals:
Min 1Q Median 3Q Max
-0.7197 -0.4296 0.2803 0.5704 0.5704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.42964 0.01378 31.18 <2e-16 ***
activation 0.29001 0.02180 13.30 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4775 on 1998 degrees of freedom
Multiple R-squared: 0.08138, Adjusted R-squared: 0.08092
F-statistic: 177 on 1 and 1998 DF, p-value: < 2.2e-16
coef(reg2a)(Intercept) activation
0.4296420 0.2900076
Interpretation:
\(\beta_0(intercept)=0.43, p<0.001\) this means that among subscribers who did not activate the new feature, the retention rate is 43%. This is the baseline retention rate for subscribers who did not use the one-to-one fitness assessment feature. This statistic is significant (p<0.001)
\(\beta_1(activation) = 0.29, p<0.001\) this means subscribers who activated the feature had a retention rate 29% higher than those who did not activate it. The statistics is significant (p<0.001) which means strong evidence that activation is associated with retention rate.
\(R^2=0.081\) means the model accounts for 81% of variation in the retention rate.
2b)
#q2b
library(tidyverse)
library(knitr)
library(pwrss)
library(readr)
cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")
#equation: retention = b0 + b1treated + error
reg2b = lm(retention ~ treated, data = cd)
summary(reg2b)
Call:
lm(formula = retention ~ treated, data = cd)
Residuals:
Min 1Q Median 3Q Max
-0.600 -0.491 0.400 0.400 0.509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.49100 0.01566 31.356 < 2e-16 ***
treated 0.10900 0.02215 4.922 9.27e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4952 on 1998 degrees of freedom
Multiple R-squared: 0.01198, Adjusted R-squared: 0.01149
F-statistic: 24.23 on 1 and 1998 DF, p-value: 9.266e-07
coef(reg2b) (Intercept) treated
0.491 0.109
Interpretation:
\(\beta_0=0.491\) means among subscribers in the control group (those who were not assigned to the treatment group), the retention rate is 49.1%. p-value<0.001 means this is statistically significant.
\(\beta_1=0.109\) means subscribers who were assigned to treatment group had a retention rate 10.9% higher than control group, p<0.001 means this is statistically significant. Intention to treat effect means the effect of offering new feature, regardless of whether subscribers actually activated it. This interprets the intention to treat effect. Intention to treat effects ensure randomness, and reflects realistic implementation including non-compliance. This is more realistically representative than 2a.
\(R^2=0.012\) means treatment assignment accounts for 1.2% of variation in retention rate.
2c)
# 2c
cd<- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/customer_data.csv")
# retention rate analysis
# Control group
control_retention <- mean(cd$retention[cd$treated == 0])
control_n <- sum(cd$treated == 0)
# Treated group who activated
treated_activated_retention <- mean(cd$retention[cd$treated == 1 & cd$activation == 1])
treated_activated_n <- sum(cd$treated == 1 & cd$activation == 1)
# Treated group who did NOT activate (key evidence!)
treated_not_activated_retention <- mean(cd$retention[cd$treated == 1 & cd$activation == 0])
treated_not_activated_n <- sum(cd$treated == 1 & cd$activation == 0)
# Calculate the critical difference
diff <- treated_not_activated_retention - control_retention
# Statistical test
test_data <- cd[(cd$treated == 1 & cd$activation == 0) | cd$treated == 0, ]
t_test <- t.test(retention ~ treated, data = test_data)
cat("t-test comparing treated non-activators vs control:\n")t-test comparing treated non-activators vs control:
# Compliance rate
compliance_rate <- mean(cd$activation[cd$treated == 1])
# Create visualization data
viz_data <- data.frame(
Group = c("Control", "Treated - No Activation", "Treated - Activated"),
Retention = c(control_retention, treated_not_activated_retention, treated_activated_retention),
N = c(control_n, treated_not_activated_n, treated_activated_n)
)
print(viz_data) Group Retention N
1 Control 0.4910000 1000
2 Treated - No Activation 0.1243781 201
3 Treated - Activated 0.7196496 799
The two regressions imply different business recommendations
Model 2a \(\beta_1\) showed the effect of activation, among activators the retention rate was 29% higher than baseline. Model 2b \(\beta_1\) showed the intention-to-treat effect, among people in the treatment group (regardless of whether they activated or not), the retention rate was 109% higher than baseline. The effect in model 2a is around 2.7 larger than model 2b.
Retention rate analysis: control group retention = 49.1%, treated and activated retention = 72%, treated non-activated = 12.4%. T-test for treated non-activators vs control was statistically significant, suggest there is a difference between two group (p<0.001).
Intention to treat analysis in 2b is better than 2a since model a suffers from severe self-selection bias. Comparing 2a and 2b, it is deduced that selection is not random and the decision to activate is strongly correlated with subscriber characteristics that predict retention (this means randomness of model is not maintained). Model 2a better reflects the pre-existing differences between activators and non-activators, and illustrate a realistic treatment effect of the feature.
Question 3:
3a)
library(tidyverse)
library(knitr)
library(pwrss)
library(readr)
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
library(tidyverse)
sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")
#3a
# Convert to long format
sd <- as.data.table(sd)
sd_long <- melt(
data = sd,
id.vars = c("store_id", "city"), # Columns to keep
measure.vars = patterns("^day_"), # Columns to pivot
variable.name = "time", # Name for time column
value.name = "sales" # Name for sales column
)
sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))
head(sd_long, 10) id city time sales
1: 1 Manchester 1 2678
2: 2 Manchester 1 2605
3: 3 Manchester 1 2774
4: 4 Manchester 1 2550
5: 5 Manchester 1 2617
6: 6 Manchester 1 2896
7: 7 Manchester 1 2650
8: 8 Manchester 1 2668
9: 9 Manchester 1 2115
10: 10 Manchester 1 2691
3b)
manchester_avg <- sd_long[city == "Manchester",.(avg_sales = mean(sales)), by = time]
# Create the line plot
library(ggplot2)
ggplot(manchester_avg, aes(x = time, y = avg_sales)) +
geom_line(color = "blue", linewidth = 1) +
geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", linewidth = 1) +
labs(
title = "Average Daily Sales in Manchester",
subtitle = "Vertical line marks the start of free shipping trial (Day 31)",
x = "Day",
y = "Average Daily Sales (£)"
) +
theme_bw() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11),
axis.title = element_text(face = "bold")
)Based on the graph, there is a clear preliminary evidence of a treatment effect.
Visible jump at intervention point: There is a dramatic and immediate increase in average daily sales immediately on day 31 (the day free shipping begins)
Treatment effect: The sharp jump starts on day 31 suggests the invention had a substantial positive impact on sales. The effect appears to be both immediate and persistent and the magnitude of change is larger relative to normal day.
Pre-trial average was 2658 and post trial sales average was 3461, which indicates clear positive preliminary evidence of treatment effect.
man_data <- sd_long[city == "Manchester"]
pre_trial_avg <- man_data[time <= 30, mean(sales)]
post_trial_avg <- man_data[time >= 31, mean(sales)]
q3b_dif <- post_trial_avg - pre_trial_avg
pct_incr <- (q3b_dif / pre_trial_avg) * 100
print(pre_trial_avg)[1] 2657.648
print(post_trial_avg)[1] 3461.465
print(q3b_dif)[1] 803.8167
print(pct_incr)[1] 30.24541
3c)
#q3c:
man_data3c <- sd_long[city =="Manchester"]
print(man_data3c[1:10]) id city time sales
1: 1 Manchester 1 2678
2: 2 Manchester 1 2605
3: 3 Manchester 1 2774
4: 4 Manchester 1 2550
5: 5 Manchester 1 2617
6: 6 Manchester 1 2896
7: 7 Manchester 1 2650
8: 8 Manchester 1 2668
9: 9 Manchester 1 2115
10: 10 Manchester 1 2691
man_data3c[, treatment := ifelse(time >= 31, 1, 0)]
print(table(man_data3c$treatment))
0 1
600 600
#regression: sales = b0 + b1treatment + e
q3c_reg <- lm(sales ~ treatment, data = man_data3c)
summary(q3c_reg)
Call:
lm(formula = sales ~ treatment, data = man_data3c)
Residuals:
Min 1Q Median 3Q Max
-903.46 -166.90 -0.65 177.79 899.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2657.65 10.61 250.55 <2e-16 ***
treatment 803.82 15.00 53.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 259.8 on 1198 degrees of freedom
Multiple R-squared: 0.7056, Adjusted R-squared: 0.7054
F-statistic: 2871 on 1 and 1198 DF, p-value: < 2.2e-16
coef(q3c_reg)(Intercept) treatment
2657.6483 803.8167
Name of research design: Pre-post treatment design
Interpretation:
\(\beta_1(treatment) = 803.82\) indicates the free shipping offer is associated with an average increase of £803.82 in daily sales. The coefficient is statistically significant (p<0.001), which means there is strong evidence that free shipping offer had a positive effect on sales.
Intercept (\(\beta_0)\) represents the average daily sales revenue across all Manchester sales districts during the pre-trail period (days 1-30). 2657.65 means that pre-treatment, daily average sales in Manchester was 2657.65.
\(R^2=0.706\) which means the model explain for 70.6% of sales variation.
This suggests that introducing free shipping represent a 30.2% increase for daily sales during the trial period.
3d)
#3d
library(data.table)
library(ggplot2)
# prep data
sd_long <- melt(
data = sd,
id.vars = c("store_id", "city"),
measure.vars = patterns("^day_"),
variable.name = "time",
value.name = "sales"
)
sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))
# Create treatment indicator
sd_long[, treatment := ifelse(time >= 31, 1, 0)]
# Issue 1: Pre-existing time trend in Manchester
man_pre <- sd_long[city == "Manchester" & treatment == 0]
#formula: sales = b0+b1*time+ e
trend_model <- lm(sales ~ time, data = man_pre)
summary(trend_model)
Call:
lm(formula = sales ~ time, data = man_pre)
Residuals:
Min 1Q Median 3Q Max
-729.93 -183.12 9.13 174.98 766.77
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2509.862 21.124 118.816 < 2e-16 ***
time 9.535 1.190 8.013 5.88e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 252.3 on 598 degrees of freedom
Multiple R-squared: 0.09696, Adjusted R-squared: 0.09545
F-statistic: 64.21 on 1 and 598 DF, p-value: 5.883e-15
# Issue 2: Both cities show increase
avg_by_city_time <- sd_long[, .(avg_sales = mean(sales)), by = .(city, time)]
print(avg_by_city_time[1:10]) city time avg_sales
1: Manchester 1 2482.45
2: Birmingham 1 1932.45
3: Manchester 2 2470.95
4: Birmingham 2 1927.20
5: Manchester 3 2512.60
6: Birmingham 3 2031.30
7: Manchester 4 2561.25
8: Birmingham 4 1940.60
9: Manchester 5 2594.35
10: Birmingham 5 2018.35
# Plot both cities
ggplot(avg_by_city_time, aes(x = time, y = avg_sales, color = city)) +
geom_line(linewidth = 1) +
geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", linewidth = 1) +
labs(
title = "Average Daily Sales by City Over Time",
subtitle = "Vertical line marks start of free shipping trial in Manchester (Day 31)",
x = "Day",
y = "Average Daily Sales (£)",
color = "City"
) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_x_continuous(breaks = seq(0, 60, by = 10))ggsave("sales_trends_both_cities.png", width = 10, height = 6)
# Issue 3: Birmingham
birm_data <- sd_long[city == "Birmingham"]
birm_model <- lm(sales ~ treatment, data = birm_data)
summary(birm_model)
Call:
lm(formula = sales ~ treatment, data = birm_data)
Residuals:
Min 1Q Median 3Q Max
-936.85 -199.49 4.89 191.40 745.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2152.37 10.92 197.19 <2e-16 ***
treatment 310.47 15.44 20.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 267.4 on 1198 degrees of freedom
Multiple R-squared: 0.2524, Adjusted R-squared: 0.2518
F-statistic: 404.5 on 1 and 1198 DF, p-value: < 2.2e-16
# Summary statistics
summary_stats <- sd_long[, .(
mean_sales = mean(sales),
sd_sales = sd(sales)
), by = .(city, treatment)]
print(summary_stats) city treatment mean_sales sd_sales
1: Manchester 0 2657.648 265.2481
2: Birmingham 0 2152.373 266.4162
3: Manchester 1 3461.465 254.2783
4: Birmingham 1 2462.847 268.3195
# Calculate changes in both cities
changes <- sd_long[, .(avg_sales = mean(sales)), by = .(city, treatment)]
changes_wide <- dcast(changes, city ~ treatment, value.var = "avg_sales")
setnames(changes_wide, c("0", "1"), c("pre_avg", "post_avg"))
changes_wide[, change := post_avg - pre_avg]
print(changes_wide) city pre_avg post_avg change
1: Birmingham 2152.373 2462.847 310.4733
2: Manchester 2657.648 3461.465 803.8167
There are three issues identified with pre-post treatment analysis.
Issue 1: Pre-existing time trend - this regression tests whether Manchester already had a time trend before the free shipping offer intitated. The coefficient for time \(\beta_1(time)\) is 9.54, which means daily sales in Manchester increases by average by £9.54 per day during pre-treatment period. This is statistically significantly (p<0.001), which means there was a pre-existing upward trend pre-treatment.
Issue 2: Unable to determine counterfactual for Manchester data without a control group (ie what Manchester daily sales would have been during days 31-60 with no free shipping) - without comparing to Birmingham (control group), inability to separate treatment effect from background trends (see graph both cities saw background rises pre-treatment and Birmingham witnessed rise during trial period (control group).
Issue 3: The fundamental issue with Pre-Post treatment design is that it cannot separate the effect of the treatment (free shipping) from other potential concurrent factors (seasonality, promotions, general market trends, …) that may also influence sales.
3e and f)
library(data.table)
library(ggplot2)
sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")
setDT(sd)
sd_long <- melt(
sd,
id.vars = c("store_id", "city"),
measure.vars = patterns("^day_"),
variable.name = "time",
value.name = "sales"
)
sd_long[, time := as.integer(gsub("day_", "", time))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))
sd_long[, treatment := ifelse(time >= 31, 1, 0)]
sd_long[, manchester := ifelse(city == "Manchester", 1, 0)]
pre_data <- sd_long[treatment == 0]
#formula: sales = b0 + b1*time + b2*manchester + b3(time*manchester) + error
parallel_test <- lm(sales ~ time + manchester + time:manchester, data = pre_data)
summary(parallel_test)
Call:
lm(formula = sales ~ time + manchester + time:manchester, data = pre_data)
Residuals:
Min 1Q Median 3Q Max
-729.93 -175.23 2.85 172.30 871.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1955.262 20.730 94.320 <2e-16 ***
time 12.717 1.168 10.891 <2e-16 ***
manchester 554.600 29.317 18.917 <2e-16 ***
time:manchester -3.182 1.651 -1.927 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 247.6 on 1196 degrees of freedom
Multiple R-squared: 0.5454, Adjusted R-squared: 0.5443
F-statistic: 478.3 on 3 and 1196 DF, p-value: < 2.2e-16
pre_avg <- sd_long[treatment == 0, .(avg_sales = mean(sales)), by = .(city, time)]
ggplot(pre_avg, aes(x = time, y = avg_sales, color = city)) +
geom_line(size = 1) +
geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
labs(
title = "Pre-Treatment Trends (Days 1-30)",
x = "Day",
y = "Average Daily Sales (£)",
color = "City"
) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_x_continuous(breaks = seq(0, 30, by = 5))Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
ggsave("parallel_trends_check.png", width = 10, height = 6)`geom_smooth()` using formula = 'y ~ x'
avg_full <- sd_long[, .(avg_sales = mean(sales)), by = .(city, time)]
ggplot(avg_full, aes(x = time, y = avg_sales, color = city)) +
geom_line(size = 1) +
geom_vline(xintercept = 30.5, linetype = "dashed", color = "red", size = 1) +
labs(
title = "Average Daily Sales: Manchester vs Birmingham",
x = "Day",
y = "Average Daily Sales (£)",
color = "City"
) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_x_continuous(breaks = seq(0, 60, by = 10))ggsave("did_design_visualization.png", width = 10, height = 6)
did_summary <- sd_long[, .(avg_sales = mean(sales)), by = .(city, treatment)]
did_wide <- dcast(did_summary, city ~ treatment, value.var = "avg_sales")
setnames(did_wide, c("0", "1"), c("pre_avg", "post_avg"))
did_wide[, within_diff := post_avg - pre_avg]
print(did_wide) city pre_avg post_avg within_diff
1: Birmingham 2152.373 2462.847 310.4733
2: Manchester 2657.648 3461.465 803.8167
manchester_change <- did_wide[city == "Manchester", within_diff]
birmingham_change <- did_wide[city == "Birmingham", within_diff]
did_estimate <- manchester_change - birmingham_change
print(did_estimate)[1] 493.3433
Interpretation:
Parallel trend test: \(\beta_3=-3.182, p-value = 0.0542\) indicates how much that Manchester and Birmingham had statistically similar trends during the pre-treatment period (1-30).
Pre-treatment trends plot - The plot visually confirms that Manchester and Birmingham had parallel trends pre-treatment.
Full Period Plot - In the post treatment period, Manchester indicated sharp jump at day 31, then stabilises at higher levels. While Birmingham (31-60) showed its gradual upward trend continuation. From this we have counterfactual, Birmingham shows what would have happened to Manchester without treatment.
Proposed design to overcome issues in 3d - Difference in Differences (DiD) with fixed time effects.
DiD functions to estimate causal effects of free shipping on sales by comparing changes in outcome between Manchester (treatment) and Birmingham (control), this addresses confounding factors issues.
Formula: \(Sales_{idt}=\beta_0+\beta_1*manchester+\delta_t+\beta_{DiD}(Manchester_i*Trial_t)+error\)
library(data.table)
library(ggplot2)
library(tidyverse)
# Load the data
sd <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/sales_data.csv")
# Convert to long
sd_long <- sd %>%
pivot_longer(
cols = starts_with("day_"),
names_to = "time_str",
values_to = "sales"
) %>%
# Convert to data.table structure for subsequent data.table operations
setDT()
# Convert time column to integer and rename columns
sd_long[, time := as.integer(gsub("day_", "", time_str))]
setnames(sd_long, "store_id", "id")
setcolorder(sd_long, c("id", "city", "time", "sales"))
sd_long[, time_str := NULL]
# Create treatment indicator
sd_long[, treatment := ifelse(time >= 31, 1, 0)]
# Create DiD variables
# 1. City dummy (Group Fixed Effect)
sd_long[, manchester := ifelse(city == "Manchester", 1, 0)]
# 2. Interaction term (DiD estimator)
sd_long[, man_trial_interact := manchester * treatment]
# Execute the proposed DiD model with Day Fixed Effects (as.factor(time))
did_tfe_model <- lm(sales ~ manchester + as.factor(time) + man_trial_interact,
data = sd_long)
# Print the summary of the robust DiD regression
print(summary(did_tfe_model))
Call:
lm(formula = sales ~ manchester + as.factor(time) + man_trial_interact,
data = sd_long)
Residuals:
Min 1Q Median 3Q Max
-830.48 -165.76 -1.76 163.02 905.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1954.812 39.701 49.239 < 2e-16 ***
manchester 505.275 14.261 35.431 < 2e-16 ***
as.factor(time)2 -8.375 55.232 -0.152 0.879489
as.factor(time)3 64.500 55.232 1.168 0.243006
as.factor(time)4 43.475 55.232 0.787 0.431283
as.factor(time)5 98.900 55.232 1.791 0.073482 .
as.factor(time)6 105.625 55.232 1.912 0.055949 .
as.factor(time)7 93.725 55.232 1.697 0.089843 .
as.factor(time)8 92.300 55.232 1.671 0.094829 .
as.factor(time)9 175.250 55.232 3.173 0.001528 **
as.factor(time)10 184.900 55.232 3.348 0.000828 ***
as.factor(time)11 163.775 55.232 2.965 0.003055 **
as.factor(time)12 38.950 55.232 0.705 0.480752
as.factor(time)13 143.025 55.232 2.590 0.009670 **
as.factor(time)14 232.500 55.232 4.210 2.66e-05 ***
as.factor(time)15 228.350 55.232 4.134 3.68e-05 ***
as.factor(time)16 259.500 55.232 4.698 2.77e-06 ***
as.factor(time)17 218.475 55.232 3.956 7.86e-05 ***
as.factor(time)18 258.475 55.232 4.680 3.04e-06 ***
as.factor(time)19 260.375 55.232 4.714 2.57e-06 ***
as.factor(time)20 248.250 55.232 4.495 7.31e-06 ***
as.factor(time)21 253.875 55.232 4.597 4.52e-06 ***
as.factor(time)22 311.775 55.232 5.645 1.85e-08 ***
as.factor(time)23 304.275 55.232 5.509 4.00e-08 ***
as.factor(time)24 252.500 55.232 4.572 5.09e-06 ***
as.factor(time)25 346.325 55.232 6.270 4.28e-10 ***
as.factor(time)26 351.175 55.232 6.358 2.45e-10 ***
as.factor(time)27 319.250 55.232 5.780 8.46e-09 ***
as.factor(time)28 290.775 55.232 5.265 1.53e-07 ***
as.factor(time)29 313.825 55.232 5.682 1.50e-08 ***
as.factor(time)30 281.075 55.232 5.089 3.89e-07 ***
as.factor(time)31 332.728 56.145 5.926 3.56e-09 ***
as.factor(time)32 386.153 56.145 6.878 7.77e-12 ***
as.factor(time)33 382.278 56.145 6.809 1.25e-11 ***
as.factor(time)34 348.878 56.145 6.214 6.10e-10 ***
as.factor(time)35 342.103 56.145 6.093 1.29e-09 ***
as.factor(time)36 393.378 56.145 7.006 3.18e-12 ***
as.factor(time)37 464.928 56.145 8.281 < 2e-16 ***
as.factor(time)38 441.253 56.145 7.859 5.85e-15 ***
as.factor(time)39 408.853 56.145 7.282 4.47e-13 ***
as.factor(time)40 560.078 56.145 9.976 < 2e-16 ***
as.factor(time)41 456.653 56.145 8.133 6.70e-16 ***
as.factor(time)42 457.153 56.145 8.142 6.23e-16 ***
as.factor(time)43 521.253 56.145 9.284 < 2e-16 ***
as.factor(time)44 586.053 56.145 10.438 < 2e-16 ***
as.factor(time)45 536.878 56.145 9.562 < 2e-16 ***
as.factor(time)46 534.078 56.145 9.512 < 2e-16 ***
as.factor(time)47 559.128 56.145 9.959 < 2e-16 ***
as.factor(time)48 548.728 56.145 9.773 < 2e-16 ***
as.factor(time)49 435.053 56.145 7.749 1.37e-14 ***
as.factor(time)50 600.278 56.145 10.692 < 2e-16 ***
as.factor(time)51 579.378 56.145 10.319 < 2e-16 ***
as.factor(time)52 553.378 56.145 9.856 < 2e-16 ***
as.factor(time)53 568.753 56.145 10.130 < 2e-16 ***
as.factor(time)54 595.278 56.145 10.603 < 2e-16 ***
as.factor(time)55 610.128 56.145 10.867 < 2e-16 ***
as.factor(time)56 610.278 56.145 10.870 < 2e-16 ***
as.factor(time)57 661.453 56.145 11.781 < 2e-16 ***
as.factor(time)58 621.803 56.145 11.075 < 2e-16 ***
as.factor(time)59 520.353 56.145 9.268 < 2e-16 ***
as.factor(time)60 624.328 56.145 11.120 < 2e-16 ***
man_trial_interact 493.343 20.168 24.462 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 247 on 2338 degrees of freedom
Multiple R-squared: 0.8042, Adjusted R-squared: 0.7991
F-statistic: 157.4 on 61 and 2338 DF, p-value: < 2.2e-16
Coefficient Interpretations:
Question 4:
4a)
#q4a
library(tidyverse)
library(knitr)
library(pwrss)
library(readr)
library(mlogit)Warning: package 'mlogit' was built under R version 4.3.3
Loading required package: dfidx
Warning: package 'dfidx' was built under R version 4.3.3
library(tidyverse)
camera_df <- read.csv("/Users/linhkieunguyen/Library/Mobile Documents/com~apple~CloudDocs/msin0041/cw1 data/camera.csv")
head(camera_df) chid consumer alt canon sony nikon panasonic pixels zoom video swivel wifi
1 1 1 1 0 0 1 0 0 1 0 1 0
2 1 1 2 1 0 0 0 1 1 0 1 1
3 1 1 3 0 0 0 1 0 0 0 0 1
4 1 1 4 0 1 0 0 0 1 0 1 0
5 1 1 5 0 0 0 0 0 0 0 0 0
6 2 1 1 0 0 0 1 1 0 0 1 0
price choice
1 0.79 1
2 2.29 0
3 1.29 0
4 2.79 0
5 0.00 0
6 2.79 0
str(camera_df)'data.frame': 26560 obs. of 14 variables:
$ chid : int 1 1 1 1 1 2 2 2 2 2 ...
$ consumer : int 1 1 1 1 1 1 1 1 1 1 ...
$ alt : int 1 2 3 4 5 1 2 3 4 5 ...
$ canon : int 0 1 0 0 0 0 1 0 0 0 ...
$ sony : int 0 0 0 1 0 0 0 1 0 0 ...
$ nikon : int 1 0 0 0 0 0 0 0 1 0 ...
$ panasonic: int 0 0 1 0 0 1 0 0 0 0 ...
$ pixels : int 0 1 0 0 0 1 1 1 1 0 ...
$ zoom : int 1 1 0 1 0 0 0 0 0 0 ...
$ video : int 0 0 0 0 0 0 1 1 1 0 ...
$ swivel : int 1 1 0 1 0 1 0 0 1 0 ...
$ wifi : int 0 1 1 0 0 0 1 1 1 0 ...
$ price : num 0.79 2.29 1.29 2.79 0 2.79 0.79 1.79 1.29 0 ...
$ choice : int 1 0 0 0 0 0 1 0 0 0 ...
# Convert to mlogit format
camera_mlogit <- dfidx(camera_df,
idx = list(c("chid", "consumer"), "alt"),
choice = "choice")
# Define attribute names
attribute_names <- c("canon", "sony", "nikon", "panasonic",
"pixels", "zoom", "video", "swivel", "wifi", "price")
# the multinomial logit model
logit_formula <- as.formula(paste("choice ~",
paste(attribute_names, collapse = " + "),
"+ 0"))
logit_model <- mlogit(logit_formula,
data = camera_mlogit,
method = "nr")
summary(logit_model)
Call:
mlogit(formula = choice ~ canon + sony + nikon + panasonic +
pixels + zoom + video + swivel + wifi + price + 0, data = camera_mlogit,
method = "nr")
Frequencies of alternatives:choice
1 2 3 4 5
0.20708 0.17620 0.19484 0.16905 0.25282
nr method
5 iterations, 0h:0m:0s
g'(-H)^-1g = 0.0099
successive function values within tolerance limits
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
canon 0.465026 0.075967 6.1214 9.273e-10 ***
sony 0.238372 0.076693 3.1081 0.001883 **
nikon 0.311653 0.076591 4.0691 4.720e-05 ***
panasonic 0.022662 0.077852 0.2911 0.770986
pixels 0.758258 0.042194 17.9706 < 2.2e-16 ***
zoom 0.819351 0.041940 19.5363 < 2.2e-16 ***
video 0.627884 0.040647 15.4472 < 2.2e-16 ***
swivel 0.367104 0.040211 9.1294 < 2.2e-16 ***
wifi 0.577804 0.041658 13.8703 < 2.2e-16 ***
price -1.485549 0.032467 -45.7551 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -6503.7
4b)
# Question 4b: Price coefficient interpretation
price_coef <- coef(logit_model)["price"]
print(price_coef) price
-1.485549
The price coefficient is -1.4855 (\(\beta\)_price) (unit = $100)
Log Odds Intepretation:
In the multinomial logit model, the log odds ratio between any 2 alternative j and k in the same choice set is given as:
\(log[Pr(Y_i=j|\beta)/Pr(Y_i=k|\beta)]=\beta^T(x_j-x_i)\)
When one alternative price is changed and one is constant, we have:
\(\Delta log[Pr(Y_i=j)/Pr(Y_i=k)]=\beta price*\Delta p\)
Therefore when \(\Delta p = 1\), the log odds of choosing alternative j versus k decreases by 1.4855 or:
\(\Delta log[Pr(Y_i=j)/Pr(Y_i=k)]=-1.4855*1=-1.4855 p\)
Odds interpretation:
When alternative j’s price increases by \(\Delta p\) = 1
\(exp(\beta price * \Delta p) = exp(-1.4855) = 0.2264\)
This mean for every $100 price increase, the probability of j being chosen decreases by 77.36%.
The negative price coefficient (\(\beta price = -1.4855\)) demonstrates that consumers are less likely to choose alternatives with higher prices, all other factors being constant.
This large magnitude of the coefficients show substantial price sensitivity in camera purchases. Specifically, the probability of a camera being chosen compared to alternatives decreases by 77.36% for every $100 increase. This means:
Pricing strategy is crucial in the camera market
Small price changes can affect market share significantly
Consumers in this market are highly responsive to price differences when choosing between cameras with similar features.
4c)
# Question 4c:
# Extract coefficients
sony_coef <- coef(logit_model)["sony"]
print(sony_coef) sony
0.2383719
nikon_coef <- coef(logit_model)["nikon"]
print(nikon_coef) nikon
0.3116529
price_coef <- coef(logit_model)["price"]
# Calculate brand premium
brand_premium_sony_vs_nikon <- -(sony_coef - nikon_coef) / price_coef * 100
print(brand_premium_sony_vs_nikon) sony
-4.932926
The brand premium of Sony relative to Nikon is -$4.93. This indicates that:
Sony has a lower brand premium than Nikon - if all factors are constant, consumers prefer Nikon to Sony.
Monetary equivelance: A Sony camera would need to be priced about $4.93 lower than an otherwise identical Nikon camera for consumers to be indifferent between the two brands, or for they to have the same utility to consumers.
4d)
# Question 4d
# Extract coefficients
wifi_coef <- coef(logit_model)["wifi"]
price_coef <- coef(logit_model)["price"]
# Calculate valuation
wifi_valuation <- -wifi_coef / price_coef * 100
wifi_valuation wifi
38.89496
The estimated consumer valuation for Wifi connectivity is $38.89, which means:
4e)
# Question 4e:
# beta
get_probs <- function(X, beta) {
# Calculate utility for each alternative: X %*% beta
utilities <- X %*% beta
# Apply softmax trans
exp_utilities <- exp(utilities)
probs <- exp_utilities / sum(exp_utilities)
# Return as vector
return(as.vector(probs))
}
# Extract coefficients
beta <- coef(logit_model)
# Check the names
names(beta) [1] "canon" "sony" "nikon" "panasonic" "pixels" "zoom"
[7] "video" "swivel" "wifi" "price"
# choice task with chid 7
choice_task_7 <- camera_df %>%
filter(chid == 7) %>%
select(all_of(names(beta)))
# Convert to matrix
X_task_7 <- as.matrix(choice_task_7)
dim(X_task_7) [1] 5 10
length(beta) [1] 10
# Calculate choice probabilities
probs_task_7 <- get_probs(X_task_7, beta)
probs_task_7[1] 0.50542669 0.07523878 0.04238305 0.08599906 0.29095242
# alternative labels for clarity
data.frame(
alternative = 1:5,
probability = round(probs_task_7, 4)
) alternative probability
1 1 0.5054
2 2 0.0752
3 3 0.0424
4 4 0.0860
5 5 0.2910
4f)
# Question 4f:
# Function to get Sony's profit
# price: Sony camera's price
# X: matrix characterising the choice task
# beta: vector of estimated coefficients
# mc: marginal cost of Sony's camera
get_profit <- function(price, X, beta, mc) {
# Update Sony's price in the choice task matrix
# Identify which row is Sony's camera
sony_col_index <- which(colnames(X) == "sony")
sony_row_index <- which(X[, sony_col_index] == 1)
# Update the price for Sony's camera
X_updated <- X
price_col_index <- which(colnames(X) == "price")
X_updated[sony_row_index, price_col_index] <- price
# Get choice probabilities with updated price
probs <- get_probs(X_updated, beta)
# Sony's market share= probability of choosing Sony
sony_prob <- probs[sony_row_index]
# Calculate profit: market_share * (price - marginal_cost)
# Convert price
profit <- sony_prob * (price * 100 - mc)
return(profit)
}
# Get the choice task with chid == 7
choice_task_7 <- camera_df %>%
filter(chid == 7) %>%
select(all_of(names(beta)))
X_task_7 <- as.matrix(choice_task_7)
# Set marginal cost
mc <- 70 # £70
# Get Sony's current price in choice task 7
sony_row <- which(X_task_7[, "sony"] == 1)
current_price <- X_task_7[sony_row, "price"]
# Calculate Sony's predicted profit at current price
sony_profit <- get_profit(current_price, X_task_7, beta, mc)
#results
cat("Sony's predicted profit in choice task 7:\n")Sony's predicted profit in choice task 7:
cat("Current price: $", current_price * 100, "\n")Current price: $ 229
cat("Marginal cost: $", mc, "\n")Marginal cost: $ 70
cat("Predicted profit: $", round(sony_profit, 2), "\n")Predicted profit: $ 13.67
sony_profit price
13.67385
4g)
# Question 4g:
# choice task with chid == 7
choice_task_7 <- camera_df %>%
filter(chid == 7) %>%
select(all_of(names(beta)))
X_task_7 <- as.matrix(choice_task_7)
# Set marginal cost
mc <- 70 # £70
price_range <- seq(0.50, 3.50, by = 0.01)
# Calculate profit
profits <- sapply(price_range, function(p) {
get_profit(p, X_task_7, beta, mc)
})
# Create a data frame for plotting
profit_data <- data.frame(
price = price_range * 100, # Convert to dollars for plotting
profit = profits
)
# optimal price
optimal_index <- which.max(profits)
optimal_price <- price_range[optimal_index]
optimal_profit <- profits[optimal_index]
# Plotting
library(ggplot2)
ggplot(profit_data, aes(x = price, y = profit)) +
geom_line(color = "blue", linewidth = 1) +
geom_vline(xintercept = optimal_price * 100,
linetype = "dashed", color = "red", linewidth = 0.8) +
geom_point(aes(x = optimal_price * 100, y = optimal_profit),
color = "red", size = 3) +
annotate("text",
x = optimal_price * 100 + 20,
y = optimal_profit,
label = paste0("Optimal: $", round(optimal_price * 100, 2),
"\nProfit: $", round(optimal_profit, 4)),
hjust = 0, color = "red", size = 4) +
labs(title = "Sony's Predicted Profit vs Price (Choice Task 7)",
subtitle = paste0("Marginal Cost = $", mc),
x = "Price ($)",
y = "Predicted Profit ($)") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(face = "bold"))# Print the optimal price and profit
cat("\n=== OPTIMAL PRICING RESULTS ===\n")
=== OPTIMAL PRICING RESULTS ===
cat("Optimal price for Sony: $", round(optimal_price * 100, 2), "\n")Optimal price for Sony: $ 156
cat("Expected profit at optimal price: $", round(optimal_profit, 4), "\n")Expected profit at optimal price: $ 18.7233
cat("\nCurrent price in dataset: $", X_task_7[which(X_task_7[, "sony"] == 1), "price"] * 100, "\n")
Current price in dataset: $ 229
cat("Profit at current price: $",
round(get_profit(X_task_7[which(X_task_7[, "sony"] == 1), "price"], X_task_7, beta, mc), 4), "\n")Profit at current price: $ 13.6739
optimal_price * 100 # Return in dollars[1] 156