data_sw <- read.csv("/Users/sendibai/Desktop/Statistical Learning /Syndicate Assignment 1/StreamWave.csv",
header = TRUE, sep = ",")
summary(data_sw)## Churn MonthlyFee Usage Length
## Min. :0.0000 Min. : 9.00 Min. : 2.991 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:25.307 1st Qu.: 9.00
## Median :0.0000 Median :14.00 Median :31.180 Median :11.00
## Mean :0.2365 Mean :15.04 Mean :31.020 Mean :11.08
## 3rd Qu.:0.0000 3rd Qu.:19.00 3rd Qu.:36.851 3rd Qu.:13.00
## Max. :1.0000 Max. :25.00 Max. :58.448 Max. :24.00
## Discount Age Region Complaints
## Min. :0.000 Min. :18.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:31.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :38.00 Median :0.0000 Median :0.0000
## Mean :0.299 Mean :38.14 Mean :0.3045 Mean :0.4395
## 3rd Qu.:1.000 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :71.00 Max. :1.0000 Max. :4.0000
## AutoPay
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4095
## 3rd Qu.:1.0000
## Max. :1.0000
##
## Call:
## lm(formula = Usage ~ MonthlyFee + Length + Age, data = data_sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.8845 -5.5653 0.0551 5.6808 26.7831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.77845 1.13627 34.128 < 2e-16 ***
## MonthlyFee -0.27228 0.03874 -7.029 2.84e-12 ***
## Length 0.07763 0.06009 1.292 0.197
## Age -0.11860 0.01837 -6.456 1.35e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.257 on 1996 degrees of freedom
## Multiple R-squared: 0.04405, Adjusted R-squared: 0.04262
## F-statistic: 30.66 on 3 and 1996 DF, p-value: < 2.2e-16
The estimated model is:
\[ \widehat{Usage} = 38.78 - 0.27\, MonthlyFee + 0.08\, Length - 0.12\, Age \]
Interpretation of coefficients:
The log-likelihood function under the normality assumption is:
\[ \ell(\beta, \sigma) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \mathbf{x}_i'\boldsymbol{\beta})^2 \]
library(stats4)
LL_sw <- function(b0, b1, b2, b3, sig) {
R <- suppressWarnings(
dnorm(data_sw$Usage,
b0 + b1 * data_sw$MonthlyFee + b2 * data_sw$Length + b3 * data_sw$Age,
sig, log = TRUE)
)
-sum(R)
}
mle_sw <- mle(minuslogl = LL_sw,
start = list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, sig = 1),
method = "BFGS")
summary(mle_sw)## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = LL_sw, start = list(b0 = 0, b1 = 0, b2 = 0, b3 = 0,
## sig = 1), method = "BFGS")
##
## Coefficients:
## Estimate Std. Error
## b0 38.8255070 1.13539531
## b1 -0.2731360 0.03870755
## b2 0.0761398 0.06003919
## b3 -0.1190414 0.01835771
## sig 8.2509749 0.13050562
##
## -2 log L: 14116.14
The MLE estimates are virtually identical to the OLS estimates. This is expected: under the normality assumption, MLE and OLS are equivalent, as both minimise the same objective function.
##
## Call:
## lm(formula = Churn ~ Usage + Complaints + Discount, data = data_sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9239 -0.2615 -0.1156 0.1386 1.1229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.673165 0.034748 19.373 < 2e-16 ***
## Usage -0.017473 0.001015 -17.219 < 2e-16 ***
## Complaints 0.141992 0.012970 10.948 < 2e-16 ***
## Discount 0.143588 0.018633 7.706 2.03e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3778 on 1996 degrees of freedom
## Multiple R-squared: 0.2112, Adjusted R-squared: 0.21
## F-statistic: 178.1 on 3 and 1996 DF, p-value: < 2.2e-16
The estimated linear probability model is:
\[ \widehat{Churn} = 0.673 - 0.017\, Usage + 0.142\, Complaints + 0.144\, Discount \]
Interpretation of coefficients:
# Low-usage, high-complaints customer
x1 <- data.frame(Usage = 5, Complaints = 4, Discount = 0)
p_lumc_sw <- predict(ols_model, newdata = x1)
p_lumc_sw## 1
## 1.153768
# High-usage, no complaints customer
x2 <- data.frame(Usage = 55, Complaints = 0, Discount = 0)
p_hunc_sw <- predict(ols_model, newdata = x2)
p_hunc_sw## 1
## -0.2878357
| Customer Profile | Usage | Complaints | Discount | LPM Predicted Probability |
|---|---|---|---|---|
| Low-usage, high-complaints | 5 | 4 | 0 | 1.15 (impossible: > 1) |
| High-usage, no complaints | 55 | 0 | 0 | −0.29 (impossible: < 0) |
Both predictions fall outside the valid [0, 1] range. This illustrates a fundamental limitation of the LPM: it places no constraint on predicted values. If the firm relied on this model for retention targeting, it could systematically misclassify customers at both extremes — wasting resources on customers already certain to stay, while missing genuinely at-risk customers.
##
## Call:
## glm(formula = Churn ~ Usage, family = binomial, data = data_sw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.491171 0.227524 10.95 <2e-16 ***
## Usage -0.126367 0.008021 -15.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2188.0 on 1999 degrees of freedom
## Residual deviance: 1868.4 on 1998 degrees of freedom
## AIC: 1872.4
##
## Number of Fisher Scoring iterations: 5
The estimated logistic regression model is:
\[ \ln\left(\frac{P}{1-P}\right) = 2.491 - 0.126\, Usage \]
## (Intercept) Usage
## 12.0754037 0.8812915
Odds ratio interpretation:
The odds ratio for Usage is \(e^{-0.126}
\approx 0.881\). Each additional hour of usage reduces the odds
of churn by approximately 11.9% (i.e., \(1 - 0.881 = 0.119\)), holding other
variables constant.
usage_seq <- seq(0, 60, 1)
prob <- predict(logit_model,
newdata = data.frame(Usage = usage_seq),
type = "response")
plot(usage_seq, prob,
type = "l", col = "blue", lwd = 2,
xlab = "Usage Hours",
ylab = "Predicted Probability of Churn",
main = "Logistic Regression: Churn vs Usage")# Logit predictions
p1 <- predict(logit_model, newdata = data.frame(Usage = 5), type = "response")
p2 <- predict(logit_model, newdata = data.frame(Usage = 50), type = "response")
p1 # Usage = 5## 1
## 0.8652196
## 1
## 0.02130721
| Model | Usage = 5 | Usage = 50 |
|---|---|---|
| LPM | 1.15 ❌ | −0.29 ❌ |
| Logistic | 0.865 ✅ | 0.021 ✅ |
Unlike the LPM, the logistic model constrains all predicted probabilities to the [0, 1] range.
Marginal Effects (\(ME = \beta \times \hat{P}(1 - \hat{P})\), where \(\beta = -0.126\)):
The marginal effect decreases as usage increases, reflecting the diminishing sensitivity of the logistic curve at extreme values.
fit <- glm(Churn ~ Usage + Length + Discount + Complaints + AutoPay,
family = binomial(link = "logit"),
data = data_sw)
summary(fit)##
## Call:
## glm(formula = Churn ~ Usage + Length + Discount + Complaints +
## AutoPay, family = binomial(link = "logit"), data = data_sw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.232275 0.352348 9.174 < 2e-16 ***
## Usage -0.124489 0.008814 -14.124 < 2e-16 ***
## Length -0.110609 0.021364 -5.177 2.25e-07 ***
## Discount 0.907302 0.134356 6.753 1.45e-11 ***
## Complaints 0.950577 0.092746 10.249 < 2e-16 ***
## AutoPay -1.277957 0.145267 -8.797 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2188.0 on 1999 degrees of freedom
## Residual deviance: 1605.2 on 1994 degrees of freedom
## AIC: 1617.2
##
## Number of Fisher Scoring iterations: 5
The estimated model is:
\[ \ln\left(\frac{P}{1-P}\right) = 3.232 - 0.124\, Usage - 0.111\, Length + 0.907\, Discount + 0.951\, Complaints - 1.278\, AutoPay \]
## (Intercept) Usage Length Discount Complaints AutoPay
## 25.3372443 0.8829482 0.8952890 2.4776282 2.5872030 0.2786059
All predictors are statistically significant (p < 0.001). The odds ratios for key variables are summarised below:
| Variable | Coefficient | Odds Ratio | Interpretation |
|---|---|---|---|
| Discount | 0.907 | 2.477 | Discounted customers are ~2.5× more likely to churn |
| Complaints | 0.951 | 2.588 | Each additional complaint nearly doubles the odds of churn |
| AutoPay | −1.278 | 0.278 | AutoPay customers have 72% lower odds of churning |
Complaints and Discount have the largest positive effects on churn risk, while AutoPay is the strongest protective factor.
fit2 <- glm(Churn ~ Usage + Length + Discount + Complaints + AutoPay +
Complaints:Discount,
family = binomial(link = "logit"),
data = data_sw)
summary(fit2)##
## Call:
## glm(formula = Churn ~ Usage + Length + Discount + Complaints +
## AutoPay + Complaints:Discount, family = binomial(link = "logit"),
## data = data_sw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.254976 0.353511 9.208 < 2e-16 ***
## Usage -0.124506 0.008813 -14.127 < 2e-16 ***
## Length -0.109953 0.021394 -5.139 2.76e-07 ***
## Discount 0.827898 0.162509 5.094 3.50e-07 ***
## Complaints 0.905942 0.105533 8.584 < 2e-16 ***
## AutoPay -1.274324 0.145291 -8.771 < 2e-16 ***
## Discount:Complaints 0.188103 0.216891 0.867 0.386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2188.0 on 1999 degrees of freedom
## Residual deviance: 1604.5 on 1993 degrees of freedom
## AIC: 1618.5
##
## Number of Fisher Scoring iterations: 5
The estimated model is:
\[ \ln\left(\frac{P}{1-P}\right) = 3.255 - 0.125\, Usage - 0.110\, Length + 0.828\, Discount + 0.906\, Complaints - 1.274\, AutoPay + 0.188\,(Complaints \times Discount) \]
The interaction term Complaints × Discount is not statistically significant (p = 0.386 > 0.05), suggesting that the effect of complaints on churn does not significantly differ between discounted and non-discounted customers.
Marginal Effect of Complaints with interaction:
\[ ME_{\text{Complaints}} = (\beta_{\text{Complaints}} + \beta_{\text{interaction}} \times Discount) \times \hat{P}(1 - \hat{P}) \]
P_hat <- mean(data_sw$Churn)
b_complaints <- coef(fit2)["Complaints"]
b_interaction <- coef(fit2)["Discount:Complaints"]
ME_no_discount <- (b_complaints + b_interaction * 0) * P_hat * (1 - P_hat)
ME_discount <- (b_complaints + b_interaction * 1) * P_hat * (1 - P_hat)
cat("ME (Discount = 0):", round(ME_no_discount, 4), "\n")## ME (Discount = 0): 0.1636
## ME (Discount = 1): 0.1975
| Customer Group | Marginal Effect of Complaints |
|---|---|
| Non-discounted (Discount = 0) | 0.163 (16.3 pp increase in churn probability) |
| Discounted (Discount = 1) | 0.197 (19.7 pp increase in churn probability) |
Managerial Implications:
Although the interaction term is not statistically significant,
discounted customers who also lodge complaints represent a particularly
high-risk segment — they are already price-sensitive and are
simultaneously signalling dissatisfaction. StreamWave should prioritise
complaint resolution for customers on promotional discounts.
Practically, this could take the form of a dedicated support tier or
proactive outreach when a discounted customer files their first
complaint.