data_sw <- read.csv("/Users/sendibai/Desktop/Statistical Learning /Syndicate Assignment 1/StreamWave.csv",
                    header = TRUE, sep = ",")
summary(data_sw)
##      Churn          MonthlyFee        Usage            Length     
##  Min.   :0.0000   Min.   : 9.00   Min.   : 2.991   Min.   : 2.00  
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.:25.307   1st Qu.: 9.00  
##  Median :0.0000   Median :14.00   Median :31.180   Median :11.00  
##  Mean   :0.2365   Mean   :15.04   Mean   :31.020   Mean   :11.08  
##  3rd Qu.:0.0000   3rd Qu.:19.00   3rd Qu.:36.851   3rd Qu.:13.00  
##  Max.   :1.0000   Max.   :25.00   Max.   :58.448   Max.   :24.00  
##     Discount          Age            Region         Complaints    
##  Min.   :0.000   Min.   :18.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:31.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :38.00   Median :0.0000   Median :0.0000  
##  Mean   :0.299   Mean   :38.14   Mean   :0.3045   Mean   :0.4395  
##  3rd Qu.:1.000   3rd Qu.:45.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :71.00   Max.   :1.0000   Max.   :4.0000  
##     AutoPay      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4095  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Part 1

Q1 — OLS: Usage ~ MonthlyFee + Length + Age

model1 <- lm(Usage ~ MonthlyFee + Length + Age, data = data_sw)
summary(model1)
## 
## Call:
## lm(formula = Usage ~ MonthlyFee + Length + Age, data = data_sw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.8845  -5.5653   0.0551   5.6808  26.7831 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.77845    1.13627  34.128  < 2e-16 ***
## MonthlyFee  -0.27228    0.03874  -7.029 2.84e-12 ***
## Length       0.07763    0.06009   1.292    0.197    
## Age         -0.11860    0.01837  -6.456 1.35e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.257 on 1996 degrees of freedom
## Multiple R-squared:  0.04405,    Adjusted R-squared:  0.04262 
## F-statistic: 30.66 on 3 and 1996 DF,  p-value: < 2.2e-16

The estimated model is:

\[ \widehat{Usage} = 38.78 - 0.27\, MonthlyFee + 0.08\, Length - 0.12\, Age \]

Interpretation of coefficients:

  • MonthlyFee (−0.27): A one-dollar increase in the monthly fee is associated with a decrease of approximately 0.27 hours of usage, holding other variables constant.
  • Length (0.08): Each additional month of subscription is associated with an increase of approximately 0.08 hours of usage, holding other variables constant. Note that this coefficient is not statistically significant (p = 0.197).
  • Age (−0.12): Each additional year of age is associated with a decrease of approximately 0.12 hours of usage, holding other variables constant.

Q2 — MLE: Log-likelihood Estimation

The log-likelihood function under the normality assumption is:

\[ \ell(\beta, \sigma) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \mathbf{x}_i'\boldsymbol{\beta})^2 \]

library(stats4)

LL_sw <- function(b0, b1, b2, b3, sig) {
  R <- suppressWarnings(
    dnorm(data_sw$Usage,
          b0 + b1 * data_sw$MonthlyFee + b2 * data_sw$Length + b3 * data_sw$Age,
          sig, log = TRUE)
  )
  -sum(R)
}

mle_sw <- mle(minuslogl = LL_sw,
              start  = list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, sig = 1),
              method = "BFGS")
summary(mle_sw)
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = LL_sw, start = list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, 
##     sig = 1), method = "BFGS")
## 
## Coefficients:
##       Estimate Std. Error
## b0  38.8255070 1.13539531
## b1  -0.2731360 0.03870755
## b2   0.0761398 0.06003919
## b3  -0.1190414 0.01835771
## sig  8.2509749 0.13050562
## 
## -2 log L: 14116.14

The MLE estimates are virtually identical to the OLS estimates. This is expected: under the normality assumption, MLE and OLS are equivalent, as both minimise the same objective function.


Q3 — Linear Probability Model: Churn ~ Usage + Complaints + Discount

ols_model <- lm(Churn ~ Usage + Complaints + Discount, data = data_sw)
summary(ols_model)
## 
## Call:
## lm(formula = Churn ~ Usage + Complaints + Discount, data = data_sw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9239 -0.2615 -0.1156  0.1386  1.1229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.673165   0.034748  19.373  < 2e-16 ***
## Usage       -0.017473   0.001015 -17.219  < 2e-16 ***
## Complaints   0.141992   0.012970  10.948  < 2e-16 ***
## Discount     0.143588   0.018633   7.706 2.03e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3778 on 1996 degrees of freedom
## Multiple R-squared:  0.2112, Adjusted R-squared:   0.21 
## F-statistic: 178.1 on 3 and 1996 DF,  p-value: < 2.2e-16

The estimated linear probability model is:

\[ \widehat{Churn} = 0.673 - 0.017\, Usage + 0.142\, Complaints + 0.144\, Discount \]

Interpretation of coefficients:

  • Usage (−0.017): Each additional hour of streaming is associated with a 1.7 percentage point decrease in the probability of churn, holding other variables constant.
  • Complaints (0.142): Each additional complaint is associated with a 14.2 percentage point increase in the probability of churn, holding other variables constant.
  • Discount (0.144): Customers receiving a discount have a 14.4 percentage point higher probability of churn compared to those who do not, holding other variables constant. This is counterintuitive and likely reflects a selection effect — discounts tend to attract price-sensitive customers who are more prone to churn.

Q4 — Predicted Probabilities and LPM Limitations

# Low-usage, high-complaints customer
x1 <- data.frame(Usage = 5, Complaints = 4, Discount = 0)
p_lumc_sw <- predict(ols_model, newdata = x1)
p_lumc_sw
##        1 
## 1.153768
# High-usage, no complaints customer
x2 <- data.frame(Usage = 55, Complaints = 0, Discount = 0)
p_hunc_sw <- predict(ols_model, newdata = x2)
p_hunc_sw
##          1 
## -0.2878357
Customer Profile Usage Complaints Discount LPM Predicted Probability
Low-usage, high-complaints 5 4 0 1.15 (impossible: > 1)
High-usage, no complaints 55 0 0 −0.29 (impossible: < 0)

Both predictions fall outside the valid [0, 1] range. This illustrates a fundamental limitation of the LPM: it places no constraint on predicted values. If the firm relied on this model for retention targeting, it could systematically misclassify customers at both extremes — wasting resources on customers already certain to stay, while missing genuinely at-risk customers.


Part 2

Q5 — Simple Logistic Regression: Churn ~ Usage

logit_model <- glm(Churn ~ Usage,
                   data   = data_sw,
                   family = binomial)
summary(logit_model)
## 
## Call:
## glm(formula = Churn ~ Usage, family = binomial, data = data_sw)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.491171   0.227524   10.95   <2e-16 ***
## Usage       -0.126367   0.008021  -15.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2188.0  on 1999  degrees of freedom
## Residual deviance: 1868.4  on 1998  degrees of freedom
## AIC: 1872.4
## 
## Number of Fisher Scoring iterations: 5

The estimated logistic regression model is:

\[ \ln\left(\frac{P}{1-P}\right) = 2.491 - 0.126\, Usage \]

exp(coef(logit_model))
## (Intercept)       Usage 
##  12.0754037   0.8812915

Odds ratio interpretation:
The odds ratio for Usage is \(e^{-0.126} \approx 0.881\). Each additional hour of usage reduces the odds of churn by approximately 11.9% (i.e., \(1 - 0.881 = 0.119\)), holding other variables constant.

usage_seq <- seq(0, 60, 1)
prob <- predict(logit_model,
                newdata = data.frame(Usage = usage_seq),
                type    = "response")

plot(usage_seq, prob,
     type = "l", col = "blue", lwd = 2,
     xlab = "Usage Hours",
     ylab = "Predicted Probability of Churn",
     main = "Logistic Regression: Churn vs Usage")


Q6 — Predicted Probabilities: Logit vs LPM

# Logit predictions
p1 <- predict(logit_model, newdata = data.frame(Usage = 5),  type = "response")
p2 <- predict(logit_model, newdata = data.frame(Usage = 50), type = "response")

p1  # Usage = 5
##         1 
## 0.8652196
p2  # Usage = 50
##          1 
## 0.02130721
Model Usage = 5 Usage = 50
LPM 1.15 ❌ −0.29 ❌
Logistic 0.865 0.021

Unlike the LPM, the logistic model constrains all predicted probabilities to the [0, 1] range.

Marginal Effects (\(ME = \beta \times \hat{P}(1 - \hat{P})\), where \(\beta = -0.126\)):

  • At Usage = 5: \(ME = -0.126 \times 0.865 \times (1 - 0.865) \approx -0.015\). A one-hour increase in usage decreases churn probability by 1.5 percentage points.
  • At Usage = 50: \(ME = -0.126 \times 0.021 \times (1 - 0.021) \approx -0.003\). A one-hour increase in usage decreases churn probability by 0.3 percentage points.

The marginal effect decreases as usage increases, reflecting the diminishing sensitivity of the logistic curve at extreme values.


Q7 — Multiple Logistic Regression

fit <- glm(Churn ~ Usage + Length + Discount + Complaints + AutoPay,
           family = binomial(link = "logit"),
           data   = data_sw)
summary(fit)
## 
## Call:
## glm(formula = Churn ~ Usage + Length + Discount + Complaints + 
##     AutoPay, family = binomial(link = "logit"), data = data_sw)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.232275   0.352348   9.174  < 2e-16 ***
## Usage       -0.124489   0.008814 -14.124  < 2e-16 ***
## Length      -0.110609   0.021364  -5.177 2.25e-07 ***
## Discount     0.907302   0.134356   6.753 1.45e-11 ***
## Complaints   0.950577   0.092746  10.249  < 2e-16 ***
## AutoPay     -1.277957   0.145267  -8.797  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2188.0  on 1999  degrees of freedom
## Residual deviance: 1605.2  on 1994  degrees of freedom
## AIC: 1617.2
## 
## Number of Fisher Scoring iterations: 5

The estimated model is:

\[ \ln\left(\frac{P}{1-P}\right) = 3.232 - 0.124\, Usage - 0.111\, Length + 0.907\, Discount + 0.951\, Complaints - 1.278\, AutoPay \]

exp(coef(fit))
## (Intercept)       Usage      Length    Discount  Complaints     AutoPay 
##  25.3372443   0.8829482   0.8952890   2.4776282   2.5872030   0.2786059

All predictors are statistically significant (p < 0.001). The odds ratios for key variables are summarised below:

Variable Coefficient Odds Ratio Interpretation
Discount 0.907 2.477 Discounted customers are ~2.5× more likely to churn
Complaints 0.951 2.588 Each additional complaint nearly doubles the odds of churn
AutoPay −1.278 0.278 AutoPay customers have 72% lower odds of churning

Complaints and Discount have the largest positive effects on churn risk, while AutoPay is the strongest protective factor.


Q8 — Interaction: Complaints × Discount

fit2 <- glm(Churn ~ Usage + Length + Discount + Complaints + AutoPay +
              Complaints:Discount,
            family = binomial(link = "logit"),
            data   = data_sw)
summary(fit2)
## 
## Call:
## glm(formula = Churn ~ Usage + Length + Discount + Complaints + 
##     AutoPay + Complaints:Discount, family = binomial(link = "logit"), 
##     data = data_sw)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.254976   0.353511   9.208  < 2e-16 ***
## Usage               -0.124506   0.008813 -14.127  < 2e-16 ***
## Length              -0.109953   0.021394  -5.139 2.76e-07 ***
## Discount             0.827898   0.162509   5.094 3.50e-07 ***
## Complaints           0.905942   0.105533   8.584  < 2e-16 ***
## AutoPay             -1.274324   0.145291  -8.771  < 2e-16 ***
## Discount:Complaints  0.188103   0.216891   0.867    0.386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2188.0  on 1999  degrees of freedom
## Residual deviance: 1604.5  on 1993  degrees of freedom
## AIC: 1618.5
## 
## Number of Fisher Scoring iterations: 5

The estimated model is:

\[ \ln\left(\frac{P}{1-P}\right) = 3.255 - 0.125\, Usage - 0.110\, Length + 0.828\, Discount + 0.906\, Complaints - 1.274\, AutoPay + 0.188\,(Complaints \times Discount) \]

The interaction term Complaints × Discount is not statistically significant (p = 0.386 > 0.05), suggesting that the effect of complaints on churn does not significantly differ between discounted and non-discounted customers.

Marginal Effect of Complaints with interaction:

\[ ME_{\text{Complaints}} = (\beta_{\text{Complaints}} + \beta_{\text{interaction}} \times Discount) \times \hat{P}(1 - \hat{P}) \]

P_hat         <- mean(data_sw$Churn)
b_complaints  <- coef(fit2)["Complaints"]
b_interaction <- coef(fit2)["Discount:Complaints"]

ME_no_discount <- (b_complaints + b_interaction * 0) * P_hat * (1 - P_hat)
ME_discount    <- (b_complaints + b_interaction * 1) * P_hat * (1 - P_hat)

cat("ME (Discount = 0):", round(ME_no_discount, 4), "\n")
## ME (Discount = 0): 0.1636
cat("ME (Discount = 1):", round(ME_discount, 4), "\n")
## ME (Discount = 1): 0.1975
Customer Group Marginal Effect of Complaints
Non-discounted (Discount = 0) 0.163 (16.3 pp increase in churn probability)
Discounted (Discount = 1) 0.197 (19.7 pp increase in churn probability)

Managerial Implications:
Although the interaction term is not statistically significant, discounted customers who also lodge complaints represent a particularly high-risk segment — they are already price-sensitive and are simultaneously signalling dissatisfaction. StreamWave should prioritise complaint resolution for customers on promotional discounts. Practically, this could take the form of a dedicated support tier or proactive outreach when a discounted customer files their first complaint.