For this assignment, I am using the Telco Customer Churn dataset, which contains detailed information about telecom customers and whether they have churned (left the service). The dataset includes demographic information, contract details, payment method types, and monthly service charges. Since customer retention is a critical metric in the telecom industry, analyzing churn trends helps companies develop better retention strategies.
# Load the Customer Churn dataset
telco_data <- read_csv("telco-data.csv.csv") %>%
mutate(Churn = if_else(Churn == "Yes", 1, 0), # Convert Churn to binary (1 = churn, 0 = no churn)
Contract = as.numeric(factor(Contract)), # Convert contract type to numeric
PaymentMethod = as.numeric(factor(PaymentMethod))) # Convert payment method to numeric
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the structure of the dataset
glimpse(telco_data)
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 2, 1, 1, 3, 2, 3,…
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <dbl> 3, 4, 4, 1, 3, 3, 2, 4, 3, 1, 4, 2, 2, 1, 3, 2, 4, 1,…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <dbl> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
Before diving into modeling, it is essential to explore the data to identify patterns and relationships.
# Summary statistics
summary(telco_data)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Min. :1.00
## Class :character Class :character Class :character 1st Qu.:1.00
## Mode :character Mode :character Mode :character Median :1.00
## Mean :1.69
## 3rd Qu.:2.00
## Max. :3.00
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Min. :1.000 Min. : 18.25 Min. : 18.8
## Class :character 1st Qu.:2.000 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Median :3.000 Median : 70.35 Median :1397.5
## Mean :2.574 Mean : 64.76 Mean :2283.3
## 3rd Qu.:3.000 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :4.000 Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2654
## 3rd Qu.:1.0000
## Max. :1.0000
##
# Visualizing Churn by Contract Type
ggplot(telco_data, aes(x = as.factor(Contract), fill = as.factor(Churn))) +
geom_bar(position = "fill") +
labs(title = "Churn Rate by Contract Type", x = "Contract Type", y = "Proportion")
# Visualizing Churn by Monthly Charges
ggplot(telco_data, aes(x = MonthlyCharges, fill = as.factor(Churn))) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of Monthly Charges by Churn Status", x = "Monthly Charges", y = "Density")
Logistic regression is commonly used when the dependent variable is
binary. Since our target variable (Churn
) is either
1 (churned) or 0 (not churned),
logistic regression is appropriate for this analysis. The logit model
estimates the probability of churn as a function of multiple predictors,
transforming the linear combination of features into a probability via
the logistic function.
This baseline model evaluates the impact of tenure (how long a customer has been with the service) and monthly charges on churn probability.
baseline_model <- glm(Churn ~ tenure + MonthlyCharges, data = telco_data, family = binomial)
summary(baseline_model)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges, family = binomial,
## data = telco_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.802436 0.086557 -20.82 <2e-16 ***
## tenure -0.054850 0.001689 -32.47 <2e-16 ***
## MonthlyCharges 0.032954 0.001299 25.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8150.1 on 7042 degrees of freedom
## Residual deviance: 6394.4 on 7040 degrees of freedom
## AIC: 6400.4
##
## Number of Fisher Scoring iterations: 5
Contract type is an important variable, as longer contracts
(one-year, two-year) might reduce churn probability. This model extends
the baseline model by incorporating Contract
.
model2 <- glm(Churn ~ tenure + MonthlyCharges + Contract, data = telco_data, family = binomial)
summary(model2)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges + Contract, family = binomial,
## data = telco_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.573358 0.118770 -4.827 1.38e-06 ***
## tenure -0.035787 0.002037 -17.569 < 2e-16 ***
## MonthlyCharges 0.028621 0.001349 21.212 < 2e-16 ***
## Contract -1.030966 0.071445 -14.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8150.1 on 7042 degrees of freedom
## Residual deviance: 6146.1 on 7039 degrees of freedom
## AIC: 6154.1
##
## Number of Fisher Scoring iterations: 6
To explore whether the effect of MonthlyCharges
on churn
depends on Contract
type, we introduce an
interaction term.
model3 <- glm(Churn ~ tenure + MonthlyCharges * Contract, data = telco_data, family = binomial)
summary(model3)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges * Contract, family = binomial,
## data = telco_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.196469 0.256383 0.766 0.443493
## tenure -0.036163 0.002056 -17.592 < 2e-16 ***
## MonthlyCharges 0.018817 0.003118 6.036 1.58e-09 ***
## Contract -1.660611 0.205560 -8.078 6.56e-16 ***
## MonthlyCharges:Contract 0.007896 0.002312 3.415 0.000638 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8150.1 on 7042 degrees of freedom
## Residual deviance: 6133.5 on 7038 degrees of freedom
## AIC: 6143.5
##
## Number of Fisher Scoring iterations: 7
Model comparison helps determine the best-fit model while balancing complexity and performance. We use: - Likelihood Ratio Test (LRT): Compares nested models to test if adding predictors significantly improves model fit. - AIC (Akaike Information Criterion): Evaluates model fit by penalizing complexity. - BIC (Bayesian Information Criterion): Similar to AIC but more conservative in penalizing complexity.
# Compare models using Likelihood Ratio Test
anova(baseline_model, model2, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Churn ~ tenure + MonthlyCharges
## Model 2: Churn ~ tenure + MonthlyCharges + Contract
## Model 3: Churn ~ tenure + MonthlyCharges * Contract
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7040 6394.4
## 2 7039 6146.1 1 248.273 < 2.2e-16 ***
## 3 7038 6133.5 1 12.552 0.0003957 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare models using AIC and BIC
model_comparison <- tibble(
Model = c("Baseline", "Model 2", "Model 3"),
AIC = c(AIC(baseline_model), AIC(model2), AIC(model3)),
BIC = c(BIC(baseline_model), BIC(model2), BIC(model3))
)
print(model_comparison)
## # A tibble: 3 × 3
## Model AIC BIC
## <chr> <dbl> <dbl>
## 1 Baseline 6400. 6421.
## 2 Model 2 6154. 6182.
## 3 Model 3 6144. 6178.
MonthlyCharges
and Contract
is important for
predicting churn.Contract
and the interaction term significantly improve the
model.This analysis provides actionable insights for customer retention strategies: - Tenure matters: Customers with shorter tenure are more likely to churn, suggesting that early retention strategies (e.g., discounts or loyalty rewards) could be beneficial. - Pricing strategy impact: Higher monthly charges correlate with higher churn rates. The company may explore flexible pricing options or value-added services to retain high-paying customers. - Contract importance: Customers with longer contracts have lower churn probability, indicating that incentivizing annual or two-year contracts may reduce churn rates.
Using logistic regression with MLE and model selection techniques, we identified key churn predictors and determined the best-fit model. This framework can be extended to include additional variables, refine segmentation, and optimize retention strategies.