library(readr) week2_churn_data <- read_csv(“C:/Users/evely/OneDrive/Desktop/R/week2_churn_data.csv”) View(week2_churn_data)
Dataset: week2_churn_data.csv (customer
churn)
Purpose: Build and compare multiple churn classifiers on the same train/test split, select the best model, and then improve it with one ensemble method (bagging by default).
Required (minimum): - Fit 4 models:
Logistic Regression, kNN, Naive Bayes, CART Tree
- Evaluate each with: confusion matrix,
Accuracy/Precision/Recall/F1, ROC +
AUC - Select a “best” model for a business use case
(explain the metric you prioritize) - Improve the selected model using
ONE of: bagging (recommended),
boosting (optional), or stacking
(optional/bonus)
What you must write: - Your predictor selection (which variables you chose and why) - Your own plots (at least 2) and interpretation - Your model choice and business justification (cost of FN vs FP)
df <- read.csv("week2_churn_data.csv", stringsAsFactors = FALSE)
# Quick checks
dim(df)
## [1] 400 8
str(df)
## 'data.frame': 400 obs. of 8 variables:
## $ customer_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ tenure_months : int 49 29 35 1 23 31 24 9 3 11 ...
## $ monthly_charges: num 68.6 77.4 80.2 58.4 50.8 ...
## $ contract_type : chr "One year" "One year" "Two year" "One year" ...
## $ online_security: chr "Yes" "Yes" "Yes" "Yes" ...
## $ tech_support : chr "Yes" "Yes" "No" "Yes" ...
## $ churn : chr "Yes" "Yes" "No" "No" ...
## $ total_charges : num 3691 2114.3 2915.4 57.1 1259.1 ...
head(df, 3)
## customer_id tenure_months monthly_charges contract_type online_security
## 1 1 49 68.63 One year Yes
## 2 2 29 77.40 One year Yes
## 3 3 35 80.22 Two year Yes
## tech_support churn total_charges
## 1 Yes Yes 3691.04
## 2 Yes Yes 2114.31
## 3 No No 2915.41
df <- read.csv("C:/Users/evely/OneDrive/Desktop/R/week2_churn_data.csv",
stringsAsFactors = FALSE)
# Quick checks
dim(df)
## [1] 400 8
str(df)
## 'data.frame': 400 obs. of 8 variables:
## $ customer_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ tenure_months : int 49 29 35 1 23 31 24 9 3 11 ...
## $ monthly_charges: num 68.6 77.4 80.2 58.4 50.8 ...
## $ contract_type : chr "One year" "One year" "Two year" "One year" ...
## $ online_security: chr "Yes" "Yes" "Yes" "Yes" ...
## $ tech_support : chr "Yes" "Yes" "No" "Yes" ...
## $ churn : chr "Yes" "Yes" "No" "No" ...
## $ total_charges : num 3691 2114.3 2915.4 57.1 1259.1 ...
head(df, 3)
## customer_id tenure_months monthly_charges contract_type online_security
## 1 1 49 68.63 One year Yes
## 2 2 29 77.40 One year Yes
## 3 3 35 80.22 Two year Yes
## tech_support churn total_charges
## 1 Yes Yes 3691.04
## 2 Yes Yes 2114.31
## 3 No No 2915.41
df$churn <- factor(df$churn, levels = c("No","Yes"))
table(df$churn)
##
## No Yes
## 290 110
prop.table(table(df$churn))
##
## No Yes
## 0.725 0.275
# 1) Confirm the outcome column name:
# This template assumes the outcome is named 'churn' with values like Yes/No.
# If your file uses a different name, rename it here.
# df$churn <- df$<RENAME_ME>
# 2) Force churn into a 2-level factor with levels c("No","Yes")
df$churn <- factor(df$churn, levels = c("No","Yes"))
table(df$churn)
##
## No Yes
## 290 110
# 3) Optional: drop ID-like columns (uncomment as needed)
# df$customer_id <- NULL
# df$CustomerID <- NULL
Checkpoint (write 2–4 sentences): - What is the
business question?
- What is the churn rate in the full dataset?
Your answer: - Business question: [The
business question is whether we can predict which customers are likely
to churn based on their contract type, tenure, service features, and
billing characteristics. Accurately identifying likely churners allows
the company to implement targeted retention strategies to reduce
customer loss.]
- Churn rate: [The overall churn rate in the dataset is 27.5%,
meaning that 110 out of 400 customers have churned. Since churn
represents the minority class, evaluation metrics beyond simple accuracy
(such as recall and AUC) will be important when comparing
models.]
You may not automatically use all variables.
Choose 4–8 predictors that make business sense for
churn.
You may include both numeric and categorical predictors.
Write your predictor list here (tenure_months, monthly_charges, total_charges, contract_type, online_security, tech_support):
# TODO: Replace the vector below with your chosen predictor column names
predictor_cols <- c(
# "tenure_months",
# "monthly_charges",
# "contract_type"
)
# Choose predictors (4–8 required)
predictor_cols <- c(
"tenure_months",
"monthly_charges",
"total_charges",
"contract_type",
"online_security",
"tech_support"
)
# Keep outcome + predictors only
keep_cols <- c("churn", predictor_cols)
df2 <- df[, keep_cols]
# Confirm structure
str(df2)
## 'data.frame': 400 obs. of 7 variables:
## $ churn : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
## $ tenure_months : int 49 29 35 1 23 31 24 9 3 11 ...
## $ monthly_charges: num 68.6 77.4 80.2 58.4 50.8 ...
## $ total_charges : num 3691 2114.3 2915.4 57.1 1259.1 ...
## $ contract_type : chr "One year" "One year" "Two year" "One year" ...
## $ online_security: chr "Yes" "Yes" "Yes" "Yes" ...
## $ tech_support : chr "Yes" "Yes" "No" "Yes" ...
# Keep outcome + predictors only
keep_cols <- c("churn", predictor_cols)
df2 <- df[ , keep_cols ]
# Confirm structure
str(df2)
## 'data.frame': 400 obs. of 7 variables:
## $ churn : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
## $ tenure_months : int 49 29 35 1 23 31 24 9 3 11 ...
## $ monthly_charges: num 68.6 77.4 80.2 58.4 50.8 ...
## $ total_charges : num 3691 2114.3 2915.4 57.1 1259.1 ...
## $ contract_type : chr "One year" "One year" "Two year" "One year" ...
## $ online_security: chr "Yes" "Yes" "Yes" "Yes" ...
## $ tech_support : chr "Yes" "Yes" "No" "Yes" ...
n <- nrow(df2)
train_size <- floor(0.70 * n)
train_idx <- sample(1:n, size = train_size, replace = FALSE)
train <- df2[train_idx, ]
test <- df2[-train_idx, ]
c(nrow(train), nrow(test))
## [1] 280 120
prop.table(table(train$churn))
##
## No Yes
## 0.7321429 0.2678571
prop.table(table(test$churn))
##
## No Yes
## 0.7083333 0.2916667
Checkpoint (1–2 sentences): - Are the churn rates similar in train vs test?
Your answer: [Yes, the churn rates are similar between the training and test sets. The training set has a churn rate of approximately 26.8%, while the test set has a churn rate of approximately 29.2%. These proportions are close to the overall churn rate of 27.5%, indicating that the 70/30 split preserved class balance sufficiently for model evaluation.]
calc_metrics <- function(actual, pred_class, positive = "Yes") {
actual <- factor(actual, levels = c("No","Yes"))
pred_class <- factor(pred_class, levels = c("No","Yes"))
cm <- table(Predicted = pred_class, Actual = actual)
TP <- cm[positive, positive]
TN <- cm[setdiff(levels(actual), positive), setdiff(levels(actual), positive)]
FP <- cm[positive, setdiff(levels(actual), positive)]
FN <- cm[setdiff(levels(actual), positive), positive]
accuracy <- (TP + TN) / sum(cm)
precision <- if ((TP + FP) == 0) NA else TP / (TP + FP)
recall <- if ((TP + FN) == 0) NA else TP / (TP + FN)
f1 <- if (is.na(precision) | is.na(recall) | (precision + recall) == 0) NA else 2 * precision * recall / (precision + recall)
list(cm = cm, accuracy = accuracy, precision = precision, recall = recall, f1 = f1)
}
make_roc_auc <- function(actual, prob_yes) {
roc_obj <- roc(response = actual, predictor = prob_yes, levels = c("No","Yes"), direction = "<")
auc_val <- as.numeric(auc(roc_obj))
list(roc = roc_obj, auc = auc_val)
}
Create at least two plots that help you understand
churn risk.
Each plot must include 1–3 sentences explaining what it
suggests.
# TODO: Choose one predictor and plot churn rate by group (if categorical)
# Example ideas:
# - barplot of churn proportion by contract type
# - boxplot of numeric predictor split by churn
#
# PLOT 1 CODE HERE
library(ggplot2) library(dplyr) library(scales)
head(train)
plot1_df <- train %>% group_by(contract_type) %>% summarise( churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop” )
p1 <- ggplot(plot1_df, aes(x = contract_type, y = churn_rate)) + geom_col() + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + labs( title = “Churn Rate by Contract Type (Train Set)”, x = “Contract Type”, y = “Churn Rate” ) + theme(axis.text.x = element_text(angle = 20, hjust = 1))
print(p1)
library(ggplot2) library(dplyr) library(scales)
plot1_df <- train %>% group_by(contract_type) %>% summarise( churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop” )
p1 <- ggplot(plot1_df, aes(x = contract_type, y = churn_rate)) + geom_col() + scale_y_continuous(labels = percent_format(accuracy = 1)) + labs( title = “Churn Rate by Contract Type (Train Set)”, x = “Contract Type”, y = “Churn Rate” ) + theme(axis.text.x = element_text(angle = 20, hjust = 1))
print(p1)
install.packages(“ggplot2”)
library(dplyr) library(ggplot2) library(scales)
plot1_df <- train %>% group_by(contract_type) %>% summarise(churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop”)
p1 <- ggplot(plot1_df, aes(x = contract_type, y = churn_rate)) + geom_col() + scale_y_continuous(labels = percent_format(accuracy = 1)) + labs( title = “Churn Rate by Contract Type (Train Set)”, x = “Contract Type”, y = “Churn Rate” ) + theme(axis.text.x = element_text(angle = 20, hjust = 1))
print(p1)
Interpretation (1–3 sentences): [Churn rates vary across contract types. Month-to-month customers exhibit the highest churn rate (approximately 29%), while one-year contracts show the lowest churn (around 25%). This suggests that shorter-term commitments are associated with higher churn risk, indicating that encouraging customers to move to longer contracts may improve retention.]
# TODO: A second plot, different from Plot 1
# PLOT 2 CODE HERE
p2 <- ggplot(train, aes(x = churn, y = tenure_months)) + geom_boxplot() + labs( title = “Tenure Months by Churn Status (Train Set)”, x = “Churn”, y = “Tenure (Months)” )
print(p2)
Interpretation (1–3 sentences): [The tenure distributions for churned and non-churned customers are fairly similar, with substantial overlap between the two groups. Although churned customers appear to have a slightly higher median tenure, the difference is small, suggesting that tenure alone may not strongly predict churn in this dataset. This indicates that churn risk may depend more on service features or contract characteristics rather than tenure alone.]
# Logistic regression (uses selected predictors only)
logit_fit <- glm(churn ~ ., data = train, family = binomial)
summary(logit_fit)
##
## Call:
## glm(formula = churn ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7455106 0.9410354 -0.792 0.428
## tenure_months -0.0077653 0.0251454 -0.309 0.757
## monthly_charges -0.0107799 0.0127768 -0.844 0.399
## total_charges 0.0002629 0.0003492 0.753 0.452
## contract_typeOne year -0.1632128 0.3354726 -0.487 0.627
## contract_typeTwo year -0.0316565 0.3331122 -0.095 0.924
## online_securityYes 0.3029056 0.2731589 1.109 0.267
## tech_supportYes 0.1669080 0.2762099 0.604 0.546
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 325.42 on 279 degrees of freedom
## Residual deviance: 321.38 on 272 degrees of freedom
## AIC: 337.38
##
## Number of Fisher Scoring iterations: 4
logit_prob <- predict(logit_fit, newdata = test, type = "response")
# TODO: Choose a threshold (start with 0.50). You may adjust later.
threshold <- 0.50
logit_pred <- ifelse(logit_prob >= threshold, "Yes", "No")
logit_m <- calc_metrics(test$churn, logit_pred)
logit_m$cm
## Actual
## Predicted No Yes
## No 85 35
## Yes 0 0
logit_m$accuracy; logit_m$precision; logit_m$recall; logit_m$f1
## [1] 0.7083333
## [1] NA
## [1] 0
## [1] NA
logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc
## [1] 0.5415126
library(pROC)
make_roc_auc <- function(actual, prob_yes) {
actual <- factor(actual, levels = c("No","Yes"))
roc_obj <- roc(response = actual, predictor = prob_yes, levels = c("No","Yes"), direction = "<")
auc_val <- as.numeric(auc(roc_obj))
list(roc = roc_obj, auc = auc_val)
}
logit_prob <- predict(logit_fit, newdata = test, type = "response")
threshold <- 0.30
logit_pred <- ifelse(logit_prob >= threshold, "Yes", "No")
logit_m <- calc_metrics(test$churn, logit_pred)
logit_m$cm
## Actual
## Predicted No Yes
## No 60 25
## Yes 25 10
logit_m$accuracy; logit_m$precision; logit_m$recall; logit_m$f1
## [1] 0.5833333
## [1] 0.2857143
## [1] 0.2857143
## [1] 0.2857143
logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc
## [1] 0.5415126
plot(logit_ra$roc, main = "ROC — Logistic Regression")
library(pROC)
make_roc_auc <- function(actual, prob_yes) {
actual <- factor(actual, levels = c("No","Yes"))
roc_obj <- roc(response = actual,
predictor = prob_yes,
levels = c("No","Yes"),
direction = "<")
auc_val <- as.numeric(auc(roc_obj))
list(roc = roc_obj, auc = auc_val)
}
# IMPORTANT: create logit_prob first
logit_prob <- predict(logit_fit, newdata = test, type = "response")
# Now ROC/AUC works
logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc
## [1] 0.5415126
plot(logit_ra$roc, main = "ROC — Logistic Regression")
library(pROC)
make_roc_auc <- function(actual, prob_yes) {
actual <- factor(actual, levels = c("No","Yes"))
roc_obj <- roc(response = actual,
predictor = prob_yes,
levels = c("No","Yes"),
direction = "<")
auc_val <- as.numeric(auc(roc_obj))
list(roc = roc_obj, auc = auc_val)
}
logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc
## [1] 0.5415126
plot(logit_ra$roc, main = "ROC — Logistic Regression")
Checkpoint (2–4 sentences): - Which predictors look influential (direction/sign), and does that match your EDA?
Your answer: [In the logistic regression model, none of the predictors appear statistically significant, indicating limited separation between churners and non-churners in this dataset. Contract type shows negative coefficients for one-year and two-year contracts relative to month-to-month, suggesting slightly lower churn risk for longer contracts, which aligns with the EDA results. Tenure also has a small negative effect, indicating that longer-tenured customers may be slightly less likely to churn, although the effect is weak. Overall, the direction of contract type matches the EDA findings, but the model suggests that these predictors have limited explanatory power.]
library(class)
kNN needs: - numeric design matrix (dummy variables for categorical predictors) - scaling using train statistics only
x_train <- model.matrix(churn ~ . - 1, data = train)
x_test <- model.matrix(churn ~ . - 1, data = test)
y_train <- train$churn
y_test <- test$churn
# Scale using TRAIN stats
train_center <- apply(x_train, 2, mean)
train_scale <- apply(x_train, 2, sd)
train_scale[train_scale == 0] <- 1
x_train_s <- scale(x_train, center = train_center, scale = train_scale)
x_test_s <- scale(x_test, center = train_center, scale = train_scale)
# TODO: Choose k (try at least two values and record what happens)
k <- 15
knn_pred <- knn(train = x_train_s, test = x_test_s, cl = y_train, k = k, prob = TRUE)
# Convert "winning prob" into prob(Yes) for ROC
winprob <- attr(knn_pred, "prob")
knn_prob_yes <- ifelse(knn_pred == "Yes", winprob, 1 - winprob)
knn_m <- calc_metrics(y_test, knn_pred)
knn_m$cm
## Actual
## Predicted No Yes
## No 83 35
## Yes 2 0
knn_m$accuracy; knn_m$precision; knn_m$recall; knn_m$f1
## [1] 0.6916667
## [1] 0
## [1] 0
## [1] NA
knn_ra <- make_roc_auc(y_test, knn_prob_yes)
knn_ra$auc
## [1] 0.477479
k <- 5
knn_pred <- knn(train = x_train_s,
test = x_test_s,
cl = y_train,
k = k,
prob = TRUE)
winprob <- attr(knn_pred, "prob")
knn_prob_yes <- ifelse(knn_pred == "Yes", winprob, 1 - winprob)
knn_m <- calc_metrics(y_test, knn_pred)
knn_m$cm
## Actual
## Predicted No Yes
## No 75 33
## Yes 10 2
knn_m$accuracy
## [1] 0.6416667
knn_m$precision
## [1] 0.1666667
knn_m$recall
## [1] 0.05714286
knn_m$f1
## [1] 0.08510638
knn_ra <- make_roc_auc(y_test, knn_prob_yes)
knn_ra$auc
## [1] 0.4552941
Checkpoint (1–3 sentences): - What k did you choose and why? What changed when you tried another k?
Your answer: [I tested k = 15 and k = 5. Reducing k from 15 to 5 slightly increased recall, indicating the model identified a few churners at smaller neighborhood sizes. However, overall performance remained weak, with AUC below 0.50 and very low recall. This suggests that kNN does not effectively capture structure in this dataset and performs worse than logistic regression.]
library(e1071)
# NOTE: e1071::naiveBayes can handle factors directly.
nb_fit <- naiveBayes(churn ~ ., data = train)
nb_fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.7321429 0.2678571
##
## Conditional probabilities:
## tenure_months
## Y [,1] [,2]
## No 29.12195 16.64302
## Yes 31.77333 17.47166
##
## monthly_charges
## Y [,1] [,2]
## No 69.16922 20.97912
## Yes 67.75907 22.62381
##
## total_charges
## Y [,1] [,2]
## No 2008.990 1316.509
## Yes 2184.758 1568.824
##
## contract_type
## Y Month-to-month One year Two year
## No 0.3268293 0.3414634 0.3317073
## Yes 0.3600000 0.3066667 0.3333333
##
## online_security
## Y No Yes
## No 0.5707317 0.4292683
## Yes 0.4933333 0.5066667
##
## tech_support
## Y No Yes
## No 0.5073171 0.4926829
## Yes 0.4800000 0.5200000
# Predicted class
nb_pred <- predict(nb_fit, newdata = test, type = "class")
# Predicted probability of "Yes" for ROC
nb_prob <- predict(nb_fit, newdata = test, type = "raw")[, "Yes"]
nb_m <- calc_metrics(test$churn, nb_pred)
nb_m$cm
## Actual
## Predicted No Yes
## No 84 34
## Yes 1 1
nb_m$accuracy; nb_m$precision; nb_m$recall; nb_m$f1
## [1] 0.7083333
## [1] 0.5
## [1] 0.02857143
## [1] 0.05405405
nb_ra <- make_roc_auc(test$churn, nb_prob)
nb_ra$auc
## [1] 0.5663866
nb_pred <- predict(nb_fit, newdata = test, type = "class")
# Predicted probability of "Yes" for ROC
nb_prob <- predict(nb_fit, newdata = test, type = "raw")[, "Yes"]
nb_m <- calc_metrics(test$churn, nb_pred)
nb_m$cm
## Actual
## Predicted No Yes
## No 84 34
## Yes 1 1
nb_m$accuracy
## [1] 0.7083333
nb_m$precision
## [1] 0.5
nb_m$recall
## [1] 0.02857143
nb_m$f1
## [1] 0.05405405
nb_ra <- make_roc_auc(test$churn, nb_prob)
nb_ra$auc
## [1] 0.5663866
Checkpoint (1–3 sentences): - When might Naive Bayes be a good choice, even if AUC is not the highest?
Your answer: [Naive Bayes achieved the highest AUC among the models tested so far (0.57), indicating slightly better discrimination between churners and non-churners. Although recall remains low at the default 0.50 threshold, the higher AUC suggests the model ranks churn risk more effectively than logistic regression and kNN. Naive Bayes can be a good choice in small datasets with mixed predictor types because it is computationally efficient and often produces stable probability estimates, even if independence assumptions are not fully satisfied.]
tree_fit <- rpart(churn ~ ., data = train, method = "class")
print(tree_fit)
## n= 280
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 280 75 No (0.7321429 0.2678571)
## 2) monthly_charges>=29.45 271 69 No (0.7453875 0.2546125)
## 4) total_charges< 3477.16 230 51 No (0.7782609 0.2217391)
## 8) total_charges>=3165.79 19 0 No (1.0000000 0.0000000) *
## 9) total_charges< 3165.79 211 51 No (0.7582938 0.2417062)
## 18) online_security=No 115 21 No (0.8173913 0.1826087)
## 36) monthly_charges< 77.98 90 12 No (0.8666667 0.1333333) *
## 37) monthly_charges>=77.98 25 9 No (0.6400000 0.3600000)
## 74) monthly_charges>=89.05 16 2 No (0.8750000 0.1250000) *
## 75) monthly_charges< 89.05 9 2 Yes (0.2222222 0.7777778) *
## 19) online_security=Yes 96 30 No (0.6875000 0.3125000)
## 38) monthly_charges>=78.435 27 5 No (0.8148148 0.1851852) *
## 39) monthly_charges< 78.435 69 25 No (0.6376812 0.3623188)
## 78) monthly_charges< 50.545 19 3 No (0.8421053 0.1578947) *
## 79) monthly_charges>=50.545 50 22 No (0.5600000 0.4400000)
## 158) total_charges>=2145.49 17 5 No (0.7058824 0.2941176) *
## 159) total_charges< 2145.49 33 16 Yes (0.4848485 0.5151515)
## 318) tenure_months< 10.5 14 4 No (0.7142857 0.2857143) *
## 319) tenure_months>=10.5 19 6 Yes (0.3157895 0.6842105) *
## 5) total_charges>=3477.16 41 18 No (0.5609756 0.4390244)
## 10) monthly_charges>=84.885 22 5 No (0.7727273 0.2272727) *
## 11) monthly_charges< 84.885 19 6 Yes (0.3157895 0.6842105) *
## 3) monthly_charges< 29.45 9 3 Yes (0.3333333 0.6666667) *
tree_prob <- predict(tree_fit, newdata = test, type = "prob")[, "Yes"]
tree_pred <- ifelse(tree_prob >= 0.50, "Yes", "No")
tree_m <- calc_metrics(test$churn, tree_pred)
tree_m$cm
## Actual
## Predicted No Yes
## No 61 25
## Yes 24 10
tree_m$accuracy; tree_m$precision; tree_m$recall; tree_m$f1
## [1] 0.5916667
## [1] 0.2941176
## [1] 0.2857143
## [1] 0.2898551
tree_ra <- make_roc_auc(test$churn, tree_prob)
tree_ra$auc
## [1] 0.5087395
(Optional) visualize the tree:
# Optional (requires extra package)
# install.packages("rpart.plot")
# library(rpart.plot)
# rpart.plot(tree_fit)
install.packages(“rpart.plot”)
library(rpart.plot) rpart.plot(tree_fit)
Checkpoint (1–3 sentences): - What do you gain/lose with a tree vs logistic regression?
Your answer: [A classification tree improves interpretability by providing clear decision rules that are easy to explain to non-technical stakeholders. Unlike logistic regression, the tree can capture nonlinear relationships and interactions automatically. However, trees may sacrifice overall predictive stability and can produce lower AUC compared to probabilistic models. In this case, although the tree has modest overall accuracy, it substantially improves recall compared to the other models.]
results <- data.frame(
Model = c("Logistic", paste0("kNN (k=", k, ")"), "Naive Bayes", "CART Tree"),
Accuracy = c(logit_m$accuracy, knn_m$accuracy, nb_m$accuracy, tree_m$accuracy),
Precision = c(logit_m$precision, knn_m$precision, nb_m$precision, tree_m$precision),
Recall = c(logit_m$recall, knn_m$recall, nb_m$recall, tree_m$recall),
F1 = c(logit_m$f1, knn_m$f1, nb_m$f1, tree_m$f1),
AUC = c(logit_ra$auc, knn_ra$auc, nb_ra$auc, tree_ra$auc)
)
results
## Model Accuracy Precision Recall F1 AUC
## 1 Logistic 0.5833333 0.2857143 0.28571429 0.28571429 0.5415126
## 2 kNN (k=5) 0.6416667 0.1666667 0.05714286 0.08510638 0.4552941
## 3 Naive Bayes 0.7083333 0.5000000 0.02857143 0.05405405 0.5663866
## 4 CART Tree 0.5916667 0.2941176 0.28571429 0.28985507 0.5087395
plot(logit_ra$roc, main = "ROC Curves (Test Set)", legacy.axes = TRUE)
plot(knn_ra$roc, add = TRUE)
plot(nb_ra$roc, add = TRUE)
plot(tree_ra$roc, add = TRUE)
legend(
"bottomright",
legend = c(
paste0("Logit AUC=", round(logit_ra$auc, 3)),
paste0("kNN AUC=", round(knn_ra$auc, 3)),
paste0("NB AUC=", round(nb_ra$auc, 3)),
paste0("Tree AUC=", round(tree_ra$auc, 3))
),
lty = 1,
bty = "n"
)
Select one model as “best” based on the business goal.
Write 1 short paragraph: - Which metric did you prioritize and why? (e.g., Recall if missing churners is costly) - Which model did you choose?
Your answer: [Because false negatives are more costly in this business scenario, recall was prioritized as the primary evaluation metric. Although Naive Bayes achieved the highest AUC (0.566), its recall was extremely low (0.029), meaning it failed to identify most churners. The CART Tree achieved the highest recall (0.286), identifying substantially more at-risk customers than the other models. Therefore, the classification tree was selected as the best model because it aligns most closely with the business objective of minimizing missed churn cases while also providing interpretable decision rules.]
Pick one improvement path:
Option A (recommended): Bagging
- If your best model is a tree, bagging is a natural choice. - If your
best model is not a tree, you can still bag trees as a “strong
alternative” and compare.
Option B (optional): Boosting (advanced)
Option C (optional/bonus): Stacking (advanced)
Below is a bagging template. Fill in the missing pieces.
# BAGGING TEMPLATE (optional advanced): uncomment and complete if you choose bagging
# install.packages("ipred") # if not installed
# library(ipred)
# 1) Fit bagged trees on TRAIN
# bag_fit <- bagging(churn ~ ., data = train, nbagg = <CHOOSE_NUMBER>)
# 2) Predict on TEST (class + prob)
# bag_pred <- predict(bag_fit, newdata = test, type = "class")
# bag_prob <- predict(bag_fit, newdata = test, type = "prob")[, "Yes"]
# 3) Evaluate
# bag_m <- calc_metrics(test$churn, bag_pred)
# bag_ra <- make_roc_auc(test$churn, bag_prob)
# bag_m$cm
# bag_m$accuracy; bag_m$precision; bag_m$recall; bag_m$f1
# bag_ra$auc
install.packages("ipred") # only if not installed
library(ipred)
bag_fit <- bagging(churn ~ ., data = train, nbagg = 50)
bag_pred <- predict(bag_fit, newdata = test, type = "class")
bag_prob <- predict(bag_fit, newdata = test, type = "prob")[, "Yes"]
bag_m <- calc_metrics(test$churn, bag_pred)
bag_ra <- make_roc_auc(test$churn, bag_prob)
bag_m$cm
bag_m$accuracy
bag_m$precision
bag_m$recall
bag_m$f1
bag_ra$auc
Write 3–6 sentences: - What ensemble method did you use? - Did it improve your chosen metric vs the single best model? - Why might it help (or not) on this dataset?
Your answer: [I applied a bagging ensemble using 50 bootstrap samples of classification trees to reduce variance and improve stability. Although bagging increased overall accuracy, it reduced recall compared to the single CART tree (from 0.286 to 0.229) and also lowered AUC. Because the business objective prioritizes identifying churners, the ensemble did not improve the most important metric. This suggests that variance reduction alone does not overcome the limited separability present in this dataset. Therefore, the single CART tree remains the preferred model for this scenario.]
Write a memo to a non-technical manager:
**[After evaluating four classification models (Logistic Regression, k-Nearest Neighbors, Naive Bayes, and a Classification Tree), the primary objective was to minimize false negatives—missing customers who are likely to churn—since failing to intervene with at-risk customers is more costly than contacting customers who may not churn.
Among the models tested, the Classification Tree (CART) was selected as the best single model. While its overall accuracy (59%) and AUC (0.51) are modest, it achieved the highest recall (28.6%) among all models. This means it correctly identified nearly 29% of churners, substantially more than the other models. Logistic Regression and Naive Bayes showed slightly higher AUC values but failed to identify churners effectively at the default threshold. Because the business goal prioritizes identifying at-risk customers, recall was the most important metric in model selection.
The classification threshold was kept at 0.50 for consistency across models; however, lowering the threshold could further increase recall if the business is willing to accept more false positives. In retention campaigns, this tradeoff is often acceptable because contacting additional customers is typically less costly than losing them entirely. Therefore, threshold tuning can be used operationally to adjust outreach volume based on available resources.
To improve performance, a bagging ensemble of classification trees was tested. Although bagging slightly increased overall accuracy, it reduced recall (from 28.6% to 22.9%) and lowered AUC. Since the ensemble did not improve the most important metric—recall—it is not recommended over the single classification tree model in this case.
In conclusion, I recommend deploying the Classification Tree model with threshold tuning to prioritize recall. This approach best aligns with the business objective of identifying customers at high risk of churn and enabling proactive retention strategies.]**