1 Overview
2 Task 1 — Load data & verify columns
3 Task 2 — Choose predictors (required)
4 Task 3 — Train/Test split (locked 70/30)
5 Helper functions — metrics & ROC
6 Task 4 — Student-generated EDA plots (required)
7 Task 5 — Model 1: Logistic Regression
8 Task 6 — Model 2: kNN (with scaling)
9 Task 7 — Model 3: Naive Bayes
10 Task 8 — Model 4: CART Tree (classification tree)
11 Task 9 — Comparison table + ROC plot (required)
12 Task 10 — Choose “best” model for a business scenario (required)
13 Task 11 — Improve your chosen model with ONE ensemble method (required)
14 Final task — Recommendation memo (≤ 1 page)
- 14.1 Memo

library(readr) week2_churn_data <- read_csv(“C:/Users/evely/OneDrive/Desktop/R/week2_churn_data.csv”) View(week2_churn_data)

1 Overview

Dataset: week2_churn_data.csv (customer churn)

Purpose: Build and compare multiple churn classifiers on the same train/test split, select the best model, and then improve it with one ensemble method (bagging by default).

Required (minimum): - Fit 4 models: Logistic Regression, kNN, Naive Bayes, CART Tree
- Evaluate each with: confusion matrix, Accuracy/Precision/Recall/F1, ROC + AUC - Select a “best” model for a business use case (explain the metric you prioritize) - Improve the selected model using ONE of: bagging (recommended), boosting (optional), or stacking (optional/bonus)

What you must write: - Your predictor selection (which variables you chose and why) - Your own plots (at least 2) and interpretation - Your model choice and business justification (cost of FN vs FP)

2 Task 1 — Load data & verify columns

df <- read.csv("week2_churn_data.csv", stringsAsFactors = FALSE)

# Quick checks
dim(df)

## [1] 400   8

str(df)

## 'data.frame':    400 obs. of  8 variables:
##  $ customer_id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tenure_months  : int  49 29 35 1 23 31 24 9 3 11 ...
##  $ monthly_charges: num  68.6 77.4 80.2 58.4 50.8 ...
##  $ contract_type  : chr  "One year" "One year" "Two year" "One year" ...
##  $ online_security: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ tech_support   : chr  "Yes" "Yes" "No" "Yes" ...
##  $ churn          : chr  "Yes" "Yes" "No" "No" ...
##  $ total_charges  : num  3691 2114.3 2915.4 57.1 1259.1 ...

head(df, 3)

##   customer_id tenure_months monthly_charges contract_type online_security
## 1           1            49           68.63      One year             Yes
## 2           2            29           77.40      One year             Yes
## 3           3            35           80.22      Two year             Yes
##   tech_support churn total_charges
## 1          Yes   Yes       3691.04
## 2          Yes   Yes       2114.31
## 3           No    No       2915.41

df <- read.csv("C:/Users/evely/OneDrive/Desktop/R/week2_churn_data.csv",
               stringsAsFactors = FALSE)

# Quick checks
dim(df)

## [1] 400   8

str(df)

## 'data.frame':    400 obs. of  8 variables:
##  $ customer_id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tenure_months  : int  49 29 35 1 23 31 24 9 3 11 ...
##  $ monthly_charges: num  68.6 77.4 80.2 58.4 50.8 ...
##  $ contract_type  : chr  "One year" "One year" "Two year" "One year" ...
##  $ online_security: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ tech_support   : chr  "Yes" "Yes" "No" "Yes" ...
##  $ churn          : chr  "Yes" "Yes" "No" "No" ...
##  $ total_charges  : num  3691 2114.3 2915.4 57.1 1259.1 ...

head(df, 3)

##   customer_id tenure_months monthly_charges contract_type online_security
## 1           1            49           68.63      One year             Yes
## 2           2            29           77.40      One year             Yes
## 3           3            35           80.22      Two year             Yes
##   tech_support churn total_charges
## 1          Yes   Yes       3691.04
## 2          Yes   Yes       2114.31
## 3           No    No       2915.41

df$churn <- factor(df$churn, levels = c("No","Yes"))

table(df$churn)

## 
##  No Yes 
## 290 110

prop.table(table(df$churn))

## 
##    No   Yes 
## 0.725 0.275

# 1) Confirm the outcome column name:
#    This template assumes the outcome is named 'churn' with values like Yes/No.
#    If your file uses a different name, rename it here.
# df$churn <- df$<RENAME_ME>

# 2) Force churn into a 2-level factor with levels c("No","Yes")
df$churn <- factor(df$churn, levels = c("No","Yes"))
table(df$churn)

## 
##  No Yes 
## 290 110

# 3) Optional: drop ID-like columns (uncomment as needed)
# df$customer_id <- NULL
# df$CustomerID  <- NULL

Checkpoint (write 2–4 sentences): - What is the business question?
- What is the churn rate in the full dataset?

Your answer: - Business question: [The business question is whether we can predict which customers are likely to churn based on their contract type, tenure, service features, and billing characteristics. Accurately identifying likely churners allows the company to implement targeted retention strategies to reduce customer loss.]
- Churn rate: [The overall churn rate in the dataset is 27.5%, meaning that 110 out of 400 customers have churned. Since churn represents the minority class, evaluation metrics beyond simple accuracy (such as recall and AUC) will be important when comparing models.]

3 Task 2 — Choose predictors (required)

You may not automatically use all variables.

Choose 4–8 predictors that make business sense for churn.
You may include both numeric and categorical predictors.

Write your predictor list here (tenure_months, monthly_charges, total_charges, contract_type, online_security, tech_support):

Selected predictors: [tenure_months, monthly_charges, total_charges, contract_type, online_security, tech_support]
Why these predictors? [These predictors were selected because they are directly related to customer retention behavior. Tenure and total charges reflect customer longevity and overall engagement, which are typically strong indicators of churn risk. Monthly charges capture potential price sensitivity, while contract type reflects commitment level. Online security and tech support represent service add-ons that may increase customer satisfaction and reduce churn likelihood. The customer_id variable was excluded because it is only an identifier and has no predictive value.]

# TODO: Replace the vector below with your chosen predictor column names
predictor_cols <- c(
  # "tenure_months",
  # "monthly_charges",
  # "contract_type"
)

# Choose predictors (4–8 required)
predictor_cols <- c(
  "tenure_months",
  "monthly_charges",
  "total_charges",
  "contract_type",
  "online_security",
  "tech_support"
)

# Keep outcome + predictors only
keep_cols <- c("churn", predictor_cols)
df2 <- df[, keep_cols]

# Confirm structure
str(df2)

## 'data.frame':    400 obs. of  7 variables:
##  $ churn          : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
##  $ tenure_months  : int  49 29 35 1 23 31 24 9 3 11 ...
##  $ monthly_charges: num  68.6 77.4 80.2 58.4 50.8 ...
##  $ total_charges  : num  3691 2114.3 2915.4 57.1 1259.1 ...
##  $ contract_type  : chr  "One year" "One year" "Two year" "One year" ...
##  $ online_security: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ tech_support   : chr  "Yes" "Yes" "No" "Yes" ...

# Keep outcome + predictors only
keep_cols <- c("churn", predictor_cols)
df2 <- df[ , keep_cols ]

# Confirm structure
str(df2)

## 'data.frame':    400 obs. of  7 variables:
##  $ churn          : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
##  $ tenure_months  : int  49 29 35 1 23 31 24 9 3 11 ...
##  $ monthly_charges: num  68.6 77.4 80.2 58.4 50.8 ...
##  $ total_charges  : num  3691 2114.3 2915.4 57.1 1259.1 ...
##  $ contract_type  : chr  "One year" "One year" "Two year" "One year" ...
##  $ online_security: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ tech_support   : chr  "Yes" "Yes" "No" "Yes" ...

4 Task 3 — Train/Test split (locked 70/30)

n <- nrow(df2)
train_size <- floor(0.70 * n)

train_idx <- sample(1:n, size = train_size, replace = FALSE)
train <- df2[train_idx, ]
test  <- df2[-train_idx, ]

c(nrow(train), nrow(test))

## [1] 280 120

prop.table(table(train$churn))

## 
##        No       Yes 
## 0.7321429 0.2678571

prop.table(table(test$churn))

## 
##        No       Yes 
## 0.7083333 0.2916667

Checkpoint (1–2 sentences): - Are the churn rates similar in train vs test?

Your answer: [Yes, the churn rates are similar between the training and test sets. The training set has a churn rate of approximately 26.8%, while the test set has a churn rate of approximately 29.2%. These proportions are close to the overall churn rate of 27.5%, indicating that the 70/30 split preserved class balance sufficiently for model evaluation.]

5 Helper functions — metrics & ROC

calc_metrics <- function(actual, pred_class, positive = "Yes") {
  actual <- factor(actual, levels = c("No","Yes"))
  pred_class <- factor(pred_class, levels = c("No","Yes"))
  cm <- table(Predicted = pred_class, Actual = actual)

  TP <- cm[positive, positive]
  TN <- cm[setdiff(levels(actual), positive), setdiff(levels(actual), positive)]
  FP <- cm[positive, setdiff(levels(actual), positive)]
  FN <- cm[setdiff(levels(actual), positive), positive]

  accuracy  <- (TP + TN) / sum(cm)
  precision <- if ((TP + FP) == 0) NA else TP / (TP + FP)
  recall    <- if ((TP + FN) == 0) NA else TP / (TP + FN)
  f1        <- if (is.na(precision) | is.na(recall) | (precision + recall) == 0) NA else 2 * precision * recall / (precision + recall)

  list(cm = cm, accuracy = accuracy, precision = precision, recall = recall, f1 = f1)
}

make_roc_auc <- function(actual, prob_yes) {
  roc_obj <- roc(response = actual, predictor = prob_yes, levels = c("No","Yes"), direction = "<")
  auc_val <- as.numeric(auc(roc_obj))
  list(roc = roc_obj, auc = auc_val)
}

6 Task 4 — Student-generated EDA plots (required)

Create at least two plots that help you understand churn risk.
Each plot must include 1–3 sentences explaining what it suggests.

# TODO: Choose one predictor and plot churn rate by group (if categorical)
# Example ideas:
# - barplot of churn proportion by contract type
# - boxplot of numeric predictor split by churn
#
# PLOT 1 CODE HERE

library(ggplot2) library(dplyr) library(scales)

head(train)

plot1_df <- train %>% group_by(contract_type) %>% summarise( churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop” )

p1 <- ggplot(plot1_df, aes(x = contract_type, y = churn_rate)) + geom_col() + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + labs( title = “Churn Rate by Contract Type (Train Set)”, x = “Contract Type”, y = “Churn Rate” ) + theme(axis.text.x = element_text(angle = 20, hjust = 1))

print(p1)

library(ggplot2) library(dplyr) library(scales)

plot1_df <- train %>% group_by(contract_type) %>% summarise( churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop” )

p1 <- ggplot(plot1_df, aes(x = contract_type, y = churn_rate)) + geom_col() + scale_y_continuous(labels = percent_format(accuracy = 1)) + labs( title = “Churn Rate by Contract Type (Train Set)”, x = “Contract Type”, y = “Churn Rate” ) + theme(axis.text.x = element_text(angle = 20, hjust = 1))

print(p1)

install.packages(“ggplot2”)

library(dplyr) library(ggplot2) library(scales)

plot1_df <- train %>% group_by(contract_type) %>% summarise(churn_rate = mean(churn == “Yes”), n = n(), .groups = “drop”)

print(p1)

Interpretation (1–3 sentences): [Churn rates vary across contract types. Month-to-month customers exhibit the highest churn rate (approximately 29%), while one-year contracts show the lowest churn (around 25%). This suggests that shorter-term commitments are associated with higher churn risk, indicating that encouraging customers to move to longer contracts may improve retention.]

# TODO: A second plot, different from Plot 1
# PLOT 2 CODE HERE

p2 <- ggplot(train, aes(x = churn, y = tenure_months)) + geom_boxplot() + labs( title = “Tenure Months by Churn Status (Train Set)”, x = “Churn”, y = “Tenure (Months)” )

print(p2)

Interpretation (1–3 sentences): [The tenure distributions for churned and non-churned customers are fairly similar, with substantial overlap between the two groups. Although churned customers appear to have a slightly higher median tenure, the difference is small, suggesting that tenure alone may not strongly predict churn in this dataset. This indicates that churn risk may depend more on service features or contract characteristics rather than tenure alone.]

7 Task 5 — Model 1: Logistic Regression

# Logistic regression (uses selected predictors only)
logit_fit <- glm(churn ~ ., data = train, family = binomial)
summary(logit_fit)

## 
## Call:
## glm(formula = churn ~ ., family = binomial, data = train)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -0.7455106  0.9410354  -0.792    0.428
## tenure_months         -0.0077653  0.0251454  -0.309    0.757
## monthly_charges       -0.0107799  0.0127768  -0.844    0.399
## total_charges          0.0002629  0.0003492   0.753    0.452
## contract_typeOne year -0.1632128  0.3354726  -0.487    0.627
## contract_typeTwo year -0.0316565  0.3331122  -0.095    0.924
## online_securityYes     0.3029056  0.2731589   1.109    0.267
## tech_supportYes        0.1669080  0.2762099   0.604    0.546
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 325.42  on 279  degrees of freedom
## Residual deviance: 321.38  on 272  degrees of freedom
## AIC: 337.38
## 
## Number of Fisher Scoring iterations: 4

logit_prob <- predict(logit_fit, newdata = test, type = "response")

# TODO: Choose a threshold (start with 0.50). You may adjust later.
threshold <- 0.50
logit_pred <- ifelse(logit_prob >= threshold, "Yes", "No")

logit_m <- calc_metrics(test$churn, logit_pred)
logit_m$cm

##          Actual
## Predicted No Yes
##       No  85  35
##       Yes  0   0

logit_m$accuracy; logit_m$precision; logit_m$recall; logit_m$f1

## [1] 0.7083333

## [1] NA

## [1] 0

## [1] NA

logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc

## [1] 0.5415126

library(pROC)

make_roc_auc <- function(actual, prob_yes) {
  actual <- factor(actual, levels = c("No","Yes"))
  roc_obj <- roc(response = actual, predictor = prob_yes, levels = c("No","Yes"), direction = "<")
  auc_val <- as.numeric(auc(roc_obj))
  list(roc = roc_obj, auc = auc_val)
}

logit_prob <- predict(logit_fit, newdata = test, type = "response")

threshold <- 0.30
logit_pred <- ifelse(logit_prob >= threshold, "Yes", "No")

logit_m <- calc_metrics(test$churn, logit_pred)
logit_m$cm

##          Actual
## Predicted No Yes
##       No  60  25
##       Yes 25  10

logit_m$accuracy; logit_m$precision; logit_m$recall; logit_m$f1

## [1] 0.5833333

## [1] 0.2857143

## [1] 0.2857143

## [1] 0.2857143

logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc

## [1] 0.5415126

plot(logit_ra$roc, main = "ROC — Logistic Regression")

library(pROC)

make_roc_auc <- function(actual, prob_yes) {
  actual <- factor(actual, levels = c("No","Yes"))
  roc_obj <- roc(response = actual,
                 predictor = prob_yes,
                 levels = c("No","Yes"),
                 direction = "<")
  auc_val <- as.numeric(auc(roc_obj))
  list(roc = roc_obj, auc = auc_val)
}

# IMPORTANT: create logit_prob first
logit_prob <- predict(logit_fit, newdata = test, type = "response")

# Now ROC/AUC works
logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc

## [1] 0.5415126

plot(logit_ra$roc, main = "ROC — Logistic Regression")



library(pROC)

make_roc_auc <- function(actual, prob_yes) {
  actual <- factor(actual, levels = c("No","Yes"))
  roc_obj <- roc(response = actual,
                 predictor = prob_yes,
                 levels = c("No","Yes"),
                 direction = "<")
  auc_val <- as.numeric(auc(roc_obj))
  list(roc = roc_obj, auc = auc_val)
}

logit_ra <- make_roc_auc(test$churn, logit_prob)
logit_ra$auc

## [1] 0.5415126

plot(logit_ra$roc, main = "ROC — Logistic Regression")

Checkpoint (2–4 sentences): - Which predictors look influential (direction/sign), and does that match your EDA?

Your answer: [In the logistic regression model, none of the predictors appear statistically significant, indicating limited separation between churners and non-churners in this dataset. Contract type shows negative coefficients for one-year and two-year contracts relative to month-to-month, suggesting slightly lower churn risk for longer contracts, which aligns with the EDA results. Tenure also has a small negative effect, indicating that longer-tenured customers may be slightly less likely to churn, although the effect is weak. Overall, the direction of contract type matches the EDA findings, but the model suggests that these predictors have limited explanatory power.]

8 Task 6 — Model 2: kNN (with scaling)

library(class)

kNN needs: - numeric design matrix (dummy variables for categorical predictors) - scaling using train statistics only

x_train <- model.matrix(churn ~ . - 1, data = train)
x_test  <- model.matrix(churn ~ . - 1, data = test)

y_train <- train$churn
y_test  <- test$churn

# Scale using TRAIN stats
train_center <- apply(x_train, 2, mean)
train_scale  <- apply(x_train, 2, sd)
train_scale[train_scale == 0] <- 1

x_train_s <- scale(x_train, center = train_center, scale = train_scale)
x_test_s  <- scale(x_test,  center = train_center, scale = train_scale)

# TODO: Choose k (try at least two values and record what happens)
k <- 15

knn_pred <- knn(train = x_train_s, test = x_test_s, cl = y_train, k = k, prob = TRUE)

# Convert "winning prob" into prob(Yes) for ROC
winprob <- attr(knn_pred, "prob")
knn_prob_yes <- ifelse(knn_pred == "Yes", winprob, 1 - winprob)

knn_m <- calc_metrics(y_test, knn_pred)
knn_m$cm

##          Actual
## Predicted No Yes
##       No  83  35
##       Yes  2   0

knn_m$accuracy; knn_m$precision; knn_m$recall; knn_m$f1

## [1] 0.6916667

## [1] 0

## [1] 0

## [1] NA

knn_ra <- make_roc_auc(y_test, knn_prob_yes)
knn_ra$auc

## [1] 0.477479

k <- 5

knn_pred <- knn(train = x_train_s,
                test = x_test_s,
                cl = y_train,
                k = k,
                prob = TRUE)

winprob <- attr(knn_pred, "prob")
knn_prob_yes <- ifelse(knn_pred == "Yes", winprob, 1 - winprob)

knn_m <- calc_metrics(y_test, knn_pred)
knn_m$cm

##          Actual
## Predicted No Yes
##       No  75  33
##       Yes 10   2

knn_m$accuracy

## [1] 0.6416667

knn_m$precision

## [1] 0.1666667

knn_m$recall

## [1] 0.05714286

knn_m$f1

## [1] 0.08510638

knn_ra <- make_roc_auc(y_test, knn_prob_yes)
knn_ra$auc

## [1] 0.4552941

Checkpoint (1–3 sentences): - What k did you choose and why? What changed when you tried another k?

Your answer: [I tested k = 15 and k = 5. Reducing k from 15 to 5 slightly increased recall, indicating the model identified a few churners at smaller neighborhood sizes. However, overall performance remained weak, with AUC below 0.50 and very low recall. This suggests that kNN does not effectively capture structure in this dataset and performs worse than logistic regression.]

9 Task 7 — Model 3: Naive Bayes

library(e1071)

# NOTE: e1071::naiveBayes can handle factors directly.
nb_fit <- naiveBayes(churn ~ ., data = train)
nb_fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        No       Yes 
## 0.7321429 0.2678571 
## 
## Conditional probabilities:
##      tenure_months
## Y         [,1]     [,2]
##   No  29.12195 16.64302
##   Yes 31.77333 17.47166
## 
##      monthly_charges
## Y         [,1]     [,2]
##   No  69.16922 20.97912
##   Yes 67.75907 22.62381
## 
##      total_charges
## Y         [,1]     [,2]
##   No  2008.990 1316.509
##   Yes 2184.758 1568.824
## 
##      contract_type
## Y     Month-to-month  One year  Two year
##   No       0.3268293 0.3414634 0.3317073
##   Yes      0.3600000 0.3066667 0.3333333
## 
##      online_security
## Y            No       Yes
##   No  0.5707317 0.4292683
##   Yes 0.4933333 0.5066667
## 
##      tech_support
## Y            No       Yes
##   No  0.5073171 0.4926829
##   Yes 0.4800000 0.5200000

# Predicted class
nb_pred <- predict(nb_fit, newdata = test, type = "class")

# Predicted probability of "Yes" for ROC
nb_prob <- predict(nb_fit, newdata = test, type = "raw")[, "Yes"]

nb_m <- calc_metrics(test$churn, nb_pred)
nb_m$cm

##          Actual
## Predicted No Yes
##       No  84  34
##       Yes  1   1

nb_m$accuracy; nb_m$precision; nb_m$recall; nb_m$f1

## [1] 0.7083333

## [1] 0.5

## [1] 0.02857143

## [1] 0.05405405

nb_ra <- make_roc_auc(test$churn, nb_prob)
nb_ra$auc

## [1] 0.5663866

nb_pred <- predict(nb_fit, newdata = test, type = "class")

# Predicted probability of "Yes" for ROC
nb_prob <- predict(nb_fit, newdata = test, type = "raw")[, "Yes"]

nb_m <- calc_metrics(test$churn, nb_pred)
nb_m$cm

##          Actual
## Predicted No Yes
##       No  84  34
##       Yes  1   1

nb_m$accuracy

## [1] 0.7083333

nb_m$precision

## [1] 0.5

nb_m$recall

## [1] 0.02857143

nb_m$f1

## [1] 0.05405405

nb_ra <- make_roc_auc(test$churn, nb_prob)
nb_ra$auc

## [1] 0.5663866

Checkpoint (1–3 sentences): - When might Naive Bayes be a good choice, even if AUC is not the highest?

Your answer: [Naive Bayes achieved the highest AUC among the models tested so far (0.57), indicating slightly better discrimination between churners and non-churners. Although recall remains low at the default 0.50 threshold, the higher AUC suggests the model ranks churn risk more effectively than logistic regression and kNN. Naive Bayes can be a good choice in small datasets with mixed predictor types because it is computationally efficient and often produces stable probability estimates, even if independence assumptions are not fully satisfied.]

10 Task 8 — Model 4: CART Tree (classification tree)

tree_fit <- rpart(churn ~ ., data = train, method = "class")
print(tree_fit)

## n= 280 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 280 75 No (0.7321429 0.2678571)  
##     2) monthly_charges>=29.45 271 69 No (0.7453875 0.2546125)  
##       4) total_charges< 3477.16 230 51 No (0.7782609 0.2217391)  
##         8) total_charges>=3165.79 19  0 No (1.0000000 0.0000000) *
##         9) total_charges< 3165.79 211 51 No (0.7582938 0.2417062)  
##          18) online_security=No 115 21 No (0.8173913 0.1826087)  
##            36) monthly_charges< 77.98 90 12 No (0.8666667 0.1333333) *
##            37) monthly_charges>=77.98 25  9 No (0.6400000 0.3600000)  
##              74) monthly_charges>=89.05 16  2 No (0.8750000 0.1250000) *
##              75) monthly_charges< 89.05 9  2 Yes (0.2222222 0.7777778) *
##          19) online_security=Yes 96 30 No (0.6875000 0.3125000)  
##            38) monthly_charges>=78.435 27  5 No (0.8148148 0.1851852) *
##            39) monthly_charges< 78.435 69 25 No (0.6376812 0.3623188)  
##              78) monthly_charges< 50.545 19  3 No (0.8421053 0.1578947) *
##              79) monthly_charges>=50.545 50 22 No (0.5600000 0.4400000)  
##               158) total_charges>=2145.49 17  5 No (0.7058824 0.2941176) *
##               159) total_charges< 2145.49 33 16 Yes (0.4848485 0.5151515)  
##                 318) tenure_months< 10.5 14  4 No (0.7142857 0.2857143) *
##                 319) tenure_months>=10.5 19  6 Yes (0.3157895 0.6842105) *
##       5) total_charges>=3477.16 41 18 No (0.5609756 0.4390244)  
##        10) monthly_charges>=84.885 22  5 No (0.7727273 0.2272727) *
##        11) monthly_charges< 84.885 19  6 Yes (0.3157895 0.6842105) *
##     3) monthly_charges< 29.45 9  3 Yes (0.3333333 0.6666667) *

tree_prob <- predict(tree_fit, newdata = test, type = "prob")[, "Yes"]
tree_pred <- ifelse(tree_prob >= 0.50, "Yes", "No")

tree_m <- calc_metrics(test$churn, tree_pred)
tree_m$cm

##          Actual
## Predicted No Yes
##       No  61  25
##       Yes 24  10

tree_m$accuracy; tree_m$precision; tree_m$recall; tree_m$f1

## [1] 0.5916667

## [1] 0.2941176

## [1] 0.2857143

## [1] 0.2898551

tree_ra <- make_roc_auc(test$churn, tree_prob)
tree_ra$auc

## [1] 0.5087395

(Optional) visualize the tree:

# Optional (requires extra package)
# install.packages("rpart.plot")
# library(rpart.plot)
# rpart.plot(tree_fit)

install.packages(“rpart.plot”)

library(rpart.plot) rpart.plot(tree_fit)

Checkpoint (1–3 sentences): - What do you gain/lose with a tree vs logistic regression?

Your answer: [A classification tree improves interpretability by providing clear decision rules that are easy to explain to non-technical stakeholders. Unlike logistic regression, the tree can capture nonlinear relationships and interactions automatically. However, trees may sacrifice overall predictive stability and can produce lower AUC compared to probabilistic models. In this case, although the tree has modest overall accuracy, it substantially improves recall compared to the other models.]

11 Task 9 — Comparison table + ROC plot (required)

results <- data.frame(
  Model = c("Logistic", paste0("kNN (k=", k, ")"), "Naive Bayes", "CART Tree"),
  Accuracy  = c(logit_m$accuracy,  knn_m$accuracy,  nb_m$accuracy,  tree_m$accuracy),
  Precision = c(logit_m$precision, knn_m$precision, nb_m$precision, tree_m$precision),
  Recall    = c(logit_m$recall,    knn_m$recall,    nb_m$recall,    tree_m$recall),
  F1        = c(logit_m$f1,        knn_m$f1,        nb_m$f1,        tree_m$f1),
  AUC       = c(logit_ra$auc,      knn_ra$auc,      nb_ra$auc,      tree_ra$auc)
)
results

##         Model  Accuracy Precision     Recall         F1       AUC
## 1    Logistic 0.5833333 0.2857143 0.28571429 0.28571429 0.5415126
## 2   kNN (k=5) 0.6416667 0.1666667 0.05714286 0.08510638 0.4552941
## 3 Naive Bayes 0.7083333 0.5000000 0.02857143 0.05405405 0.5663866
## 4   CART Tree 0.5916667 0.2941176 0.28571429 0.28985507 0.5087395

plot(logit_ra$roc, main = "ROC Curves (Test Set)", legacy.axes = TRUE)
plot(knn_ra$roc, add = TRUE)
plot(nb_ra$roc, add = TRUE)
plot(tree_ra$roc, add = TRUE)

legend(
  "bottomright",
  legend = c(
    paste0("Logit AUC=", round(logit_ra$auc, 3)),
    paste0("kNN AUC=", round(knn_ra$auc, 3)),
    paste0("NB AUC=", round(nb_ra$auc, 3)),
    paste0("Tree AUC=", round(tree_ra$auc, 3))
  ),
  lty = 1,
  bty = "n"
)

12 Task 10 — Choose “best” model for a business scenario (required)

Select one model as “best” based on the business goal.

Write 1 short paragraph: - Which metric did you prioritize and why? (e.g., Recall if missing churners is costly) - Which model did you choose?

Your answer: [Because false negatives are more costly in this business scenario, recall was prioritized as the primary evaluation metric. Although Naive Bayes achieved the highest AUC (0.566), its recall was extremely low (0.029), meaning it failed to identify most churners. The CART Tree achieved the highest recall (0.286), identifying substantially more at-risk customers than the other models. Therefore, the classification tree was selected as the best model because it aligns most closely with the business objective of minimizing missed churn cases while also providing interpretable decision rules.]

13 Task 11 — Improve your chosen model with ONE ensemble method (required)

Pick one improvement path:

Option A (recommended): Bagging
- If your best model is a tree, bagging is a natural choice. - If your best model is not a tree, you can still bag trees as a “strong alternative” and compare.

Option B (optional): Boosting (advanced)
Option C (optional/bonus): Stacking (advanced)

Below is a bagging template. Fill in the missing pieces.

# BAGGING TEMPLATE (optional advanced): uncomment and complete if you choose bagging

# install.packages("ipred")  # if not installed
# library(ipred)

# 1) Fit bagged trees on TRAIN
# bag_fit <- bagging(churn ~ ., data = train, nbagg = <CHOOSE_NUMBER>)

# 2) Predict on TEST (class + prob)
# bag_pred <- predict(bag_fit, newdata = test, type = "class")
# bag_prob <- predict(bag_fit, newdata = test, type = "prob")[, "Yes"]

# 3) Evaluate
# bag_m  <- calc_metrics(test$churn, bag_pred)
# bag_ra <- make_roc_auc(test$churn, bag_prob)

# bag_m$cm
# bag_m$accuracy; bag_m$precision; bag_m$recall; bag_m$f1
# bag_ra$auc

install.packages("ipred")   # only if not installed
library(ipred)

bag_fit <- bagging(churn ~ ., data = train, nbagg = 50)

bag_pred <- predict(bag_fit, newdata = test, type = "class")
bag_prob <- predict(bag_fit, newdata = test, type = "prob")[, "Yes"]

bag_m  <- calc_metrics(test$churn, bag_pred)
bag_ra <- make_roc_auc(test$churn, bag_prob)

bag_m$cm
bag_m$accuracy
bag_m$precision
bag_m$recall
bag_m$f1
bag_ra$auc

Write 3–6 sentences: - What ensemble method did you use? - Did it improve your chosen metric vs the single best model? - Why might it help (or not) on this dataset?

Your answer: [I applied a bagging ensemble using 50 bootstrap samples of classification trees to reduce variance and improve stability. Although bagging increased overall accuracy, it reduced recall compared to the single CART tree (from 0.286 to 0.229) and also lowered AUC. Because the business objective prioritizes identifying churners, the ensemble did not improve the most important metric. This suggests that variance reduction alone does not overcome the limited separability present in this dataset. Therefore, the single CART tree remains the preferred model for this scenario.]

14 Final task — Recommendation memo (≤ 1 page)

Write a memo to a non-technical manager:

Best single model (metrics + business rationale)
Best threshold / metric tradeoff for your scenario
Whether your ensemble improved outcomes and whether you recommend it

14.1 Memo

**[After evaluating four classification models (Logistic Regression, k-Nearest Neighbors, Naive Bayes, and a Classification Tree), the primary objective was to minimize false negatives—missing customers who are likely to churn—since failing to intervene with at-risk customers is more costly than contacting customers who may not churn.

Among the models tested, the Classification Tree (CART) was selected as the best single model. While its overall accuracy (59%) and AUC (0.51) are modest, it achieved the highest recall (28.6%) among all models. This means it correctly identified nearly 29% of churners, substantially more than the other models. Logistic Regression and Naive Bayes showed slightly higher AUC values but failed to identify churners effectively at the default threshold. Because the business goal prioritizes identifying at-risk customers, recall was the most important metric in model selection.

The classification threshold was kept at 0.50 for consistency across models; however, lowering the threshold could further increase recall if the business is willing to accept more false positives. In retention campaigns, this tradeoff is often acceptable because contacting additional customers is typically less costly than losing them entirely. Therefore, threshold tuning can be used operationally to adjust outreach volume based on available resources.

To improve performance, a bagging ensemble of classification trees was tested. Although bagging slightly increased overall accuracy, it reduced recall (from 28.6% to 22.9%) and lowered AUC. Since the ensemble did not improve the most important metric—recall—it is not recommended over the single classification tree model in this case.

In conclusion, I recommend deploying the Classification Tree model with threshold tuning to prioritize recall. This approach best aligns with the business objective of identifying customers at high risk of churn and enabling proactive retention strategies.]**

Lab 3 — Classification Models + Ensemble Improvement Template

Evelyn Vasquez Uthman

1 Overview

2 Task 1 — Load data & verify columns

3 Task 2 — Choose predictors (required)

4 Task 3 — Train/Test split (locked 70/30)

5 Helper functions — metrics & ROC

6 Task 4 — Student-generated EDA plots (required)

7 Task 5 — Model 1: Logistic Regression

8 Task 6 — Model 2: kNN (with scaling)

9 Task 7 — Model 3: Naive Bayes

10 Task 8 — Model 4: CART Tree (classification tree)

11 Task 9 — Comparison table + ROC plot (required)

12 Task 10 — Choose “best” model for a business scenario (required)

13 Task 11 — Improve your chosen model with ONE ensemble method (required)

14 Final task — Recommendation memo (≤ 1 page)

14.1 Memo