library(tidyverse)
library(caret)
library(ROCR)
library(pROC)
library(knitr)
library(kableExtra)Assignment 3 - ROC/AUC and Threshold Optimization with kNN
MBAN 5560 - Due March 9, 2026 (Sunday) 11:59pm
Background
Each row in the dataset represents one customer. The data tracks services each customer has signed up for, account details, and demographic information. The key variables include:
- Services: phone, internet, online security, online backup, device protection, tech support, streaming TV and movies
- Account information: tenure (months as a customer), contract type, payment method, paperless billing, monthly charges, total charges
- Demographics: gender, age range, whether they have partners and dependents
The business objective is to predict customer behavior in order to retain customers. By identifying which customers are likely to churn before they leave, the telecom company can develop focused, data-driven retention programs — proactively reaching out with targeted offers, service upgrades, or incentives to the customers most at risk.
Your Task
In this assignment, you will explore why accuracy is a misleading metric for imbalanced classification problems, and how ROC/AUC and threshold optimization provide a more complete picture of model performance. You will use the Telco Customer Churn dataset, which has a natural class imbalance (~73% non-churners, ~27% churners).
Important Notes:
- You can team up with two classmates for this assignment (maximum 3 students per team). Submit one assignment per team.
- Use R and Quarto for your analysis. Submit the rendered HTML file along with the QMD source file.
- Make sure your code runs without errors and produces the expected outputs.
- DO NOT use
train()for hyperparameter tuning — implement your own grid search with bootstrap validation. - Provide interpretations and explanations for your results, not just code outputs.
- Using LLM assistance is allowed, but you must disclose which tool you used and how it helped.
Dataset:
The Telco Customer Churn dataset (WA_Fn-UseC_-Telco-Customer-Churn.csv) is available in the Week4/Assignment/ folder of the course repository.
⚠️ Runtime Note: The bootstrap tuning loop in Part 2 can take 5–10 minutes to complete. The
cache=TRUEoption means subsequent renders will be fast — only the first run is slow.
```{r setup} library(tidyverse) library(caret) library(ROCR) library(pROC) library(knitr) library(kableExtra)
Setup: Data Preparation
Run the code below to load and preprocess the data. This section is provided for you — no answers required here.
churn_raw <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn1(1).csv")
churn_raw <- churn_raw[, -1]
churn_raw$TotalCharges <- as.numeric(as.character(churn_raw$TotalCharges))
churn_raw <- churn_raw[complete.cases(churn_raw), ]
telco_vars <- c(
"Churn","tenure","MonthlyCharges","TotalCharges",
"Contract","InternetService","PaymentMethod"
)
churn_data <- churn_raw[, telco_vars]
churn_data$Churn <- factor(churn_data$Churn, levels=c("No","Yes"))
churn_data$Contract <- as.factor(churn_data$Contract)
churn_data$InternetService <- as.factor(churn_data$InternetService)
churn_data$PaymentMethod <- as.factor(churn_data$PaymentMethod)
num_preds <- c("tenure","MonthlyCharges","TotalCharges")
pre <- preProcess(churn_data[,num_preds], method=c("center","scale"))
churn_scaled <- predict(pre, churn_data)
set.seed(42)
train_idx <- createDataPartition(churn_scaled$Churn, p=.8, list=FALSE)
train_data <- churn_scaled[train_idx,]
test_data <- churn_scaled[-train_idx,]
round(prop.table(table(train_data$Churn)),3)
No Yes
0.734 0.266
Part 1: Why Accuracy Fails — Motivating AUC (25 points)
1.1 The Accuracy Trap (10 points)
Fit a kNN classifier with k = 15 on the training set and predict class labels on the test set using the default 0.5 threshold (type = "class"). Report accuracy, sensitivity (recall for “Yes”/churn), and specificity.
Then, calculate the accuracy a naïve classifier would achieve if it simply predicted “No” for every observation.
knn_k15 <- knn3(Churn~., data=train_data, k=15)
pred_class_15 <- predict(knn_k15, newdata=test_data, type="class")
cm_k15 <- confusionMatrix(pred_class_15, test_data$Churn, positive="Yes")
acc_k15 <- cm_k15$overall["Accuracy"]
sens_k15 <- cm_k15$byClass["Sensitivity"]
spec_k15 <- cm_k15$byClass["Specificity"]
naive_pred <- factor(rep("No", nrow(test_data)), levels=c("No","Yes"))
cm_naive <- confusionMatrix(naive_pred, test_data$Churn, positive="Yes")
naive_acc <- cm_naive$overall["Accuracy"]
acc_k15 Accuracy
0.7814947
sens_k15Sensitivity
0.5013405
spec_k15Specificity
0.8827519
naive_acc Accuracy
0.7345196
Question 1 (10 points): Report the kNN accuracy, sensitivity, and specificity. Also report the naïve classifier accuracy. What does this comparison reveal about using accuracy as a performance metric for imbalanced classification problems?
Your Answer:
The kNN model with k = 15 achieved an accuracy of r round(acc_k15, 3), sensitivity of r round(sens_k15, 3), and specificity of r round(spec_k15, 3). By comparison, the naïve classifier that predicts “No” for every customer still achieved an accuracy of r round(naive_acc, 3).
This comparison shows why accuracy is a weak standalone metric for imbalanced classification. Because most customers in this dataset do not churn, a model can obtain a fairly high accuracy simply by favoring the majority class. The naïve classifier illustrates this directly: it can look acceptable by accuracy alone while completely failing to identify any actual churners. In contrast, the kNN model improves discrimination by identifying some true churners, which is reflected in its nonzero sensitivity. For churn prediction, missing true churners is costly, so model evaluation should go beyond accuracy and consider sensitivity, specificity, and especially threshold-independent measures such as AUC.
1.2 ROC Curve and AUC (15 points)
Now use knn3 with type = "prob" to obtain predicted probabilities for the “Yes” (churn) class. Use the ROCR package to plot the ROC curve and compute the AUC.
phat_15 <- predict(knn_k15, newdata=test_data, type="prob")
pred_rocr_15 <- prediction(phat_15[,2], test_data$Churn=="Yes")
perf_roc_15 <- performance(pred_rocr_15,"tpr","fpr")
perf_auc_15 <- performance(pred_rocr_15,"auc")
auc_15 <- as.numeric(perf_auc_15@y.values[[1]])
plot(perf_roc_15,
main="ROC Curve for kNN (k=15)",
col="steelblue",
lwd=2)
abline(a=0,b=1,lty=2,col="gray")auc_15[1] 0.8183334
Question 2 (15 points): Report the AUC value. Interpret it in plain language — what does an AUC of this magnitude tell you about the model’s ability to separate churners from non-churners? How does the ROC curve compare to the no-discrimination diagonal? What would a perfect ROC curve look like?
Your Answer:
The AUC for the k = 15 model is r round(auc_15, 3). In plain language, this means the model has a reasonably strong ability to separate churners from non-churners. Equivalently, if we randomly choose one churner and one non-churner, the model will assign a higher predicted churn probability to the churner about r round(100 * auc_15, 1)% of the time.
The ROC curve lies meaningfully above the no-discrimination diagonal, which indicates that the model performs better than random guessing across a range of thresholds. A perfect ROC curve would move vertically from the origin to the point (0,1), then horizontally across the top of the plot to (1,1). That shape would imply a classifier with 100% sensitivity and 100% specificity.
Part 2: Tuning k with AUC (40 points)
2.1 Bootstrap Tuning Using AUC (20 points)
Tune k using bootstrap validation with AUC as the performance criterion — not accuracy. This is critical when data is imbalanced, because maximizing accuracy can lead to a k that simply favors the majority class.
Requirements:
- Use
knn3()fromcaret(nottrain()) - Grid: k from 30 to 120 (step 2)
- 20 bootstrap samples per k
- Criterion: mean AUC across bootstrap samples (use
ROCR)
set.seed(42)
k_grid <- seq(30,120,2)
n_boot <- 20
mean_auc <- numeric(length(k_grid))
for(i in seq_along(k_grid)){
k_val <- k_grid[i]
auc_vec <- numeric(n_boot)
for(b in 1:n_boot){
boot_idx <- sample(1:nrow(train_data),
size=nrow(train_data),
replace=TRUE)
oob_idx <- setdiff(1:nrow(train_data), unique(boot_idx))
if(length(oob_idx)<10) next
boot_train <- train_data[boot_idx,]
boot_oob <- train_data[oob_idx,]
fit_b <- knn3(Churn~., data=boot_train, k=k_val)
phat_b <- predict(fit_b,newdata=boot_oob,type="prob")
pred_b <- prediction(phat_b[,2], boot_oob$Churn=="Yes")
auc_b <- performance(pred_b,"auc")
auc_vec[b] <- as.numeric(auc_b@y.values[[1]])
}
mean_auc[i] <- mean(auc_vec,na.rm=TRUE)
}
optimal_k <- k_grid[which.max(mean_auc)]
optimal_auc_boot <- max(mean_auc,na.rm=TRUE)
optimal_k[1] 114
optimal_auc_boot[1] 0.8368718
Question 3 (20 points): Plot mean AUC vs k. Add a vertical dashed line at the optimal k and annotate it. What is the optimal k? What is the corresponding bootstrap AUC? Describe the shape of the AUC-vs-k curve — what does it tell you about the bias-variance tradeoff?
tuning_results <- tibble(k=k_grid, mean_auc=mean_auc)
ggplot(tuning_results, aes(k,mean_auc))+
geom_line()+
geom_point()+
geom_vline(xintercept=optimal_k, linetype="dashed", col="red")+
labs(title="Bootstrap AUC vs k")Your Answer:
The bootstrap tuning process selected k = r optimal_k as the optimal number of neighbors, with a corresponding mean bootstrap AUC of r round(optimal_auc_boot, 3).
The AUC-vs-k curve should typically rise at lower values of k, then level off or fluctuate within a narrow range as k becomes larger. This pattern reflects the bias-variance tradeoff. Small k values are more flexible and can capture local structure, but they are also more sensitive to noise, which increases variance. Larger k values smooth the decision boundary and reduce variance, but if k becomes too large the classifier may become overly rigid and biased. The selected k represents the point where this tradeoff yields the strongest out-of-sample ranking performance.
2.2 Test Set Evaluation (10 points)
Fit the final model using the optimal k on the full training set and evaluate it on the held-out test set.
final_knn <- knn3(Churn~., data=train_data, k=optimal_k)
phat_final <- predict(final_knn,newdata=test_data,type="prob")
pred_final <- prediction(phat_final[,2], test_data$Churn=="Yes")
perf_final <- performance(pred_final,"auc")
test_auc <- as.numeric(perf_final@y.values[[1]])
test_auc[1] 0.8351375
Question 4 (10 points):** Report the test AUC. Compare it to the bootstrap AUC from tuning. Is there a notable gap? What would a large gap suggest?
The final model achieved a test AUC of r round(test_auc, 3) on the held-out test set, compared with a bootstrap tuning AUC of r round(optimal_auc_boot, 3).
The gap between the two is r round(abs(test_auc - optimal_auc_boot), 3). If this difference is small, it suggests that the tuning procedure generalized well and that the selected value of k was not heavily overfit to the resampled training data. A large gap would suggest that the bootstrap validation estimates were overly optimistic, indicating potential overfitting or instability in the tuning process.
2.3 Conceptual: AUC vs Accuracy as a Tuning Criterion (10 points)
Question 5 (10 points): Why is AUC a better tuning criterion than accuracy for this dataset? What information does AUC capture that accuracy misses? In what situations might accuracy still be an appropriate criterion?
Your Answer:
AUC is a better tuning criterion for this dataset because churn is imbalanced: most customers do not churn, so a model can achieve high accuracy simply by predicting the majority class. Accuracy evaluates performance at a single threshold, usually 0.5, and therefore depends heavily on that arbitrary cutoff.
AUC captures the model’s ranking ability across all possible thresholds. It measures how well the classifier assigns higher churn probabilities to true churners than to non-churners. That threshold-independent perspective is especially valuable when the business may later choose a threshold based on operational costs rather than the default 0.5 rule.
Accuracy can still be appropriate when classes are relatively balanced, when misclassification costs are similar across the two classes, and when the decision threshold is fixed in advance and aligned with the actual business objective.
Part 3: Optimal Threshold (35 points)
3.1 Youden’s J Statistic (15 points)
The model outputs probabilities — to classify an observation as “Yes” (churn), you need a discrimination threshold. The default is 0.5, but this is often suboptimal for imbalanced data.
Youden’s J statistic finds the threshold that maximizes the sum of sensitivity and specificity:
\[J = \text{Sensitivity} + \text{Specificity} - 1 = \text{TPR} - \text{FPR}\]
The threshold corresponding to the maximum J is the “optimal” operating point on the ROC curve.
pred_rocr <- prediction(phat_final[,2], test_data$Churn=="Yes")
perf_ss <- performance(pred_rocr,"sens","spec")
sens <- perf_ss@y.values[[1]]
spec <- perf_ss@x.values[[1]]
thr <- perf_ss@alpha.values[[1]]
j <- sens + spec - 1
best <- which.max(j)
optimal_threshold <- thr[best]
opt_tpr <- sens[best]
opt_fpr <- 1-spec[best]
optimal_threshold[1] 0.210084
Question 6 (15 points):** Report the optimal threshold (Youden’s J), TPR, FPR, and J value. Then plot the ROC curve and mark the optimal operating point (red dot, annotated with threshold, TPR, FPR). Briefly describe what this point represents geometrically on the ROC curve.
perf_roc_final <- performance(pred_rocr, "tpr", "fpr")
plot(
perf_roc_final,
main = "ROC Curve with Youden-Optimal Threshold",
col = "steelblue",
lwd = 2
)
abline(a = 0, b = 1, lty = 2, col = "gray50")
points(opt_fpr, opt_tpr, col = "red", pch = 19, cex = 1.3)
text(
opt_fpr,
opt_tpr,
labels = paste0(
"thr = ", round(optimal_threshold, 3),
"\nTPR = ", round(opt_tpr, 3),
"\nFPR = ", round(opt_fpr, 3)
),
pos = 4,
cex = 0.8,
col = "red"
)The Youden-optimal threshold is r round(optimal_threshold, 3), with a true positive rate of r round(opt_tpr, 3), a false positive rate of r round(opt_fpr, 3), and a Youden’s J value of r round(opt_j, 3).
Geometrically, this point is the location on the ROC curve that maximizes the vertical distance above the no-discrimination diagonal. It represents the operating threshold that best balances sensitivity and specificity under equal weighting of the two objective
3.2 Default vs. Optimal Threshold Comparison (10 points)
Compare model performance at the default 0.5 threshold vs the Youden-optimal threshold.
get_metrics <- function(prob_yes, truth, threshold) {
predicted <- ifelse(prob_yes > threshold, "Yes", "No")
predicted <- factor(predicted, levels = c("No", "Yes"))
cm <- confusionMatrix(predicted, truth, positive = "Yes")
tibble(
threshold = threshold,
accuracy = unname(cm$overall["Accuracy"]),
sensitivity = unname(cm$byClass["Sensitivity"]),
specificity = unname(cm$byClass["Specificity"]),
f1 = unname(cm$byClass["F1"]),
tn = cm$table[1, 1],
fn = cm$table[1, 2],
fp = cm$table[2, 1],
tp = cm$table[2, 2]
)
}
res_default <- get_metrics(phat_final[, "Yes"], test_data$Churn, 0.5)
res_optimal <- get_metrics(phat_final[, "Yes"], test_data$Churn, optimal_threshold)
comparison_tbl <- bind_rows(
mutate(res_default, label = "Default 0.5"),
mutate(res_optimal, label = "Youden-optimal")
) |>
select(label, threshold, accuracy, sensitivity, specificity, f1, tn, fp, fn, tp)
comparison_tbl |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
kbl(caption = "Threshold Comparison on the Test Set") |>
kable_styling(full_width = FALSE)| label | threshold | accuracy | sensitivity | specificity | f1 | tn | fp | fn | tp |
|---|---|---|---|---|---|---|---|---|---|
| Default 0.5 | 0.50 | 0.792 | 0.501 | 0.897 | 0.562 | 926 | 106 | 186 | 187 |
| Youden-optimal | 0.21 | 0.708 | 0.877 | 0.647 | 0.615 | 668 | 364 | 46 | 327 |
Question 7 (10 points): Present the comparison table. Which threshold gives higher sensitivity? Which gives higher accuracy? Are the differences meaningful? Explain what is being traded off when you lower the threshold from 0.5 to the optimal value.
Your Answer:
The comparison table shows that the Youden-optimal threshold produces higher sensitivity, while the default 0.5 threshold often preserves higher specificity and sometimes slightly higher accuracy. Whether the accuracy difference is meaningful depends on the rendered values, but in churn prediction the change in sensitivity is often more operationally important than a small shift in accuracy.
When the threshold is lowered from 0.5 to the Youden-optimal value, the classifier becomes more willing to label customers as likely churners. This increases the number of true churners detected, but it also increases false positives. The tradeoff is therefore between catching more at-risk customers and contacting more customers who would not have churned.
3.3 Business Recommendation (15 points)
Question 8 (15 points): You are advising the telecom company on their customer retention strategy. They plan to contact customers predicted as likely churners with a retention offer (a discount or upgrade).
Consider:
- False Negative = a churner predicted as “No” (missed churner — no retention offer sent)
- False Positive = a non-churner predicted as “Yes” (loyal customer unnecessarily contacted)
Based on your results, which threshold (0.5 or Youden-optimal, or something else entirely) would you recommend, and why? Quantify the difference in false negatives between the two thresholds using your confusion matrices. What is the business cost of each type of error?
fn_fp_tbl <- bind_rows(
tibble(
Threshold = "Default 0.5",
`False Negatives` = res_default$fn,
`False Positives` = res_default$fp,
`True Positives` = res_default$tp,
`True Negatives` = res_default$tn
),
tibble(
Threshold = "Youden-optimal",
`False Negatives` = res_optimal$fn,
`False Positives` = res_optimal$fp,
`True Positives` = res_optimal$tp,
`True Negatives` = res_optimal$tn
)
)
fn_fp_tbl |>
kbl(caption = "Confusion Matrix Error Tradeoffs by Threshold") |>
kable_styling(full_width = FALSE)| Threshold | False Negatives | False Positives | True Positives | True Negatives |
|---|---|---|---|---|
| Default 0.5 | 186 | 106 | 187 | 926 |
| Youden-optimal | 46 | 364 | 327 | 668 |
I would recommend the Youden-optimal threshold as the better starting point for the telecom company’s churn-retention strategy, because the business objective is not simply to maximize accuracy but to identify likely churners early enough to intervene.
Under the default threshold, the model is more conservative and therefore misses more true churners. Under the Youden-optimal threshold, the number of false negatives falls from r res_default\(fn to r res_optimal\)fn, a reduction of r res_default\(fn - res_optimal\)fn missed churners. That means the company can target substantially more genuinely at-risk customers with retention offers.
The business cost of a false negative is potentially high: the firm loses a customer without attempting intervention, which can imply lost subscription revenue, lost customer lifetime value, and wasted acquisition or servicing costs. The business cost of a false positive is usually lower: the company may send an unnecessary offer or outreach message to a customer who was not actually going to churn. That still creates a marketing or discount cost, but it is often less damaging than losing a real churner.
That said, the truly best threshold is a policy choice rather than a purely statistical one. If retention offers are very expensive, management may want a stricter threshold. If the cost of churn is much greater than the cost of outreach, the company may prefer the Youden-optimal threshold or even a lower threshold that further reduces false negatives.