# Core
library(tidyverse)
library(caret)
library(knitr)
library(kableExtra)
library(patchwork)
# Classification extras (AUC/ROC + threshold)
library(pROC)
library(yardstick)
set.seed(5560)Assignment 2: k-Nearest Neighbors for Regression and Classification
Introduction:
This assignment applies k-nearest neighbours (kNN) to two structured datasets:
• Ames Housing** (2,930 observations, 7 variables) for regression
• Telco Customer Churn** (7,043 observations, 7 variables) for classification
In both cases, the hyperparameter *k* was tuned using a manual grid search combined with 20 bootstrap resamples inside each of 20 outer 80/20 splits. Scaling was performed strictly within each training split to avoid information leakage.
Model performance was evaluated on held-out test sets. Results are interpreted using error curves, variability across splits, confusion matrices, ROC analysis, and threshold tuning. The bias–variance tradeoff is discussed in the context of the observed tuning curves rather than abstract theory.
Section 1 - Back with Ames Housing (kNN Regression)
# Load Ames Housing Dataset (AmesHousing.csv)
ames_raw <- read.csv("AmesHousing.csv")
# Subset to required variables for kNN regression
ames <- ames_raw %>%
select(
SalePrice,
Gr.Liv.Area,
Total.Bsmt.SF,
Garage.Area,
Year.Built,
Overall.Qual,
Overall.Cond
)
# Quick check
dim(ames)[1] 2930 7
glimpse(ames)Rows: 2,930
Columns: 7
$ SalePrice <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, …
$ Gr.Liv.Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 180…
$ Total.Bsmt.SF <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994, …
$ Garage.Area <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440, 4…
$ Year.Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, 19…
$ Overall.Qual <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, 4,…
$ Overall.Cond <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, 5,…
Missing Values & chosen Standardization
Scaling parameters (center/scale) are fit on training data only per split. Inside each bootstrap validation, scaling is fit on bootstrap train only, then applied to out of bag.
# Median imputation (deterministic) ames <- ames %>% mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) stopifnot(sum(is.na(ames)) == 0)
Observations& Sale Price
n_obs_ames <- nrow(ames)
sale_summary <- ames %>%
summarise(
n = n(),
min = min(SalePrice),
q1 = quantile(SalePrice, 0.25),
median = median(SalePrice),
mean = mean(SalePrice),
q3 = quantile(SalePrice, 0.75),
max = max(SalePrice)
)
sale_summary %>%
kable(digits = 2, caption = "Ames Housing: Summary of SalePrice (Target)") %>%
kable_styling(full_width = FALSE)| n | min | q1 | median | mean | q3 | max |
|---|---|---|---|---|---|---|
| 2930 | 12789 | 129500 | 160000 | 180796.1 | 213500 | 755000 |
ggplot(ames, aes(x = SalePrice)) +
geom_histogram(bins = 35, color = "white") +
labs(
title = "SalePrice Distribution (AmesHousing.csv)",
x = "SalePrice",
y = "Count"
) +
theme_minimal()Question 1: How many Obeservations?
The Ames dataset contains 2,930 observations and 7 variables (1 target and 6 predictors).
After the 80/20 split, each training set contains approximately 2,344 observations and each test set contains about 586 observations. With k evaluated up to 30, even the largest k uses only about 1.3% of the training data for prediction (30 / 2344 ≈ 0.013). This ensures that the model remains local rather than collapsing toward a global average. The dataset size is therefore sufficient to support stable neighborhood estimation and repeated bootstrap tuning.
The SalePrice summary shows: Minimum: $12,789 | Median: $160,000 | Mean: $180,796 | Maximum: $755,000
The histogram indicates right-skewness. Most homes are clustered between roughly $120,000 and $250,000, while a small number of high-value properties extend far into the upper tail.
This skewness explains why prediction error increases for expensive homes in the scatter plot later. kNN averages nearby observations, so extreme values are partially smoothed.
Question 2: Why standardize kNN?
kNN relies on Euclidean distance (Straight line distance between two points in space). The predictors include square footage (e.g., Gr.Liv.Area ≈ 800–2,500), garage area (hundreds), and quality scores (1–10).
Without scaling, square footage variables would dominate distance calculations purely due to magnitude. Centering and scaling ensures each predictor contributes proportionally to neighborhood selection.
Scaling was fit on training data only within each split and bootstrap iteration to prevent leakage from test data.
Section 1.2 kNN Regression: Manual BTS Tuning & Eval
rmspe <- function(actual, predicted) sqrt(mean((actual - predicted)^2))Helper:
This helper function fits on the preProcess() training preditors only. Only towards scaling to help train and test.
prep_ames_split <- function(train_df, test_df) {
# Separate X/Y
y_train <- train_df$SalePrice
y_test <- test_df$SalePrice
x_train <- train_df %>% select(-SalePrice)
x_test <- test_df %>% select(-SalePrice)
# Fit scaler on TRAIN ONLY (purist)
pp <- preProcess(x_train, method = c("center", "scale"))
x_train_s <- predict(pp, x_train)
x_test_s <- predict(pp, x_test)
list(
x_train = as.matrix(x_train_s),
y_train = y_train,
x_test = as.matrix(x_test_s),
y_test = y_test
)
}This is the core loop (tuning curves). It holds:
optimal_k_reg[i] |test_RMSPE_reg[i]| rmspe_curve_reg[i, k] - mean bootstrap RMSPE for each k in the split/
set.seed(5560)
k_grid_reg <- 1:30
optimal_k_reg <- numeric(20)
test_RMSPE_reg <- numeric(20)
rmspe_curve_reg <- matrix(NA_real_, nrow = 20, ncol = length(k_grid_reg))
for (i in 1:20) {
# Outer split: 80/20
idx_train <- createDataPartition(ames$SalePrice, p = 0.8, list = FALSE)
train_data <- ames[idx_train, ]
test_data <- ames[-idx_train, ]
# Purist scaling for this outer split (train only)
split_obj <- prep_ames_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test <- split_obj$x_test
y_test <- split_obj$y_test
# For bootstrap tuning we need access to the training data rows
# We'll keep a data.frame version for resampling indexes, but do scaling inside each bootstrap (purist)
train_df <- train_data
mean_rmspe_k <- numeric(length(k_grid_reg))
for (k_idx in seq_along(k_grid_reg)) {
k_val <- k_grid_reg[k_idx]
boot_rmspe <- numeric(20)
for (j in 1:20) {
boot_index <- sample(seq_len(nrow(train_df)), replace = TRUE)
oob_index <- setdiff(seq_len(nrow(train_df)), unique(boot_index))
if (length(oob_index) == 0) next
boot_train <- train_df[boot_index, ]
boot_val <- train_df[oob_index, ]
# Purist scaling INSIDE bootstrap: fit on boot_train only
pp_boot <- preProcess(boot_train %>% select(-SalePrice),
method = c("center", "scale"))
x_boot_train <- as.matrix(predict(pp_boot, boot_train %>% select(-SalePrice)))
y_boot_train <- boot_train$SalePrice
x_boot_val <- as.matrix(predict(pp_boot, boot_val %>% select(-SalePrice)))
y_boot_val <- boot_val$SalePrice
model <- knnreg(x = x_boot_train, y = y_boot_train, k = k_val)
pred <- predict(model, x_boot_val)
boot_rmspe[j] <- rmspe(y_boot_val, pred)
}
mean_rmspe_k[k_idx] <- mean(boot_rmspe, na.rm = TRUE)
}
rmspe_curve_reg[i, ] <- mean_rmspe_k
best_k <- k_grid_reg[which.min(mean_rmspe_k)]
optimal_k_reg[i] <- best_k
# Fit final on OUTER TRAIN (already scaled with outer train-only pp)
final_model <- knnreg(x = x_train, y = y_train, k = best_k)
test_pred <- predict(final_model, x_test)
test_RMSPE_reg[i] <- rmspe(y_test, test_pred)
}mean_curve <- colMeans(rmspe_curve_reg, na.rm = TRUE)
curve_df <- data.frame(k = k_grid_reg, mean_rmspe = mean_curve)
opt_k_overall <- curve_df$k[which.min(curve_df$mean_rmspe)]
opt_rmspe_overall <- min(curve_df$mean_rmspe)
curve_df %>%
ggplot(aes(x = k, y = mean_rmspe)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
geom_vline(xintercept = opt_k_overall, linetype = "dashed") +
labs(
title = "Mean Bootstrap RMSPE vs k (AmesHousing.csv)",
subtitle = paste0("Overall optimal k ≈ ", opt_k_overall,
" | Mean RMSPE ≈ ", round(opt_rmspe_overall, 2)),
x = "k",
y = "Mean Bootstrap RMSPE"
) +
theme_minimal()Question 3: Mean RMSPE vs K + optimal k + curve shape
The mean bootstrap RMSPE curve shows a steep decline from k = 1 (≈ 37,500) to approximately k = 8–12 (≈ 31,000). The overall minimum occurs at k ≈ 20, with mean RMSPE ≈ 31,029.
For k > 20, error begins to increase slightly, reaching ≈ 31,400 by k = 30.
This pattern indicates:
- k = 1 overfits (high variance)
- Moderate k stabilizes predictions
- Larger k begins oversmoothing (increasing bias)
The curve shape is consistent across splits, suggesting the optimal region is not driven by random sampling.
Question 4: Mean/SD test RMSPE + histogram
rmspe_df <- data.frame(test_RMSPE = test_RMSPE_reg)
rmspe_df %>%
summarise(
mean_test_RMSPE = mean(test_RMSPE),
sd_test_RMSPE = sd(test_RMSPE)
) %>%
kable(digits = 3, caption = "Test RMSPE Summary Across 20 Splits (AmesHousing.csv)") %>%
kable_styling(full_width = FALSE)| mean_test_RMSPE | sd_test_RMSPE |
|---|---|
| 30553.31 | 3042.183 |
ggplot(rmspe_df, aes(x = test_RMSPE)) +
geom_histogram(bins = 12, color = "white") +
labs(
title = "Distribution of Test RMSPE Across 20 Splits (AmesHousing.csv)",
x = "Test RMSPE",
y = "Count"
) +
theme_minimal()Across 20 outer splits:
Mean test RMSPE = 30,553 | Standard deviation = 3,042
The histogram shows most test errors fall between 28,000 and 33,000, with a few splits exceeding 35,000. A standard deviation of ~3,000 relative to a mean of ~30,500 indicates moderate variability (≈10%). This suggests the model generalizes consistently across different partitions of the data.
Question 5: Actual vs Predicted scatter
set.seed(5560)
idx_train <- createDataPartition(ames$SalePrice, p = 0.8, list = FALSE)
train_data <- ames[idx_train, ]
test_data <- ames[-idx_train, ]
split_obj <- prep_ames_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test <- split_obj$x_test
y_test <- split_obj$y_test
# Use a representative k: overall curve minimum (computed earlier)
k_rep <- opt_k_overall
final_model <- knnreg(x = x_train, y = y_train, k = k_rep)
pred <- predict(final_model, x_test)
plot_df <- data.frame(actual = y_test, predicted = pred)
ggplot(plot_df, aes(x = actual, y = predicted)) +
geom_point(alpha = 0.6) +
geom_abline(linetype = "dashed") +
labs(
title = "Actual vs Predicted SalePrice (Representative Split, AmesHousing.csv)",
subtitle = paste0("k = ", k_rep),
x = "Actual SalePrice",
y = "Predicted SalePrice"
) +
theme_minimal()Question 5: Performance evaluation
Response:
The scatter plot shows predicted versus actual SalePrice for a samp split using k = 20. Predictions align closely with the 45-degree line in the mid-range of prices ($120k–$300k).
However, dispersion increases for homes above $400k. This reflects kNN’s averaging behavior: high-priced homes have fewer comparable neighbors, and predictions are pulled toward the local mean.
The model captures general structure well but struggles with extreme values.
Section 1.3: Comparison with caret::train()
#code only allowed for this section.
set.seed(5560)
pp_all <- preProcess(ames %>% select(-SalePrice), method = c("center", "scale"))
ames_train_ready <- data.frame(
SalePrice = ames$SalePrice,
predict(pp_all, ames %>% select(-SalePrice))
)
ctrl <- trainControl(method = "cv", number = 5)
train_model <- train(
SalePrice ~ .,
data = ames_train_ready,
method = "knn",
tuneGrid = data.frame(k = 1:30),
trControl = ctrl
)
train_model$bestTunetrain_model$results %>%
arrange(RMSE) %>%
head(10)Question 6: Compare optimal k
The manual bootstrap procedure selected k ≈ 20. Using caret’s 5-fold cross-validation, the optimal k was 8, with RMSE ≈ 29,702. Both methods selected moderate values of k within the 8–20 range. The small difference likely reflects:
- Cross-validation vs bootstrap differences | 2. Random variation across folds
Importantly, both approaches reject very small k values and favor moderate smoothing.
Question 7: Bias-variance tradeoff for kNN
Section 2.1: Prediction Customer Churn kNN Classification
# Load Telco Customer Churn Dataset
telco_raw <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn(1).csv")
# Remove ID
telco_raw$customerID <- NULL
# Convert TotalCharges to numeric (creates NAs where blanks existed)
telco_raw$TotalCharges <- as.numeric(telco_raw$TotalCharges)# Subset required variables
telco <- telco_raw %>%
select(
Churn,
tenure,
MonthlyCharges,
TotalCharges,
Contract,
InternetService,
PaymentMethod
)
# Handle NAs (from TotalCharges conversion)
telco <- telco %>%
mutate(across(where(is.numeric),
~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Churn as factor with explicit levels
telco$Churn <- factor(telco$Churn, levels = c("No", "Yes"))
glimpse(telco)Rows: 7,043
Columns: 7
$ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, No…
$ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25…
$ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75…
$ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.…
$ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One y…
$ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opti…
$ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "B…
Question 1: Observations…Disitribtuions…Balance
The Telco dataset contains 7,043 observations and 7 variables.
Class distribution: No churn: 5,174 (73.5%) | Yes churn: 1,869 (26.5%)
This imbalance is substantial. A naive classifier predicting “No” for every customer would achieve 73.5% accuracy. Therefore, accuracy must be interpreted carefully. In business terms, the higher the churn, the worse for overall health & longevity
Question 2: The power of standardization
n_obs_telco <- nrow(telco)
class_dist <- telco %>%
count(Churn) %>%
mutate(pct = n / sum(n))
class_dist %>%
kable(digits = 3, caption = "Telco Churn Class Distribution (Target: Churn)") %>%
kable_styling(full_width = FALSE)| Churn | n | pct |
|---|---|---|
| No | 5174 | 0.735 |
| Yes | 1869 | 0.265 |
ggplot(telco, aes(x = Churn)) +
geom_bar() +
labs(
title = "Churn Class Distribution (WA_Fn-UseC_-Telco-Customer-Churn.csv)",
x = "Churn",
y = "Count"
) +
theme_minimal()Question 2.
Repsonse:
Tenure ranges from 1 to 72 months. MonthlyCharges ranges roughly from 20 to 120. TotalCharges can exceed several thousand.
Without scaling, TotalCharges would dominate distance calculations. Dummy encoding was applied to categorical variables, followed by scaling within each training split.
This ensures neighborhoods reflect combined behavioral and financial similarity. Adhering to the integrity of the data.
Section 2.2 kNN Classification: Tuning and Evaluation
prep_telco_split <- function(train_df, test_df) {
# Dummy encoding on TRAIN ONLY (purist)
dmy <- dummyVars(Churn ~ ., data = train_df, fullRank = TRUE)
x_train <- predict(dmy, newdata = train_df)
x_test <- predict(dmy, newdata = test_df)
# Scale on TRAIN ONLY (purist)
pp <- preProcess(x_train, method = c("center", "scale"))
x_train_s <- predict(pp, x_train)
x_test_s <- predict(pp, x_test)
list(
x_train = as.matrix(x_train_s),
y_train = train_df$Churn,
x_test = as.matrix(x_test_s),
y_test = test_df$Churn
)
}Core Loop
- Accuracy Curve & Metrics with AUC
set.seed(5560)
k_grid_class <- seq(1, 29, by = 2)
optimal_k_class <- numeric(20)
acc_curve_class <- matrix(NA_real_, nrow = 20, ncol = length(k_grid_class))
accuracy_vec <- numeric(20)
precision_vec <- numeric(20)
recall_vec <- numeric(20)
f1_vec <- numeric(20)
auc_vec <- numeric(20)
for (i in 1:20) {
idx_train <- createDataPartition(telco$Churn, p = 0.8, list = FALSE)
train_data <- telco[idx_train, ]
test_data <- telco[-idx_train, ]
mean_acc_k <- numeric(length(k_grid_class))
for (k_idx in seq_along(k_grid_class)) {
k_val <- k_grid_class[k_idx]
boot_acc <- numeric(20)
for (j in 1:20) {
boot_index <- sample(seq_len(nrow(train_data)), replace = TRUE)
oob_index <- setdiff(seq_len(nrow(train_data)), unique(boot_index))
if (length(oob_index) == 0) next
boot_train <- train_data[boot_index, ]
boot_val <- train_data[oob_index, ]
# Purist dummy + scale inside bootstrap
dmy_boot <- dummyVars(Churn ~ ., data = boot_train, fullRank = TRUE)
x_bt <- predict(dmy_boot, newdata = boot_train)
x_bv <- predict(dmy_boot, newdata = boot_val)
pp_boot <- preProcess(x_bt, method = c("center", "scale"))
x_bt_s <- as.matrix(predict(pp_boot, x_bt))
x_bv_s <- as.matrix(predict(pp_boot, x_bv))
y_bt <- boot_train$Churn
y_bv <- boot_val$Churn
model <- knn3(x = x_bt_s, y = y_bt, k = k_val)
preds <- predict(model, x_bv_s, type = "class")
boot_acc[j] <- mean(preds == y_bv)
}
mean_acc_k[k_idx] <- mean(boot_acc, na.rm = TRUE)
}
acc_curve_class[i, ] <- mean_acc_k
best_k <- k_grid_class[which.max(mean_acc_k)]
optimal_k_class[i] <- best_k
# Final fit on OUTER TRAIN with purist preprocessing
split_obj <- prep_telco_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test <- split_obj$x_test
y_test <- split_obj$y_test
final_model <- knn3(x = x_train, y = y_train, k = best_k)
class_preds <- predict(final_model, x_test, type = "class")
prob_preds <- predict(final_model, x_test, type = "prob")
accuracy_vec[i] <- mean(class_preds == y_test)
cm <- confusionMatrix(class_preds, y_test, positive = "Yes")
precision_vec[i] <- as.numeric(cm$byClass["Precision"])
recall_vec[i] <- as.numeric(cm$byClass["Recall"])
f1_vec[i] <- as.numeric(cm$byClass["F1"])
roc_obj <- roc(y_test, prob_preds[, "Yes"])
auc_vec[i] <- as.numeric(auc(roc_obj))
}Question 3: Mean Accuracy
The bootstrap accuracy curve increases steadily from ≈0.739 at k = 1 to ≈0.785 at k ≈ 27–29. The overall optimal k is in the high 20s, with mean accuracy ≈ 0.785.
Unlike regression, the curve does not exhibit a strong U-shape. Larger k values improve stability under class imbalance.
mean_acc_curve <- colMeans(acc_curve_class, na.rm = TRUE)
acc_df <- data.frame(k = k_grid_class, mean_accuracy = mean_acc_curve)
opt_k_class_overall <- acc_df$k[which.max(acc_df$mean_accuracy)]
opt_acc_overall <- max(acc_df$mean_accuracy)
ggplot(acc_df, aes(x = k, y = mean_accuracy)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
geom_vline(xintercept = opt_k_class_overall, linetype = "dashed") +
labs(
title = "Mean Bootstrap Accuracy vs k (Telco Churn Dataset)",
subtitle = paste0("Overall optimal k ≈ ", opt_k_class_overall,
" | Mean Accuracy ≈ ", round(opt_acc_overall, 3)),
x = "k",
y = "Mean Bootstrap Accuracy"
) +
theme_minimal()Question 4: Mean & SD characteristics n metrics
metrics_tbl <- data.frame(
Metric = c("Accuracy","Precision","Recall","F1","AUC"),
Mean = c(mean(accuracy_vec), mean(precision_vec), mean(recall_vec), mean(f1_vec), mean(auc_vec)),
SD = c(sd(accuracy_vec), sd(precision_vec), sd(recall_vec), sd(f1_vec), sd(auc_vec))
)
metrics_tbl %>%
kable(digits = 3, caption = "Telco Churn: Test Metrics Across 20 Splits") %>%
kable_styling(full_width = FALSE)| Metric | Mean | SD |
|---|---|---|
| Accuracy | 0.792 | 0.009 |
| Precision | 0.639 | 0.022 |
| Recall | 0.495 | 0.026 |
| F1 | 0.558 | 0.022 |
| AUC | 0.832 | 0.009 |
Across 20 splits:
Accuracy: 0.792 (SD 0.009) | Precision: 0.639 (SD 0.022) | Recall: 0.495 (SD 0.026) | | F1: 0.558 (SD 0.022) | AUC: 0.832 (SD 0.009) |
Accuracy exceeds the baseline (0.735) by ~6 percentage points.
However, recall remains below 0.50, meaning the model identifies fewer than half of churners at the default threshold.
Question 5: Confusion Matrix
set.seed(5560)
idx_train <- createDataPartition(telco$Churn, p = 0.8, list = FALSE)
train_data <- telco[idx_train, ]
test_data <- telco[-idx_train, ]
split_obj <- prep_telco_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test <- split_obj$x_test
y_test <- split_obj$y_test
k_rep_class <- opt_k_class_overall
final_model <- knn3(x = x_train, y = y_train, k = k_rep_class)
class_preds <- predict(final_model, x_test, type = "class")
confusionMatrix(class_preds, y_test, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 933 198
Yes 101 175
Accuracy : 0.7875
95% CI : (0.7652, 0.8086)
No Information Rate : 0.7349
P-Value [Acc > NIR] : 2.830e-06
Kappa : 0.4052
Mcnemar's Test P-Value : 2.827e-08
Sensitivity : 0.4692
Specificity : 0.9023
Pos Pred Value : 0.6341
Neg Pred Value : 0.8249
Prevalence : 0.2651
Detection Rate : 0.1244
Detection Prevalence : 0.1962
Balanced Accuracy : 0.6857
'Positive' Class : Yes
For the representative split:
Sensitivity (Recall): 0.469 | Specificity: 0.902 | Precision: 0.634 | Accuracy: 0.788
The model correctly identifies 90% of non-churners but only 47% of churners. Balanced accuracy = 0.686, which better reflects performance under imbalance.
Sectction 2.3: Interpretation
Question 6. How misleading is accuracy
Very. Accuracy of 0.788 appears strong relative to 0.735 baseline. However, recall of 0.469 indicates that more than half of churners are missed. AUC of 0.832 shows the model ranks churn risk well, even if the default threshold under-detects churn.
Therefore, AUC provides a more complete assessment than accuracy alone.
Question 7. Steps to improve churn?
Lowering the probability threshold increases recall substantially.
At threshold = 0.2:
Recall increases to 0.863 | Precision drops to 0.489 | Accuracy falls to 0.725
At threshold = 0.5:
Recall = 0.469 | Precision = 0.634
Youden’s optimal threshold ≈ 0.237 yields:
Sensitivity = 0.836 | Specificity = 0.714
This threshold better balances false positives and false negatives, depending on business priorities. For this set, Youden’s threshold holds better with this dataset.
Bonus: Threshold Tuning + Youden’s J+ AUC:
# Representative split from above assumed available:
# final_model, x_test, y_test
prob_preds <- predict(final_model, x_test, type = "prob")[, "Yes"]
thresholds <- c(0.2, 0.3, 0.4, 0.5, 0.6)
threshold_results <- map_dfr(thresholds, function(th) {
pred_custom <- ifelse(prob_preds > th, "Yes", "No")
pred_custom <- factor(pred_custom, levels = c("No", "Yes"))
cm <- confusionMatrix(pred_custom, y_test, positive = "Yes")
tibble(
threshold = th,
accuracy = cm$overall["Accuracy"],
precision = cm$byClass["Precision"],
recall = cm$byClass["Recall"],
f1 = cm$byClass["F1"]
)
})
threshold_results %>%
kable(digits = 3, caption = "Threshold Tuning Results (Telco Churn)") %>%
kable_styling(full_width = FALSE)| threshold | accuracy | precision | recall | f1 |
|---|---|---|---|---|
| 0.2 | 0.725 | 0.489 | 0.863 | 0.625 |
| 0.3 | 0.758 | 0.530 | 0.764 | 0.626 |
| 0.4 | 0.782 | 0.579 | 0.649 | 0.612 |
| 0.5 | 0.787 | 0.634 | 0.469 | 0.539 |
| 0.6 | 0.787 | 0.699 | 0.343 | 0.460 |
threshold_long <- threshold_results %>%
pivot_longer(cols = c(accuracy, precision, recall, f1),
names_to = "metric", values_to = "value")
ggplot(threshold_long, aes(x = threshold, y = value)) +
geom_line() +
geom_point(size = 2) +
facet_wrap(~ metric, scales = "free_y") +
labs(
title = "Metric Sensitivity to Threshold (Telco Churn)",
x = "Threshold (Predict Yes if P(Yes) > threshold)",
y = "Metric Value"
) +
theme_minimal()- Youden’s
roc_obj <- roc(y_test, prob_preds)
plot(roc_obj, main = "ROC Curve (Telco Churn, kNN)")youden <- coords(roc_obj, "best", best.method = "youden",
ret = c("threshold", "sensitivity", "specificity"))
youdenBonus:
Using the optimal k (≈ 27–29) and one representative split, I evaluated five probability thresholds from 0.2 to 0.6. As the threshold increased, precision rose from 0.489 to 0.699, while recall dropped sharply from 0.863 to 0.343. Accuracy remained relatively stable (0.725–0.787), which reinforces that accuracy alone does not capture model behavior under class imbalance. The F1-score peaked around 0.2–0.3 (≈ 0.626), then declined as recall deteriorated. At the default 0.5 threshold, recall was only 0.469, meaning more than half of churners were missed.
For a churn problem, missing a churner is typically more costly than contacting a non-churner. A threshold around 0.3 provides a better balance: recall increases to 0.764 while precision remains reasonable at 0.530. This setting materially improves churn detection compared to 0.5 without collapsing precision. Therefore, I would recommend a threshold near 0.3, as it aligns better with the objective of identifying at-risk customers while maintaining controlled false positives.
Section 3: Wrap up Analysis.
curve_reg_df <- data.frame(k = k_grid_reg, mean_rmspe = colMeans(rmspe_curve_reg, na.rm = TRUE))
curve_cls_df <- data.frame(k = k_grid_class, mean_acc = colMeans(acc_curve_class, na.rm = TRUE))
p_reg <- ggplot(curve_reg_df, aes(x = k, y = mean_rmspe)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(title = "Regression: RMSPE vs k (AmesHousing.csv)",
x = "k", y = "Mean Bootstrap RMSPE") +
theme_minimal()
p_cls <- ggplot(curve_cls_df, aes(x = k, y = mean_acc)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(title = "Classification: Accuracy vs k (Telco Churn)",
x = "k", y = "Mean Bootstrap Accuracy") +
theme_minimal()
p_reg + p_clsSection 3: Comparison
The side-by-side comparison shows clear differences in how performance changes with k for regression and classification.
For regression (Ames Housing)
The optimal k from bootstrap tuning was approximately k ≈ 20.
The minimum mean bootstrap RMSPE was approximately 31,029.
The curve displays a clear U-shape: RMSPE is very high at k = 1 (≈ 37,500), declines sharply until about k = 8–12, stabilizes near k = 20, and then increases slightly for larger k.
For classification (Telco Churn).
The optimal k was in the high range, approximately k ≈ 27–29.
The maximum mean bootstrap accuracy was approximately 0.785.
The accuracy curve increases steadily with k and does not show a strong U-shape.
The optimal k values differ because the two tasks behave differently under smoothing:
In regression, increasing k beyond the optimal region increases bias and leads to oversmoothing, which increases RMSPE.
In classification, especially under class imbalance (73.5% non-churn), larger k values stabilize predictions and reduce variance, leading to improved accuracy.
In short, regression required a balance between variance and bias, while classification benefited more consistently from smoothing due to noise and imbalance in class labels.
library(DiagrammeR)
grViz("
digraph G {
rankdir=TB;
node [shape=box];
A[label='Outer Split (80/20)'];
B[label='Grid Search over k'];
C[label='Bootstrap (20x)'];
D[label='Mean Metric per k'];
E[label='Select Optimal k'];
F[label='Evaluate on Test Set'];
G[label='Store Metrics'];
A -> B -> C -> D -> E -> F -> G;
}
")set.seed(5560)
n <- 100
boot_index <- sample(1:n, replace = TRUE)
data.frame(index = boot_index) %>%
ggplot(aes(x = index)) +
geom_histogram(binwidth = 1) +
labs(
title = "Example Bootstrap Sampling (With Replacement)",
subtitle = "Notice some observations appear multiple times, others are missing",
x = "Original Observation Index",
y = "Frequency in Bootstrap Sample"
) +
theme_minimal()set.seed(5560)
n <- 100
boot_index <- sample(1:n, replace = TRUE)
data.frame(index = boot_index) %>%
ggplot(aes(x = index)) +
geom_histogram(binwidth = 1) +
labs(
title = "Example Bootstrap Sampling (With Replacement)",
subtitle = "Notice some observations appear multiple times, others are missing",
x = "Original Observation Index",
y = "Frequency in Bootstrap Sample"
) +
theme_minimal()oob_index <- setdiff(1:n, unique(boot_index))
length(oob_index) / n[1] 0.37
library(ggplot2)
x <- seq(0, 10, length.out = 100)
y_true <- sin(x)
y_noisy <- y_true + rnorm(100, sd = 0.3)
df <- data.frame(x, y_noisy)
ggplot(df, aes(x, y_noisy)) +
geom_point(alpha = 0.5) +
labs(
title = "Noisy Data Example: What kNN Is Trying to Approximate",
subtitle = "Small k → jagged fit | Large k → smoother fit"
) +
theme_minimal()mean_acc <- colMeans(acc_curve_class)
sd_acc <- apply(acc_curve_class, 2, sd)
acc_df <- data.frame(
k = k_grid_class,
mean = mean_acc,
lower = mean_acc - sd_acc,
upper = mean_acc + sd_acc
)
ggplot(acc_df, aes(x = k, y = mean)) +
geom_line(linewidth = 1) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
labs(
title = "Bootstrap Accuracy vs k (with Variability)",
subtitle = "Ribbon = ±1 SD across outer splits",
x = "k",
y = "Mean Accuracy"
) +
theme_minimal()# Compute ROC
roc_obj <- roc(y_test, prob_preds)
roc_df <- data.frame(
tpr = roc_obj$sensitivities,
fpr = 1 - roc_obj$specificities
)
# Compute Youden optimal point
youden <- coords(roc_obj, "best", best.method = "youden")
youden_df <- data.frame(
fpr = 1 - as.numeric(youden["specificity"]),
tpr = as.numeric(youden["sensitivity"])
)
ggplot(roc_df, aes(x = fpr, y = tpr)) +
geom_line(linewidth = 1) +
geom_abline(linetype = "dashed") +
geom_point(data = youden_df,
aes(x = fpr, y = tpr),
color = "red",
size = 3) +
labs(
title = "ROC Curve with Youden Optimal Threshold (Telco Churn kNN)",
x = "False Positive Rate",
y = "True Positive Rate"
) +
theme_minimal()k_vals <- 1:30
bias_curve <- log(k_vals)
variance_curve <- rev(log(k_vals))
df_bias <- data.frame(
k = k_vals,
Bias = bias_curve,
Variance = variance_curve
)
df_long <- pivot_longer(df_bias, -k)
ggplot(df_long, aes(x = k, y = value, color = name)) +
geom_line() +
labs(
title = "Conceptual Bias–Variance Tradeoff in kNN",
x = "k",
y = "Relative Magnitude"
) +
theme_minimal()grViz("
digraph G {
rankdir=LR;
node [shape=box];
A[label='Raw Data'];
B[label='Train/Test Split'];
C[label='Bootstrap Tuning'];
D[label='Select k'];
E[label='Fit Final Model'];
F[label='Evaluate Metrics'];
G[label='ROC / Threshold Optimization'];
A -> B -> C -> D -> E -> F -> G;
}
")Use of AI: AI was used as a tool to enhance the analysis. Materials used were only internal - non from invalidated open-source. AI kept me in alignment with the assignment parameters while outlining blindspots & improving user reasoning & task execution.