Assignment 2: k-Nearest Neighbors for Regression and Classification

Author

Gavin Shklanka

Published

February 27, 2026

# Core
library(tidyverse)
library(caret)
library(knitr)
library(kableExtra)
library(patchwork)

# Classification extras (AUC/ROC + threshold)
library(pROC)
library(yardstick)

set.seed(5560)

Introduction:

This assignment applies k-nearest neighbours (kNN) to two structured datasets:

• Ames Housing** (2,930 observations, 7 variables) for regression
• Telco Customer Churn** (7,043 observations, 7 variables) for classification

In both cases, the hyperparameter *k* was tuned using a manual grid search combined with 20 bootstrap resamples inside each of 20 outer 80/20 splits. Scaling was performed strictly within each training split to avoid information leakage.

Model performance was evaluated on held-out test sets. Results are interpreted using error curves, variability across splits, confusion matrices, ROC analysis, and threshold tuning. The bias–variance tradeoff is discussed in the context of the observed tuning curves rather than abstract theory.

Section 1 - Back with Ames Housing (kNN Regression)

# Load Ames Housing Dataset (AmesHousing.csv)
ames_raw <- read.csv("AmesHousing.csv")

# Subset to required variables for kNN regression
ames <- ames_raw %>%
  select(
    SalePrice,
    Gr.Liv.Area,
    Total.Bsmt.SF,
    Garage.Area,
    Year.Built,
    Overall.Qual,
    Overall.Cond
  )

# Quick check
dim(ames)

[1] 2930    7

glimpse(ames)

Rows: 2,930
Columns: 7
$ SalePrice     <int> 215000, 105000, 172000, 244000, 189900, 195500, 213500, …
$ Gr.Liv.Area   <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 180…
$ Total.Bsmt.SF <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994, …
$ Garage.Area   <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440, 4…
$ Year.Built    <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, 19…
$ Overall.Qual  <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, 4,…
$ Overall.Cond  <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, 5,…

Missing Values & chosen Standardization

Scaling parameters (center/scale) are fit on training data only per split. Inside each bootstrap validation, scaling is fit on bootstrap train only, then applied to out of bag.

# Median imputation (deterministic)
ames <- ames %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

stopifnot(sum(is.na(ames)) == 0)

Observations& Sale Price

n_obs_ames <- nrow(ames)

sale_summary <- ames %>%
  summarise(
    n = n(),
    min = min(SalePrice),
    q1  = quantile(SalePrice, 0.25),
    median = median(SalePrice),
    mean = mean(SalePrice),
    q3  = quantile(SalePrice, 0.75),
    max = max(SalePrice)
  )

sale_summary %>%
  kable(digits = 2, caption = "Ames Housing: Summary of SalePrice (Target)") %>%
  kable_styling(full_width = FALSE)

Ames Housing: Summary of SalePrice (Target)
n	min	q1	median	mean	q3	max
2930	12789	129500	160000	180796.1	213500	755000

ggplot(ames, aes(x = SalePrice)) +
  geom_histogram(bins = 35, color = "white") +
  labs(
    title = "SalePrice Distribution (AmesHousing.csv)",
    x = "SalePrice",
    y = "Count"
  ) +
  theme_minimal()

Question 1: How many Obeservations?

The Ames dataset contains 2,930 observations and 7 variables (1 target and 6 predictors).

After the 80/20 split, each training set contains approximately 2,344 observations and each test set contains about 586 observations. With k evaluated up to 30, even the largest k uses only about 1.3% of the training data for prediction (30 / 2344 ≈ 0.013). This ensures that the model remains local rather than collapsing toward a global average. The dataset size is therefore sufficient to support stable neighborhood estimation and repeated bootstrap tuning.

The SalePrice summary shows: Minimum: $12,789 | Median: $160,000 | Mean: $180,796 | Maximum: $755,000

The histogram indicates right-skewness. Most homes are clustered between roughly $120,000 and $250,000, while a small number of high-value properties extend far into the upper tail.

This skewness explains why prediction error increases for expensive homes in the scatter plot later. kNN averages nearby observations, so extreme values are partially smoothed.

Question 2: Why standardize kNN?

kNN relies on Euclidean distance (Straight line distance between two points in space). The predictors include square footage (e.g., Gr.Liv.Area ≈ 800–2,500), garage area (hundreds), and quality scores (1–10).

Without scaling, square footage variables would dominate distance calculations purely due to magnitude. Centering and scaling ensures each predictor contributes proportionally to neighborhood selection.

Scaling was fit on training data only within each split and bootstrap iteration to prevent leakage from test data.

Section 1.2 kNN Regression: Manual BTS Tuning & Eval

rmspe <- function(actual, predicted) sqrt(mean((actual - predicted)^2))

Helper:

This helper function fits on the preProcess() training preditors only. Only towards scaling to help train and test.

prep_ames_split <- function(train_df, test_df) {
  # Separate X/Y
  y_train <- train_df$SalePrice
  y_test  <- test_df$SalePrice

  x_train <- train_df %>% select(-SalePrice)
  x_test  <- test_df %>% select(-SalePrice)

  # Fit scaler on TRAIN ONLY (purist)
  pp <- preProcess(x_train, method = c("center", "scale"))
  x_train_s <- predict(pp, x_train)
  x_test_s  <- predict(pp, x_test)

  list(
    x_train = as.matrix(x_train_s),
    y_train = y_train,
    x_test  = as.matrix(x_test_s),
    y_test  = y_test
  )
}

This is the core loop (tuning curves). It holds:

optimal_k_reg[i] |test_RMSPE_reg[i]| rmspe_curve_reg[i, k] - mean bootstrap RMSPE for each k in the split/

set.seed(5560)

k_grid_reg <- 1:30

optimal_k_reg <- numeric(20)
test_RMSPE_reg <- numeric(20)
rmspe_curve_reg <- matrix(NA_real_, nrow = 20, ncol = length(k_grid_reg))

for (i in 1:20) {

  # Outer split: 80/20
  idx_train <- createDataPartition(ames$SalePrice, p = 0.8, list = FALSE)
  train_data <- ames[idx_train, ]
  test_data  <- ames[-idx_train, ]

  # Purist scaling for this outer split (train only)
  split_obj <- prep_ames_split(train_data, test_data)
  x_train <- split_obj$x_train
  y_train <- split_obj$y_train
  x_test  <- split_obj$x_test
  y_test  <- split_obj$y_test

  # For bootstrap tuning we need access to the training data rows
  # We'll keep a data.frame version for resampling indexes, but do scaling inside each bootstrap (purist)
  train_df <- train_data

  mean_rmspe_k <- numeric(length(k_grid_reg))

  for (k_idx in seq_along(k_grid_reg)) {
    k_val <- k_grid_reg[k_idx]
    boot_rmspe <- numeric(20)

    for (j in 1:20) {
      boot_index <- sample(seq_len(nrow(train_df)), replace = TRUE)
      oob_index  <- setdiff(seq_len(nrow(train_df)), unique(boot_index))
      if (length(oob_index) == 0) next

      boot_train <- train_df[boot_index, ]
      boot_val   <- train_df[oob_index, ]

      # Purist scaling INSIDE bootstrap: fit on boot_train only
      pp_boot <- preProcess(boot_train %>% select(-SalePrice),
                            method = c("center", "scale"))

      x_boot_train <- as.matrix(predict(pp_boot, boot_train %>% select(-SalePrice)))
      y_boot_train <- boot_train$SalePrice

      x_boot_val <- as.matrix(predict(pp_boot, boot_val %>% select(-SalePrice)))
      y_boot_val <- boot_val$SalePrice

      model <- knnreg(x = x_boot_train, y = y_boot_train, k = k_val)
      pred  <- predict(model, x_boot_val)

      boot_rmspe[j] <- rmspe(y_boot_val, pred)
    }

    mean_rmspe_k[k_idx] <- mean(boot_rmspe, na.rm = TRUE)
  }

  rmspe_curve_reg[i, ] <- mean_rmspe_k

  best_k <- k_grid_reg[which.min(mean_rmspe_k)]
  optimal_k_reg[i] <- best_k

  # Fit final on OUTER TRAIN (already scaled with outer train-only pp)
  final_model <- knnreg(x = x_train, y = y_train, k = best_k)
  test_pred <- predict(final_model, x_test)

  test_RMSPE_reg[i] <- rmspe(y_test, test_pred)
}

mean_curve <- colMeans(rmspe_curve_reg, na.rm = TRUE)
curve_df <- data.frame(k = k_grid_reg, mean_rmspe = mean_curve)

opt_k_overall <- curve_df$k[which.min(curve_df$mean_rmspe)]
opt_rmspe_overall <- min(curve_df$mean_rmspe)

curve_df %>%
  ggplot(aes(x = k, y = mean_rmspe)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_vline(xintercept = opt_k_overall, linetype = "dashed") +
  labs(
    title = "Mean Bootstrap RMSPE vs k (AmesHousing.csv)",
    subtitle = paste0("Overall optimal k ≈ ", opt_k_overall,
                      " | Mean RMSPE ≈ ", round(opt_rmspe_overall, 2)),
    x = "k",
    y = "Mean Bootstrap RMSPE"
  ) +
  theme_minimal()

Question 3: Mean RMSPE vs K + optimal k + curve shape

The mean bootstrap RMSPE curve shows a steep decline from k = 1 (≈ 37,500) to approximately k = 8–12 (≈ 31,000). The overall minimum occurs at k ≈ 20, with mean RMSPE ≈ 31,029.

For k > 20, error begins to increase slightly, reaching ≈ 31,400 by k = 30.

This pattern indicates:

k = 1 overfits (high variance)
Moderate k stabilizes predictions
Larger k begins oversmoothing (increasing bias)

The curve shape is consistent across splits, suggesting the optimal region is not driven by random sampling.

Question 4: Mean/SD test RMSPE + histogram

rmspe_df <- data.frame(test_RMSPE = test_RMSPE_reg)

rmspe_df %>%
  summarise(
    mean_test_RMSPE = mean(test_RMSPE),
    sd_test_RMSPE   = sd(test_RMSPE)
  ) %>%
  kable(digits = 3, caption = "Test RMSPE Summary Across 20 Splits (AmesHousing.csv)") %>%
  kable_styling(full_width = FALSE)

Test RMSPE Summary Across 20 Splits (AmesHousing.csv)
mean_test_RMSPE	sd_test_RMSPE
30553.31	3042.183

ggplot(rmspe_df, aes(x = test_RMSPE)) +
  geom_histogram(bins = 12, color = "white") +
  labs(
    title = "Distribution of Test RMSPE Across 20 Splits (AmesHousing.csv)",
    x = "Test RMSPE",
    y = "Count"
  ) +
  theme_minimal()

Across 20 outer splits:

Mean test RMSPE = 30,553 | Standard deviation = 3,042

The histogram shows most test errors fall between 28,000 and 33,000, with a few splits exceeding 35,000. A standard deviation of ~3,000 relative to a mean of ~30,500 indicates moderate variability (≈10%). This suggests the model generalizes consistently across different partitions of the data.

Question 5: Actual vs Predicted scatter

set.seed(5560)

idx_train <- createDataPartition(ames$SalePrice, p = 0.8, list = FALSE)
train_data <- ames[idx_train, ]
test_data  <- ames[-idx_train, ]

split_obj <- prep_ames_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test  <- split_obj$x_test
y_test  <- split_obj$y_test

# Use a representative k: overall curve minimum (computed earlier)
k_rep <- opt_k_overall

final_model <- knnreg(x = x_train, y = y_train, k = k_rep)
pred <- predict(final_model, x_test)

plot_df <- data.frame(actual = y_test, predicted = pred)

ggplot(plot_df, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "Actual vs Predicted SalePrice (Representative Split, AmesHousing.csv)",
    subtitle = paste0("k = ", k_rep),
    x = "Actual SalePrice",
    y = "Predicted SalePrice"
  ) +
  theme_minimal()

Question 5: Performance evaluation

Response:

The scatter plot shows predicted versus actual SalePrice for a samp split using k = 20. Predictions align closely with the 45-degree line in the mid-range of prices ($120k–$300k).

However, dispersion increases for homes above $400k. This reflects kNN’s averaging behavior: high-priced homes have fewer comparable neighbors, and predictions are pulled toward the local mean.

The model captures general structure well but struggles with extreme values.

Section 1.3: Comparison with caret::train()

#code only allowed for this section. 
set.seed(5560)

pp_all <- preProcess(ames %>% select(-SalePrice), method = c("center", "scale"))
ames_train_ready <- data.frame(
  SalePrice = ames$SalePrice,
  predict(pp_all, ames %>% select(-SalePrice))
)

ctrl <- trainControl(method = "cv", number = 5)

train_model <- train(
  SalePrice ~ .,
  data = ames_train_ready,
  method = "knn",
  tuneGrid = data.frame(k = 1:30),
  trControl = ctrl
)

train_model$bestTune

train_model$results %>%
  arrange(RMSE) %>%
  head(10)

Question 6: Compare optimal k

The manual bootstrap procedure selected k ≈ 20. Using caret’s 5-fold cross-validation, the optimal k was 8, with RMSE ≈ 29,702. Both methods selected moderate values of k within the 8–20 range. The small difference likely reflects:

Cross-validation vs bootstrap differences | 2. Random variation across folds

Importantly, both approaches reject very small k values and favor moderate smoothing.

Question 7: Bias-variance tradeoff for kNN

Section 2.1: Prediction Customer Churn kNN Classification

# Load Telco Customer Churn Dataset
telco_raw <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn(1).csv")

# Remove ID
telco_raw$customerID <- NULL

# Convert TotalCharges to numeric (creates NAs where blanks existed)
telco_raw$TotalCharges <- as.numeric(telco_raw$TotalCharges)

# Subset required variables
telco <- telco_raw %>%
  select(
    Churn,
    tenure,
    MonthlyCharges,
    TotalCharges,
    Contract,
    InternetService,
    PaymentMethod
  )

# Handle NAs (from TotalCharges conversion)
telco <- telco %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Churn as factor with explicit levels
telco$Churn <- factor(telco$Churn, levels = c("No", "Yes"))

glimpse(telco)

Rows: 7,043
Columns: 7
$ Churn           <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, No…
$ tenure          <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25…
$ MonthlyCharges  <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75…
$ TotalCharges    <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.…
$ Contract        <chr> "Month-to-month", "One year", "Month-to-month", "One y…
$ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opti…
$ PaymentMethod   <chr> "Electronic check", "Mailed check", "Mailed check", "B…

Question 1: Observations…Disitribtuions…Balance

The Telco dataset contains 7,043 observations and 7 variables.

Class distribution: No churn: 5,174 (73.5%) | Yes churn: 1,869 (26.5%)

This imbalance is substantial. A naive classifier predicting “No” for every customer would achieve 73.5% accuracy. Therefore, accuracy must be interpreted carefully. In business terms, the higher the churn, the worse for overall health & longevity

Question 2: The power of standardization

n_obs_telco <- nrow(telco)

class_dist <- telco %>%
  count(Churn) %>%
  mutate(pct = n / sum(n))

class_dist %>%
  kable(digits = 3, caption = "Telco Churn Class Distribution (Target: Churn)") %>%
  kable_styling(full_width = FALSE)

Telco Churn Class Distribution (Target: Churn)
Churn	n	pct
No	5174	0.735
Yes	1869	0.265

ggplot(telco, aes(x = Churn)) +
  geom_bar() +
  labs(
    title = "Churn Class Distribution (WA_Fn-UseC_-Telco-Customer-Churn.csv)",
    x = "Churn",
    y = "Count"
  ) +
  theme_minimal()

Question 2.

Repsonse:

Tenure ranges from 1 to 72 months. MonthlyCharges ranges roughly from 20 to 120. TotalCharges can exceed several thousand.

Without scaling, TotalCharges would dominate distance calculations. Dummy encoding was applied to categorical variables, followed by scaling within each training split.

This ensures neighborhoods reflect combined behavioral and financial similarity. Adhering to the integrity of the data.

Section 2.2 kNN Classification: Tuning and Evaluation

prep_telco_split <- function(train_df, test_df) {

  # Dummy encoding on TRAIN ONLY (purist)
  dmy <- dummyVars(Churn ~ ., data = train_df, fullRank = TRUE)
  x_train <- predict(dmy, newdata = train_df)
  x_test  <- predict(dmy, newdata = test_df)

  # Scale on TRAIN ONLY (purist)
  pp <- preProcess(x_train, method = c("center", "scale"))
  x_train_s <- predict(pp, x_train)
  x_test_s  <- predict(pp, x_test)

  list(
    x_train = as.matrix(x_train_s),
    y_train = train_df$Churn,
    x_test  = as.matrix(x_test_s),
    y_test  = test_df$Churn
  )
}

Core Loop

Accuracy Curve & Metrics with AUC

set.seed(5560)

k_grid_class <- seq(1, 29, by = 2)

optimal_k_class <- numeric(20)
acc_curve_class <- matrix(NA_real_, nrow = 20, ncol = length(k_grid_class))

accuracy_vec  <- numeric(20)
precision_vec <- numeric(20)
recall_vec    <- numeric(20)
f1_vec        <- numeric(20)
auc_vec       <- numeric(20)

for (i in 1:20) {

  idx_train <- createDataPartition(telco$Churn, p = 0.8, list = FALSE)
  train_data <- telco[idx_train, ]
  test_data  <- telco[-idx_train, ]

  mean_acc_k <- numeric(length(k_grid_class))

  for (k_idx in seq_along(k_grid_class)) {
    k_val <- k_grid_class[k_idx]
    boot_acc <- numeric(20)

    for (j in 1:20) {
      boot_index <- sample(seq_len(nrow(train_data)), replace = TRUE)
      oob_index  <- setdiff(seq_len(nrow(train_data)), unique(boot_index))
      if (length(oob_index) == 0) next

      boot_train <- train_data[boot_index, ]
      boot_val   <- train_data[oob_index, ]

      # Purist dummy + scale inside bootstrap
      dmy_boot <- dummyVars(Churn ~ ., data = boot_train, fullRank = TRUE)
      x_bt <- predict(dmy_boot, newdata = boot_train)
      x_bv <- predict(dmy_boot, newdata = boot_val)

      pp_boot <- preProcess(x_bt, method = c("center", "scale"))
      x_bt_s <- as.matrix(predict(pp_boot, x_bt))
      x_bv_s <- as.matrix(predict(pp_boot, x_bv))

      y_bt <- boot_train$Churn
      y_bv <- boot_val$Churn

      model <- knn3(x = x_bt_s, y = y_bt, k = k_val)
      preds <- predict(model, x_bv_s, type = "class")

      boot_acc[j] <- mean(preds == y_bv)
    }

    mean_acc_k[k_idx] <- mean(boot_acc, na.rm = TRUE)
  }

  acc_curve_class[i, ] <- mean_acc_k

  best_k <- k_grid_class[which.max(mean_acc_k)]
  optimal_k_class[i] <- best_k

  # Final fit on OUTER TRAIN with purist preprocessing
  split_obj <- prep_telco_split(train_data, test_data)
  x_train <- split_obj$x_train
  y_train <- split_obj$y_train
  x_test  <- split_obj$x_test
  y_test  <- split_obj$y_test

  final_model <- knn3(x = x_train, y = y_train, k = best_k)

  class_preds <- predict(final_model, x_test, type = "class")
  prob_preds  <- predict(final_model, x_test, type = "prob")

  accuracy_vec[i] <- mean(class_preds == y_test)

  cm <- confusionMatrix(class_preds, y_test, positive = "Yes")
  precision_vec[i] <- as.numeric(cm$byClass["Precision"])
  recall_vec[i]    <- as.numeric(cm$byClass["Recall"])
  f1_vec[i]        <- as.numeric(cm$byClass["F1"])

  roc_obj <- roc(y_test, prob_preds[, "Yes"])
  auc_vec[i] <- as.numeric(auc(roc_obj))
}

Question 3: Mean Accuracy

The bootstrap accuracy curve increases steadily from ≈0.739 at k = 1 to ≈0.785 at k ≈ 27–29. The overall optimal k is in the high 20s, with mean accuracy ≈ 0.785.

Unlike regression, the curve does not exhibit a strong U-shape. Larger k values improve stability under class imbalance.

mean_acc_curve <- colMeans(acc_curve_class, na.rm = TRUE)
acc_df <- data.frame(k = k_grid_class, mean_accuracy = mean_acc_curve)

opt_k_class_overall <- acc_df$k[which.max(acc_df$mean_accuracy)]
opt_acc_overall <- max(acc_df$mean_accuracy)

ggplot(acc_df, aes(x = k, y = mean_accuracy)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_vline(xintercept = opt_k_class_overall, linetype = "dashed") +
  labs(
    title = "Mean Bootstrap Accuracy vs k (Telco Churn Dataset)",
    subtitle = paste0("Overall optimal k ≈ ", opt_k_class_overall,
                      " | Mean Accuracy ≈ ", round(opt_acc_overall, 3)),
    x = "k",
    y = "Mean Bootstrap Accuracy"
  ) +
  theme_minimal()

Question 4: Mean & SD characteristics n metrics

metrics_tbl <- data.frame(
  Metric = c("Accuracy","Precision","Recall","F1","AUC"),
  Mean = c(mean(accuracy_vec), mean(precision_vec), mean(recall_vec), mean(f1_vec), mean(auc_vec)),
  SD   = c(sd(accuracy_vec), sd(precision_vec), sd(recall_vec), sd(f1_vec), sd(auc_vec))
)

metrics_tbl %>%
  kable(digits = 3, caption = "Telco Churn: Test Metrics Across 20 Splits") %>%
  kable_styling(full_width = FALSE)

Telco Churn: Test Metrics Across 20 Splits
Metric	Mean	SD
Accuracy	0.792	0.009
Precision	0.639	0.022
Recall	0.495	0.026
F1	0.558	0.022
AUC	0.832	0.009

Across 20 splits:

Accuracy exceeds the baseline (0.735) by ~6 percentage points.

However, recall remains below 0.50, meaning the model identifies fewer than half of churners at the default threshold.

Question 5: Confusion Matrix

set.seed(5560)

idx_train <- createDataPartition(telco$Churn, p = 0.8, list = FALSE)
train_data <- telco[idx_train, ]
test_data  <- telco[-idx_train, ]

split_obj <- prep_telco_split(train_data, test_data)
x_train <- split_obj$x_train
y_train <- split_obj$y_train
x_test  <- split_obj$x_test
y_test  <- split_obj$y_test

k_rep_class <- opt_k_class_overall

final_model <- knn3(x = x_train, y = y_train, k = k_rep_class)
class_preds <- predict(final_model, x_test, type = "class")

confusionMatrix(class_preds, y_test, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  933 198
       Yes 101 175
                                          
               Accuracy : 0.7875          
                 95% CI : (0.7652, 0.8086)
    No Information Rate : 0.7349          
    P-Value [Acc > NIR] : 2.830e-06       
                                          
                  Kappa : 0.4052          
                                          
 Mcnemar's Test P-Value : 2.827e-08       
                                          
            Sensitivity : 0.4692          
            Specificity : 0.9023          
         Pos Pred Value : 0.6341          
         Neg Pred Value : 0.8249          
             Prevalence : 0.2651          
         Detection Rate : 0.1244          
   Detection Prevalence : 0.1962          
      Balanced Accuracy : 0.6857          
                                          
       'Positive' Class : Yes

For the representative split:

Sensitivity (Recall): 0.469 | Specificity: 0.902 | Precision: 0.634 | Accuracy: 0.788

The model correctly identifies 90% of non-churners but only 47% of churners. Balanced accuracy = 0.686, which better reflects performance under imbalance.

Sectction 2.3: Interpretation

Question 6. How misleading is accuracy

Very. Accuracy of 0.788 appears strong relative to 0.735 baseline. However, recall of 0.469 indicates that more than half of churners are missed. AUC of 0.832 shows the model ranks churn risk well, even if the default threshold under-detects churn.

Therefore, AUC provides a more complete assessment than accuracy alone.

Question 7. Steps to improve churn?

Lowering the probability threshold increases recall substantially.

At threshold = 0.2:

Recall increases to 0.863 | Precision drops to 0.489 | Accuracy falls to 0.725

At threshold = 0.5:

Recall = 0.469 | Precision = 0.634

Youden’s optimal threshold ≈ 0.237 yields:

Sensitivity = 0.836 | Specificity = 0.714

This threshold better balances false positives and false negatives, depending on business priorities. For this set, Youden’s threshold holds better with this dataset.

Bonus: Threshold Tuning + Youden’s J+ AUC:

# Representative split from above assumed available:
# final_model, x_test, y_test
prob_preds <- predict(final_model, x_test, type = "prob")[, "Yes"]

thresholds <- c(0.2, 0.3, 0.4, 0.5, 0.6)

threshold_results <- map_dfr(thresholds, function(th) {
  pred_custom <- ifelse(prob_preds > th, "Yes", "No")
  pred_custom <- factor(pred_custom, levels = c("No", "Yes"))

  cm <- confusionMatrix(pred_custom, y_test, positive = "Yes")

  tibble(
    threshold = th,
    accuracy  = cm$overall["Accuracy"],
    precision = cm$byClass["Precision"],
    recall    = cm$byClass["Recall"],
    f1        = cm$byClass["F1"]
  )
})

threshold_results %>%
  kable(digits = 3, caption = "Threshold Tuning Results (Telco Churn)") %>%
  kable_styling(full_width = FALSE)

Threshold Tuning Results (Telco Churn)
threshold	accuracy	precision	recall	f1
0.2	0.725	0.489	0.863	0.625
0.3	0.758	0.530	0.764	0.626
0.4	0.782	0.579	0.649	0.612
0.5	0.787	0.634	0.469	0.539
0.6	0.787	0.699	0.343	0.460

threshold_long <- threshold_results %>%
  pivot_longer(cols = c(accuracy, precision, recall, f1),
               names_to = "metric", values_to = "value")

ggplot(threshold_long, aes(x = threshold, y = value)) +
  geom_line() +
  geom_point(size = 2) +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    title = "Metric Sensitivity to Threshold (Telco Churn)",
    x = "Threshold (Predict Yes if P(Yes) > threshold)",
    y = "Metric Value"
  ) +
  theme_minimal()

Youden’s

roc_obj <- roc(y_test, prob_preds)
plot(roc_obj, main = "ROC Curve (Telco Churn, kNN)")

youden <- coords(roc_obj, "best", best.method = "youden",
                 ret = c("threshold", "sensitivity", "specificity"))
youden

Bonus:

Using the optimal k (≈ 27–29) and one representative split, I evaluated five probability thresholds from 0.2 to 0.6. As the threshold increased, precision rose from 0.489 to 0.699, while recall dropped sharply from 0.863 to 0.343. Accuracy remained relatively stable (0.725–0.787), which reinforces that accuracy alone does not capture model behavior under class imbalance. The F1-score peaked around 0.2–0.3 (≈ 0.626), then declined as recall deteriorated. At the default 0.5 threshold, recall was only 0.469, meaning more than half of churners were missed.

For a churn problem, missing a churner is typically more costly than contacting a non-churner. A threshold around 0.3 provides a better balance: recall increases to 0.764 while precision remains reasonable at 0.530. This setting materially improves churn detection compared to 0.5 without collapsing precision. Therefore, I would recommend a threshold near 0.3, as it aligns better with the objective of identifying at-risk customers while maintaining controlled false positives.

Section 3: Wrap up Analysis.

curve_reg_df <- data.frame(k = k_grid_reg, mean_rmspe = colMeans(rmspe_curve_reg, na.rm = TRUE))
curve_cls_df <- data.frame(k = k_grid_class, mean_acc  = colMeans(acc_curve_class, na.rm = TRUE))

p_reg <- ggplot(curve_reg_df, aes(x = k, y = mean_rmspe)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "Regression: RMSPE vs k (AmesHousing.csv)",
       x = "k", y = "Mean Bootstrap RMSPE") +
  theme_minimal()

p_cls <- ggplot(curve_cls_df, aes(x = k, y = mean_acc)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "Classification: Accuracy vs k (Telco Churn)",
       x = "k", y = "Mean Bootstrap Accuracy") +
  theme_minimal()

p_reg + p_cls

Section 3: Comparison

The side-by-side comparison shows clear differences in how performance changes with k for regression and classification.

For regression (Ames Housing)

The optimal k from bootstrap tuning was approximately k ≈ 20.

The minimum mean bootstrap RMSPE was approximately 31,029.

The curve displays a clear U-shape: RMSPE is very high at k = 1 (≈ 37,500), declines sharply until about k = 8–12, stabilizes near k = 20, and then increases slightly for larger k.

For classification (Telco Churn).

The optimal k was in the high range, approximately k ≈ 27–29.

The maximum mean bootstrap accuracy was approximately 0.785.

The accuracy curve increases steadily with k and does not show a strong U-shape.

The optimal k values differ because the two tasks behave differently under smoothing:

In regression, increasing k beyond the optimal region increases bias and leads to oversmoothing, which increases RMSPE.

In classification, especially under class imbalance (73.5% non-churn), larger k values stabilize predictions and reduce variance, leading to improved accuracy.

In short, regression required a balance between variance and bias, while classification benefited more consistently from smoothing due to noise and imbalance in class labels.

library(DiagrammeR)

grViz("
digraph G {
  rankdir=TB;
  
  node [shape=box];

  A[label='Outer Split (80/20)'];
  B[label='Grid Search over k'];
  C[label='Bootstrap (20x)'];
  D[label='Mean Metric per k'];
  E[label='Select Optimal k'];
  F[label='Evaluate on Test Set'];
  G[label='Store Metrics'];

  A -> B -> C -> D -> E -> F -> G;
}
")

set.seed(5560)

n <- 100
boot_index <- sample(1:n, replace = TRUE)

data.frame(index = boot_index) %>%
  ggplot(aes(x = index)) +
  geom_histogram(binwidth = 1) +
  labs(
    title = "Example Bootstrap Sampling (With Replacement)",
    subtitle = "Notice some observations appear multiple times, others are missing",
    x = "Original Observation Index",
    y = "Frequency in Bootstrap Sample"
  ) +
  theme_minimal()

set.seed(5560)

n <- 100
boot_index <- sample(1:n, replace = TRUE)

data.frame(index = boot_index) %>%
  ggplot(aes(x = index)) +
  geom_histogram(binwidth = 1) +
  labs(
    title = "Example Bootstrap Sampling (With Replacement)",
    subtitle = "Notice some observations appear multiple times, others are missing",
    x = "Original Observation Index",
    y = "Frequency in Bootstrap Sample"
  ) +
  theme_minimal()

oob_index <- setdiff(1:n, unique(boot_index))
length(oob_index) / n

[1] 0.37

library(ggplot2)

x <- seq(0, 10, length.out = 100)
y_true <- sin(x)
y_noisy <- y_true + rnorm(100, sd = 0.3)

df <- data.frame(x, y_noisy)

ggplot(df, aes(x, y_noisy)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Noisy Data Example: What kNN Is Trying to Approximate",
    subtitle = "Small k → jagged fit | Large k → smoother fit"
  ) +
  theme_minimal()

mean_acc <- colMeans(acc_curve_class)
sd_acc   <- apply(acc_curve_class, 2, sd)

acc_df <- data.frame(
  k = k_grid_class,
  mean = mean_acc,
  lower = mean_acc - sd_acc,
  upper = mean_acc + sd_acc
)

ggplot(acc_df, aes(x = k, y = mean)) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  labs(
    title = "Bootstrap Accuracy vs k (with Variability)",
    subtitle = "Ribbon = ±1 SD across outer splits",
    x = "k",
    y = "Mean Accuracy"
  ) +
  theme_minimal()

# Compute ROC
roc_obj <- roc(y_test, prob_preds)

roc_df <- data.frame(
  tpr = roc_obj$sensitivities,
  fpr = 1 - roc_obj$specificities
)

# Compute Youden optimal point
youden <- coords(roc_obj, "best", best.method = "youden")

youden_df <- data.frame(
  fpr = 1 - as.numeric(youden["specificity"]),
  tpr = as.numeric(youden["sensitivity"])
)

ggplot(roc_df, aes(x = fpr, y = tpr)) +
  geom_line(linewidth = 1) +
  geom_abline(linetype = "dashed") +
  geom_point(data = youden_df,
             aes(x = fpr, y = tpr),
             color = "red",
             size = 3) +
  labs(
    title = "ROC Curve with Youden Optimal Threshold (Telco Churn kNN)",
    x = "False Positive Rate",
    y = "True Positive Rate"
  ) +
  theme_minimal()

k_vals <- 1:30
bias_curve <- log(k_vals)
variance_curve <- rev(log(k_vals))

df_bias <- data.frame(
  k = k_vals,
  Bias = bias_curve,
  Variance = variance_curve
)

df_long <- pivot_longer(df_bias, -k)

ggplot(df_long, aes(x = k, y = value, color = name)) +
  geom_line() +
  labs(
    title = "Conceptual Bias–Variance Tradeoff in kNN",
    x = "k",
    y = "Relative Magnitude"
  ) +
  theme_minimal()

grViz("
digraph G {
  rankdir=LR;
  node [shape=box];

  A[label='Raw Data'];
  B[label='Train/Test Split'];
  C[label='Bootstrap Tuning'];
  D[label='Select k'];
  E[label='Fit Final Model'];
  F[label='Evaluate Metrics'];
  G[label='ROC / Threshold Optimization'];

  A -> B -> C -> D -> E -> F -> G;
}
")

Use of AI: AI was used as a tool to enhance the analysis. Materials used were only internal - non from invalidated open-source. AI kept me in alignment with the assignment parameters while outlining blindspots & improving user reasoning & task execution.