Lab-9

Author

Jagath Kumar Reddy Katama Reddy

library(ISLR2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tree)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
library(reshape2)
library(ggplot2)

Attaching package: 'ggplot2'
The following object is masked from 'package:randomForest':

    margin
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
The following object is masked from 'package:ISLR2':

    Boston
library(BART)
Loading required package: nlme

Attaching package: 'nlme'
The following object is masked from 'package:dplyr':

    collapse
Loading required package: survival
library(gbm)
Loaded gbm 2.2.2
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(class)
library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:survival':

    cluster
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var

7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

set.seed(18)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- Boston[-train, ]
test_medv <- test$medv

# Adjusted mtry values (p, p/2, sqrt(p), p/4)
mtry_vals <- c(12, 6, 3, 2)  # 12 = all predictors, sqrt(12) ≈ 3.46
ntree_vals <- seq(25, 500, by = 25)  # Reduced range for computational efficiency

# Create an empty matrix to store test MSE
test_mse <- matrix(NA, nrow = length(ntree_vals), ncol = length(mtry_vals),
                   dimnames = list(ntree_vals, paste0("mtry_", mtry_vals)))

# Loop over combinations of mtry and ntree
for (i in seq_along(mtry_vals)) {
  for (j in seq_along(ntree_vals)) {
    rf.model <- randomForest(medv ~ ., data = Boston, subset = train,
                             mtry = mtry_vals[i], ntree = ntree_vals[j])
    pred <- predict(rf.model, newdata = test)
    test_mse[j, i] <- mean((pred - test_medv)^2)
  }
}

# Create data frame for plotting
library(reshape2)
library(ggplot2)

df_plot <- melt(test_mse, varnames = c("ntree", "mtry"), value.name = "TestMSE")
df_plot$mtry <- as.factor(gsub("mtry_", "", df_plot$mtry))
df_plot$ntree <- as.numeric(as.character(df_plot$ntree))

# Plot with better formatting
ggplot(df_plot, aes(x = ntree, y = TestMSE, color = mtry)) +
  geom_line(size = 1) +
  geom_point(size = 0.8, alpha = 0.5) +
  scale_color_brewer(palette = "Set1", name = "mtry") +
  labs(title = "Random Forest Test MSE vs. Number of Trees",
       subtitle = "Boston Housing Dataset",
       x = "Number of Trees (ntree)",
       y = "Test Mean Squared Error") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Observations: 1) Based on validation set approach, as mtry increases test mse tend to get lower. 2) As ntree increases, test mse tend to decrease upto certain number of trees and tend to stablise 3) Therefore, mtry = 12 (that is all variable in Boston) has a better test mse than mtry < 12. And number of trees ntree after 1250 shows minimal gains and minimal fluctations in test mse.

Question 11: Caravan Dataset Analysis

(a) Split into Training and Test Sets

library(ISLR2)
library(dplyr)

CaravanClean <- na.omit(Caravan)

# Split: first 1000 = training, rest = test
train_indices <- 1:1000
caravan_train <- CaravanClean[train_indices, ]
caravan_test <- CaravanClean[-train_indices, ]

(b) Boosting Models with Different Shrinkage Values

library(gbm)

# Convert response to binary for gbm
caravan_train$Purchase <- as.numeric(caravan_train$Purchase == "Yes")

boost_0.01 <- gbm(
  Purchase ~ ., 
  data = caravan_train,
  distribution = "bernoulli",
  n.trees = 1000,
  shrinkage = 0.01,
  verbose = FALSE
)
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 71: AVRAAUT has no variation.
# Variable importance plot
summary(boost_0.01)

              var     rel.inf
PPERSAUT PPERSAUT 14.56833294
MKOOPKLA MKOOPKLA 10.31184614
MOPLHOOG MOPLHOOG  6.85258691
PBRAND     PBRAND  6.25025129
MBERMIDD MBERMIDD  4.97334157
MGODGE     MGODGE  4.66671342
ABRAND     ABRAND  3.83792152
MINK3045 MINK3045  3.48544853
MSKC         MSKC  2.44618746
MGODPR     MGODPR  2.36668937
MSKA         MSKA  2.33872928
MSKB1       MSKB1  2.27075876
MAUT1       MAUT1  2.21480274
MAUT2       MAUT2  2.19984153
MOSTYPE   MOSTYPE  2.19287146
PWAPART   PWAPART  2.12535640
MBERARBG MBERARBG  1.99835717
PBYSTAND PBYSTAND  1.97723504
MBERHOOG MBERHOOG  1.96517994
MAUT0       MAUT0  1.78956623
MGODOV     MGODOV  1.75254205
MINKGEM   MINKGEM  1.57391055
MRELGE     MRELGE  1.20349723
MFGEKIND MFGEKIND  1.16494562
APERSAUT APERSAUT  1.05449772
MBERARBO MBERARBO  0.97495004
MFWEKIND MFWEKIND  0.84339940
MOPLMIDD MOPLMIDD  0.82364704
MBERBOER MBERBOER  0.80841336
MINK4575 MINK4575  0.78777079
MGODRK     MGODRK  0.70763986
MSKB2       MSKB2  0.63749603
MINK7512 MINK7512  0.61411491
MSKD         MSKD  0.59186787
MRELOV     MRELOV  0.59123209
MZPART     MZPART  0.58944741
MFALLEEN MFALLEEN  0.53088755
MHHUUR     MHHUUR  0.46304963
PMOTSCO   PMOTSCO  0.46008297
MGEMOMV   MGEMOMV  0.40798393
PLEVEN     PLEVEN  0.37089384
MGEMLEEF MGEMLEEF  0.36698885
MHKOOP     MHKOOP  0.31999996
MOPLLAAG MOPLLAAG  0.31259175
MINKM30   MINKM30  0.29567346
MOSHOOFD MOSHOOFD  0.29522611
MZFONDS   MZFONDS  0.29036252
MINK123M MINK123M  0.12077005
MBERZELF MBERZELF  0.09414952
MRELSA     MRELSA  0.07492588
MAANTHUI MAANTHUI  0.04502430
PWABEDR   PWABEDR  0.00000000
PWALAND   PWALAND  0.00000000
PBESAUT   PBESAUT  0.00000000
PVRAAUT   PVRAAUT  0.00000000
PAANHANG PAANHANG  0.00000000
PTRACTOR PTRACTOR  0.00000000
PWERKT     PWERKT  0.00000000
PBROM       PBROM  0.00000000
PPERSONG PPERSONG  0.00000000
PGEZONG   PGEZONG  0.00000000
PWAOREG   PWAOREG  0.00000000
PZEILPL   PZEILPL  0.00000000
PPLEZIER PPLEZIER  0.00000000
PFIETS     PFIETS  0.00000000
PINBOED   PINBOED  0.00000000
AWAPART   AWAPART  0.00000000
AWABEDR   AWABEDR  0.00000000
AWALAND   AWALAND  0.00000000
ABESAUT   ABESAUT  0.00000000
AMOTSCO   AMOTSCO  0.00000000
AVRAAUT   AVRAAUT  0.00000000
AAANHANG AAANHANG  0.00000000
ATRACTOR ATRACTOR  0.00000000
AWERKT     AWERKT  0.00000000
ABROM       ABROM  0.00000000
ALEVEN     ALEVEN  0.00000000
APERSONG APERSONG  0.00000000
AGEZONG   AGEZONG  0.00000000
AWAOREG   AWAOREG  0.00000000
AZEILPL   AZEILPL  0.00000000
APLEZIER APLEZIER  0.00000000
AFIETS     AFIETS  0.00000000
AINBOED   AINBOED  0.00000000
ABYSTAND ABYSTAND  0.00000000

(c) Prediction with Boosting and Evaluation

caravan_test$Purchase_bin <- as.numeric(caravan_test$Purchase == "Yes")

shrinkages <- c(0.01, 0.05, 0.1, 0.2)
boosting_results <- data.frame()
best_model <- NULL
best_precision <- 0

for (lambda in shrinkages) {
  model <- gbm(
    Purchase ~ ., 
    data = caravan_train,
    distribution = "bernoulli",
    n.trees = 1000,
    shrinkage = lambda,
    verbose = FALSE
  )
  
  probs <- predict(model, newdata = caravan_test, n.trees = 1000, type = "response")
  preds <- ifelse(probs > 0.2, 1, 0)
  
  cm <- table(Predicted = preds, Actual = caravan_test$Purchase_bin)
  
  precision <- ifelse("1" %in% rownames(cm), cm["1", "1"] / sum(cm["1", ]), NA)
  accuracy <- sum(diag(cm)) / sum(cm)
  
  boosting_results <- rbind(boosting_results, data.frame(Shrinkage = lambda, Precision = precision, Accuracy = accuracy))
  
  if (!is.na(precision) && precision > best_precision) {
    best_precision <- precision
    best_model <- model
  }
}
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
: variable 71: AVRAAUT has no variation.
print(boosting_results)
  Shrinkage Precision  Accuracy
1      0.01 0.2187500 0.9214019
2      0.05 0.1306991 0.8896723
3      0.10 0.1338583 0.8822066
4      0.20 0.1202532 0.8654085

KNN Implementation

library(class)
library(caret)

clean_data <- na.omit(Caravan)
train_data <- clean_data[1:1000, ]
test_data <- clean_data[-(1:1000), ]

# Select numeric columns and remove zero-variance columns
numeric_cols <- sapply(clean_data[,-86], is.numeric)
numeric_vars <- names(clean_data)[-86][numeric_cols]
zero_var_cols <- sapply(train_data[, numeric_vars], function(col) sd(col) == 0)
numeric_vars <- numeric_vars[!zero_var_cols]

# Standardize using training data
train_X <- scale(train_data[, numeric_vars])
test_X <- scale(test_data[, numeric_vars],
                center = attr(train_X, "scaled:center"),
                scale = attr(train_X, "scaled:scale"))

train_y_knn <- train_data$Purchase
test_y_knn <- test_data$Purchase

evaluate_knn <- function(k) {
  pred_knn <- knn(train = train_X, test = test_X, cl = train_y_knn, k = k)
  cm <- table(Predicted = pred_knn, Actual = test_y_knn)
  
  if ("Yes" %in% rownames(cm)) {
    precision <- cm["Yes", "Yes"] / sum(cm["Yes", ])
  } else {
    precision <- NA
  }

  cat(paste0("\n--- KNN (k = ", k, ") ---\n"))
  print(cm)
  cat(sprintf("Fraction of predicted 'Yes' that are correct: %.3f\n", precision))
  
  return(list(pred = pred_knn, cm = cm, precision = precision))
}

# Try multiple k values
knn_results <- list()
for (k in c(1, 3, 5, 7, 9)) {
  knn_results[[paste0("k", k)]] <- evaluate_knn(k)
}

--- KNN (k = 1) ---
         Actual
Predicted   No  Yes
      No  4255  249
      Yes  278   40
Fraction of predicted 'Yes' that are correct: 0.126

--- KNN (k = 3) ---
         Actual
Predicted   No  Yes
      No  4449  267
      Yes   84   22
Fraction of predicted 'Yes' that are correct: 0.208

--- KNN (k = 5) ---
         Actual
Predicted   No  Yes
      No  4516  281
      Yes   17    8
Fraction of predicted 'Yes' that are correct: 0.320

--- KNN (k = 7) ---
         Actual
Predicted   No  Yes
      No  4525  288
      Yes    8    1
Fraction of predicted 'Yes' that are correct: 0.111

--- KNN (k = 9) ---
         Actual
Predicted   No  Yes
      No  4532  289
      Yes    1    0
Fraction of predicted 'Yes' that are correct: 0.000
# Store best KNN result
best_k <- 5  # ← placeholder; replace based on actual results
best_knn_pred <- knn_results[[paste0("k", best_k)]]$pred
test_Y <- test_y_knn
  • Best Shrinkage Value: The optimal λ value is 0.01, which yielded the highest precision of approximately 20.9%. This suggests that a smaller step size in the gradient boosting process worked best for this dataset, allowing the model to make more careful, incremental improvements that capture the subtleties in the data without overfitting.

  • Most Important Predictors: The top predictors PPERSAUT (likely representing car ownership) and MKOOPKLA (likely representing purchasing power or socioeconomic class) are the most influential variables in determining whether someone will purchase insurance. This provides valuable business insight by highlighting customer characteristics that strongly correlate with purchase behavior.

  • Precision: With a precision of 20.9% when using a 0.01 shrinkage value, approximately 1 in 5 customers predicted to make a purchase actually did so. While this may seem low in absolute terms, it represents a significant improvement over random targeting, considering the low base rate of purchases in insurance data.

  • The confusion matrix shows that of all people predicted to make a purchase, about 20.9% actually did make one (with the best shrinkage value of 0.01). Comparing this with KNN:

  • KNN with k=5 achieved a higher precision of 32%, meaning 32% of those predicted to make a purchase actually did.

  • However, the KNN model with k=5 made far fewer positive predictions overall (only 25 test cases were predicted as purchases with 8 being correct).

  • The boosting model likely identified more potential customers while maintaining a reasonable precision rate.

  • The boosting model offers an advantage in identifying important variables (PPERSAUT and MKOOPKLA) that strongly influence purchase decisions, providing interpretability that KNN lacks.

  • The 20% probability threshold was an appropriate choice for this imbalanced dataset, as it allowed the model to identify more potential customers than would be possible with the default 50% threshold.

  • Interpretability Advantage: While KNN performed better in precision, the boosting model offers greater interpretability by identifying important variables like PPERSAUT and MKOOPKLA. This provides actionable insights that can inform broader marketing strategies beyond just prediction.

  • Thresholding Strategy: The 20% probability threshold for boosting predictions was effective, acknowledging the imbalanced nature of insurance purchase data. This approach appropriately balances false positives and false negatives in a domain where missed opportunities (false negatives) can be costly.

  • In summary, while KNN with k=5 achieved higher precision (32% vs. 20.9%), the boosting model provided a better balance between prediction volume and accuracy, making it potentially more useful in a practical marketing scenario where identifying a reasonable number of likely purchasers is important