This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a)
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.4.3
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:ISLR2':
Boston
library(ggplot2)library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:MASS':
select
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Display a summary of the Weekly datasetsummary(Weekly)
Year Lag1 Lag2 Lag3
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag4 Lag5 Volume Today
Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
Direction
Down:484
Up :605
# Let's first check the pairs of the Weekly dataset to visualize relationshipsplot_obj <-ggpairs( Weekly,aes(color =factor(Direction), alpha =0.5),upper =list(continuous =wrap("cor", size =2.5)),lower =list(continuous =wrap("points", alpha =0.6, size =1.5)),diag =list(continuous =wrap("densityDiag"))) +theme_minimal() +labs(title ="Scatterplot Matrix of Weekly Data",subtitle ="Colored by Direction",color ="Direction" )suppressWarnings(suppressMessages(print(plot_obj)))
Response:
The scatterplot matrix above provides a visual summary of the relationships between the variables in the weekly dataset. For the mostpart the plots are unremarkable, with few discernible patterns. However, there are some observations:
- Although the correlation between Year and Volume is high (r = 0.84), the relationship does not appear strictly linear. A visual inspection of the scatterplot suggests a nonlinear, upward-curving trend in Volume over time, which may indicate that trading volume has generally increased over the years.
- An examination of the lag variables (Lag1 through Lag5) exhibit only weak relationships with the Direction variable (whether the market went Up or Down in a given week). Among these, Lag2, Lag4, and Lag5 stand out slightly
(b)
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
# Logistic regression with Direction as the response and lag variables plus Volume as predictorslogistic_model <-glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)summary(logistic_model)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
Response:
In the logistic regression model, the coefficient for Lag2 is estimated at 0.05844 with a standard error of 0.02686, resulting in a z-value of 2.175 and a p-value of 0.0296. Since the p-value is below the 0.05 significance threshold, we conclude that Lag2 is statistically significant at the 5% level.
This positive coefficient indicates that higher returns two weeks ago are associated with an increased likelihood that the market will go Up in the current week. Specifically, for every 1% increase in Lag2, the log-odds of an Up week increase by approximately 0.05844. When translated to odds, this equates to a 6% increase in the odds of the market going Up, holding all other variables constant:
This result suggests that Lag2 may carry some limited predictive power for weekly market direction.
(c)
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
# Compute predicted probabilitiespredicted_probs <-predict(logistic_model, type ="response")# Convert probabilities to binary predictionspredicted_classes <-ifelse(predicted_probs >0.5, "Up", "Down")# Create confusion matrixconfusion_matrix <-table(Predicted = predicted_classes, Actual = Weekly$Direction)# Compute overall fraction of correct predictionsaccuracy <-sum(diag(confusion_matrix)) /sum(confusion_matrix)# Print confusion matrix and accuracyprint(confusion_matrix)
Actual
Predicted Down Up
Down 54 48
Up 430 557
cat("Overall fraction of correct predictions (accuracy):", accuracy, "\n")
Overall fraction of correct predictions (accuracy): 0.5610652
Response:
The model correctly predicted the market direction in 611 out of 1089 weeks, resulting in an overall accuracy of:
This suggests that the model performs only slightly better than random guessing.
Notably, the model frequently misclassifies Down weeks, predicting “Up” in 430 of the 484 weeks when the market actually went down. This indicates a strong bias toward predicting the more common outcome, market increases, at the expense of accurately detecting downturns. This imbalance may limit the model’s practical usefulness, especially for applications like trading strategies or risk management, where correctly identifying Down weeks is critical.
(d)
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
# Subset the data into training and testing setstrain_data <-subset(Weekly, Year <2009)test_data <-subset(Weekly, Year >=2009)# Fit the logistic regression modellogistic_model_1990_2008 <-glm(Direction ~ Lag2, data = train_data, family = binomial)# Compute predicted probabilities on the test setpredicted_probs_test <-predict(logistic_model_1990_2008, newdata = test_data, type ="response")# Convert probabilities to binary predictionspredicted_classes_test <-ifelse(predicted_probs_test >0.5, "Up", "Down")# Create confusion matrixconfusion_matrix_test <-table(Predicted = predicted_classes_test, Actual = test_data$Direction)# Compute overall fraction of correct predictionsaccuracy_test <-sum(diag(confusion_matrix_test)) /sum(confusion_matrix_test)# Print confusion matrix and accuracyprint(confusion_matrix_test)
Actual
Predicted Down Up
Down 9 5
Up 34 56
cat("Overall fraction of correct predictions (accuracy):", accuracy_test, "\n")
Overall fraction of correct predictions (accuracy): 0.625
(e)
Repeat (d) using LDA.
# Fit the LDA modellda_model <-lda(Direction ~ Lag2, data = train_data)# Compute predicted probabilities on the test setpredicted_probs_lda <-predict(lda_model, newdata = test_data)$posterior[, "Up"]# Convert probabilities to binary predictionspredicted_classes_lda <-ifelse(predicted_probs_lda >0.5, "Up", "Down")# Create confusion matrixconfusion_matrix_lda <-table(Predicted = predicted_classes_lda, Actual = test_data$Direction)# Compute overall fraction of correct predictionsaccuracy_lda <-sum(diag(confusion_matrix_lda)) /sum(confusion_matrix_lda)# Print confusion matrix and accuracyprint(confusion_matrix_lda)
Actual
Predicted Down Up
Down 9 5
Up 34 56
cat("Overall fraction of correct predictions (accuracy):", accuracy_lda, "\n")
Overall fraction of correct predictions (accuracy): 0.625
(f)
Repeat (d) using QDA.
# Fit the QDA modelqda_model <-qda(Direction ~ Lag2, data = train_data)# Compute predicted probabilities on the test setpredicted_probs_qda <-predict(qda_model, newdata = test_data)$posterior[, "Up"]# Convert probabilities to binary predictionspredicted_classes_qda <-ifelse(predicted_probs_qda >0.5, "Up", "Down")# Ensure both predicted and actual are factors with consistent levelspredicted_classes_qda <-factor(predicted_classes_qda, levels =c("Down", "Up"))actual_classes_qda <-factor(test_data$Direction, levels =c("Down", "Up"))# Create confusion matrixconfusion_matrix_qda <-table(Predicted = predicted_classes_qda, Actual = actual_classes_qda)# Compute overall fraction of correct predictionsaccuracy_qda <-sum(diag(confusion_matrix_qda)) /sum(confusion_matrix_qda)# Print confusion matrix and accuracyprint(confusion_matrix_qda)
Actual
Predicted Down Up
Down 0 0
Up 43 61
cat("Overall fraction of correct predictions (accuracy):", accuracy_qda, "\n")
Overall fraction of correct predictions (accuracy): 0.5865385
(g)
Repeat (d) using KNN with K = 1.
# Prepare training and test sets for k-NN# Only use Lag2 as the predictor (as in your QDA example)train_X <-data.frame(Lag2 = train_data$Lag2)test_X <-data.frame(Lag2 = test_data$Lag2)# Target variable must be a factortrain_Y <- train_data$Direction# Apply k-NN with k = 1predicted_classes_knn <-knn(train = train_X, test = test_X, cl = train_Y, k =1)# Create confusion matrixconfusion_matrix_knn <-table(Predicted = predicted_classes_knn, Actual = test_data$Direction)# Compute overall fraction of correct predictionsaccuracy_knn <-sum(diag(confusion_matrix_knn)) /sum(confusion_matrix_knn)# Print confusion matrix and accuracyprint(confusion_matrix_knn)
Actual
Predicted Down Up
Down 21 30
Up 22 31
cat("Overall fraction of correct predictions (accuracy):", accuracy_knn, "\n")
Overall fraction of correct predictions (accuracy): 0.5
(h)
Repeat (d) using naive Bayes.
# Fit the Naive Bayes model using only Lag2nb_model <-naiveBayes(Direction ~ Lag2, data = train_data)# Compute predicted probabilities on the test setpredicted_probs_nb <-predict(nb_model, newdata = test_data, type ="raw")[, "Up"]# Convert probabilities to binary predictionspredicted_classes_nb <-ifelse(predicted_probs_nb >0.5, "Up", "Down")# Ensure both predicted and actual are factors with the same levelspredicted_classes_nb <-factor(predicted_classes_nb, levels =c("Down", "Up"))actual_classes <-factor(test_data$Direction, levels =c("Down", "Up"))# Create confusion matrixconfusion_matrix_nb <-table(Predicted = predicted_classes_nb, Actual = actual_classes)# Compute overall fraction of correct predictionsaccuracy_nb <-sum(diag(confusion_matrix_nb)) /sum(confusion_matrix_nb)# Print confusion matrix and accuracyprint(confusion_matrix_nb)
Actual
Predicted Down Up
Down 0 0
Up 43 61
cat("Overall fraction of correct predictions (accuracy):", accuracy_nb, "\n")
Overall fraction of correct predictions (accuracy): 0.5865385
(i)
(i) Which of these methods appears to provide the best results on this data?
# First, store the accuracy values from each modelaccuracy_results <-data.frame(Model =c("Logistic Regression", "LDA", "QDA", "k-NN (k = 1)", "Naive Bayes"),Accuracy =c(accuracy_test, accuracy_lda, accuracy_qda, accuracy_knn, accuracy_nb))# Print as a clean table (requires knitr package)library(knitr)kable(accuracy_results, digits =3, caption ="Model Accuracy Comparison")
Model Accuracy Comparison
Model
Accuracy
Logistic Regression
0.625
LDA
0.625
QDA
0.587
k-NN (k = 1)
0.500
Naive Bayes
0.587
Response:
Logistic regression and LDA provide the best results on this data, with identical accuracy. This is expected since both models produce linear decision boundaries and are fit using only a single predictor (Lag2). In such cases, their predictions often align closely. More complex models like QDA or k-NN may overfit or underperform due to the simplicity of the data.
Problem 14
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a)
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
# Load the Auto datasetdata("Auto")# Remove rows with missing valuesAuto <-na.omit(Auto)# Create binary response variable mpg01: 1 if mpg > median, else 0cat("Median mpg:", median(Auto$mpg), "\n")
Median mpg: 22.75
Auto$mpg01 <-factor(ifelse(Auto$mpg >median(Auto$mpg), "High", "Low"))# Convert relevant variables to factorsAuto$cylinders <-factor(Auto$cylinders)Auto$year <-factor(Auto$year)Auto$origin <-factor(Auto$origin)# Drop the 'name' variable (not useful for modeling)Auto <- Auto %>%select(-name, -year)# View structure and sample of cleaned datasetstr(Auto)
(b)
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
# Select numeric predictors only (excluding mpg01, which is now a factor)predictors <- Auto %>%select(-mpg01) %>%select(where(is.numeric))# Create the pairs plot with mpg01 as the colorplot_obj <-ggpairs(data =bind_cols(predictors, mpg01 = Auto$mpg01),mapping =aes(color = mpg01, alpha =0.5),upper =list(continuous =wrap("cor", size =2.5)),lower =list(continuous =wrap("points", alpha =0.6, size =1.5)),diag =list(continuous =wrap("densityDiag"))) +theme_minimal() +labs(title ="Scatterplot Matrix of Auto Dataset",subtitle ="Colored by mpg01 (1 = high MPG, 0 = low MPG)",color ="mpg01" )# Suppress warnings and print the plotsuppressWarnings(suppressMessages(print(plot_obj)))
Response:
Based on the pairs plot, the features most useful in predicting ‘mpg01’ appear to be weight, horsepower, and displacement. These variables show strong negative correlations with ‘mpg’ and clear class separation between high and low MPG vehicles in both scatterplots and boxplots. Vehicles with lower weight, horsepower, and displacement are much more likely to fall into the high MPG category. In contrast, acceleration appears to have a weaker relationship and may be less useful for classification on its own.
# Select relevant categorical predictors and mpg01categorical_vars <- Auto %>%select(mpg01, cylinders, origin)# Pivot into long format for faceted plottinglong_cat <- categorical_vars %>%pivot_longer(cols =c(cylinders, origin), names_to ="variable", values_to ="category")# Plot proportion of mpg01 = 1 vs 0 for each category of each variableggplot(long_cat, aes(x = category, fill = mpg01)) +geom_bar(position ="fill") +facet_wrap(~variable, scales ="free_x") +labs(title ="Proportion of High vs Low MPG by Categorical Variables",y ="Proportion", x =NULL, fill ="mpg01") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
# Reshape for boxplot facetingboxplot_data <- Auto %>%select(mpg, cylinders, origin) %>%pivot_longer(cols =c(cylinders, origin), names_to ="variable", values_to ="category")# Faceted boxplots of mpg by each categorical variableggplot(boxplot_data, aes(x = category, y = mpg)) +geom_boxplot(fill ="#69b3a2", alpha =0.7) +facet_wrap(~variable, scales ="free_x") +labs(title ="Distribution of MPG by Categorical Predictors",x =NULL, y ="MPG") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
(c)
Split the data into a training set and a test set.
# Set a seed for reproducibilityset.seed(123)# Create a stratified split: 70% train, 30% test (preserving mpg01 balance)train_index <-createDataPartition(Auto$mpg01, p =0.7, list =FALSE)# Split the datatrain_data <- Auto[train_index, ]test_data <- Auto[-train_index, ]# Drop highly correlated predictorscor_matrix <-cor(select_if(train_data, is.numeric))high_cor <-findCorrelation(cor_matrix, cutoff =0.75)if (length(high_cor) >0) { train_data <- train_data[, -high_cor] test_data <- test_data[, -high_cor]print(high_cor)}
[1] 3 2 4
# Confirm the class balance is preservedprop.table(table(train_data$mpg01))
High Low
0.5 0.5
prop.table(table(test_data$mpg01))
High Low
0.5 0.5
# I am going to try to use some of the techniques we learning in class with cross-validation. # I will try to weave this into the questions as I go along.# Define the cross-validation controltrain_control <-trainControl(method ="repeatedcv",number =10,repeats =3,classProbs =TRUE,summaryFunction = twoClassSummary)
(d)
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
# Fit LDA model using caretmodel_lda <-train( mpg01 ~ ., # Predict mpg01 using all other variablesdata = train_data,method ="lda",trControl = train_control,metric ="ROC")# Print cross-validation resultsprint(model_lda)
Linear Discriminant Analysis
276 samples
4 predictor
2 classes: 'High', 'Low'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 248, 248, 248, 250, 248, 248, ...
Resampling results:
ROC Sens Spec
0.997619 0.9445055 0.9855311
# Predict on the test setpred_classes <-predict(model_lda, newdata = test_data)pred_probs <-predict(model_lda, newdata = test_data, type ="prob")# Compute confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 55 2
Low 3 56
Accuracy : 0.9569
95% CI : (0.9023, 0.9859)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9138
Mcnemar's Test P-Value : 1
Sensitivity : 0.9483
Specificity : 0.9655
Pos Pred Value : 0.9649
Neg Pred Value : 0.9492
Prevalence : 0.5000
Detection Rate : 0.4741
Detection Prevalence : 0.4914
Balanced Accuracy : 0.9569
'Positive' Class : High
# Calculate accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracy# Print accuracy and test errorcat("Accuracy on test set:", round(accuracy, 3), "\n")
(e)
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
# Fit QDA model using caretmodel_qda <-train( mpg01 ~ ., # Predict mpg01 using all other variablesdata = train_data,method ="qda",trControl = train_control,metric ="ROC")# Print cross-validation resultsprint(model_qda)
# Predict on the test setpred_classes <-predict(model_qda, newdata = test_data)pred_probs <-predict(model_qda, newdata = test_data, type ="prob")# Compute confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 55 4
Low 3 54
Accuracy : 0.9397
95% CI : (0.8796, 0.9754)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8793
Mcnemar's Test P-Value : 1
Sensitivity : 0.9483
Specificity : 0.9310
Pos Pred Value : 0.9322
Neg Pred Value : 0.9474
Prevalence : 0.5000
Detection Rate : 0.4741
Detection Prevalence : 0.5086
Balanced Accuracy : 0.9397
'Positive' Class : High
# Calculate accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracy# Print accuracy and test errorcat("Accuracy on test set:", round(accuracy, 3), "\n")
(f)
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
# Fit logistic regression model using only key predictorsmodel_logit <-train( mpg01 ~ ., # predictors selected from part (b)data = train_data,method ="glm",family ="binomial",trControl = train_control,metric ="ROC")
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Generalized Linear Model
276 samples
4 predictor
2 classes: 'High', 'Low'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 248, 248, 249, 248, 248, 249, ...
Resampling results:
ROC Sens Spec
0.9983778 0.9855311 1
# Predict on the test setpred_classes <-predict(model_logit, newdata = test_data)pred_probs <-predict(model_logit, newdata = test_data, type ="prob")# Compute confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 58 0
Low 0 58
Accuracy : 1
95% CI : (0.9687, 1)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0
Specificity : 1.0
Pos Pred Value : 1.0
Neg Pred Value : 1.0
Prevalence : 0.5
Detection Rate : 0.5
Detection Prevalence : 0.5
Balanced Accuracy : 1.0
'Positive' Class : High
# Calculate accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracy# Print accuracy and test errorcat("Accuracy on test set:", round(accuracy, 3), "\n")
(g)
Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
# Fit Naive Bayes model using selected predictorsmodel_nb <-train( mpg01 ~ .,data = train_data,method ="naive_bayes",trControl = train_control,metric ="ROC")# Print cross-validation resultsprint(model_nb)
Naive Bayes
276 samples
4 predictor
2 classes: 'High', 'Low'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 248, 248, 248, 248, 250, 250, ...
Resampling results across tuning parameters:
usekernel ROC Sens Spec
FALSE 0.9737270 0.9664835 0.8620879
TRUE 0.9974097 0.9690476 0.9761905
Tuning parameter 'laplace' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
and adjust = 1.
# Predict on the test setpred_classes <-predict(model_nb, newdata = test_data)pred_probs <-predict(model_nb, newdata = test_data, type ="prob")# Compute confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 56 1
Low 2 57
Accuracy : 0.9741
95% CI : (0.9263, 0.9946)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9483
Mcnemar's Test P-Value : 1
Sensitivity : 0.9655
Specificity : 0.9828
Pos Pred Value : 0.9825
Neg Pred Value : 0.9661
Prevalence : 0.5000
Detection Rate : 0.4828
Detection Prevalence : 0.4914
Balanced Accuracy : 0.9741
'Positive' Class : High
# Calculate accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracy# Print accuracy and test errorcat("Accuracy on test set:", round(accuracy, 3), "\n")
(h)
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
# Define a grid of K values to try (e.g., 1 to 20)k_grid <-expand.grid(k =1:20)# Fit KNN model using caret with selected predictorsmodel_knn <-train( mpg01 ~ .,data = train_data,method ="knn",trControl = train_control,tuneGrid = k_grid,metric ="ROC")# Print cross-validation results for all K valuesprint(model_knn)
k-Nearest Neighbors
276 samples
4 predictor
2 classes: 'High', 'Low'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 248, 249, 248, 248, 248, 249, ...
Resampling results across tuning parameters:
k ROC Sens Spec
1 0.8560440 0.8739927 0.8380952
2 0.8936043 0.8668498 0.8424908
3 0.9074382 0.8838828 0.8357143
4 0.9102307 0.8716117 0.8357143
5 0.9079887 0.8622711 0.8210623
6 0.9147999 0.8743590 0.8382784
7 0.9160604 0.8695971 0.8331502
8 0.9194607 0.8670330 0.8430403
9 0.9224570 0.8695971 0.8426740
10 0.9248355 0.8717949 0.8424908
11 0.9274539 0.8793040 0.8452381
12 0.9295466 0.8840659 0.8525641
13 0.9304895 0.8793040 0.8478022
14 0.9322314 0.8794872 0.8426740
15 0.9341414 0.8719780 0.8428571
16 0.9366009 0.8816850 0.8503663
17 0.9348740 0.8816850 0.8452381
18 0.9363196 0.8864469 0.8551282
19 0.9369148 0.8791209 0.8551282
20 0.9374381 0.8840659 0.8622711
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 20.
plot(model_knn)
# Predict on the test set using the best K found via CVpred_classes <-predict(model_knn, newdata = test_data)pred_probs <-predict(model_knn, newdata = test_data, type ="prob")# Confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 53 5
Low 5 53
Accuracy : 0.9138
95% CI : (0.8472, 0.9579)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8276
Mcnemar's Test P-Value : 1
Sensitivity : 0.9138
Specificity : 0.9138
Pos Pred Value : 0.9138
Neg Pred Value : 0.9138
Prevalence : 0.5000
Detection Rate : 0.4569
Detection Prevalence : 0.5000
Balanced Accuracy : 0.9138
'Positive' Class : High
# Accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracycat("Best K:", model_knn$bestTune$k, "\n")
Best K: 20
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Response:
Caret automatically selects the best K value based on cross-validation performance. The best K value is determined by the highest ROC AUC score across the grid of K values. Caret chose 19 as the best k value, which is somewhat high for knn. Looking at the plot of K vs ROC it appears that 14 is the best K value, as there are diminishing returns after that point. I am going to force caret to use K = 14 and see how it performs.
# Force caret to use K = 14model_knn_k14 <-train( mpg01 ~ .,data = train_data,method ="knn",tuneGrid =data.frame(k =14), # <- explicitly set K = 14trControl = train_control,metric ="ROC")# Print model detailsprint(model_knn_k14)
k-Nearest Neighbors
276 samples
4 predictor
2 classes: 'High', 'Low'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 248, 248, 248, 249, 248, 248, ...
Resampling results:
ROC Sens Spec
0.9366703 0.8772894 0.8549451
Tuning parameter 'k' was held constant at a value of 14
# Predict class labels and probabilities on test setpred_classes <-predict(model_knn_k14, newdata = test_data)pred_probs <-predict(model_knn_k14, newdata = test_data, type ="prob")# Compute confusion matrixconf_matrix <-confusionMatrix(pred_classes, test_data$mpg01, positive ="High")print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low
High 52 5
Low 6 53
Accuracy : 0.9052
95% CI : (0.8367, 0.9517)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8103
Mcnemar's Test P-Value : 1
Sensitivity : 0.8966
Specificity : 0.9138
Pos Pred Value : 0.9123
Neg Pred Value : 0.8983
Prevalence : 0.5000
Detection Rate : 0.4483
Detection Prevalence : 0.4914
Balanced Accuracy : 0.9052
'Positive' Class : High
# Calculate accuracy and test erroraccuracy <- conf_matrix$overall["Accuracy"]test_error <-1- accuracycat("Accuracy on test set (K = 14):", round(accuracy, 3), "\n")
Response:
The KNN model with K = 14 achieved an accuracy of 0.905, which is marginally better than the default K value of 19. Sincce the accuracy is better in K = 14 than K = 19, I will use K = 14 for the rest of the analysis.
Problem 16
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
# Load the Boston datasetdata("Boston")# Create a binary response: 1 if crim > median, else 0median_crim <-median(Boston$crim)Boston$crim01 <-factor(ifelse(Boston$crim > median_crim, "High", "Low"))# Drop 'crim' to avoid leakageBoston <- Boston %>%select(-crim)
# Split the data into training and test setsset.seed(123)train_index <-createDataPartition(Boston$crim01, p =0.7, list =FALSE)train_data <- Boston[train_index, ]test_data <- Boston[-train_index, ]# Check class balanceprop.table(table(train_data$crim01))
# Convert results list to a data frameaccuracy_df <-data.frame(Model =names(results),Accuracy =round(unlist(results), 3))# Display tablelibrary(knitr)kable(accuracy_df, caption ="Test Set Accuracy by Model")
Test Set Accuracy by Model
Model
Accuracy
Logistic.Accuracy
Logistic
0.927
LDA.Accuracy
LDA
0.913
NaiveBayes.Accuracy
NaiveBayes
0.900
KNN.Accuracy
KNN
0.953
Model Comparison Results
We evaluated four classification models to predict whether a Boston census tract has a crime rate above the median. Test set accuracies were as follows: KNN (95.3%), Logistic Regression (92.7%), LDA (91.3%), and Naive Bayes (90.0%). KNN performed the best overall, suggesting that neighborhood-based classification effectively captures the structure in the data. All models performed well, with only slight differences in accuracy.