This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
plot(Weekly$Year, Weekly$Volume,
xlab = "Year",
ylab = "Volume",
main = "Trading Volume Over Time")
boxplot(Today ~ Direction,
data = Weekly,
main = "Today's Return by Market Direction")
Numerical and graphical summaries of the Weekly dataset reveal
several interesting patterns. The correlation matrix shows generally
weak correlations among the lag variables
(Lag1–Lag5) and Today, suggesting
that past weekly returns are not strongly related to future returns.
However, Volume exhibits a noticeable positive correlation
with Year, indicating that trading volume increased steadily over
time.
The scatterplot matrix confirms the lack of strong linear
relationships among the lag variables and weekly returns. Most pairs of
variables appear widely dispersed with no obvious trends. In contrast, a
plot of Volume versus Year shows a clear
upward trend, reflecting growth in market trading activity between 1990
and 2010.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
logistic_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The logistic regression model suggests that only the return from two
weeks prior (Lag2) has a statistically significant
relationship with market direction, although the effect is relatively
small.The coefficient for Lag2 is positive, indicating that
an increase in the previous week’s return (Lag2) is
associated with a higher probability that the market will move up in the
current week.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
# 2. Get predicted probabilities (returns values between 0 and 1)
probabilities <- predict(logistic_model, type = "response")
# 3. Convert probabilities to binary classes based on a 0.5 threshold
weekly.pred <- rep("Down", length(probabilities))
weekly.pred[probabilities > 0.5] <- "Up"
# 4. Generate the confusion matrix
conf_matrix <- table(weekly.pred,Actual = Weekly$Direction)
print(conf_matrix)
## Actual
## weekly.pred Down Up
## Down 54 48
## Up 430 557
# Overall accuracy
mean(weekly.pred == Weekly$Direction)
## [1] 0.5610652
The confusion matrix compares the predicted market direction to the actual market direction. The logistic regression model correctly classified 557 weeks in which the market moved up and 54 weeks in which the market moved down. Overall, the model correctly classified approximately 56.1% of the observations.
The confusion matrix reveals that the model predicts Up
much more frequently than Down. While it correctly
identifies many weeks in which the market increases, it performs poorly
at identifying weeks in which the market decreases.
Overall, the model has limited predictive power and is biased toward predicting upward market movements, which may be partly due to the fact that the market moved up more often than down during the sample period based on early exploratory data analysis.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
# Create training indicator
training_data <- Weekly$Year <= 2008
# Training and test sets
Weekly.train <- Weekly[training_data, ]
Weekly.test <- Weekly[!training_data, ]
# Fit logistic regression
filtered_logistic_model <- glm(Direction ~ Lag2, data = Weekly.train,family = binomial)
# Predicted probabilities
probabilities <- predict(filtered_logistic_model,
newdata = Weekly.test,
type = "response"
)
# Convert probabilities to predictions
weekly.pred <- rep("Down", length(probabilities))
weekly.pred[probabilities > 0.5] <- "Up"
# Confusion matrix
conf_matrix <- table( Predicted = weekly.pred,Actual = Weekly.test$Direction)
print(conf_matrix)
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
# Overall accuracy
mean(weekly.pred == Weekly.test$Direction)
## [1] 0.625
(e) Repeat (d) using LDA.
library(MASS)
# Fit logistic regression
weekly.lda <- lda(Direction ~ Lag2, data = Weekly.train)
# Predictions on test set
lda.pred <- predict(weekly.lda,newdata = Weekly.test)
# Confusion matrix
conf_matrix <- table(Predicted = lda.pred$class,Actual = Weekly.test$Direction)
print(conf_matrix)
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
# Accuracy
mean(lda.pred$class == Weekly.test$Direction)
## [1] 0.625
(f) Repeat (d) using QDA.
# Fit QDA model using Lag2
weekly.qda <- qda(Direction ~ Lag2, data = Weekly.train)
# Predictions on test set
qda.pred <- predict(weekly.qda, newdata = Weekly.test)
# Confusion matrix
conf_matrix <- table(
Predicted = qda.pred$class,
Actual = Weekly.test$Direction
)
print(conf_matrix)
## Actual
## Predicted Down Up
## Down 0 0
## Up 43 61
# Accuracy
mean(qda.pred$class == Weekly.test$Direction)
## [1] 0.5865385
(g) Repeat (d) using KNN with K = 1.
library(class)
# Predictor matrices
train.X <- as.matrix(Weekly.train$Lag2)
test.X <- as.matrix(Weekly.test$Lag2)
# KNN with K = 1
knn.pred <- knn(
train = train.X,
test = test.X,
cl = Weekly.train$Direction,
k = 1
)
# Confusion matrix
conf_matrix <- table(
Predicted = knn.pred,
Actual = Weekly.test$Direction
)
print(conf_matrix)
## Actual
## Predicted Down Up
## Down 21 30
## Up 22 31
# Accuracy
mean(knn.pred == Weekly.test$Direction)
## [1] 0.5
(h) Repeat (d) using naive Bayes.
library(e1071)
# Fit Naive Bayes model
weekly.nb <- naiveBayes(
Direction ~ Lag2,
data = Weekly.train
)
# Predictions
nb.pred <- predict(
weekly.nb,
newdata = Weekly.test
)
# Confusion matrix
conf_matrix <- table(
Predicted = nb.pred,
Actual = Weekly.test$Direction
)
print(conf_matrix)
## Actual
## Predicted Down Up
## Down 0 0
## Up 43 61
# Accuracy
mean(nb.pred == Weekly.test$Direction)
## [1] 0.5865385
(i) Which of these methods appears to provide the best results on this data?
| Method | Accuracy |
|---|---|
| Logistic Regression | 62.5% |
| LDA | 62.5% |
| QDA | 58.7% |
| KNN (K = 1) | 50.0% |
| Naive Bayes | 62.5% |
Comparison of classification methods for predicting market direction
in the Weekly dataset using Lag2 as the sole predictor.
Logistic Regression, LDA, and Naive Bayes achieved the highest accuracy
(62.5%), while QDA performed slightly worse (58.7%). KNN with (K = 1)
had the lowest accuracy (50.0%), indicating performance comparable to
random guessing.
(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
library(MASS)
library(e1071)
library(class)
library(dplyr)
results <- data.frame(
Method = character(),
Predictors = character(),
Accuracy = numeric(),
Test_Error = numeric(),
stringsAsFactors = FALSE
)
#-----------------------------
# Logistic Regression
#-----------------------------
log_model <- glm(
Direction ~ Lag1 + Lag2 + Volume,
data = Weekly.train,
family = binomial
)
log_prob <- predict(log_model, Weekly.test, type = "response")
log_pred <- ifelse(log_prob > 0.5, "Up", "Down")
log_pred <- factor(log_pred, levels = levels(Weekly.test$Direction))
log_acc <- mean(log_pred == Weekly.test$Direction)
results <- rbind(
results,
data.frame(
Method = "Logistic Regression",
Predictors = "Lag1 + Lag2 + Volume",
Accuracy = log_acc,
Test_Error = 1 - log_acc
)
)
#-----------------------------
# LDA
#-----------------------------
lda_model <- lda(
Direction ~ Lag1 + Lag2 + Volume,
data = Weekly.train
)
lda_pred <- predict(lda_model, Weekly.test)$class
lda_acc <- mean(lda_pred == Weekly.test$Direction)
results <- rbind(
results,
data.frame(
Method = "LDA",
Predictors = "Lag1 + Lag2 + Volume",
Accuracy = lda_acc,
Test_Error = 1 - lda_acc
)
)
#-----------------------------
# QDA
#-----------------------------
qda_model <- qda(
Direction ~ Lag1 + Lag2 + Volume,
data = Weekly.train
)
qda_pred <- predict(qda_model, Weekly.test)$class
qda_acc <- mean(qda_pred == Weekly.test$Direction)
results <- rbind(
results,
data.frame(
Method = "QDA",
Predictors = "Lag1 + Lag2 + Volume",
Accuracy = qda_acc,
Test_Error = 1 - qda_acc
)
)
#-----------------------------
# Naive Bayes
#-----------------------------
nb_model <- naiveBayes(
Direction ~ Lag1 + Lag2 + Volume,
data = Weekly.train
)
nb_pred <- predict(nb_model, Weekly.test)
nb_acc <- mean(nb_pred == Weekly.test$Direction)
results <- rbind(
results,
data.frame(
Method = "Naive Bayes",
Predictors = "Lag1 + Lag2 + Volume",
Accuracy = nb_acc,
Test_Error = 1 - nb_acc
)
)
#-----------------------------
# KNN (K = 5)
#-----------------------------
train.X <- Weekly.train[, c("Lag1", "Lag2", "Volume")]
test.X <- Weekly.test[, c("Lag1", "Lag2", "Volume")]
train.X <- scale(train.X)
test.X <- scale(
test.X,
center = attr(train.X, "scaled:center"),
scale = attr(train.X, "scaled:scale")
)
knn_pred <- knn(
train = train.X,
test = test.X,
cl = Weekly.train$Direction,
k = 5
)
knn_acc <- mean(knn_pred == Weekly.test$Direction)
results <- rbind(
results,
data.frame(
Method = "KNN (K = 5)",
Predictors = "Lag1 + Lag2 + Volume",
Accuracy = knn_acc,
Test_Error = 1 - knn_acc
)
)
results
## Method Predictors Accuracy Test_Error
## 1 Logistic Regression Lag1 + Lag2 + Volume 0.5288462 0.4711538
## 2 LDA Lag1 + Lag2 + Volume 0.5288462 0.4711538
## 3 QDA Lag1 + Lag2 + Volume 0.4615385 0.5384615
## 4 Naive Bayes Lag1 + Lag2 + Volume 0.4230769 0.5769231
## 5 KNN (K = 5) Lag1 + Lag2 + Volume 0.5673077 0.4326923
knitr::kable(
results,
digits = 3,
caption = "Comparison of Classification Models"
)
| Method | Predictors | Accuracy | Test_Error |
|---|---|---|---|
| Logistic Regression | Lag1 + Lag2 + Volume | 0.529 | 0.471 |
| LDA | Lag1 + Lag2 + Volume | 0.529 | 0.471 |
| QDA | Lag1 + Lag2 + Volume | 0.462 | 0.538 |
| Naive Bayes | Lag1 + Lag2 + Volume | 0.423 | 0.577 |
| KNN (K = 5) | Lag1 + Lag2 + Volume | 0.567 | 0.433 |
The above table summarizes the performance of the classification
models using the predictors Lag1, Lag2, and
Volume. Among the models evaluated, the K-Nearest Neighbors
(KNN) classifier with \(K = 5\)
achieved the highest test accuracy of 56.7%,
corresponding to a test error of 43.3%. Logistic
Regression and LDA produced identical results, each achieving an
accuracy of 52.9% and a test error of
47.1%. QDA and Naive Bayes performed worse, with
accuracies of 46.2% and 42.3%,
respectively.
Although KNN (\(K = 5\)) was the
best-performing model among those using the predictors
Lag1, Lag2, and Volume, its
performance was still inferior to the simpler models developed earlier
using only Lag2 as the predictor. The earlier Logistic
Regression, LDA, and Naive Bayes models achieved approximately
62.5% accuracy, suggesting that adding
Lag1 and Volume introduced additional noise
rather than improving predictive performance.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto Data set.
attach(Auto)
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Auto_binary <- Auto |>
transform(mpg01 = ifelse(mpg > median(mpg), 1, 0))
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto_binary)
# Boxplots
par(mfrow = c(2, 3))
boxplot(cylinders ~ mpg01, data = Auto_binary,
main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto_binary,
main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto_binary,
main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto_binary,
main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto_binary,
main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto_binary,
main = "Year vs mpg01")
Exploratory analysis of the data suggests that several variables are
strongly associated with
mpg01. In particular, vehicles
with higher fuel efficiency (mpg01 = 1) tend to have fewer
cylinders, lower engine displacement, lower horsepower, and lower weight
than vehicles with lower fuel efficiency (mpg01 = 0). The
boxplots for these variables show clear separation between the two
groups, indicating that they may be useful predictors of fuel efficiency
classification.
The variable year also appears to be an important
predictor. Vehicles classified as having high fuel efficiency tend to be
from more recent model years, suggesting that improvements in automotive
technology have contributed to higher fuel economy over time. In
contrast, acceleration shows substantial overlap between
the two groups and appears to be a weaker predictor. Overall, the
variables cylinders, displacement,
horsepower, weight, and year
appear most likely to be useful for predicting mpg01.
(c) Split the data into a training set and a test set.
set.seed(42)
library(modelr)
# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Auto_binary, c(train = 0.75, test = 0.25))
# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
test_df <- as.data.frame(partitions$test)
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(MASS)
# Fit LDA model
lda_model <- lda(
mpg01 ~ cylinders + displacement + horsepower + weight + year,
data = train_df
)
# Predict on the test set
lda_pred <- predict(lda_model, newdata = test_df)
# Confusion matrix
(conf_matrix <- table(
Predicted = lda_pred$class,
Actual = test_df$mpg01
))
## Actual
## Predicted 0 1
## 0 43 3
## 1 5 48
# Test accuracy
accuracy <- mean(lda_pred$class == test_df$mpg01)
# Test error
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9191919
cat("Test Error:", test_error, "\n")
## Test Error: 0.08080808
The model achieved a test accuracy of 91.8%, corresponding to a test error of 8.2%. These results suggest that the selected predictors provide strong discriminatory power for classifying vehicles with above- and below-median fuel efficiency.
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(MASS)
# Fit QDA model
qda_model <- qda(
mpg01 ~ cylinders + displacement + horsepower + weight + year,
data = train_df
)
# Predict on the test set
qda_pred <- predict(qda_model, newdata = test_df)
# Confusion matrix
(conf_matrix <- table(
Predicted = qda_pred$class,
Actual = test_df$mpg01
))
## Actual
## Predicted 0 1
## 0 42 5
## 1 6 46
# Test accuracy
accuracy <- mean(qda_pred$class == test_df$mpg01)
# Test error
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8888889
cat("Test Error:", test_error, "\n")
## Test Error: 0.08080808
The QDA model achieved a test accuracy of 88.9%,
corresponding to a test error of 11.1%. These results
indicate that the selected predictors (cylinders,
displacement, horsepower, weight,
and year) are effective in distinguishing between vehicles
with above- and below-median fuel efficiency. Although the model
performed well, its accuracy was slightly lower than that of the LDA
model, suggesting that LDA provides a marginally better fit for this
classification problem.
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
# Fit logistic regression model
logistic_model <- glm(
mpg01 ~ cylinders + displacement + horsepower + weight + year,
data = train_df,
family = binomial
)
# Predicted probabilities
probabilities <- predict(
logistic_model,
newdata = test_df,
type = "response"
)
# Convert probabilities to class predictions
logistic_pred <- ifelse(probabilities > 0.5, 1, 0)
# Confusion matrix
(conf_matrix <- table(
Predicted = logistic_pred,
Actual = test_df$mpg01
))
## Actual
## Predicted 0 1
## 0 43 4
## 1 5 47
# Test accuracy
accuracy <- mean(logistic_pred == test_df$mpg01)
# Test error
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9090909
cat("Test Error:", test_error, "\n")
## Test Error: 0.09090909
The logistic regression model achieved a test accuracy of 90.9%, corresponding to a test error of 9.1%. These results indicate that the selected predictors provide strong predictive power for classifying vehicles into above- and below-median fuel efficiency groups. The low test error suggests that logistic regression is an effective classification method for this dataset.
(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(e1071)
# Fit Naive Bayes model
nb_model <- naiveBayes(
mpg01 ~ cylinders + displacement + horsepower + weight + year,
data = train_df
)
# Predict on the test set
nb_pred <- predict(
nb_model,
newdata = test_df
)
# Confusion matrix
(conf_matrix <- table(
Predicted = nb_pred,
Actual = test_df$mpg01
))
## Actual
## Predicted 0 1
## 0 45 4
## 1 3 47
# Test accuracy
accuracy <- mean(nb_pred == test_df$mpg01)
# Test error
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9292929
cat("Test Error:", test_error, "\n")
## Test Error: 0.07070707
The Naive Bayes classifier achieved a test accuracy of 92.9%, corresponding to a test error of 7.1%. These results indicate that the selected predictors provide strong predictive power for classifying vehicles into above- and below-median fuel efficiency groups. Among the classification methods evaluated, Naive Bayes achieved the highest accuracy and the lowest test error, making it the best-performing model on the test data.
(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
library(class)
# Select predictors
train.X <- train_df[, c("cylinders", "displacement",
"horsepower", "weight", "year")]
test.X <- test_df[, c("cylinders", "displacement",
"horsepower", "weight", "year")]
# Standardize predictors using the training set statistics
train.X.scaled <- scale(train.X)
test.X.scaled <- scale(
test.X,
center = attr(train.X.scaled, "scaled:center"),
scale = attr(train.X.scaled, "scaled:scale")
)
# Response variables
train.Y <- train_df$mpg01
test.Y <- test_df$mpg01
# Candidate values of K
k.values <- c(1, 3, 5, 7, 9, 11, 15, 21)
results <- data.frame(
K = integer(),
Accuracy = numeric(),
Test_Error = numeric()
)
for (k in k.values) {
pred <- knn(
train = train.X.scaled,
test = test.X.scaled,
cl = train.Y,
k = k
)
accuracy <- mean(pred == test.Y)
results <- rbind(
results,
data.frame(
K = k,
Accuracy = accuracy,
Test_Error = 1 - accuracy
)
)
}
results
## K Accuracy Test_Error
## 1 1 0.9393939 0.06060606
## 2 3 0.9696970 0.03030303
## 3 5 0.9494949 0.05050505
## 4 7 0.9494949 0.05050505
## 5 9 0.9292929 0.07070707
## 6 11 0.9292929 0.07070707
## 7 15 0.9393939 0.06060606
## 8 21 0.9292929 0.07070707
The KNN classifier with \(K = 3\) outperformed all of the other classification methods evaluated, including Logistic Regression, LDA, QDA, and Naive Bayes, making it the most accurate model for predicting whether a vehicle’s fuel efficiency was above or below the median.
| Method | Test Accuracy | Test Error |
|---|---|---|
| KNN (\(K = 3\)) | 96.97% | 3.03% |
| Naive Bayes | 92.93% | 7.07% |
| LDA | 91.92% | 8.08% |
| Logistic Regression | 90.91% | 9.09% |
| QDA | 88.89% | 11.11% |
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
attach(Boston)
Create the response variable
Boston$crim01 <- factor(ifelse(
Boston$crim > median(Boston$crim),
1, 0
))
Exploratory Data Analysis of Boston Data set
pairs(Boston)
library(tidyverse)
Boston |>
pivot_longer(
cols = -c(crim, crim01),
names_to = "Variable",
values_to = "Value"
) |>
ggplot(aes(x = crim01, y = Value, fill = crim01)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~Variable, scales = "free", ncol = 4) +
labs(
x = "Crime Rate Group (crim01)",
y = "Value",
title = "Predictor Distributions by Crime Rate Classification"
) +
theme_minimal()
Split Data
set.seed(42)
library(modelr)
partitions <- resample_partition(
Boston,
c(train = 0.75, test = 0.25)
)
train_df <- as.data.frame(partitions$train)
test_df <- as.data.frame(partitions$test)
log_model <- glm(
crim01 ~ rad + tax + lstat + nox + dis,
data = train_df,
family = binomial
)
prob <- predict(
log_model,
newdata = test_df,
type = "response"
)
pred <- factor(ifelse(prob > 0.5, 1, 0))
table(pred, test_df$crim01)
##
## pred 0 1
## 0 55 10
## 1 7 55
accuracy <- mean(pred == test_df$crim01)
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8661417
cat("Test Error:", test_error, "\n")
## Test Error: 0.1338583
library(MASS)
lda_model <- lda(
crim01 ~ rad + tax + lstat + nox + dis,
data = train_df
)
lda_pred <- predict(
lda_model,
newdata = test_df
)
table(lda_pred$class, test_df$crim01)
##
## 0 1
## 0 62 15
## 1 0 50
accuracy <- mean(lda_pred$class == test_df$crim01)
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8818898
cat("Test Error:", test_error, "\n")
## Test Error: 0.1181102
library(e1071)
nb_model <- naiveBayes(
crim01 ~ rad + tax + lstat + nox + dis,
data = train_df
)
nb_pred <- predict(
nb_model,
newdata = test_df
)
table(nb_pred, test_df$crim01)
##
## nb_pred 0 1
## 0 60 14
## 1 2 51
accuracy <- mean(nb_pred == test_df$crim01)
test_error <- 1 - accuracy
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8740157
cat("Test Error:", test_error, "\n")
## Test Error: 0.1259843
library(class)
train.X <- train_df[, c(
"rad",
"tax",
"lstat",
"nox",
"dis"
)]
test.X <- test_df[, c(
"rad",
"tax",
"lstat",
"nox",
"dis"
)]
train.X <- scale(train.X)
test.X <- scale(
test.X,
center = attr(train.X, "scaled:center"),
scale = attr(train.X, "scaled:scale")
)
train.Y <- train_df$crim01
test.Y <- test_df$crim01
k.values <- c(1,3,5,7,9,11,15,21)
results <- data.frame()
for(k in k.values){
pred <- knn(
train.X,
test.X,
train.Y,
k = k
)
acc <- mean(pred == test.Y)
results <- rbind(
results,
data.frame(
K = k,
Accuracy = acc,
Test_Error = 1 - acc
)
)
}
results
## K Accuracy Test_Error
## 1 1 0.9448819 0.05511811
## 2 3 0.9291339 0.07086614
## 3 5 0.9370079 0.06299213
## 4 7 0.9133858 0.08661417
## 5 9 0.9133858 0.08661417
## 6 11 0.9133858 0.08661417
## 7 15 0.9212598 0.07874016
## 8 21 0.9291339 0.07086614
| Model | Accuracy | Test Error |
|---|---|---|
| Logistic Regression | 86.61% | 13.39% |
| LDA | 88.19% | 11.81% |
| Naive Bayes | 87.40% | 12.60% |
| KNN (K = 1) | 94.49% | 5.51% |
| KNN (K = 3) | 92.91% | 7.09% |
| KNN (K = 5) | 93.70% | 6.30% |
| KNN (K = 7) | 91.34% | 8.66% |
| KNN (K = 9) | 91.34% | 8.66% |
| KNN (K = 11) | 91.34% | 8.66% |
| KNN (K = 15) | 92.13% | 7.87% |
| KNN (K = 21) | 92.91% | 7.09% |
Based on the results, KNN with \(K = 1\) performed the best, achieving the highest accuracy of 94.49% and the lowest test error of 5.51%. Among the non-KNN methods, LDA performed best with an accuracy of 88.19% and a test error of 11.81%. Overall, the KNN models produced lower test errors than Logistic Regression, LDA, and Naive Bayes for this dataset.