library(ISLR2)
library(caret)
library(MASS)
library(class)
library(e1071)
data("Weekly")
pairs(Weekly[, -9], main = "Scatterplot Matrix of Weekly Data (Numeric Variables)", col = "blue")
NA
NA
I don’t see any patterns in the pairs scatterplot matrix but I will try some other graphs to keep checking
hist(Weekly$Today, breaks = 30, col = "lightblue", main = "Histogram of Weekly Returns",
xlab = "Weekly Return", border = "white")
boxplot(Weekly[,2:6], main = "Boxplots of Lagged Weekly Returns",
col = c("lightblue", "lightgreen", "pink", "yellow", "orange"))
plot(Weekly$Volume, type = "l", col = "blue",
main = "Time Series of Trading Volume",
ylab = "Volume", xlab = "Time Index")
plot(Weekly$Lag1, Weekly$Today,
main = "Scatterplot of Today’s Return vs. Lag1",
xlab = "Lag1", ylab = "Today", col = "red", pch = 19)
abline(lm(Today ~ Lag1, data = Weekly), col = "blue", lwd = 2)
table(Weekly$Direction)
Down Up
484 605
barplot(table(Weekly$Direction), col = c("red", "green"),
main = "Market Direction Frequency",
ylab = "Count")
NA
NA
NA
I notice on the histogram it looks to be a normal distribution, on the box plots they look even, on the frequency plot there is more up than down which is good for the American Investing Public
# Convert Direction to a binary factor (Up = 1, Down = 0)
Weekly$Direction <- as.factor(Weekly$Direction)
# Fit the logistic regression model
logit_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
# Display model summary
summary(logit_model)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
Lag 2, for some reason seems significant but none of the rest are. One thing I did notice is that lag 2 which is significant has a positive estimate, and the model over predicts up, and all the other variables are negitive and are also not significant.
# Predict probabilities
pred_probs <- predict(logit_model, type = "response")
# Convert probabilities to class labels (threshold = 0.5)
pred_labels <- ifelse(pred_probs > 0.5, "Up", "Down")
# Convert to factor to match actual Direction variable
pred_labels <- factor(pred_labels, levels = levels(Weekly$Direction))
# Create confusion matrix
conf_matrix <- table(Predicted = pred_labels, Actual = Weekly$Direction)
print(conf_matrix)
Actual
Predicted Down Up
Down 54 48
Up 430 557
# Compute overall accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4)))
[1] "Accuracy: 0.5611"
confusionMatrix(pred_labels, Weekly$Direction, positive = "Up")
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 54 48
Up 430 557
Accuracy : 0.5611
95% CI : (0.531, 0.5908)
No Information Rate : 0.5556
P-Value [Acc > NIR] : 0.369
Kappa : 0.035
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9207
Specificity : 0.1116
Pos Pred Value : 0.5643
Neg Pred Value : 0.5294
Prevalence : 0.5556
Detection Rate : 0.5115
Detection Prevalence : 0.9063
Balanced Accuracy : 0.5161
'Positive' Class : Up
The model predicts “Up” much more often than “Down”, leading to many false negatives (430 cases). It struggles to correctly predict “Down” movements. This suggests that logistic regression does not perform well in predicting market declines
# Split data into training (1990-2008) and testing (2009-2010)
train_data <- subset(Weekly, Year <= 2008)
test_data <- subset(Weekly, Year > 2008)
# Fit logistic regression using Lag2 as the only predictor
logit_model_lag2 <- glm(Direction ~ Lag2, data = train_data, family = binomial)
# Print summary of the model
summary(logit_model_lag2)
Call:
glm(formula = Direction ~ Lag2, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.20326 0.06428 3.162 0.00157 **
Lag2 0.05810 0.02870 2.024 0.04298 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1354.7 on 984 degrees of freedom
Residual deviance: 1350.5 on 983 degrees of freedom
AIC: 1354.5
Number of Fisher Scoring iterations: 4
# Predict probabilities for test data
pred_probs_test <- predict(logit_model_lag2, test_data, type = "response")
# Convert probabilities to class labels (threshold = 0.5)
pred_labels_test <- ifelse(pred_probs_test > 0.5, "Up", "Down")
# Convert to factor to match actual Direction variable
pred_labels_test <- factor(pred_labels_test, levels = levels(Weekly$Direction))
# Compute confusion matrix
conf_matrix_test <- table(Predicted = pred_labels_test, Actual = test_data$Direction)
print(conf_matrix_test)
Actual
Predicted Down Up
Down 9 5
Up 34 56
# Compute overall accuracy
accuracy_test <- sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
print(paste("Test Accuracy:", round(accuracy_test, 4)))
[1] "Test Accuracy: 0.625"
# Training data: 1990-2008
train_data <- subset(Weekly, Year <= 2008)
# Test data: 2009-2010
test_data <- subset(Weekly, Year > 2008)
# Fit LDA using Lag2 as the only predictor
lda_model <- lda(Direction ~ Lag2, data = train_data)
# Print model summary
lda_model
Call:
lda(Direction ~ Lag2, data = train_data)
Prior probabilities of groups:
Down Up
0.4477157 0.5522843
Group means:
Lag2
Down -0.03568254
Up 0.26036581
Coefficients of linear discriminants:
LD1
Lag2 0.4414162
# Predict class labels on test data
lda_predictions <- predict(lda_model, test_data)
# Extract predicted class labels
pred_labels_lda <- lda_predictions$class
# Compute confusion matrix
conf_matrix_lda <- table(Predicted = pred_labels_lda, Actual = test_data$Direction)
print(conf_matrix_lda)
Actual
Predicted Down Up
Down 9 5
Up 34 56
# Compute overall accuracy
accuracy_lda <- sum(diag(conf_matrix_lda)) / sum(conf_matrix_lda)
print(paste("LDA Test Accuracy:", round(accuracy_lda, 4)))
[1] "LDA Test Accuracy: 0.625"
# Training data: 1990-2008
train_data <- subset(Weekly, Year <= 2008)
# Test data: 2009-2010
test_data <- subset(Weekly, Year > 2008)
# Fit QDA using Lag2 as the only predictor
qda_model <- qda(Direction ~ Lag2, data = train_data)
# Print model summary
qda_model
Call:
qda(Direction ~ Lag2, data = train_data)
Prior probabilities of groups:
Down Up
0.4477157 0.5522843
Group means:
Lag2
Down -0.03568254
Up 0.26036581
# Predict class labels on test data
qda_predictions <- predict(qda_model, test_data)
# Extract predicted class labels
pred_labels_qda <- qda_predictions$class
# Compute confusion matrix
conf_matrix_qda <- table(Predicted = pred_labels_qda, Actual = test_data$Direction)
print(conf_matrix_qda)
Actual
Predicted Down Up
Down 0 0
Up 43 61
# Compute overall accuracy
accuracy_qda <- sum(diag(conf_matrix_qda)) / sum(conf_matrix_qda)
print(paste("QDA Test Accuracy:", round(accuracy_qda, 4)))
[1] "QDA Test Accuracy: 0.5865"
accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
print(paste("KNN Test Accuracy (K=1):", round(accuracy_knn, 4)))
[1] "KNN Test Accuracy (K=1): 0.5"
confusionMatrix(knn_pred, test_data$Direction, positive = "Up")
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 21 30
Up 22 31
Accuracy : 0.5
95% CI : (0.4003, 0.5997)
No Information Rate : 0.5865
P-Value [Acc > NIR] : 0.9700
Kappa : -0.0033
Mcnemar's Test P-Value : 0.3317
Sensitivity : 0.5082
Specificity : 0.4884
Pos Pred Value : 0.5849
Neg Pred Value : 0.4118
Prevalence : 0.5865
Detection Rate : 0.2981
Detection Prevalence : 0.5096
Balanced Accuracy : 0.4983
'Positive' Class : Up
# Split data into training (1990-2008) and testing (2009-2010)
train_data <- subset(Weekly, Year <= 2008)
test_data <- subset(Weekly, Year > 2008)
# Define predictor and response variables
train_X <- train_data[, "Lag2", drop = FALSE] # Keep as data frame
test_X <- test_data[, "Lag2", drop = FALSE]
train_Y <- train_data$Direction # Response variable for training
test_Y <- test_data$Direction # Response variable for testing
# Fit Naïve Bayes model
nb_model <- naiveBayes(Direction ~ Lag2, data = train_data)
# Print model summary
print(nb_model)
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
Down Up
0.4477157 0.5522843
Conditional probabilities:
Lag2
Y [,1] [,2]
Down -0.03568254 2.199504
Up 0.26036581 2.317485
# Predict class labels on test data
nb_predictions <- predict(nb_model, test_X)
# Compute confusion matrix
conf_matrix_nb <- table(Predicted = nb_predictions, Actual = test_Y)
print(conf_matrix_nb)
Actual
Predicted Down Up
Down 0 0
Up 43 61
# Compute overall accuracy
accuracy_nb <- sum(diag(conf_matrix_nb)) / sum(conf_matrix_nb)
print(paste("Naïve Bayes Test Accuracy:", round(accuracy_nb, 4)))
[1] "Naïve Bayes Test Accuracy: 0.5865"
Logistic Regression 62.5% LDA (Linear Discriminant Analysis) 62.5% QDA (Quadratic Discriminant Analysis) 52% KNN (K = 1) 50% Naïve Bayes 58%
this would suggest the Logistic Regression and Linear Discriminant Analysis gave the best results for this data set but none are that good.
# Create a new model with additional predictors
train_data <- subset(Weekly, Year <= 2008)
test_data <- subset(Weekly, Year > 2008)
# Create interactions
train_data$Lag1_Lag2 <- train_data$Lag1 * train_data$Lag2
test_data$Lag1_Lag2 <- test_data$Lag1 * test_data$Lag2
# Logistic regression with multiple predictors
logit_model_exp <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume + Lag1_Lag2,
data = train_data, family = binomial)
# Make predictions on test data
logit_probs <- predict(logit_model_exp, test_data, type = "response")
logit_preds <- factor(ifelse(logit_probs > 0.5, "Up", "Down"), levels = levels(test_data$Direction))
# Compute confusion matrix
conf_matrix_logit <- table(Predicted = logit_preds, Actual = test_data$Direction)
print(conf_matrix_logit)
Actual
Predicted Down Up
Down 29 42
Up 14 19
# Compute accuracy
accuracy_logit <- sum(diag(conf_matrix_logit)) / sum(conf_matrix_logit)
print(paste("Logistic Regression Accuracy:", round(accuracy_logit, 4)))
[1] "Logistic Regression Accuracy: 0.4615"
# LDA with more predictors
train_data$Direction <- as.factor(train_data$Direction)
lda_model_exp <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,data = train_data)
# Make predictions
lda_preds <- predict(lda_model_exp, test_data)$class
# Compute confusion matrix
conf_matrix_lda <- table(Predicted = lda_preds, Actual = test_data$Direction)
print(conf_matrix_lda)
Actual
Predicted Down Up
Down 31 44
Up 12 17
# Compute accuracy
accuracy_lda <- sum(diag(conf_matrix_lda)) / sum(conf_matrix_lda)
print(paste("LDA Accuracy:", round(accuracy_lda, 4)))
[1] "LDA Accuracy: 0.4615"
# Standardize predictors
train_X <- scale(train_data[, c("Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume")])
test_X <- scale(test_data[, c("Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume")],
center = attr(train_X, "scaled:center"),
scale = attr(train_X, "scaled:scale"))
train_Y <- train_data$Direction
test_Y <- test_data$Direction
# Try different values of K
for (k in c(1, 3, 5, 10, 20)) {
knn_preds <- knn(train_X, test_X, train_Y, k = k)
conf_matrix_knn <- table(Predicted = knn_preds, Actual = test_Y)
accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
print(paste("KNN Accuracy (K =", k, "):", round(accuracy_knn, 4)))
}
[1] "KNN Accuracy (K = 1 ): 0.4519"
[1] "KNN Accuracy (K = 3 ): 0.4904"
[1] "KNN Accuracy (K = 5 ): 0.5096"
[1] "KNN Accuracy (K = 10 ): 0.5481"
[1] "KNN Accuracy (K = 20 ): 0.5288"
# Train Naïve Bayes with more predictors
nb_model_exp <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = train_data)
# Make predictions
nb_preds <- predict(nb_model_exp, test_data)
# Compute confusion matrix
conf_matrix_nb <- table(Predicted = nb_preds, Actual = test_data$Direction)
print(conf_matrix_nb)
Actual
Predicted Down Up
Down 42 56
Up 1 5
# Compute accuracy
accuracy_nb <- sum(diag(conf_matrix_nb)) / sum(conf_matrix_nb)
print(paste("Naïve Bayes Accuracy:", round(accuracy_nb, 4)))
[1] "Naïve Bayes Accuracy: 0.4519"
data("Auto")
mpg <- median(Auto$mpg)
Auto$mpg01 <- ifelse(Auto$mpg > mpg, 1, 0)
# Boxplots to compare numeric features across mpg01 groups
boxplot(Auto$horsepower ~ Auto$mpg01, col = c("red", "blue"),
main = "Horsepower vs. MPG",
xlab = "MPG Category (0 = Low, 1 = High)", ylab = "Horsepower")
boxplot(Auto$weight ~ Auto$mpg01, col = c("red", "blue"),
main = "Weight vs. MPG",
xlab = "MPG Category (0 = Low, 1 = High)", ylab = "Weight")
boxplot(Auto$acceleration ~ Auto$mpg01, col = c("red", "blue"),
main = "Acceleration vs. MPG",
xlab = "MPG Category (0 = Low, 1 = High)", ylab = "Acceleration")
boxplot(Auto$displacement ~ Auto$mpg01, col = c("red", "blue"),
main = "Displacement vs. MPG",
xlab = "MPG Category (0 = Low, 1 = High)", ylab = "Displacement")
NA
NA
# Scatterplot of weight vs. horsepower, colored by mpg01
plot(Auto$horsepower, Auto$weight, col = as.numeric(Auto$mpg01) + 2,
main = "Weight vs. Horsepower (Colored by MPG Category)",
xlab = "Horsepower", ylab = "Weight", pch = 19)
legend("topright", legend = c("Low MPG", "High MPG"), col = c(2, 3), pch = 19)
Seperating MPG into a high and low became very useful because there is clear seperation between groups based on that variable. High MPG correlates to low weight, low acceleration, low displacement, and on the scatter plot you can see the linear relationship between weight and horsepower.
set.seed(42)
# Create a partition (80% training, 20% testing)
train_idx <- createDataPartition(Auto$mpg01, p = 0.8, list = FALSE)
# Create training and test sets
train <- Auto[train_idx, ]
test <- Auto[-train_idx, ]
# Check the dimensions of the split
dim(train) # Training set size
[1] 314 10
dim(test) # Test set size
[1] 78 10
lda_mpg <- lda(mpg01 ~ horsepower + weight + displacement, data=train)
summary(lda_mpg)
Length Class Mode
prior 2 -none- numeric
counts 2 -none- numeric
means 6 -none- numeric
scaling 3 -none- numeric
lev 2 -none- character
svd 1 -none- numeric
N 1 -none- numeric
call 3 -none- call
terms 3 terms call
xlevels 0 -none- list
lda_preds <- predict(lda_mpg, test)$class
conf_matrix_lda <- table(Predicted = lda_preds, Actual = test$mpg01)
print(conf_matrix_lda)
Actual
Predicted 0 1
0 31 1
1 8 38
confusionMatrix(lda_preds, test$mpg01, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 31 1
1 8 38
Accuracy : 0.8846
95% CI : (0.7922, 0.9459)
No Information Rate : 0.5
P-Value [Acc > NIR] : 6.906e-13
Kappa : 0.7692
Mcnemar's Test P-Value : 0.0455
Sensitivity : 0.9744
Specificity : 0.7949
Pos Pred Value : 0.8261
Neg Pred Value : 0.9688
Prevalence : 0.5000
Detection Rate : 0.4872
Detection Prevalence : 0.5897
Balanced Accuracy : 0.8846
'Positive' Class : 1
test_error_lda <- 1 - sum(diag(conf_matrix_lda)) / sum(conf_matrix_lda)
print(paste("LDA Test Error Rate:", round(test_error_lda, 4)))
[1] "LDA Test Error Rate: 0.1154"
the error rate was 11.54%
qda_mpg <- qda(mpg01 ~ horsepower + weight + displacement, data = train)
qda_preds <- predict(qda_mpg, test)$class
conf_matrix_qda <- table(Predicted = qda_preds, Actual = test_data$mpg01)
confusionMatrix(qda_preds, test$mpg01, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 33 3
1 6 36
Accuracy : 0.8846
95% CI : (0.7922, 0.9459)
No Information Rate : 0.5
P-Value [Acc > NIR] : 6.906e-13
Kappa : 0.7692
Mcnemar's Test P-Value : 0.505
Sensitivity : 0.9231
Specificity : 0.8462
Pos Pred Value : 0.8571
Neg Pred Value : 0.9167
Prevalence : 0.5000
Detection Rate : 0.4615
Detection Prevalence : 0.5385
Balanced Accuracy : 0.8846
'Positive' Class : 1
test_error_qda <- 1 - sum(diag(conf_matrix_qda)) / sum(conf_matrix_qda)
print(paste("QDA Test Error Rate:", round(test_error_qda, 4)))
[1] "QDA Test Error Rate: 0.1154"
it’s notable that the error rate for the LDA and QDA are the same and I suspect it has something to do with the nature of the tests
logit_mpg <- glm(mpg01 ~ horsepower + weight + displacement,
data = train, family = binomial)
logit_probs <- predict(logit_mpg, test, type = "response")
logit_preds <- factor(ifelse(logit_probs > 0.5, 1, 0), levels = c(0, 1))
conf_matrix_logit <- table(Predicted = logit_preds, Actual = test$mpg01)
confusionMatrix(logit_preds, test$mpg01, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 32 2
1 7 37
Accuracy : 0.8846
95% CI : (0.7922, 0.9459)
No Information Rate : 0.5
P-Value [Acc > NIR] : 6.906e-13
Kappa : 0.7692
Mcnemar's Test P-Value : 0.1824
Sensitivity : 0.9487
Specificity : 0.8205
Pos Pred Value : 0.8409
Neg Pred Value : 0.9412
Prevalence : 0.5000
Detection Rate : 0.4744
Detection Prevalence : 0.5641
Balanced Accuracy : 0.8846
'Positive' Class : 1
test_error_logit <- 1 - sum(diag(conf_matrix_logit)) / sum(conf_matrix_logit)
print(paste("Logistic Regression Test Error Rate:", round(test_error_logit, 4)))
[1] "Logistic Regression Test Error Rate: 0.1154"
and the error rate is the same on this test, either I’m doing something terribly wrong or this is how this data effects all the tests,
nb_mpg <- naiveBayes(mpg01 ~ horsepower + weight + displacement, data = train)
nb_preds <- predict(nb_mpg, test)
confusionMatrix(nb_preds, test$mpg01, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 31 1
1 8 38
Accuracy : 0.8846
95% CI : (0.7922, 0.9459)
No Information Rate : 0.5
P-Value [Acc > NIR] : 6.906e-13
Kappa : 0.7692
Mcnemar's Test P-Value : 0.0455
Sensitivity : 0.9744
Specificity : 0.7949
Pos Pred Value : 0.8261
Neg Pred Value : 0.9688
Prevalence : 0.5000
Detection Rate : 0.4872
Detection Prevalence : 0.5897
Balanced Accuracy : 0.8846
'Positive' Class : 1
conf_matrix_nb <- table(Predicted = nb_preds, Actual = test$mpg01)
test_error_nb <- 1 - sum(diag(conf_matrix_nb)) / sum(conf_matrix_nb)
print(paste("Naïve Bayes Test Error Rate:", round(test_error_nb, 4)))
[1] "Naïve Bayes Test Error Rate: 0.1154"
the error rate coming out of this test is…the same as the other tests and now I’m officailly freaked out
train_X <- scale(train_data[, c("horsepower", "weight", "displacement")])
test_X <- scale(test_data[, c("horsepower", "weight", "displacement")],
center = attr(train_X, "scaled:center"),
scale = attr(train_X, "scaled:scale"))
train_Y <- train_data$mpg01
test_Y <- test_data$mpg01
k_values <- c(1, 3, 5, 10, 15, 20)
test_errors <- numeric(length(k_values))
for (i in seq_along(k_values)) {
set.seed(123) # Ensure reproducibility
knn_preds <- knn(train_X, test_X, train_Y, k = k_values[i])
conf_matrix_knn <- table(Predicted = knn_preds, Actual = test_Y)
test_errors[i] <- 1 - sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
print(paste("K =", k_values[i], "Test Error Rate:", round(test_errors[i], 4)))
}
[1] "K = 1 Test Error Rate: 0.1026"
[1] "K = 3 Test Error Rate: 0.0897"
[1] "K = 5 Test Error Rate: 0.0641"
[1] "K = 10 Test Error Rate: 0.1026"
[1] "K = 15 Test Error Rate: 0.1026"
[1] "K = 20 Test Error Rate: 0.1154"
k = 5 was the most accurate and K = 20 gives the equal error rate to the other tests.
# Compute the median crime rate
crime_median <- median(Boston$crim)
# Create the binary response variable
Boston$HighCrime <- ifelse(Boston$crim > crime_median, "High", "Low")
# Convert it to a factor for classification
Boston$HighCrime <- factor(Boston$HighCrime, levels = c("Low", "High"))
# Check distribution
table(Boston$HighCrime)
Low High
253 253
set.seed(123) # Ensure reproducibility
train_idx <- createDataPartition(Boston$HighCrime, p = 0.7, list = FALSE)
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]
# Fit logistic regression model using selected predictors
logit_model <- glm(HighCrime ~ lstat + rm + dis + tax + ptratio,
data = train_data, family = binomial)
# Make predictions on test data
logit_probs <- predict(logit_model, test_data, type = "response")
# Convert probabilities to class labels
logit_preds <- factor(ifelse(logit_probs > 0.5, "High", "Low"), levels = c("Low", "High"))
# Compute confusion matrix
conf_matrix_logit <- confusionMatrix(logit_preds, test_data$HighCrime)
print(conf_matrix_logit)
Confusion Matrix and Statistics
Reference
Prediction Low High
Low 68 13
High 7 62
Accuracy : 0.8667
95% CI : (0.8016, 0.9166)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7333
Mcnemar's Test P-Value : 0.2636
Sensitivity : 0.9067
Specificity : 0.8267
Pos Pred Value : 0.8395
Neg Pred Value : 0.8986
Prevalence : 0.5000
Detection Rate : 0.4533
Detection Prevalence : 0.5400
Balanced Accuracy : 0.8667
'Positive' Class : Low
# Load MASS package (if not already loaded)
library(MASS)
# Fit LDA model
lda_model <- lda(HighCrime ~ lstat + rm + dis + tax + ptratio, data = train_data)
# Make predictions
lda_preds <- predict(lda_model, test_data)$class
# Compute confusion matrix
conf_matrix_lda <- confusionMatrix(lda_preds, test_data$HighCrime)
print(conf_matrix_lda)
Confusion Matrix and Statistics
Reference
Prediction Low High
Low 68 16
High 7 59
Accuracy : 0.8467
95% CI : (0.7789, 0.9002)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6933
Mcnemar's Test P-Value : 0.09529
Sensitivity : 0.9067
Specificity : 0.7867
Pos Pred Value : 0.8095
Neg Pred Value : 0.8939
Prevalence : 0.5000
Detection Rate : 0.4533
Detection Prevalence : 0.5600
Balanced Accuracy : 0.8467
'Positive' Class : Low
# Fit Naïve Bayes model
nb_model <- naiveBayes(HighCrime ~ lstat + rm + dis + tax + ptratio, data = train_data)
# Make predictions
nb_preds <- predict(nb_model, test_data)
# Compute confusion matrix
conf_matrix_nb <- confusionMatrix(nb_preds, test_data$HighCrime)
print(conf_matrix_nb)
Confusion Matrix and Statistics
Reference
Prediction Low High
Low 67 16
High 8 59
Accuracy : 0.84
95% CI : (0.7714, 0.8947)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.68
Mcnemar's Test P-Value : 0.153
Sensitivity : 0.8933
Specificity : 0.7867
Pos Pred Value : 0.8072
Neg Pred Value : 0.8806
Prevalence : 0.5000
Detection Rate : 0.4467
Detection Prevalence : 0.5533
Balanced Accuracy : 0.8400
'Positive' Class : Low
# Extract predictors and response
train_X <- scale(train_data[, c("lstat", "rm", "dis", "tax", "ptratio")])
test_X <- scale(test_data[, c("lstat", "rm", "dis", "tax", "ptratio")])
train_Y <- train_data$HighCrime
test_Y <- test_data$HighCrime
# Try different values of K
for (k in c(1, 3, 5, 10, 15)) {
knn_preds <- knn(train_X, test_X, train_Y, k = k)
conf_matrix_knn <- confusionMatrix(knn_preds, test_Y)
print(paste("KNN Accuracy (K =", k, "):", round(conf_matrix_knn$overall["Accuracy"], 4)))
}
[1] "KNN Accuracy (K = 1 ): 0.9267"
[1] "KNN Accuracy (K = 3 ): 0.9267"
[1] "KNN Accuracy (K = 5 ): 0.9267"
[1] "KNN Accuracy (K = 10 ): 0.9133"
[1] "KNN Accuracy (K = 15 ): 0.9267"
__Model Accuracy Logistic Regression ~86% LDA ~84% Naïve Bayes ~84% KNN (Best K=10) ~91%
the KNN (with K=10) was the most accurate, however, all were highly accurate and all four were close. I think this also shows the difference in data usability from the other two data sets as this was by far the most accurate during the variety of tests __