In this lab I will be performing classification method on three datasets from the ISLR2 package, Weekly, Auto, and Boston. The following three questions I will answer are exercises 13, 14, and 16 from the ISLR textbook. I will perform necessary data analysis such as test/training splits, data summaries, data visualization, and variable creation. The methods I will be testing with will be Logistic Regression, LDA, QDA, Naive Bayes, and KNN. Using these methods I will determine the test error/accuracy score of the model by utilizing a confusion matrix. Please feel free to toggle the code Show/Hide button to view my code.


library(tidyverse)
library(ISLR2)
library(gtsummary)
library(gt)
library(MASS)
library(class)
library(e1071)
library(rsample)

Question 13 - Weekly Dataset


This question should be Interpretationed using the Weekly data set, which is part of the ISLP package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.


(a) Data Exploration

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly) 
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
dim(Weekly)
## [1] 1089    9
cor(Weekly [,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Interpretation: This Data set contains 8 numerical variables with one Factor variable – Direction. Direction has two levels “Up” and “Down” that can be used in a classification problem. There is an interesting upward relationship between volume and year showing the newer stocks contain a higher volume of shares traded.The only significant correlation appears to be between Volume and Year. The Volume variable increases as time goes on visible in the graph below.

# Expore the Volume variable
attach(Weekly)
plot(Volume, col="maroon", pch =8, cex=0.7)


(b) Logistic Regression Model

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?


logistic = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(logistic)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Interpretation: The only variable statistically significant variable in the model would be Lag2. With a P value of 0.024503 Lag2 is a significant predictor of direction although the differences between the boxplot is very minimal this remains the strongest predictor.


Weekly |>
  ggplot(aes(
    x=Direction, 
    y=Lag2)) +
  geom_boxplot(fill = "maroon", color="black") + 
  labs(title = "Lag2's Relationship with Direction")


(c) Confusion Matrix

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs <- predict(logistic, type = "response")
glm.pred <- rep("Down", 1089)
glm.pred[glm.probs > .5] = "Up"

# Create the table
t=table(glm.pred, Weekly$Direction)
t |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted Down Up
Down 54 48
Up 430 557

mean(glm.pred == Weekly$Direction)
## [1] 0.5610652

Interpretation: Here the table is telling us the model predicts the stock market to go down 49 of the days and for it to go up the other 564 days. The other incorrect predictions are 41 and 435. I will compute a fraction to determine how often this model is correct. This model correctly predicted the direction of the market 56.3% of the time.


(d) Refined Logistic Regression Model

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

w = Weekly |> filter(Year > 1990 & Year <= 2008)

logistic1 = glm(Direction ~ Lag2, data = w, family = binomial)
summary(logistic1)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = w)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.21128    0.06592   3.205  0.00135 **
## Lag2         0.06058    0.02940   2.061  0.03934 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1289.2  on 937  degrees of freedom
## Residual deviance: 1284.9  on 936  degrees of freedom
## AIC: 1288.9
## 
## Number of Fisher Scoring iterations: 4
glm.probs1 <- predict(logistic1, type = "response")
glm.pred1 <- rep("Down", 938)
glm.pred1[glm.probs1 > .5] = "Up"
t1=table(glm.pred1, w$Direction)
t1 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted Down Up
Down 22 19
Up 396 501
mean(glm.pred1 == w$Direction)
## [1] 0.5575693

Interpretation: Here the table is telling us the model predicts the stock market to go down 22 of the days between 1990 and 2008 and for it to go up the other 501 days. The other predictions are incorrect that read are 396 and 19. Between 1990 and 2008 we can predict the direction of the market with a 55.28% accuracy score.


(e) Linear Discriminant Analysis

Repeat (d) using LDA.

lda = lda(Direction ~ Lag2, data = w)

print(lda)
## Call:
## lda(Direction ~ Lag2, data = w)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.445629 0.554371 
## 
## Group means:
##             Lag2
## Down -0.04077751
## Up    0.26935769
## 
## Coefficients of linear discriminants:
##           LD1
## Lag2 0.440232
# Computer the prediction scores
lda.pred <- predict(lda, w)
lda.class <- lda.pred$class

# Create the table and fraction
t2=table(lda.class, w$Direction)
t2 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted Down Up
Down 21 19
Up 397 501
mean(lda.class == w$Direction)
## [1] 0.5565032

Interpretation: The table is telling us the model predicts the stock market to go down 21 of the days between 1990 and 2008 and for it to go up the other 501 days. The other predictions are incorrect that read are 397 and 19. Between 1990 and 2008 we can predict the direction of the market with a 55.65% accuracy score.


(f) Quadratic Discriminant Analysis

Repeat (d) using QDA.

qda = qda(Direction ~ Lag2, data = w)

print(qda)
## Call:
## qda(Direction ~ Lag2, data = w)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.445629 0.554371 
## 
## Group means:
##             Lag2
## Down -0.04077751
## Up    0.26935769
qda.class <- predict(qda, w)$class

# Convert to table
t3=table(qda.class, w$Direction)

t3 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted Down Up
Down 0 0
Up 418 520
mean(qda.class == w$Direction)
## [1] 0.554371

Interpretation: This models accuracy score is 55.44%.

Both models show that within the training data, the market goes down 44.56% of the time and 55.44% goes up. According to the model, the average down score of Lag2 was -0.04 and the average up score was .2694

(g) K Nearest Neighbors

Repeat (d) using KNN with K =1.

# Define train
train = Weekly |>  filter(Year > 1990 & Year <= 2008)
test = Weekly |>  filter(Year > 2008)

# Create test and training dataset for KNN
train.x = cbind(train$Lag2)
test.x = cbind(test$Lag2)

train.Direction = train$Direction

knn <-  knn(train.x, test.x, train.Direction, k=1)
t5=table(knn, test$Direction)

t5 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted Down Up
Down 21 28
Up 22 33
mean(knn == test$Direction)
## [1] 0.5192308

Interpretation: This models accuracy score is 51.92%.

(h) Naive Bayes

Repeat (d) using naive Bayes.

nb <- naiveBayes(Direction ~ Lag2, data = w)

nb.class <- predict(nb, w)
table(nb.class, w$Direction)
##         
## nb.class Down  Up
##     Down    0   0
##     Up    418 520
mean(nb.class == w$Direction)
## [1] 0.554371

Interpretation: This models accuracy score is 55.44%.


(i) Model Assessment

Which of these methods appears to provide the best results on this data?

Best Model: Out of the models that only assessed Lag2 the most accurate model was the Linear Discriminant Analysis model with a score of 55.65%.


(j) Variable Combination and Model Fit

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Using QDA and the following model I achieve a 57.36% accuracy score.

qda(Direction ~ Lag2 +Lag1+Lag5+Lag3, data = w)

Question 14 - Auto Dataset

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Data Preperation

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() method of the data frame. Note you may find it helpful to add a column mpg01 to the data frame by assignment. Assuming you have stored the data frame as Auto, this can be done as follows: Auto[‘mpg01’] = mpg01

auto = Auto |> 
  mutate(mpg01 = factor(case_when(
    mpg > 22.75 ~ 1, 
    mpg <= 22.75 ~ 0))
  )

glimpse(auto)
## Rows: 392
## Columns: 10
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …

(b) Exploration and Selection

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to Interpretation this question. Describe your findings.

auto |> ggplot(
    aes(x = mpg01, 
        y = horsepower)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Horsepower on MPG") + 
    theme_gray()

auto |> ggplot(
    aes(x = mpg01, 
        y = displacement)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Displacement on MPG") + 
    theme_gray()

auto |> ggplot(
    aes(x = mpg01, 
        y = weight)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Weight on MPG") + 
    theme_gray()

auto |> ggplot(
    aes(x = mpg01, 
        y = acceleration )) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Acceleration  on MPG") + 
    theme_gray()

Interpretation: The useful variable in predicting a car with higher or lower case mileage appear to be Weight, Acceleration, Displacement, and Horsepower.


(c) Test/Training Data Split

Split the data into a training set and a test set.

split <- initial_split(auto, prop = .80, strata = mpg01)

auto_test <- testing(split)
auto_training <-  training(split)

glimpse(auto_test)
## Rows: 80
## Columns: 10
## $ mpg          <dbl> 14, 21, 24, 26, 11, 25, 16, 14, 13, 18, 31, 35, 20, 14, 1…
## $ cylinders    <int> 8, 6, 4, 4, 8, 4, 6, 8, 8, 6, 4, 4, 4, 8, 8, 8, 4, 4, 4, …
## $ displacement <dbl> 340, 200, 107, 121, 318, 113, 225, 351, 400, 250, 71, 72,…
## $ horsepower   <int> 160, 85, 90, 113, 210, 95, 105, 153, 170, 88, 65, 69, 90,…
## $ weight       <int> 3609, 2587, 2430, 2234, 4382, 2228, 3439, 4154, 4746, 313…
## $ acceleration <dbl> 8.0, 16.0, 14.5, 12.5, 13.5, 14.0, 15.5, 13.5, 12.0, 14.5…
## $ year         <int> 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 7…
## $ origin       <int> 1, 1, 2, 2, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 2, 2, 3, …
## $ name         <fct> plymouth 'cuda 340, ford maverick, audi 100 ls, bmw 2002,…
## $ mpg01        <fct> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, …
glimpse(auto_training)
## Rows: 312
## Columns: 10
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 15, 14, 22, 1…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 8, 8, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 40…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 15…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

(d) LDA

Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_lda = lda(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)

print(auto_lda)
## Call:
## lda(mpg01 ~ weight + acceleration + displacement + horsepower, 
##     data = auto_training)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##     weight acceleration displacement horsepower
## 0 3614.397     14.41474     275.7372  131.80128
## 1 2336.026     16.47821     116.3942   78.57692
## 
## Coefficients of linear discriminants:
##                       LD1
## weight       -0.000938530
## acceleration  0.008114803
## displacement -0.008745945
## horsepower    0.004525878
lda.pred1 <- predict(auto_lda, auto_training)
lda.class1 <- lda.pred1$class

# Create the table and fraction
t6=table(lda.class1, auto_training$mpg01)
t6 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 129 5
1 27 151

42 / 312 = .135~

Interpretation: The LDA performed with a 13.5% test error on the model.

(e) QDA

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_qda = qda(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)

print(auto_qda)
## Call:
## qda(mpg01 ~ weight + acceleration + displacement + horsepower, 
##     data = auto_training)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##     weight acceleration displacement horsepower
## 0 3614.397     14.41474     275.7372  131.80128
## 1 2336.026     16.47821     116.3942   78.57692
qda.pred1 <- predict(auto_qda, auto_training)
qda.class1 <- lda.pred1$class

# Create the table and fraction
t7=table(qda.class1, auto_training$mpg01)
t7 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 129 5
1 27 151

42 / 312 = .135~

Interpretation: The QDA performed with a 13.5% test error on the model.


(f) Logistic Regression

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_logistic = glm(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training, family = binomial)

print(auto_logistic)
## 
## Call:  glm(formula = mpg01 ~ weight + acceleration + displacement + 
##     horsepower, family = binomial, data = auto_training)
## 
## Coefficients:
##  (Intercept)        weight  acceleration  displacement    horsepower  
##   11.7484130    -0.0009667    -0.0416305    -0.0161292    -0.0585108  
## 
## Degrees of Freedom: 311 Total (i.e. Null);  307 Residual
## Null Deviance:       432.5 
## Residual Deviance: 167.4     AIC: 177.4
glm.probs2 <- predict(auto_logistic, type = "response")
glm.pred2 <- rep("Down", 312)
glm.pred2[glm.probs2 > .5] = "Up"

# Create the table
t8=table(glm.pred2, auto_training$mpg01)
t8 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
Down 138 12
Up 18 144

37 / 312 = .119 ~

Interpretation: The logistic regression model performs with a 11.9% test error on its model.


(g) Naive Bayes

Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_nb <- naiveBayes(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)

print(auto_logistic)
## 
## Call:  glm(formula = mpg01 ~ weight + acceleration + displacement + 
##     horsepower, family = binomial, data = auto_training)
## 
## Coefficients:
##  (Intercept)        weight  acceleration  displacement    horsepower  
##   11.7484130    -0.0009667    -0.0416305    -0.0161292    -0.0585108  
## 
## Degrees of Freedom: 311 Total (i.e. Null);  307 Residual
## Null Deviance:       432.5 
## Residual Deviance: 167.4     AIC: 177.4
nb.pred1 <- predict(auto_nb, auto_training)
t9=table(nb.pred1, Actual = auto_training$mpg01)

# Format into table
t9 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 128 9
1 28 147

156 / 312 = .5 ~

Interpretation: The Naive Bayes model performs with a 50% test error on its model.


(h) KNN

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

train_mpg01 = auto_training$mpg01
auto_training1 = auto_training |>  dplyr::select(weight, acceleration, displacement, horsepower)
auto_test1 = auto_test |>  dplyr::select(weight, acceleration, displacement, horsepower)

auto_knn <-  knn(
  train = auto_training1, 
  test = auto_test1, 
  cl = train_mpg01, 
  k=1)

print(auto_knn)
##  [1] 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1
## [39] 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [77] 0 1 1 0
## Levels: 0 1
# Create the table
t10=table(Predicted = auto_knn, Actual = auto_test$mpg01)
t10 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted 0 1
0 33 6
1 7 34

5 / 80 = .0625 ~

Interpretation: The KNN model performs with a 6.25% test error on its TEST model. The KNN model was the best model to predict whether a car will have a high MPG or low MPG.


Question 16 - Boston Dataset

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

(a) Data Preperation

# Create the binary variable based on the median 0.25651
boston = Boston |> 
  mutate(crim01 = factor(case_when(
    crim > 0.25651 ~ 1, 
    crim <= 0.25651 ~ 0))
  )

glimpse(boston)
## Rows: 506
## Columns: 15
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
## $ crim01  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…

(b) Exploration and Selection

boston |> ggplot(
    aes(x = crim01, 
        y = dis)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of  Weighted distance to employment centers on Crime Rate") + 
    theme_gray()

boston |> ggplot(
    aes(x = crim01, 
        y = zn)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of  Zoned residential land proportion on Crime Rate") + 
    theme_gray()

boston |> ggplot(
    aes(x = crim01, 
        y = nox)) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Nitric oxides concentration on Crime Rate") + 
    theme_gray()

boston |> ggplot(
    aes(x = crim01, 
        y = age )) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Average rooms per dwelling on Crime Rate") + 
    theme_gray()

boston |> ggplot(
    aes(x = crim01, 
        y = lstat )) +
    geom_boxplot(fill = "darkred") + 
    labs(title = "The Affect of Acceleration on Crime Rate") + 
    theme_gray()

Interpretation: The useful variable in predicting crime rate in the Boston dataset are dis, zn, nox, age, and lstat


(c) Test/Training Data Split

Split the data into a training set and a test set.

split1 <- initial_split(boston, prop = .80, strata = crim01)

boston_test <- testing(split1)
boston_training <-  training(split1)

glimpse(boston_test)
## Rows: 102
## Columns: 15
## $ crim    <dbl> 0.02731, 0.02985, 0.78420, 0.84054, 1.38799, 1.15172, 1.61282,…
## $ zn      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 85, 0, 0, 0, 25, 0, 0, 0, 0, 0, 0, 0, …
## $ indus   <dbl> 7.07, 2.18, 8.14, 8.14, 8.14, 8.14, 8.14, 5.96, 0.74, 10.81, 1…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.469, 0.458, 0.538, 0.538, 0.538, 0.538, 0.538, 0.499, 0.410,…
## $ rm      <dbl> 6.421, 6.430, 5.990, 5.599, 5.950, 5.701, 6.096, 5.841, 6.383,…
## $ age     <dbl> 78.9, 58.7, 81.7, 85.7, 82.0, 95.0, 96.9, 61.4, 35.7, 6.6, 53.…
## $ dis     <dbl> 4.9671, 6.0622, 4.2579, 4.4546, 3.9900, 3.7872, 3.7598, 3.3779…
## $ rad     <int> 2, 3, 4, 4, 4, 4, 4, 5, 2, 4, 5, 5, 4, 2, 2, 2, 5, 5, 5, 2, 4,…
## $ tax     <dbl> 242, 222, 307, 307, 307, 307, 307, 279, 313, 305, 398, 398, 28…
## $ ptratio <dbl> 17.8, 18.7, 21.0, 21.0, 21.0, 21.0, 21.0, 19.2, 17.3, 19.2, 18…
## $ black   <dbl> 396.90, 394.12, 386.75, 303.42, 232.60, 358.77, 248.31, 377.56…
## $ lstat   <dbl> 9.14, 5.21, 14.67, 16.51, 27.71, 18.35, 20.34, 11.41, 5.77, 6.…
## $ medv    <dbl> 21.6, 28.7, 17.5, 13.9, 13.2, 13.1, 13.5, 20.0, 24.7, 24.2, 21…
## $ crim01  <fct> 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
glimpse(boston_training)
## Rows: 404
## Columns: 15
## $ crim    <dbl> 0.00632, 0.02729, 0.03237, 0.06905, 0.08829, 0.14455, 0.21124,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5,…
## $ indus   <dbl> 2.31, 7.07, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 7.185, 6.998, 7.147, 6.012, 6.172, 5.631, 6.004, 6.377,…
## $ age     <dbl> 65.2, 61.1, 45.8, 54.2, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 3…
## $ dis     <dbl> 4.0900, 4.9671, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921…
## $ rad     <int> 1, 2, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3,…
## $ tax     <dbl> 296, 242, 222, 222, 311, 311, 311, 311, 311, 311, 311, 279, 27…
## $ ptratio <dbl> 15.3, 17.8, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 392.83, 394.63, 396.90, 395.60, 396.90, 386.63, 386.71…
## $ lstat   <dbl> 4.98, 4.03, 2.94, 5.33, 12.43, 19.15, 29.93, 17.10, 20.45, 13.…
## $ medv    <dbl> 24.0, 34.7, 33.4, 36.2, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21…
## $ crim01  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

(d) LDA

Perform LDA on the training data in order to predict crim01

boston_lda = lda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)

print(boston_lda)
## Call:
## lda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##        dis        zn       nox      age     lstat
## 0 5.141737 20.638614 0.4712718 51.47525  9.628812
## 1 2.494462  1.306931 0.6382921 85.27871 15.857574
## 
## Coefficients of linear discriminants:
##                LD1
## dis   -0.079498246
## zn    -0.002608143
## nox    9.886254941
## age    0.007922197
## lstat -0.004336295
lda.pred2 <- predict(boston_lda, boston_training)
lda.class2 <- lda.pred2$class

# Create the table and fraction
t10 =table(lda.class2, boston_training$crim01)
t10 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 175 40
1 27 162

58 / 404 = 0.144~

Interpretation: The LDA performed with a 14.4% test error on the model.

(e) QDA

Perform QDA on the training data in order to predict crim01

boston_qda = qda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)

print(boston_qda)
## Call:
## qda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##        dis        zn       nox      age     lstat
## 0 5.141737 20.638614 0.4712718 51.47525  9.628812
## 1 2.494462  1.306931 0.6382921 85.27871 15.857574
qda.pred2<- predict(boston_qda, boston_training)
qda.class2 <- lda.pred2$class

# Create the table and fraction
t11=table(qda.class2, boston_training$crim01)
t11 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 175 40
1 27 162

58 / 404 = 0.144~

Interpretation: The QDA performed with a 14.4% test error on the model.


(f) Logistic Regression

Perform logistic regression on the training data in order to predict crim01

boston_logistic = glm(crim01 ~ dis + zn + nox + age + lstat, data = boston_training, family = binomial)

print(boston_logistic)
## 
## Call:  glm(formula = crim01 ~ dis + zn + nox + age + lstat, family = binomial, 
##     data = boston_training)
## 
## Coefficients:
## (Intercept)          dis           zn          nox          age        lstat  
##   -18.76335      0.32642     -0.05275     33.22231      0.00459     -0.03989  
## 
## Degrees of Freedom: 403 Total (i.e. Null);  398 Residual
## Null Deviance:       560.1 
## Residual Deviance: 248.2     AIC: 260.2
glm.probs3 <- predict(boston_logistic, type = "response")
glm.pred3 <- rep("Down", 404)
glm.pred3[glm.probs3 > .5] = "Up"

# Create the table
t12=table(glm.pred3, boston_training$crim01)
t12 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
Down 172 27
Up 30 175

53 / 404 = 0.131~

Interpretation: The logistic regression performed with a 13.1% test error on the model.


(g) Naive Bayes

Perform naive Bayes on the training data in order to predict crim01

boston_nb <- naiveBayes(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)

print(boston_nb)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    dis
## Y       [,1]     [,2]
##   0 5.141737 2.098737
##   1 2.494462 1.132570
## 
##    zn
## Y        [,1]     [,2]
##   0 20.638614 28.35210
##   1  1.306931  4.99899
## 
##    nox
## Y        [,1]       [,2]
##   0 0.4712718 0.05526323
##   1 0.6382921 0.09847185
## 
##    age
## Y       [,1]     [,2]
##   0 51.47525 25.85871
##   1 85.27871 19.15260
## 
##    lstat
## Y        [,1]     [,2]
##   0  9.628812 4.950626
##   1 15.857574 7.625921
nb.pred2 <- predict(boston_nb, boston_training)
t13=table(nb.pred2, Actual = boston_training$crim01)

# Format into table
t13 |>
    as.data.frame.matrix() |>
    rownames_to_column("Predicted") |>
    gt()
Predicted 0 1
0 151 17
1 51 185

73 / 404 = .5 ~

Interpretation: The Naive Bayes model performs with a 18.1% test error on its model.


(h) KNN

Perform KNN on the training data, with several values of K, in order to predict crim01

train_crim01 = boston_training$crim01
boston_training1 = boston_training |>  dplyr::select(dis, zn, nox, age, lstat)
boston_test1 = boston_test |>  dplyr::select(dis, zn, nox, age, lstat)

boston_knn <-  knn(
  train = boston_training1, 
  test = boston_test1, 
  cl = train_crim01, 
  k=1)

print(boston_knn)
##   [1] 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0
##  [38] 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1
##  [75] 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1
## Levels: 0 1
# Create the table
t14=table(Predicted = boston_knn, Actual = boston_test$crim01)
t14 |>
  as.data.frame.matrix() |>
  rownames_to_column("Predicted") |>
  gt()
Predicted 0 1
0 35 17
1 16 34

25 / 80 = .3125 ~

Interpretation: The KNN model performs with a 31.25% test error on its TEST model.



Conclusion

The best model to predict crime rates was the logistic regression model with a test error of 13.1%. The best variables to put into this model for predicting crime rates were dis, zn, nox, age, and lstat.