Assignment 3

Assignment 3: Irving Cedillo

library(ISLR2)

Warning: package 'ISLR2' was built under R version 4.5.3

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(ggplot2)
library(GGally)
library(mlbench)

Warning: package 'mlbench' was built under R version 4.5.2

library(tidyr)
library(modelr)

Warning: package 'modelr' was built under R version 4.5.2

library(MASS)

Warning: package 'MASS' was built under R version 4.5.2


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

The following object is masked from 'package:ISLR2':

    Boston

library(gtsummary)


Attaching package: 'gtsummary'

The following object is masked from 'package:MASS':

    select

library(caret)

Warning: package 'caret' was built under R version 4.5.2

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

attach(Weekly)

13

From the describe function, volume is the one that has the highest skew at 1.62 which signifies a long tail. From the graphical representation we see that the volume is indeed showing a long tail. Volume also demonstrates a non linear pattern with year from the graphical output, but the two features also have the highest correlation value of .8

describe(Weekly)

           vars    n    mean   sd  median trimmed  mad     min     max range
Year          1 1089 2000.05 6.03 2000.00 2000.05 7.41 1990.00 2010.00 20.00
Lag1          2 1089    0.15 2.36    0.24    0.18 1.87  -18.20   12.03 30.22
Lag2          3 1089    0.15 2.36    0.24    0.18 1.87  -18.20   12.03 30.22
Lag3          4 1089    0.15 2.36    0.24    0.18 1.87  -18.20   12.03 30.22
Lag4          5 1089    0.15 2.36    0.24    0.17 1.87  -18.20   12.03 30.22
Lag5          6 1089    0.14 2.36    0.23    0.17 1.88  -18.20   12.03 30.22
Volume        7 1089    1.57 1.69    1.00    1.25 1.04    0.09    9.33  9.24
Today         8 1089    0.15 2.36    0.24    0.18 1.87  -18.20   12.03 30.22
Direction*    9 1089    1.56 0.50    2.00    1.57 0.00    1.00    2.00  1.00
            skew kurtosis   se
Year        0.00    -1.21 0.18
Lag1       -0.48     5.67 0.07
Lag2       -0.48     5.67 0.07
Lag3       -0.48     5.62 0.07
Lag4       -0.48     5.63 0.07
Lag5       -0.47     5.61 0.07
Volume      1.62     2.06 0.05
Today      -0.48     5.67 0.07
Direction* -0.22    -1.95 0.02

summary(Weekly)

      Year           Lag1               Lag2               Lag3         
 Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
 Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
 Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
 3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
      Lag4               Lag5              Volume            Today         
 Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
 1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
 Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
 Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
 3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
 Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
 Direction 
 Down:484  
 Up  :605

ggpairs(Weekly, progress = FALSE)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

From the model summary we see that only Lag2 is statistically significant with a P value of .0296. Holding all other variables constant one unit increase in Lag2, the odds of the market moving up increase by \(6.02\%\approx (e^{.05844}-1)*100\).

log_model <- glm(Direction ~ . -Year  - Today, data = Weekly, family = "binomial")
summary(log_model)


Call:
glm(formula = Direction ~ . - Year - Today, family = "binomial", 
    data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

set.seed(42)
partitions<- resample_partition(Weekly, c(train=.8, test=.2))
train_df<-as.data.frame(partitions$train)
test_df<-as.data.frame(partitions$test)
train_df|>
  tbl_summary( by = Direction, include = c(Lag1, Lag2, Lag3, Lag4, Lag5, Volume))|>
  add_p()

Characteristic	Down N = 391¹	Up N = 480¹	p-value²
Lag1	0.37 (-0.94, 1.59)	0.13 (-1.21, 1.21)	0.033
Lag2	0.15 (-1.45, 1.32)	0.33 (-0.98, 1.44)	0.036
Lag3	0.24 (-1.16, 1.41)	0.23 (-1.07, 1.45)	>0.9
Lag4	0.34 (-1.00, 1.52)	0.24 (-1.15, 1.32)	0.2
Lag5	0.33 (-1.11, 1.44)	0.13 (-1.18, 1.32)	0.3
Volume	1.08 (0.35, 1.89)	0.94 (0.33, 2.23)	0.6
¹ Median (Q1, Q3)
² Wilcoxon rank sum test

From the corrected labeling as positive = “Up”, we see that the matrix points at a false positive having a count. This would cause a client to purchase a stock that will eventually lose value and lose money altogether. Sensitivity has a high percentage of \(81\%\) meaning that a large percentage of Up days were successfully identified. Unfortunately with a specificity of \(15\%\) this points to many Down days not being identified, reinforced by the matrix count.

fit_logistic_model <- glm(Direction ~ . -Year  - Today, 
                    data= train_df,
                    family="binomial")

pred_probs <- predict(fit_logistic_model, newdata = test_df, type = "response")
pred_classes <- ifelse(pred_probs >= 0.5, "Up", "Down")

actual_factor <- as.factor(test_df$Direction)
pred_factor <- as.factor(pred_classes)

results <- confusionMatrix(pred_factor, actual_factor, positive = "Up")
print(results)

Confusion Matrix and Statistics

          Reference
Prediction Down  Up
      Down   14  23
      Up     79 102
                                          
               Accuracy : 0.5321          
                 95% CI : (0.4635, 0.5998)
    No Information Rate : 0.5734          
    P-Value [Acc > NIR] : 0.903           
                                          
                  Kappa : -0.0363         
                                          
 Mcnemar's Test P-Value : 5.157e-08       
                                          
            Sensitivity : 0.8160          
            Specificity : 0.1505          
         Pos Pred Value : 0.5635          
         Neg Pred Value : 0.3784          
             Prevalence : 0.5734          
         Detection Rate : 0.4679          
   Detection Prevalence : 0.8303          
      Balanced Accuracy : 0.4833          
                                          
       'Positive' Class : Up

subset_weekly <- Weekly|>
  filter(between(Year, 1990,2008))|>
  select(Direction, Lag2, Year)

set.seed(42)
sub_partitions<- resample_partition(subset_weekly, c(train=.8, test=.2))
sub_train_df<-as.data.frame(sub_partitions$train)
sub_test_df<-as.data.frame(sub_partitions$test)
sub_train_df|>
  tbl_summary( by = Direction, include = c(Lag2))|>
  add_p()

Characteristic	Down N = 347¹	Up N = 440¹	p-value²
Lag2	0.13 (-1.36, 1.17)	0.27 (-1.14, 1.38)	0.070
¹ Median (Q1, Q3)
² Wilcoxon rank sum test

library(e1071)

Warning: package 'e1071' was built under R version 4.5.2


Attaching package: 'e1071'

The following object is masked from 'package:ggplot2':

    element

fit_sub_log <- glm(Direction ~ Lag2, 
                    data= sub_train_df,
                    family="binomial")

fit_lda <- lda(Direction ~ Lag2, 
                    data= sub_train_df)

fit_qda <- qda(Direction ~ Lag2, 
                    data= sub_train_df)

fit_bayes <- naiveBayes(Direction ~ Lag2, 
                        data = sub_train_df)

fit_knn <- train(Direction ~ Lag2, 
                 data = sub_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 1))


optimal_knn  <- train(Direction ~ I(Lag2^2), 
                 data = sub_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 2))


my_models <- list(
  "Log Reg" = fit_sub_log,
  "LDA" = fit_lda,
  "Qda " = fit_qda,
  "Bayes"         = fit_bayes,
  "KNN" = fit_knn,
  "Optimal - KNN" = optimal_knn
)

evaluate_model <- function(model, test_data, target_var, pos, neg) {
  if (inherits(model, "glm")) {
    probs <- predict(model, newdata = test_data, type = "response")
    preds <- ifelse(probs >= 0.5, pos, neg)
    
  } else if (inherits(model, "lda") || inherits(model, "qda")) {
    preds <- predict(model, newdata = test_data)$class
    
  } else {
    preds <- predict(model, newdata = test_data)
  }
  

  actual_factor <- factor(test_data[[target_var]], levels = c(neg, pos))
  pred_factor   <- factor(preds, levels = c(neg, pos))
  

  return(confusionMatrix(pred_factor, actual_factor, positive = pos))
}

From the output below all models performed equally as bad at around \(54\%\) accuracy. Having to choose a specific model I would choose to optimize the KNN. Though it has the lowest accuracy of \(44\%\) it does have the advantage of having a high specificity.

for (model_name in names(my_models)) {
  
  cat("\n\nMODEL :", model_name, "\n")
  results <- evaluate_model(my_models[[model_name]], 
                            sub_test_df, 
                            target_var = "Direction", 
                            pos = "Up",
                            neg="Down")
  

  print(results$table)
  cat("Overall Accuracy: ", round(results$overall["Accuracy"], 4), "\n")
  cat("Test Error:      ", round(1-results$overall["Accuracy"], 4), "\n")
  cat("Sensitivity:      ", round(results$byClass["Sensitivity"], 4), "\n")
  cat("Specificity:      ", round(results$byClass["Specificity"], 4), "\n")
}



MODEL : Log Reg 
          Reference
Prediction Down  Up
      Down    7   4
      Up     87 100
Overall Accuracy:  0.5404 
Test Error:       0.4596 
Sensitivity:       0.9615 
Specificity:       0.0745 


MODEL : LDA 
          Reference
Prediction Down  Up
      Down    7   4
      Up     87 100
Overall Accuracy:  0.5404 
Test Error:       0.4596 
Sensitivity:       0.9615 
Specificity:       0.0745 


MODEL : Qda  
          Reference
Prediction Down  Up
      Down    0   0
      Up     94 104
Overall Accuracy:  0.5253 
Test Error:       0.4747 
Sensitivity:       1 
Specificity:       0 


MODEL : Bayes 
          Reference
Prediction Down  Up
      Down    0   0
      Up     94 104
Overall Accuracy:  0.5253 
Test Error:       0.4747 
Sensitivity:       1 
Specificity:       0 


MODEL : KNN 
          Reference
Prediction Down Up
      Down   42 55
      Up     52 49
Overall Accuracy:  0.4596 
Test Error:       0.5404 
Sensitivity:       0.4712 
Specificity:       0.4468 


MODEL : Optimal - KNN 
          Reference
Prediction Down Up
      Down   27 52
      Up     67 52
Overall Accuracy:  0.399 
Test Error:       0.601 
Sensitivity:       0.5 
Specificity:       0.2872

14

The auto data set is loaded and Cylinders and Origin were changed to type Factor since Cylinders (3-8) is categorical on the number of cylinders an engine has and Origin is a location (1-3). Once the mpg01 indicator was created there was no need for mpg since it will be directly highly correlated to the indicator variable. From the pairs plot one can see that cylinders looks like a good candidate as it is able to differentiate quite well between the mpg01 categories. Displacement, horsepower and weight are also showing clear differences of mean among between the two groups of mpg01 which could indicate good predictors.

attach(Auto)

The following object is masked from package:lubridate:

    origin

The following object is masked from package:ggplot2:

    mpg

AutoClean <- Auto |> 
  mutate(
    cylinders = as.factor(cylinders),
    origin = as.factor(origin),
  )|>
  mutate(
    mpg01 = factor(ifelse(median(mpg)<mpg,1,0))
  )|>
  select(
    mpg01,
    origin,
    cylinders, 
    displacement, 
    horsepower,
    weight,
    acceleration
  )


summary(AutoClean)

 mpg01   origin  cylinders  displacement     horsepower        weight    
 0:196   1:245   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1:196   2: 68   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
         3: 79   5:  3     Median :151.0   Median : 93.5   Median :2804  
                 6: 83     Mean   :194.4   Mean   :104.5   Mean   :2978  
                 8:103     3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
                           Max.   :455.0   Max.   :230.0   Max.   :5140  
  acceleration  
 Min.   : 8.00  
 1st Qu.:13.78  
 Median :15.50  
 Mean   :15.54  
 3rd Qu.:17.02  
 Max.   :24.80

describe(AutoClean)

             vars   n    mean     sd median trimmed    mad  min    max  range
mpg01*          1 392    1.50   0.50    1.5    1.50   0.74    1    2.0    1.0
origin*         2 392    1.58   0.81    1.0    1.47   0.00    1    3.0    2.0
cylinders*      3 392    3.21   1.33    2.0    3.15   0.00    1    5.0    4.0
displacement    4 392  194.41 104.64  151.0  183.83  90.44   68  455.0  387.0
horsepower      5 392  104.47  38.49   93.5   99.82  28.91   46  230.0  184.0
weight          6 392 2977.58 849.40 2803.5 2916.94 948.12 1613 5140.0 3527.0
acceleration    7 392   15.54   2.76   15.5   15.48   2.52    8   24.8   16.8
             skew kurtosis    se
mpg01*       0.00    -2.01  0.03
origin*      0.91    -0.86  0.04
cylinders*   0.26    -1.69  0.07
displacement 0.70    -0.79  5.29
horsepower   1.08     0.65  1.94
weight       0.52    -0.83 42.90
acceleration 0.29     0.41  0.14

ggpairs(AutoClean, progress = FALSE)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

set.seed(42)
auto_partition<- resample_partition(AutoClean, c(train=.8, test=.2))
auto_train_df<-as.data.frame(auto_partition$train)
auto_test_df<-as.data.frame(auto_partition$test)
auto_train_df|>
  tbl_summary( by = mpg01, include = c(origin, cylinders, displacement, horsepower, weight, acceleration))|>
  add_p()

Characteristic	0 N = 156¹	1 N = 157¹	p-value²
origin			<0.001
1	135 (87%)	58 (37%)
2	12 (7.7%)	42 (27%)
3	9 (5.8%)	57 (36%)
cylinders			<0.001
3	3 (1.9%)	1 (0.6%)
4	18 (12%)	143 (91%)
5	1 (0.6%)	1 (0.6%)
6	55 (35%)	10 (6.4%)
8	79 (51%)	2 (1.3%)
displacement	260 (225, 350)	105 (91, 122)	<0.001
horsepower	121 (100, 150)	76 (67, 88)	<0.001
weight	3,617 (3,094, 4,132)	2,220 (2,020, 2,595)	<0.001
acceleration	14.50 (13.00, 16.30)	16.40 (14.70, 17.80)	<0.001
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test; Fisher’s exact test; Wilcoxon rank sum test

The QDA had the lowest test error .05 out of all models. The KNN model for various K had K=3 performing the best with a test error of : .08

fit_sub_log <- glm(mpg01 ~ cylinders+horsepower+displacement+weight, 
                    data= auto_train_df,
                    family="binomial")

fit_lda <- lda(mpg01 ~ cylinders+horsepower+displacement+weight, 
                    data= auto_train_df)

fit_qda <- qda(mpg01 ~ cylinders+horsepower+displacement+weight, 
                    data= auto_train_df)

fit_bayes <- naiveBayes(mpg01 ~ cylinders+horsepower+displacement+weight, 
                        data = auto_train_df)

fit_knn <- train(mpg01 ~ cylinders+horsepower+displacement+weight, 
                 data = auto_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 1))

fit_knn3 <- train(mpg01 ~ cylinders+horsepower+displacement+weight, 
                 data = auto_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 3))

fit_knn5 <- train(mpg01 ~ cylinders+horsepower+displacement+weight, 
                 data = auto_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 5))

fit_knn10 <- train(mpg01 ~ cylinders+horsepower+displacement+weight, 
                 data = auto_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 10))

my_models <- list(
  "Log Reg" = fit_sub_log,
  "LDA" = fit_lda,
  "Qda " = fit_qda,
  "Bayes"       = fit_bayes,
  "KNN-1" = fit_knn,
  "KNN-3" = fit_knn3,
  "KNN-5" = fit_knn5,
  "KNN-10" = fit_knn10
)


for (model_name in names(my_models)) {
  
  cat("\n\nMODEL :", model_name, "\n")
  results <- evaluate_model(my_models[[model_name]], 
                            auto_test_df, 
                            target_var = "mpg01",
                            pos="1",
                            neg="0")
  

  print(results$table)
  cat("Overall Accuracy: ", round(results$overall["Accuracy"], 4), "\n")
    cat("Test Error:      ", round(1-results$overall["Accuracy"], 4), "\n")
  cat("Sensitivity:      ", round(results$byClass["Sensitivity"], 4), "\n")
  cat("Specificity:      ", round(results$byClass["Specificity"], 4), "\n")
}



MODEL : Log Reg 
          Reference
Prediction  0  1
         0 37  3
         1  3 36
Overall Accuracy:  0.9241 
Test Error:       0.0759 
Sensitivity:       0.9231 
Specificity:       0.925 


MODEL : LDA 
          Reference
Prediction  0  1
         0 38  3
         1  2 36
Overall Accuracy:  0.9367 
Test Error:       0.0633 
Sensitivity:       0.9231 
Specificity:       0.95 


MODEL : Qda  
          Reference
Prediction  0  1
         0 38  2
         1  2 37
Overall Accuracy:  0.9494 
Test Error:       0.0506 
Sensitivity:       0.9487 
Specificity:       0.95 


MODEL : Bayes 
          Reference
Prediction  0  1
         0 37  3
         1  3 36
Overall Accuracy:  0.9241 
Test Error:       0.0759 
Sensitivity:       0.9231 
Specificity:       0.925 


MODEL : KNN-1 
          Reference
Prediction  0  1
         0 36  6
         1  4 33
Overall Accuracy:  0.8734 
Test Error:       0.1266 
Sensitivity:       0.8462 
Specificity:       0.9 


MODEL : KNN-3 
          Reference
Prediction  0  1
         0 37  4
         1  3 35
Overall Accuracy:  0.9114 
Test Error:       0.0886 
Sensitivity:       0.8974 
Specificity:       0.925 


MODEL : KNN-5 
          Reference
Prediction  0  1
         0 37  5
         1  3 34
Overall Accuracy:  0.8987 
Test Error:       0.1013 
Sensitivity:       0.8718 
Specificity:       0.925 


MODEL : KNN-10 
          Reference
Prediction  0  1
         0 38  6
         1  2 33
Overall Accuracy:  0.8987 
Test Error:       0.1013 
Sensitivity:       0.8462 
Specificity:       0.95

16

I started by importing the Boston data set and cleaning up the variables that should be interpreted as factor / categorical types : chas and rad.

From the scatterplots the variables that seem most significant in separating the two classes are: indus, nox, age, dis, tax, lstat

attach(Boston)
Boston_Clean <- Boston|>
  mutate(
    Crime_Class = factor(ifelse(crim>median(crim),1,0)),
    chas = as.factor(chas),
    rad = as.factor(rad)
  )|>
  select(-crim)

summary(Boston_Clean)

       zn             indus       chas         nox               rm       
 Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850   Min.   :3.561  
 1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490   1st Qu.:5.886  
 Median :  0.00   Median : 9.69           Median :0.5380   Median :6.208  
 Mean   : 11.36   Mean   :11.14           Mean   :0.5547   Mean   :6.285  
 3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240   3rd Qu.:6.623  
 Max.   :100.00   Max.   :27.74           Max.   :0.8710   Max.   :8.780  
                                                                          
      age              dis              rad           tax       
 Min.   :  2.90   Min.   : 1.130   24     :132   Min.   :187.0  
 1st Qu.: 45.02   1st Qu.: 2.100   5      :115   1st Qu.:279.0  
 Median : 77.50   Median : 3.207   4      :110   Median :330.0  
 Mean   : 68.57   Mean   : 3.795   3      : 38   Mean   :408.2  
 3rd Qu.: 94.08   3rd Qu.: 5.188   6      : 26   3rd Qu.:666.0  
 Max.   :100.00   Max.   :12.127   2      : 24   Max.   :711.0  
                                   (Other): 61                  
    ptratio          black            lstat            medv       Crime_Class
 Min.   :12.60   Min.   :  0.32   Min.   : 1.73   Min.   : 5.00   0:253      
 1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02   1:253      
 Median :19.05   Median :391.44   Median :11.36   Median :21.20              
 Mean   :18.46   Mean   :356.67   Mean   :12.65   Mean   :22.53              
 3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00              
 Max.   :22.00   Max.   :396.90   Max.   :37.97   Max.   :50.00

describe(Boston_Clean)

             vars   n   mean     sd median trimmed    mad    min    max  range
zn              1 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
indus           2 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
chas*           3 506   1.07   0.25   1.00    1.00   0.00   1.00   2.00   1.00
nox             4 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
rm              5 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
age             6 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
dis             7 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
rad*            8 506   5.64   2.44   5.00    5.70   2.97   1.00   9.00   8.00
tax             9 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
ptratio        10 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
black          11 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
lstat          12 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
medv           13 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
Crime_Class*   14 506   1.50   0.50   1.50    1.50   0.74   1.00   2.00   1.00
              skew kurtosis   se
zn            2.21     3.95 1.04
indus         0.29    -1.24 0.30
chas*         3.39     9.48 0.01
nox           0.72    -0.09 0.01
rm            0.40     1.84 0.03
age          -0.60    -0.98 1.25
dis           1.01     0.46 0.09
rad*          0.14    -1.13 0.11
tax           0.67    -1.15 7.49
ptratio      -0.80    -0.30 0.10
black        -2.87     7.10 4.06
lstat         0.90     0.46 0.32
medv          1.10     1.45 0.41
Crime_Class*  0.00    -2.00 0.02

Boston_Num <- Boston_Clean |> 
  select(where(is.numeric), Crime_Class)

ggpairs(Boston_Num, 
        mapping = aes(color = Crime_Class, alpha =.6), 
        lower = list(continuous=wrap("points",size=.25)),
        upper = list(continuous=wrap("cor", size = 1.5)),
        progress = FALSE,
        title = "Numer Varby Crime Class")

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Boston_Fact <- Boston_Clean |> 
  select(where(is.factor))

ggpairs(Boston_Fact, 
        mapping = aes(fill = Crime_Class, alpha =.6), 
        progress = FALSE,
        title = "Factor Var")

Fitting the full model: the significant variables are: zn, nox, dis, tax, black, medv

full_model_boston <- glm(Crime_Class ~ ., data = Boston_Clean, family = "binomial")

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(full_model_boston)


Call:
glm(formula = Crime_Class ~ ., family = "binomial", data = Boston_Clean)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.900e+01  2.826e+03  -0.017  0.98616    
zn          -1.653e-01  5.910e-02  -2.797  0.00515 ** 
indus       -1.359e-01  1.016e-01  -1.337  0.18114    
chas1       -3.993e-01  9.157e-01  -0.436  0.66276    
nox          6.437e+01  1.179e+01   5.460 4.76e-08 ***
rm          -8.848e-01  9.912e-01  -0.893  0.37208    
age          1.292e-02  1.470e-02   0.879  0.37948    
dis          5.308e-01  2.521e-01   2.105  0.03527 *  
rad2         1.626e+01  2.826e+03   0.006  0.99541    
rad3         1.689e+01  2.826e+03   0.006  0.99523    
rad4         2.124e+01  2.826e+03   0.008  0.99400    
rad5         1.874e+01  2.826e+03   0.007  0.99471    
rad6         1.722e+01  2.826e+03   0.006  0.99514    
rad7         2.547e+01  2.826e+03   0.009  0.99281    
rad8         2.430e+01  2.826e+03   0.009  0.99314    
rad24        4.163e+01  3.027e+03   0.014  0.98903    
tax         -9.844e-03  4.857e-03  -2.027  0.04267 *  
ptratio      6.237e-02  1.810e-01   0.344  0.73048    
black       -1.143e-02  5.754e-03  -1.986  0.04706 *  
lstat        5.931e-02  6.052e-02   0.980  0.32711    
medv         1.970e-01  9.696e-02   2.031  0.04221 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 701.46  on 505  degrees of freedom
Residual deviance: 140.55  on 485  degrees of freedom
AIC: 182.55

Number of Fisher Scoring iterations: 19

I ran both options of selected features to determine whether each set of features were significant on their own.

set.seed(42)
boston_partition<- resample_partition(Boston_Clean, c(train=.8, test=.2))

boston_train_df<-as.data.frame(boston_partition$train)
boston_test_df<-as.data.frame(boston_partition$test)


opt_1 <- c('zn', 'nox', 'dis', 'tax', 'black', 'medv')
opt_2 <- c('indus', 'nox', 'age', 'dis', 'tax', 'lstat')



boston_train_df|>
  tbl_summary( by = Crime_Class, include = all_of(opt_1))|>
  add_p()|>
  modify_caption("Opt 1 Columns: 'zn', 'nox', 'dis', 'tax', 'black', 'medv'")

Opt 1 Columns: ‘zn’, ‘nox’, ‘dis’, ‘tax’, ‘black’, ‘medv’
Characteristic	0 N = 201¹	1 N = 203¹	p-value²
zn	0 (0, 35)	0 (0, 0)	<0.001
nox	0.45 (0.43, 0.52)	0.62 (0.54, 0.70)	<0.001
dis	5.21 (3.37, 6.50)	2.19 (1.77, 3.05)	<0.001
tax	289 (254, 352)	666 (307, 666)	<0.001
black	393 (388, 397)	380 (316, 395)	<0.001
medv	23 (20, 29)	18 (14, 23)	<0.001
¹ Median (Q1, Q3)
² Wilcoxon rank sum test

boston_train_df|>
  tbl_summary( by = Crime_Class, include = all_of(opt_2))|>
  add_p()|>
  modify_caption("Opt 2 Columns: 'indus', 'nox', 'age', 'dis', 'tax', 'lstat'")

Opt 2 Columns: ‘indus’, ‘nox’, ‘age’, ‘dis’, ‘tax’, ‘lstat’
Characteristic	0 N = 201¹	1 N = 203¹	p-value²
indus	5.6 (3.4, 9.7)	18.1 (9.9, 18.1)	<0.001
nox	0.45 (0.43, 0.52)	0.62 (0.54, 0.70)	<0.001
age	48 (32, 70)	92 (80, 97)	<0.001
dis	5.21 (3.37, 6.50)	2.19 (1.77, 3.05)	<0.001
tax	289 (254, 352)	666 (307, 666)	<0.001
lstat	9 (6, 13)	15 (11, 20)	<0.001
¹ Median (Q1, Q3)
² Wilcoxon rank sum test

cat('***********OPTION 1 FEATURES ',paste0(all_of(opt_1)),'***********')

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

***********OPTION 1 FEATURES  zn nox dis tax black medv ***********

dynamic_formula <- reformulate(termlabels = opt_1, response = "Crime_Class")

fit_sub_log <- glm(dynamic_formula, 
                    data= boston_train_df,
                    family="binomial")

fit_lda <- lda(dynamic_formula, 
                    data= boston_train_df)

fit_qda <- qda(dynamic_formula, 
                    data= boston_train_df)

fit_bayes <- naiveBayes(dynamic_formula, 
                        data = boston_train_df)

fit_knn <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 1),
                 preProcess = c("center", "scale"))

fit_knn3 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 3),
                 preProcess = c("center", "scale"))

fit_knn5 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 5),
                 preProcess = c("center", "scale"))

fit_knn10 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 10),
                 preProcess = c("center", "scale"))

my_models <- list(
  "Log Reg" = fit_sub_log,
  "LDA" = fit_lda,
  "Qda " = fit_qda,
  "Bayes"       = fit_bayes,
  "KNN-1" = fit_knn,
  "KNN-3" = fit_knn3,
  "KNN-5" = fit_knn5,
  "KNN-10" = fit_knn10
)


for (model_name in names(my_models)) {
  
  cat("\n\nMODEL :", model_name, "\n")
  results <- evaluate_model(my_models[[model_name]], 
                            boston_test_df, 
                            target_var = "Crime_Class",
                            pos="1",
                            neg="0")
  

  print(results$table)
  cat("Overall Accuracy: ", round(results$overall["Accuracy"], 4), "\n")
    cat("Test Error:      ", round(1-results$overall["Accuracy"], 4), "\n")
  cat("Sensitivity:      ", round(results$byClass["Sensitivity"], 4), "\n")
  cat("Specificity:      ", round(results$byClass["Specificity"], 4), "\n")
}



MODEL : Log Reg 
          Reference
Prediction  0  1
         0 47  5
         1  5 45
Overall Accuracy:  0.902 
Test Error:       0.098 
Sensitivity:       0.9 
Specificity:       0.9038 


MODEL : LDA 
          Reference
Prediction  0  1
         0 51 10
         1  1 40
Overall Accuracy:  0.8922 
Test Error:       0.1078 
Sensitivity:       0.8 
Specificity:       0.9808 


MODEL : Qda  
          Reference
Prediction  0  1
         0 50  7
         1  2 43
Overall Accuracy:  0.9118 
Test Error:       0.0882 
Sensitivity:       0.86 
Specificity:       0.9615 


MODEL : Bayes 
          Reference
Prediction  0  1
         0 50 13
         1  2 37
Overall Accuracy:  0.8529 
Test Error:       0.1471 
Sensitivity:       0.74 
Specificity:       0.9615 


MODEL : KNN-1 
          Reference
Prediction  0  1
         0 48  1
         1  4 49
Overall Accuracy:  0.951 
Test Error:       0.049 
Sensitivity:       0.98 
Specificity:       0.9231 


MODEL : KNN-3 
          Reference
Prediction  0  1
         0 44  2
         1  8 48
Overall Accuracy:  0.902 
Test Error:       0.098 
Sensitivity:       0.96 
Specificity:       0.8462 


MODEL : KNN-5 
          Reference
Prediction  0  1
         0 46  1
         1  6 49
Overall Accuracy:  0.9314 
Test Error:       0.0686 
Sensitivity:       0.98 
Specificity:       0.8846 


MODEL : KNN-10 
          Reference
Prediction  0  1
         0 44  1
         1  8 49
Overall Accuracy:  0.9118 
Test Error:       0.0882 
Sensitivity:       0.98 
Specificity:       0.8462

cat('***********OPTION 2 FEATURES ',paste0(all_of(opt_2)),'***********')

***********OPTION 2 FEATURES  indus nox age dis tax lstat ***********

dynamic_formula <- reformulate(termlabels = opt_2, response = "Crime_Class")

fit_sub_log <- glm(dynamic_formula, 
                    data= boston_train_df,
                    family="binomial")

fit_lda <- lda(dynamic_formula, 
                    data= boston_train_df)

fit_qda <- qda(dynamic_formula, 
                    data= boston_train_df)

fit_bayes <- naiveBayes(dynamic_formula, 
                        data = boston_train_df)

fit_knn <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 1),
                 preProcess = c("center", "scale"))

fit_knn3 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 3),
                 preProcess = c("center", "scale"))

fit_knn5 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 5),
                 preProcess = c("center", "scale"))

fit_knn10 <- train(dynamic_formula, 
                 data = boston_train_df, 
                 method = "knn", 
                 tuneGrid = data.frame(k = 10),
                 preProcess = c("center", "scale"))

my_models <- list(
  "Log Reg" = fit_sub_log,
  "LDA" = fit_lda,
  "Qda " = fit_qda,
  "Bayes"       = fit_bayes,
  "KNN-1" = fit_knn,
  "KNN-3" = fit_knn3,
  "KNN-5" = fit_knn5,
  "KNN-10" = fit_knn10
)


for (model_name in names(my_models)) {
  
  cat("\n\nMODEL :", model_name, "\n")
  results <- evaluate_model(my_models[[model_name]], 
                            boston_test_df, 
                            target_var = "Crime_Class",
                            pos="1",
                            neg="0")
  

  print(results$table)
  cat("Overall Accuracy: ", round(results$overall["Accuracy"], 4), "\n")
    cat("Test Error:      ", round(1-results$overall["Accuracy"], 4), "\n")
  cat("Sensitivity:      ", round(results$byClass["Sensitivity"], 4), "\n")
  cat("Specificity:      ", round(results$byClass["Specificity"], 4), "\n")
}



MODEL : Log Reg 
          Reference
Prediction  0  1
         0 48  7
         1  4 43
Overall Accuracy:  0.8922 
Test Error:       0.1078 
Sensitivity:       0.86 
Specificity:       0.9231 


MODEL : LDA 
          Reference
Prediction  0  1
         0 51 12
         1  1 38
Overall Accuracy:  0.8725 
Test Error:       0.1275 
Sensitivity:       0.76 
Specificity:       0.9808 


MODEL : Qda  
          Reference
Prediction  0  1
         0 49 11
         1  3 39
Overall Accuracy:  0.8627 
Test Error:       0.1373 
Sensitivity:       0.78 
Specificity:       0.9423 


MODEL : Bayes 
          Reference
Prediction  0  1
         0 47  9
         1  5 41
Overall Accuracy:  0.8627 
Test Error:       0.1373 
Sensitivity:       0.82 
Specificity:       0.9038 


MODEL : KNN-1 
          Reference
Prediction  0  1
         0 45  2
         1  7 48
Overall Accuracy:  0.9118 
Test Error:       0.0882 
Sensitivity:       0.96 
Specificity:       0.8654 


MODEL : KNN-3 
          Reference
Prediction  0  1
         0 46  1
         1  6 49
Overall Accuracy:  0.9314 
Test Error:       0.0686 
Sensitivity:       0.98 
Specificity:       0.8846 


MODEL : KNN-5 
          Reference
Prediction  0  1
         0 46  1
         1  6 49
Overall Accuracy:  0.9314 
Test Error:       0.0686 
Sensitivity:       0.98 
Specificity:       0.8846 


MODEL : KNN-10 
          Reference
Prediction  0  1
         0 44  1
         1  8 49
Overall Accuracy:  0.9118 
Test Error:       0.0882 
Sensitivity:       0.98 
Specificity:       0.8462

For both models KNN where K=3 proved to be the best choice of model regardless of the feature set chosen. Sensitivity in option 2 of features was pushed to 98%.