Assignment 3

#13.

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

attach(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly)

## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

knitr::kable(summary(Weekly))

Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction
Min. :1990	Min. :-18.1950	Min. :-18.1950	Min. :-18.1950	Min. :-18.1950	Min. :-18.1950	Min. :0.08747	Min. :-18.1950	Down:484
1st Qu.:1995	1st Qu.: -1.1540	1st Qu.: -1.1540	1st Qu.: -1.1580	1st Qu.: -1.1580	1st Qu.: -1.1660	1st Qu.:0.33202	1st Qu.: -1.1540	Up :605
Median :2000	Median : 0.2410	Median : 0.2410	Median : 0.2410	Median : 0.2380	Median : 0.2340	Median :1.00268	Median : 0.2410	NA
Mean :2000	Mean : 0.1506	Mean : 0.1511	Mean : 0.1472	Mean : 0.1458	Mean : 0.1399	Mean :1.57462	Mean : 0.1499	NA
3rd Qu.:2005	3rd Qu.: 1.4050	3rd Qu.: 1.4090	3rd Qu.: 1.4090	3rd Qu.: 1.4090	3rd Qu.: 1.4050	3rd Qu.:2.05373	3rd Qu.: 1.4050	NA
Max. :2010	Max. : 12.0260	Max. : 12.0260	Max. : 12.0260	Max. : 12.0260	Max. : 12.0260	Max. :9.32821	Max. : 12.0260	NA

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

fit_glm = glm(Direction ~ . -Year -Today, data = Weekly, family = binomial)
summary(fit_glm)

## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial, 
##     data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag 2 is statistically significant.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm_wk_probs = predict(fit_glm, type = "response")
glm_wk_preds = ifelse(glm_wk_probs > 0.5, "Up", "Down")
caret::confusionMatrix(as.factor(glm_wk_preds), Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54  48
##       Up    430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : Up             
##

The model is predicting a lot of Ups relative to the amount of Downs while getting 56% of its predictions correct.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

# splitting data
train = (Year < 2009)
Weekly_test = Weekly[!train,]
Direction_test = Weekly_test$Direction

# fitting glm
glm_fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)

# getting predictions
glm_probs = predict(glm_fit, newdata = Weekly_test, type = "response")
glm_pred = rep("Down", 104)
glm_pred[glm_probs > 0.5] = "Up"

# calculate accuracy
mean(glm_pred == Direction_test)

## [1] 0.625

# confusion matrix
caret::confusionMatrix(as.factor(glm_pred), Direction_test, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

This model predicts with a 62.5% accuracy on the held out data.

(e) Repeat (d) using LDA.

# fitting lda
lda_fit = MASS::lda(Direction ~ Lag2, data = Weekly, subset = train)

# two ways of getting accuracy
lda_pred = predict(lda_fit, newdata = Weekly_test)
names(lda_pred)

## [1] "class"     "posterior" "x"

lda_class = lda_pred$class
table(lda_class, Direction_test)

##          Direction_test
## lda_class Down Up
##      Down    9  5
##      Up     34 56

mean(lda_class == Direction_test)

## [1] 0.625

lda_probs = lda_pred$posterior[,2]
lda_pred = rep("Down", 104)
lda_pred[lda_probs > 0.5] = 'Up'
table(lda_pred, Direction_test)

##         Direction_test
## lda_pred Down Up
##     Down    9  5
##     Up     34 56

mean(lda_pred == Direction_test)

## [1] 0.625

#confusion matrix for easier visualization
caret::confusionMatrix(as.factor(lda_pred), Direction_test, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

The lda model predicted with the same accuracy as the glm.

(f) Repeat (d) using QDA.

# fitting qda
qda_fit = MASS::qda(Direction ~ Lag2, data = Weekly, subset = train)

# calculating accuracy
qda_class = predict(qda_fit, newdata = Weekly_test)$class
table(qda_class, Direction_test)

##          Direction_test
## qda_class Down Up
##      Down    0  0
##      Up     43 61

mean(qda_class == Direction_test)

## [1] 0.5865385

# confusion matrix
caret::confusionMatrix(as.factor(qda_class), Direction_test, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
##

The qda did not perform as well as glm or lda.

(g) Repeat (d) using KNN with K = 1.

train.X = cbind(Lag2, Lag2)[train,]
test.X = cbind(Lag2, Lag2)[!train,]
train.Direction = Direction[train]

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction_test)

##         Direction_test
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

mean(knn.pred == Direction_test)

## [1] 0.5

(h) Repeat (d) using naive Bayes.

# fit naive Bayes
nb_fit = e1071::naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)

# calculating accuracy
nb_class = predict(nb_fit, newdata = Weekly_test)
table(nb_class, Direction_test)

##         Direction_test
## nb_class Down Up
##     Down    0  0
##     Up     43 61

mean(nb_class == Direction_test)

## [1] 0.5865385

# confusion matrix
caret::confusionMatrix(as.factor(nb_class), Direction_test, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
##

(i) Which of these methods appears to provide the best results on this data? The logistic regression looks to be the best because it was tied for the highest accuracy.

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

# knn with k = 3
train.X = cbind(Lag2, Lag2)[train,]
test.X = cbind(Lag2, Lag2)[!train,]
train.Direction = Direction[train]

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction_test)

##         Direction_test
## knn.pred Down Up
##     Down   16 20
##     Up     27 41

mean(knn.pred == Direction_test)

## [1] 0.5480769

# knn with k = 3 and different predictors
train.X = cbind(Lag1, Lag2)[train,]
test.X = cbind(Lag1, Lag2)[!train,]
train.Direction = Direction[train]

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction_test)

##         Direction_test
## knn.pred Down Up
##     Down   22 29
##     Up     21 32

mean(knn.pred == Direction_test)

## [1] 0.5192308

#14.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

Auto_df = Auto
attach(Auto_df)

## The following object is masked from package:lubridate:
## 
##     origin

## The following object is masked from package:ggplot2:
## 
##     mpg

Auto_df$mpg01 = ifelse(Auto_df$mpg > median(Auto_df$mpg), 1, 0)
Auto_df$cylinders = as.factor(Auto_df$cylinders)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

glimpse(Auto_df)

## Rows: 392
## Columns: 10
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <fct> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## $ mpg01        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …

Auto_df |>
  ggplot(mapping = aes(x = mpg01)) +
  geom_bar()

Auto_df |>
  ggplot(mapping = aes(x = as.factor(cylinders), y = mpg)) +
  geom_boxplot(varwidth = T)

pairs(Auto_df[,1:10])

displacement, horsepower, and weight look to be have an effect on mpg and therefore mpg01. mpg should also be a strong predictor since mpg01 was created from mpg.

(c) Split the data into a training set and a test set.

# splitting data
set.seed(1)
index = sample(1:nrow(Auto_df), 0.7 * nrow(Auto_df))
train_auto = Auto_df[index,]
test_auto = Auto_df[-index,]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# fit lda with mpg
lda_fit_auto = MASS::lda(mpg01 ~ mpg, data = train_auto)

# getting accuracy
lda_pred_auto = predict(lda_fit_auto, newdata = test_auto)
lda_class_auto = lda_pred_auto$class
table(lda_class_auto, test_auto$mpg01)

##               
## lda_class_auto  0  1
##              0 61  2
##              1  0 55

mean(lda_class_auto == test_auto$mpg01)

## [1] 0.9830508

# fit lda with other features
lda_fit_auto = MASS::lda(mpg01 ~ weight + horsepower + displacement, data = train_auto)

# getting accuracy
lda_pred_auto = predict(lda_fit_auto, newdata = test_auto)
lda_class_auto = lda_pred_auto$class
table(lda_class_auto, test_auto$mpg01)

##               
## lda_class_auto  0  1
##              0 47  1
##              1 14 56

mean(lda_class_auto == test_auto$mpg01)

## [1] 0.8728814

1 - mean(lda_class_auto == test_auto$mpg01)

## [1] 0.1271186

For the model without using mpg, the error rate was 12.7%.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# fit qda
qda_fit_auto = MASS::qda(mpg01 ~ weight + horsepower + displacement, data = train_auto)

# getting accuracy
qda_pred_auto = predict(qda_fit_auto, newdata = test_auto)
qda_class_auto = qda_pred_auto$class
table(qda_class_auto, test_auto$mpg01)

##               
## qda_class_auto  0  1
##              0 51  3
##              1 10 54

mean(qda_class_auto == test_auto$mpg01)

## [1] 0.8898305

1 - mean(qda_class_auto == test_auto$mpg01)

## [1] 0.1101695

There was a lower test error with the qda. 11%

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# fit glm with other features
glm_fit_auto = glm(mpg01 ~ weight + horsepower + displacement, data = train_auto, family = binomial)

# getting predictions
glm_probs_auto = predict(glm_fit_auto, newdata = test_auto, type = "response")
glm_pred_auto = rep("Down", 118)
glm_pred_auto[glm_probs_auto > 0.5] = "1"

table(glm_pred_auto, test_auto$mpg01)

##              
## glm_pred_auto  0  1
##          1     8 54
##          Down 53  3

# calculate accuracy
mean(glm_pred_auto == test_auto$mpg01)

## [1] 0.4576271

1 - mean(glm_pred_auto == test_auto$mpg01)

## [1] 0.5423729

There was a 54% error rate with logistic regression.

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# fit naive Bayes
nb_fit_auto = e1071::naiveBayes(mpg01 ~ weight + horsepower + displacement, data = train_auto)

# getting accuracy
nb_class_auto = predict(nb_fit_auto, newdata = test_auto)
table(nb_class_auto, test_auto$mpg01)

##              
## nb_class_auto  0  1
##             0 50  2
##             1 11 55

mean(nb_class_auto == test_auto$mpg01)

## [1] 0.8898305

1 - mean(nb_class_auto == test_auto$mpg01)

## [1] 0.1101695

The error rate was 11% with a naive Bayes model.

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

# k = 1
train.X = cbind(train_auto$displacement, train_auto$weight, train_auto$horsepower)
test.X = cbind(test_auto$displacement, test_auto$weight, test_auto$horsepower)
train.mpg01 = train_auto$mpg01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.mpg01, k = 1)
table(knn.pred, test_auto$mpg01)

##         
## knn.pred  0  1
##        0 51  6
##        1 10 51

mean(knn.pred == test_auto$mpg01)

## [1] 0.8644068

1 - mean(knn.pred == test_auto$mpg01)

## [1] 0.1355932

k = 1 had a test error of 13.5%

# k = 3
train.X = cbind(train_auto$displacement, train_auto$weight, train_auto$horsepower)
test.X = cbind(test_auto$displacement, test_auto$weight, test_auto$horsepower)
train.mpg01 = train_auto$mpg01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.mpg01, k = 3)
table(knn.pred, test_auto$mpg01)

##         
## knn.pred  0  1
##        0 52  4
##        1  9 53

mean(knn.pred == test_auto$mpg01)

## [1] 0.8898305

1 - mean(knn.pred == test_auto$mpg01)

## [1] 0.1101695

k = 3 had a test error of 11%

# k = 5
train.X = cbind(train_auto$displacement, train_auto$weight, train_auto$horsepower)
test.X = cbind(test_auto$displacement, test_auto$weight, test_auto$horsepower)
train.mpg01 = train_auto$mpg01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.mpg01, k = 5)
table(knn.pred, test_auto$mpg01)

##         
## knn.pred  0  1
##        0 51  5
##        1 10 52

mean(knn.pred == test_auto$mpg01)

## [1] 0.8728814

1 - mean(knn.pred == test_auto$mpg01)

## [1] 0.1271186

k = 5 had a test error of 12.7%

# k = 4
train.X = cbind(train_auto$displacement, train_auto$weight, train_auto$horsepower)
test.X = cbind(test_auto$displacement, test_auto$weight, test_auto$horsepower)
train.mpg01 = train_auto$mpg01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.mpg01, k = 4)
table(knn.pred, test_auto$mpg01)

##         
## knn.pred  0  1
##        0 52  6
##        1  9 51

mean(knn.pred == test_auto$mpg01)

## [1] 0.8728814

1 - mean(knn.pred == test_auto$mpg01)

## [1] 0.1271186

# k = 2
train.X = cbind(train_auto$displacement, train_auto$weight, train_auto$horsepower)
test.X = cbind(test_auto$displacement, test_auto$weight, test_auto$horsepower)
train.mpg01 = train_auto$mpg01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.mpg01, k = 2)
table(knn.pred, test_auto$mpg01)

##         
## knn.pred  0  1
##        0 52  5
##        1  9 52

mean(knn.pred == test_auto$mpg01)

## [1] 0.8813559

1 - mean(knn.pred == test_auto$mpg01)

## [1] 0.1186441

k = 2 had an error rate of 11.9%

The best k value for this dataset is k = 3.

#16.

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

attach(Boston)

glimpse(Boston)

## Rows: 506
## Columns: 13
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

# creating new df with crim01 column
# 1 for above median `crim` or 0 for below median `crim`

bos = Boston
bos$crim01 = ifelse(bos$crim > median(bos$crim), 1, 0)

glimpse(bos)

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
## $ crim01  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…

attach(bos)

## The following objects are masked from Boston:
## 
##     age, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad, rm,
##     tax, zn

# splitting the data
set.seed(1)
index = sample(1:nrow(bos), 0.7 * nrow(bos))
train_bos = bos[index,]
test_bos = bos[-index,]

pairs(train_bos[1:13])

# fit glm
glm_fit_bos = glm(crim01 ~ . -crim, data = train_bos, family = binomial)
summary(glm_fit_bos)

## 
## Call:
## glm(formula = crim01 ~ . - crim, family = binomial, data = train_bos)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -45.733897   7.752156  -5.900 3.65e-09 ***
## zn           -0.106404   0.045295  -2.349 0.018816 *  
## indus        -0.066759   0.056948  -1.172 0.241088    
## chas          0.093650   0.802718   0.117 0.907125    
## nox          56.037879  10.016233   5.595 2.21e-08 ***
## rm           -0.109568   0.873739  -0.125 0.900205    
## age           0.021431   0.014179   1.511 0.130674    
## dis           1.116968   0.298179   3.746 0.000180 ***
## rad           0.661968   0.182005   3.637 0.000276 ***
## tax          -0.006115   0.003145  -1.944 0.051862 .  
## ptratio       0.262614   0.149593   1.756 0.079170 .  
## lstat         0.118890   0.058298   2.039 0.041415 *  
## medv          0.176224   0.085081   2.071 0.038335 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.65  on 353  degrees of freedom
## Residual deviance: 143.40  on 341  degrees of freedom
## AIC: 169.4
## 
## Number of Fisher Scoring iterations: 9

glm_fit_bos_step = step(glm_fit_bos, direction = "both")

## Start:  AIC=169.4
## crim01 ~ (crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat + medv) - crim
## 
##           Df Deviance    AIC
## - chas     1   143.41 167.41
## - rm       1   143.41 167.41
## - indus    1   144.85 168.85
## <none>         143.40 169.40
## - age      1   145.78 169.78
## - ptratio  1   146.66 170.66
## - tax      1   147.53 171.53
## - lstat    1   147.56 171.56
## - medv     1   148.09 172.09
## - zn       1   152.29 176.29
## - dis      1   161.00 185.00
## - rad      1   179.83 203.83
## - nox      1   207.91 231.91
## 
## Step:  AIC=167.41
## crim01 ~ zn + indus + nox + rm + age + dis + rad + tax + ptratio + 
##     lstat + medv
## 
##           Df Deviance    AIC
## - rm       1   143.43 165.43
## - indus    1   144.86 166.86
## <none>         143.41 167.41
## - age      1   145.91 167.91
## - ptratio  1   146.76 168.76
## + chas     1   143.40 169.40
## - lstat    1   147.63 169.63
## - tax      1   147.78 169.78
## - medv     1   148.10 170.10
## - zn       1   152.72 174.72
## - dis      1   161.02 183.02
## - rad      1   182.00 204.00
## - nox      1   208.86 230.86
## 
## Step:  AIC=165.43
## crim01 ~ zn + indus + nox + age + dis + rad + tax + ptratio + 
##     lstat + medv
## 
##           Df Deviance    AIC
## - indus    1   144.86 164.86
## <none>         143.43 165.43
## - age      1   146.38 166.38
## - ptratio  1   147.13 167.13
## + rm       1   143.41 167.41
## + chas     1   143.41 167.41
## - tax      1   147.92 167.92
## - lstat    1   147.99 167.99
## - zn       1   152.84 172.84
## - medv     1   157.60 177.60
## - dis      1   161.51 181.51
## - rad      1   182.00 202.00
## - nox      1   209.78 229.78
## 
## Step:  AIC=164.86
## crim01 ~ zn + nox + age + dis + rad + tax + ptratio + lstat + 
##     medv
## 
##           Df Deviance    AIC
## <none>         144.86 164.86
## + indus    1   143.43 165.43
## - age      1   148.11 166.11
## - lstat    1   148.56 166.56
## + chas     1   144.85 166.85
## + rm       1   144.86 166.86
## - ptratio  1   148.93 166.93
## - tax      1   152.16 170.16
## - zn       1   155.62 173.62
## - medv     1   158.79 176.79
## - dis      1   162.40 180.40
## - rad      1   188.02 206.02
## - nox      1   215.39 233.39

summary(glm_fit_bos_step)

## 
## Call:
## glm(formula = crim01 ~ zn + nox + age + dis + rad + tax + ptratio + 
##     lstat + medv, family = binomial, data = train_bos)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -44.036854   7.430601  -5.926 3.10e-09 ***
## zn           -0.113387   0.043432  -2.611 0.009036 ** 
## nox          50.679587   8.394722   6.037 1.57e-09 ***
## age           0.021726   0.012422   1.749 0.080287 .  
## dis           1.100549   0.295029   3.730 0.000191 ***
## rad           0.733634   0.170595   4.300 1.70e-05 ***
## tax          -0.007160   0.002873  -2.493 0.012676 *  
## ptratio       0.269421   0.135602   1.987 0.046939 *  
## lstat         0.105409   0.054666   1.928 0.053827 .  
## medv          0.170248   0.051208   3.325 0.000885 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.65  on 353  degrees of freedom
## Residual deviance: 144.86  on 344  degrees of freedom
## AIC: 164.86
## 
## Number of Fisher Scoring iterations: 9

# getting predictions
glm_probs_bos = predict(glm_fit_bos_step, newdata = test_bos, type = "response")
glm_pred_bos = rep("Down", 152)
glm_pred_bos[glm_probs_bos > 0.5] = "1"

table(glm_pred_bos, test_bos$crim01)

##             
## glm_pred_bos  0  1
##         1    13 73
##         Down 60  6

# calculate accuracy
mean(glm_pred_bos == test_bos$crim01)

## [1] 0.4802632

# fit lda
lda_fit_bos = MASS::lda(crim01 ~ zn + nox + age + dis + rad + tax + ptratio + lstat + medv, data = train_bos)

# getting accuracy
lda_pred_bos = predict(lda_fit_bos, newdata = test_bos)
lda_class_bos = lda_pred_bos$class
table(lda_class_bos, test_bos$crim01)

##              
## lda_class_bos  0  1
##             0 71 20
##             1  2 59

mean(lda_class_bos == test_bos$crim01)

## [1] 0.8552632

# fit qda
qda_fit_bos = MASS::qda(crim01 ~ zn + nox + age + dis + rad + tax + ptratio + lstat + medv, data = train_bos)

# getting accuracy
qda_pred_bos = predict(qda_fit_bos, newdata = test_bos)
qda_class_bos = qda_pred_bos$class
table(qda_class_bos, test_bos$crim01)

##              
## qda_class_bos  0  1
##             0 69 11
##             1  4 68

mean(qda_class_bos == test_bos$crim01)

## [1] 0.9013158

# fit naive Bayes
nb_fit_bos = e1071::naiveBayes(crim01 ~ zn + nox + age + dis + rad + tax + ptratio + lstat + medv, data = train_bos)

# getting accuracy
nb_class_bos = predict(nb_fit_bos, newdata = test_bos)
table(nb_class_bos, test_bos$crim01)

##             
## nb_class_bos  0  1
##            0 62 17
##            1 11 62

mean(nb_class_bos == test_bos$crim01)

## [1] 0.8157895

# k = 3
train.X = cbind(train_bos$zn, train_bos$nox, train_bos$age, train_bos$dis, train_bos$rad, train_bos$tax, train_bos$ptratio, train_bos$lstat, train_bos$medv)
test.X = cbind(test_bos$zn, test_bos$nox, test_bos$age, test_bos$dis, test_bos$rad, test_bos$tax, test_bos$ptratio, test_bos$lstat, test_bos$medv)
train.crim01 = train_bos$crim01

set.seed(1)
knn.pred = class::knn(train.X, test.X, train.crim01, k = 3)
table(knn.pred, test_bos$crim01)

##         
## knn.pred  0  1
##        0 64  5
##        1  9 74

mean(knn.pred == test_bos$crim01)

## [1] 0.9078947

The KNN with k = 3 was the best model with a 90.8% accuracy.

Assignment 3

Nicholas Mata jnm705

2025-03-06