Chapter 04 (page 168): 13, 14, 16

Problem 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR2)
library(scales)
library(MASS)
library(e1071)
library(class)
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
dim(Weekly)
## [1] 1089    9
View(Weekly)

Q13(a): Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
A13(a): There are a few patterns that can be seen both in the graphs and through correlations. The correlations between the lag variables and today’s returns are close to zero meaning there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume with a correlation of .84194162. On plot that stood out is Volume by Year showing that Volume is increasing over time.

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
attach(Weekly)
pairs(Weekly)

plot(Volume, Year)

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Q13(b): Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

A13(b): The Lag2 predictor is the only predictor that according to this model appears to be statistically significant with a P value of 0.0296. All other variables in the model are statistically insignificant with p-values above .05.

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data=Weekly,
            family=binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Q13(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
A13(c) According to the confusion matrix there were 611 of 1089 correct predictions, resulting in a classification Accuracy/True Positive of 56.107%” and an error rate (classifications) of 43.893%. Sensitivity (accuracy for accounting for “Up”) is 92.066%“, and Specificity (accuracy for accounting for”Down”) is 11.157%.

glm.prob = predict(glm.fit, 
                   type="response")
glm.pred = rep("Down", 1089)
glm.pred[glm.prob > .5] = "Up"
table(glm.pred, 
      Weekly$Direction)
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557
TruePos = mean(glm.pred==Direction)
print(paste("Accuracy/True Positive is", percent(TruePos, accuracy=.001)))
## [1] "Accuracy/True Positive is 56.107%"
print(paste("Error Rate", percent((1-TruePos), accuracy=.001)))
## [1] "Error Rate 43.893%"
Sensitivity = 557/(557+48)
print(paste("Sensitivity is", percent(Sensitivity, accuracy=.001)))
## [1] "Sensitivity is 92.066%"
Specificity = 54/(430+54)
print(paste("Specificity is", percent(Specificity, accuracy=.001)))
## [1] "Specificity is 11.157%"

Q13(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
A13(d) LR - The logistic regression model Accuracy/True Positive is 62.500%.

train = (Year < 2009)
test.2009.2010 = Weekly[!train,]
Direction.2009.2010 = Direction[!train]

lr.fit = glm(Direction ~ Lag2, 
             data = Weekly,
             family = binomial, 
             subset = train)
lr.Prob = predict(lr.fit, test.2009.2010, type = "response")

lr.pred = rep("Down", 104)

lr.pred[lr.Prob > .5] = "Up"

table(lr.pred, Direction.2009.2010)
##        Direction.2009.2010
## lr.pred Down Up
##    Down    9  5
##    Up     34 56
accuracy = mean(lr.pred==Direction.2009.2010)

print(paste("Log Reg Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "Log Reg Accuracy/True Positive is 62.500%"

Q13(e) Repeat (d) using LDA.
A13(e) LDA - The LDA model Accuracy/True Positive is 62.500%.

lda.fit = lda(Direction ~ Lag2, 
              data = Weekly, 
              subset = train)
lda.prob = predict(lda.fit, test.2009.2010, type = "response")

lda.class = lda.prob$class

table(lda.class, Direction.2009.2010)
##          Direction.2009.2010
## lda.class Down Up
##      Down    9  5
##      Up     34 56
accuracy = mean(lda.class==Direction.2009.2010)

print(paste("LDA Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "LDA Accuracy/True Positive is 62.500%"

Q13(f) Repeat (d) using QDA.
A13(f) QDA - The QDA model Accuracy/True Positive is 58.654%. The QDA model didn’t predict any down predictions, questioning its reliability.

qda.fit = qda(Direction ~ Lag2, 
              data = Weekly, 
              subset = train)
qda.prob = predict(qda.fit, test.2009.2010, type = "response")

qda.class = qda.prob$class

table(qda.class, Direction.2009.2010)
##          Direction.2009.2010
## qda.class Down Up
##      Down    0  0
##      Up     43 61
accuracy = mean(qda.class==Direction.2009.2010)

print(paste("QDA Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "QDA Accuracy/True Positive is 58.654%"

Q13(g) Repeat (d) using KNN with K = 1.
A13(g) KNN - The KNN model Accuracy/True Positive is 50%.

train.set = Lag2[train]
test.set = Lag2[!train]
train.Direction = Direction[train]

length(train.set) == length(test.set)
## [1] FALSE
set.seed(1)


knn.pred = knn(data.frame(train.set), 
               data.frame(test.set), 
               train.Direction, k=1)
table(knn.pred, Direction.2009.2010)
##         Direction.2009.2010
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
accuracy = mean(knn.pred==Direction.2009.2010)

print(paste("KNN = 1 Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "KNN = 1 Accuracy/True Positive is 50.000%"

Q13(h) Repeat (d) using naive Bayes.
A13(h) naive Bayes - naive Bayes Accuracy/True Positive is 58.654%.

Bayes.fit = naiveBayes(Direction ~ Lag2,
                       data = Weekly, 
                       subset = train)
Bayes.fit
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag2
## Y             [,1]     [,2]
##   Down -0.03568254 2.199504
##   Up    0.26036581 2.317485
Bayes.class = predict(Bayes.fit, test.set)
table(Bayes.class, Direction.2009.2010)
##            Direction.2009.2010
## Bayes.class Down Up
##        Down    0  0
##        Up     43 61
accuracy = mean(Bayes.class == Direction.2009.2010)

print(paste("naive Bayes Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "naive Bayes Accuracy/True Positive is 58.654%"

Q13(i) Which of these methods appears to provide the best results on this data?
A13(i) The model method that appears to provide the best results on this data is the Logistic Regression and LDA models, both with an accuracy of 62.5% and test error rate of 37.5%.

Q13(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
A13(j) From my various experiments, the best results from the held out data was utilizing the LDA method and an interaction between the Lag2 and Lag3 variables resulting in an accuracy/true positive of 62.5%.

Bayes.fit = naiveBayes(Direction ~ Lag1,
                       data = Weekly,
                       subset = train)
Bayes.fit
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag1
## Y              [,1]     [,2]
##   Down  0.289444444 2.211721
##   Up   -0.009213235 2.308387
Bayes.class = predict(Bayes.fit, test.set)
table(Bayes.class, Direction.2009.2010)
##            Direction.2009.2010
## Bayes.class Down Up
##        Down    0  0
##        Up     43 61
accuracy = mean(Bayes.class == Direction.2009.2010)

print(paste("Utilizing Lag1 naive Bayes Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "Utilizing Lag1 naive Bayes Accuracy/True Positive is 58.654%"
train.X=cbind(Lag1,Lag2)[train,]
test.X=cbind(Lag1,Lag2)[!train,]
train.Direction = Direction[train]

length(train.X) == length(test.X)
## [1] FALSE
set.seed(1)


knn.pred = knn(train.X, test.X, train.Direction, k=3)
table(knn.pred, Direction.2009.2010)
##         Direction.2009.2010
## knn.pred Down Up
##     Down   22 29
##     Up     21 32
accuracy = mean(knn.pred==Direction.2009.2010)

print(paste("KNN = 3 Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "KNN = 3 Accuracy/True Positive is 51.923%"
lda.fit = lda(Direction ~ Lag2*Lag3,
              data = Weekly,
              subset = train)
lda.pred = predict(lda.fit,
                    test.2009.2010,
                    type = "response")

lda.class = lda.pred$class

table(lda.class, Direction.2009.2010)
##          Direction.2009.2010
## lda.class Down Up
##      Down    8  4
##      Up     35 57
mean(lda.class == Direction.2009.2010)
## [1] 0.625
detach(Weekly)

Problem 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR2)
library(scales)
library(MASS)
library(e1071)
library(class)
attach(Auto)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
dim(Auto)
## [1] 392   9
View(Auto)

Q14(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
A14(a): The median is for mpg is 22.75, the binary variable, mpg01, was creating, it contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below the median.

median(mpg)
## [1] 22.75
mpg01 = rep(1, 392)
mpg01[mpg < median(mpg)] = 0

Auto2 = Auto
Auto2$mpg01 = mpg01
View(Auto2)

Q14(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings. A14(b) The variables cylinders, weight, displacement, and horsepower are most highly correlated with mpg01 with a negative correlation.

cor(Auto2[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
pairs(Auto2[,-9])

library(corrplot)
corrplot(cor(Auto2[, -9]), method = "color")

Q14(c) Split the data into a training set and a test set.
A14(c) Data has been split into a 70/30 ratio for training and test sets.

train = Auto2[(1:(392*.7)),]
test = Auto2[(392*.7):392,]
target = test$mpg01

Q14(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A14(d) The test error for the LDA model with an 70/30 split is 15.25%.

lda.fit = lda(mpg01 ~ cylinders + weight + displacement + horsepower,
              data = train)

lda.pred = predict(lda.fit, test, type='response')
lda.class = lda.pred$class
table(lda.class, target)
##          target
## lda.class  0  1
##         0 19 16
##         1  2 81
lda.accuracy = mean(lda.class == target)
lda.error = 1 - mean(lda.class == target) 

The test error for the LDA model with an 70/30 split is 15.254%.

Q14(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A14(e) The test error for the QDA model with an 70/30 split is 17.797%

qda.fit = qda(mpg01 ~ cylinders + weight + displacement + horsepower,
              data = train)

qda.pred = predict(qda.fit, test, type='response')
qda.class = qda.pred$class
table(qda.class, target)
##          target
## qda.class  0  1
##         0 19 19
##         1  2 78
qda.accuracy = mean(qda.class == target)
qda.error = 1 - mean(qda.class == target) 
qda.error
## [1] 0.1779661

The test error for the QDA model with an 70/30 split is 17.797%.

Q14(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A14(f) The test error for the Logistic Regression model with an 70/30 split is 24.58%.

lr.fit = glm(mpg01 ~ cylinders + weight + displacement + horsepower,
             data = train, 
             family = binomial)
lr.prob = predict(lr.fit, test, type = "response")

lr.pred = rep("0", 118)

lr.pred[lr.Prob > .5] = 1

table(lr.pred, target)
##        target
## lr.pred  0  1
##       0  6 14
##       1 15 83
lr.accuracy = mean(lr.pred==target)
lr.error = 1 - lr.accuracy
lr.error
## [1] 0.2457627

The test error for the Logistic Regression model with an 70/30 split is 24.576%.

Q14(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A14(g) The test error for the naive Bayes model with an 70/30 split is 13.56%.

Bayes.fit = naiveBayes(mpg01 ~ cylinders + weight + displacement + horsepower,
                       data = train)

Bayes.class = predict(Bayes.fit, test)
table(Bayes.class, target)
##            target
## Bayes.class  0  1
##           0 19 14
##           1  2 83
Bayes.accuracy = mean(Bayes.class == target)
Bayes.error = 1 - Bayes.accuracy

The test error for the naive Bayes model with an 70/30 split is 13.559%.

Q14(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
A14(h) 5 is value of K which seems to perform the best on this data set with a test error rate of 19.49%. Here are the test errors obtained using different K values. KNN = 1 Test Error is 22.034%. KNN = 3 Test Error is 22.034%. KNN = 5 Test Error is 19.492%. KNN = 7 Test Error is 19.492%.

detach(Auto)
attach(Auto2)
train = cbind(cylinders, weight, displacement, horsepower)[(1:(392*.7)),]
test = cbind(cylinders, weight, displacement, horsepower)[(392*.7):392,]
train.target = Auto2$mpg01[(1:(392*.7))]

set.seed(1)

knn1.pred = knn(train = train, test = test, cl = train.target, k=1)
table(knn1.pred, target)
##          target
## knn1.pred  0  1
##         0 19 24
##         1  2 73
knn.test.error = 1 - mean(knn1.pred==target)

print(paste("KNN = 1 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 1 Test Error is 22.034%"
knn3.pred = knn(train = train, test = test, cl = train.target, k=3)
table(knn3.pred, target)
##          target
## knn3.pred  0  1
##         0 21 26
##         1  0 71
knn.test.error = 1- mean(knn3.pred==target)

print(paste("KNN = 3 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 3 Test Error is 22.034%"
knn5.pred = knn(train = train, test = test, cl = train.target, k=5)
table(knn5.pred, target)
##          target
## knn5.pred  0  1
##         0 21 23
##         1  0 74
knn.test.error = 1 - mean(knn5.pred==target)

print(paste("KNN = 5 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 5 Test Error is 19.492%"
knn7.pred = knn(train = train, test = test, cl = train.target, k=7)
table(knn7.pred, target)
##          target
## knn7.pred  0  1
##         0 21 23
##         1  0 74
knn.test.error = 1 - mean(knn7.pred==target)

print(paste("KNN = 7 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 7 Test Error is 19.492%"

Problem 16

Q16Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
A16 For this problem I started by fitting the logistic regression, LDA, naive Bayes, and KNN models using a 70/30 train/set split and predictors that have high correlation (above +/- .6) with crim01 (nox, chas, age, dis, rad, tax, indus). Model performance from best to worst is Logistic Regression model with an accuracy of 96.05%, KNN model where K = 3 with an accuracy of 94.737%, LDA model with an accuracy of 93.42% and lastly the naive Bayes model with an accuracy of 32.24%

attach(Boston)

Create response variable for crime rates above/below median

crim01 = rep(0, length(crim))
crim01[crim > median(crim)] = 1
Boston2 = Boston
Boston2$crim01 = crim01
View(Boston2)

Exploring variables and correlations

cor(Boston2)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crim01   0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01  -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128  0.2535684
##               black      lstat       medv      crim01
## crim    -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       0.17552032 -0.4129946  0.3604453 -0.43615103
## indus   -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     0.04878848 -0.0539293  0.1752602  0.07009677
## nox     -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       0.12806864 -0.6138083  0.6953599 -0.15637178
## age     -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      0.29151167 -0.4969958  0.2499287 -0.61634164
## rad     -0.44441282  0.4886763 -0.3816262  0.61978625
## tax     -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio -0.17738330  0.3740443 -0.5077867  0.25356836
## black    1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat   -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     0.33346082 -0.7376627  1.0000000 -0.26301673
## crim01  -0.35121093  0.4532627 -0.2630167  1.00000000
pairs(Boston2)

length(Boston2[,1])
## [1] 506
corrplot(cor(Boston2), method = "color")

Variables that show a high correlation (above +/- .6) with crim01 are nox, chas, age, dis, rad, tax, indus.

Creating a train/test set with a 70/30 split.

train = Boston2[(1:(506*.7)),]
test = Boston2[((506*.7): 506), ]
test.target = test$crim01

attach(Boston2)

logistic regression The logistic regression model had an accuracy of 96.05%.

lr.fit = glm(crim01 ~ nox + chas + age + dis + rad + tax + indus, 
             data = train,
             family = binomial)
lr.prob = predict(lr.fit, test, type = "response")
lr.pred = ifelse(lr.prob>0.5,1,0)

table(lr.pred, test$crim01)
##        
## lr.pred   0   1
##       0  11   0
##       1   6 135
lr.accuracy.boston = mean(lr.pred == test$crim01)

mean(lr.pred ==  test$crim01)
## [1] 0.9605263

LDA Model The LDA model had an accuracy of 93.42%.

lda.fit = lda(crim01 ~ nox + chas + age + dis + rad + tax + indus,
              data = train)
lda.pred = predict(lda.fit, test, type = "response")

lda.class = lda.pred$class

table(lda.class, test.target)
##          test.target
## lda.class   0   1
##         0   7   0
##         1  10 135
mean(lda.class == test.target)
## [1] 0.9342105

naive Bayes The accuracy rate for the naive Bayes model with an 70/30 split is 32.24%.

Bayes.fit = naiveBayes(crim01 ~ nox + chas + age + dis + rad + tax + indus,
                     data = train)

Bayes.class = predict(Bayes.fit, test)
table(Bayes.class, test.target)
##            test.target
## Bayes.class  0  1
##           0  4 90
##           1 13 45
mean(Bayes.class == test.target)
## [1] 0.3223684

KNN Out of the tested values, K=3 has the best accuracy rate. KNN = 1 accuracy is 10.526%, KNN = 3 accuracy is 94.737%, KNN = 5 accuracy is 94.737%, KNN = 7 accuracy is 94.737%”

train = cbind.data.frame(nox, chas, age, dis, rad, tax, indus)[(1:(506*.7)),]
test = cbind.data.frame(nox, chas, age, dis, rad, tax, indus)[(506*.7):506,]
train.target = crim01[(1:(506*.7))]

set.seed(1)

knn1.pred = knn(train = train, test = test, cl = train.target, k=1)
table(knn1.pred, test.target)
##          test.target
## knn1.pred   0   1
##         0  16 135
##         1   1   0
knn1.accuracy = mean(knn1.pred==test.target)

print(paste("KNN = 1 accuracy is", percent(knn1.accuracy, accuracy=.001)))
## [1] "KNN = 1 accuracy is 10.526%"
knn3.pred = knn(train = train, test = test, cl = train.target, k=3)
table(knn3.pred, test.target)
##          test.target
## knn3.pred   0   1
##         0  12   3
##         1   5 132
knn3.accuracy = mean(knn3.pred==test.target)

print(paste("KNN = 3 accuracy is", percent(knn3.accuracy, accuracy=.001)))
## [1] "KNN = 3 accuracy is 94.737%"
knn5.pred = knn(train = train, test = test, cl = train.target, k=5)
table(knn5.pred, test.target)
##          test.target
## knn5.pred   0   1
##         0  12   3
##         1   5 132
knn5.accuracy = mean(knn5.pred==test.target)

print(paste("KNN = 5 accuracy is", percent(knn5.accuracy, accuracy=.001)))
## [1] "KNN = 5 accuracy is 94.737%"
knn7.pred = knn(train = train, test = test, cl = train.target, k=7)
table(knn7.pred, test.target)
##          test.target
## knn7.pred   0   1
##         0  12   3
##         1   5 132
knn7.accuracy = mean(knn7.pred==test.target)

print(paste("KNN = 7 accuracy is", percent(knn7.accuracy, accuracy=.001)))
## [1] "KNN = 7 accuracy is 94.737%"