Assignment

##This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

##a). Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? install.packages(ISLR2)

library(MASS)
library(class)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.92 loaded

library(ISLR)
library(e1071)
library(knitr)

cor_tbl<- kable(cor(Weekly[,-9]))
cor_tbl

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
Year	1.0000000	-0.0322893	-0.0333900	-0.0300065	-0.0311279	-0.0305191	0.8419416	-0.0324599
Lag1	-0.0322893	1.0000000	-0.0748531	0.0586357	-0.0712739	-0.0081831	-0.0649513	-0.0750318
Lag2	-0.0333900	-0.0748531	1.0000000	-0.0757209	0.0583815	-0.0724995	-0.0855131	0.0591667
Lag3	-0.0300065	0.0586357	-0.0757209	1.0000000	-0.0753959	0.0606572	-0.0692877	-0.0712436
Lag4	-0.0311279	-0.0712739	0.0583815	-0.0753959	1.0000000	-0.0756750	-0.0610746	-0.0078259
Lag5	-0.0305191	-0.0081831	-0.0724995	0.0606572	-0.0756750	1.0000000	-0.0585174	0.0110127
Volume	0.8419416	-0.0649513	-0.0855131	-0.0692877	-0.0610746	-0.0585174	1.0000000	-0.0330778
Today	-0.0324599	-0.0750318	0.0591667	-0.0712436	-0.0078259	0.0110127	-0.0330778	1.0000000

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

plot(Today~Lag1, data=Weekly)
simplelm = lm(Today~Lag1, data=Weekly)
abline(simplelm, lwd= 3)

pairs(Weekly)

With the exception of Year and Volume and possibly even Today and Direction, it appears that there are no discernible correlations between any two variables based on an examination of the pair-wise graphs. According to the plots, volume will rise over time, and if a market is up one week, the percentage return for that week (today) will typically be larger.The correlation table (correlation coefficient of ~0.84) demonstrates the strong relationship between Year and Volume.

##b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

logmod = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,family = "binomial", data=Weekly)
summary(logmod)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only Lag2, or the values from two weeks prior to the current value, appears to be statistically significant to predict the Direction among the six predictors. Given that the estimate coefficient is 0.058, an increase of 1 in Lag2 corresponds to an increase in the probability of a positive direction of e^0.058 = 1,06.

##c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression

probs = predict(logmod, type="response")
preds = rep("Down", 1089)
preds[probs > 0.5] = "Up"
table(preds, Weekly$Direction)

##       
## preds  Down  Up
##   Down   54  48
##   Up    430 557

We predicted the majority of cases to go up because, as the confusion matrix results show, we are really not conservative enough to consider the results as going up. Naturally, this indicates that we detect a very high proportion of true positives, but at a significant cost: we also detect 430 False Positives. Only 54/484 of the real negatives are considered as negative by us, which is a remarkably low percentage. This model has a misclassification rate of (48+430)/(54+48+430+557) = 43.9%.

##d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

trainset= Weekly[Weekly$Year<2009,]
logreg = glm(Direction~Lag2, data= trainset, family = "binomial")
summary(logreg)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = trainset)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

testset= Weekly[Weekly$Year>2008,]
testprob = predict(logreg, type="response", newdata = testset)
testsetpreds = rep("Down", 104)
testsetpreds[testprob>0.5] = "Up"
testdiractual = Weekly$Direction[Weekly$Year>2008]
confusionmatrix2<-table(testsetpreds, testdiractual)
print(confusionmatrix2)

##             testdiractual
## testsetpreds Down Up
##         Down    9  5
##         Up     34 56

The sum of correct/total predictions, or (9+56)/9+5+34+56, yields the overall fraction of correct forecasts, which is 62.5%.The number of True Positives (TP) that the model accurately predicted would be “Up” is 56. In these cases, the actual outcome was truly “Up.” The number of True Negatives (TN) that the model accurately predicted would be “Down” is nine. In these cases, the actual outcome was truly “Down.” Five of these are False Positives (FP), or situations in which the model predicted “Up” while the real result was “Down.” The number of False Negatives (FN) is 34. In these cases, the model predicted “Down” when the true result was “Up.”

##(e) Repeat (d) using LDA.

Lda = lda(Direction ~ Lag2,
               data = trainset)

Lda

## Call:
## lda(Direction ~ Lag2, data = trainset)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

testdata = Weekly[Weekly$Year>2008,]
ldateste = predict(Lda, newdata=testdata, type="response")
ldaclass = ldateste$class
table(ldaclass, testdata$Direction)

##         
## ldaclass Down Up
##     Down    9  5
##     Up     34 56

The total fraction of accurate predictions is equal to the sum of correct/total predictions, or (9+56)/9+5+34+56, or 62.5%, as in logistic regression.

(f) Repeat (d) using QDA.

qda = qda(Direction ~ Lag2,
               data = trainset)

qda

## Call:
## qda(Direction ~ Lag2, data = trainset)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

testdata = Weekly[Weekly$Year>2008,]
qdatest = predict(qda, newdata=testdata, type="response")
qdaclass = qdatest$class
table(qdaclass, testdata$Direction)

##         
## qdaclass Down Up
##     Down    0  0
##     Up     43 61

61% is the overall fraction of correct predictions, calculated as the sum of correct/total forecasts, or (0+61)/0+0+43+61. The number of True Positives (TP) that the model accurately predicted would be “Up” is 61. In these cases, the actual outcome was truly “Up.”There are 43 False Positives (FP)—instances when the model predicted “Up” while the real result was “Down.”True Negatives (TN): 0 - In this scenario, there are no cases where the model accurately predicted “Down” when the actual result was “Down.”False Negatives (FN): 0 - In this scenario, there are no cases where the model predicted “Down” while the actual result was “Up”.

(g) Repeat (d) using KNN with K = 1.

set.seed(1)
trainknn = cbind(trainset$Lag2)
testknn = cbind(testdata$Lag2)
trainknn = cbind(trainset$Direction)
knnpred = knn(trainknn, testknn, trainknn, k=1)
table(knnpred, testdata$Direction)

##        
## knnpred Down Up
##       1   28 39
##       2   15 22

The Overall fraction of correct predictions = sum of correct/Total predictions ,(28+22)/28+39+15+22 which is equal to 76%.

True Positives (TP): These are instances where the model correctly predicted “Up” when the actual outcome was indeed “Up.” In this case, it’s 22.

False Positives (FP): These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.” Here, it’s 15.

True Negatives (TN): These are instances where the model correctly predicted “Down” when the actual outcome was indeed “Down.” It’s 28 in this scenario.

False Negatives (FN): These are instances where the model incorrectly predicted “Down” when the actual outcome was “Up.” There are 39 instances.

(h) Repeat (d) using naive Bayes.

nb =  naiveBayes(Direction ~ Lag2,
               data = trainset)

nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag2
## Y             [,1]     [,2]
##   Down -0.03568254 2.199504
##   Up    0.26036581 2.317485

testdata = Weekly[Weekly$Year>2008,]
nbtest = predict(nb, newdata=testdata, type="class")
nbclass = nbtest
table(nbclass, testdata$Direction)

##        
## nbclass Down Up
##    Down    0  0
##    Up     43 61

The overall fraction of correct predictions is 61%, which is the same as the qda model. It is calculated as the sum of correct/total predictions, or (0+61)/0+0+43+61.

##(i) Which of these methods appears to provide the best results on this data? 62.5% was the accurate prediction made using LDA and logistic regression.

j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

GLM - Lag4 predictor:

trainset4= Weekly[Weekly$Year<2009,]
logreg4 = glm(Direction~Lag4, data= trainset4, family = "binomial")
testset4= Weekly[Weekly$Year>2008,]
testprob4 = predict(logreg, type="response", newdata = testset4)
testpreds4 = rep("Down", 104)
testpreds4[testprob4>0.5] = "Up"
testactual4 = Weekly$Direction[Weekly$Year>2008]
confusionmatrix4<-table(testpreds4, testactual4)
sum(diag(confusionmatrix4)) / sum(confusionmatrix4)

## [1] 0.625

GLM - lag interactions:

trainset= Weekly[Weekly$Year<2009,]
logall = glm(Direction ~ Lag1 * Lag2 * Lag3 * Lag4, data= trainset, family = "binomial")
testset= Weekly[Weekly$Year>2008,]
testprob = predict(logall, type="response", newdata = testset)
testpreds = rep("Down", 104)
testpreds[testprob>0.5] = "Up"
testactual = Weekly$Direction[Weekly$Year>2008]
confusionall<-table(testpreds, testactual)
sum(diag(confusionall)) / sum(confusionall)

## [1] 0.5961538

GLM - lag predictors:

trainset= Weekly[Weekly$Year<2009,]
logall2 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data= trainset, family = "binomial")
testset= Weekly[Weekly$Year>2008,]
testprob = predict(logall2, type="response", newdata = testset)
testpreds = rep("Down", 104)
testpreds[testprob>0.5] = "Up"
testactual = Weekly$Direction[Weekly$Year>2008]
confusionall2<-table(testpreds, testactual)
sum(diag(confusionall2)) / sum(confusionall2)

## [1] 0.5865385

NaiveBayes:

nbh3 =  naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
nbtest3 = predict(nbh3, newdata=testdata, type="class")
nbclass3 = nbtest3
nb3=table(nbclass3, testdata$Direction)
sum(diag(nb3)) / sum(nb3)

## [1] 0.5096154

LDA:

ldaall =  lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
ldatest = predict(ldaall, newdata=testdata, type="response")
ldatestclass = ldatest$class
ldacon=table(ldatestclass, testdata$Direction)
sum(diag(ldacon)) / sum(ldacon)

## [1] 0.5769231

QDA:

qdaall =  qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
qdatest = predict(qdaall, newdata=testdata, type="response")
qdatestclass = qdatest$class
qdacon=table(qdatestclass, testdata$Direction)
sum(diag(qdacon)) / sum(qdacon)

## [1] 0.5192308

KNN with best k- value with lag2 to lag4 predictors:

set.seed(1)

trainXknn <- trainset[, 3:5]
testXknn <- testdata[, 3:5]
trainYknn <- trainset$Direction

allk <- numeric(30)
bestkconfusionmatrix <- NULL

for (k in 1:30) {
    knnpred <- knn(trainXknn, testXknn, trainYknn, k = k)
    confusionmatrix <- table(knnpred, testdata$Direction)
    allk[k] <- sum(diag(confusionmatrix)) / sum(confusionmatrix)
    if (k == which.max(allk)) {
        bestkconfusionmatrix <- confusionmatrix
    }
}

bestk <- which.max(allk)

print(allk)

##  [1] 0.5096154 0.4903846 0.5000000 0.4807692 0.5192308 0.4903846 0.5384615
##  [8] 0.5576923 0.5576923 0.4903846 0.5673077 0.5769231 0.5769231 0.5192308
## [15] 0.5288462 0.5384615 0.5576923 0.5192308 0.5384615 0.5096154 0.5096154
## [22] 0.5384615 0.5576923 0.5576923 0.5576923 0.5576923 0.5769231 0.5576923
## [29] 0.5480769 0.5384615

print(paste("k value:", bestk, "Accuracy:", allk[bestk]))

## [1] "k value: 12 Accuracy: 0.576923076923077"

print("Confusion Matrix for k value:")

## [1] "Confusion Matrix for k value:"

print(bestkconfusionmatrix)

##        
## knnpred Down Up
##    Down   17 18
##    Up     26 43

correctpredictions <- sum(diag(bestkconfusionmatrix))
totalpredictions <- sum(bestkconfusionmatrix)
correctpredictionrate <- correctpredictions / totalpredictions

print(paste("Correct Prediction Rate k value:", correctpredictionrate))

## [1] "Correct Prediction Rate k value: 0.576923076923077"

When k is set as the best value at 12, KNN with the lag2–lag4 variables yields higher accuracy than logistic regression with additional predictors.

14. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

##(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

attach(Auto)

## The following object is masked from package:lubridate:
## 
##     origin

## The following object is masked from package:ggplot2:
## 
##     mpg

mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto <- data.frame(Auto, mpg01)

##(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

pairs(na.omit(Auto[-9]))

boxplot(origin ~ mpg01, data = Auto, main = "Origin vs mpg01")

boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")

boxplot(year ~ mpg01, data = Auto, main = "year vs mpg01")

boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")

boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")

boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")

boxplot(displacement ~ mpg01, data = Auto, main = "displacement vs mpg01")

“mpg01”, “cylinders”, “weight”, “displacement”, “year”, and “horsepower” are related.

(c) Split the data into a training set and a test set.

split <- sample(1:dim(Auto)[1], size=dim(Auto)[1]*0.80)
train_dataset2 <- Auto[split,]
test_dataset2 = Auto[-split,]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

LDA_d = lda(mpg01~displacement+weight+cylinders+year, data=train_dataset2)
LDA_d

## Call:
## lda(mpg01 ~ displacement + weight + cylinders + year, data = train_dataset2)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5079872 0.4920128 
## 
## Group means:
##   displacement   weight cylinders     year
## 0     274.5031 3593.113  6.767296 74.44025
## 1     119.1591 2368.740  4.220779 77.61039
## 
## Coefficients of linear discriminants:
##                       LD1
## displacement  0.001519587
## weight       -0.001043916
## cylinders    -0.444835554
## year          0.122867997

Pred_lda_d = predict(LDA_d, newdata=test_dataset2, type="response")$class
confu_m_2_d=table(Pred_lda_d, test_dataset2$mpg01)
confu_m_2_d

##           
## Pred_lda_d  0  1
##          0 33  0
##          1  4 42

1-sum(diag(confu_m_2_d)) / sum(confu_m_2_d)

## [1] 0.05063291

Test error of the LDA model obtained is 0.05063291

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

QDA_e= qda(mpg01~displacement+weight+cylinders+year, data=train_dataset2)
QDA_e

## Call:
## qda(mpg01 ~ displacement + weight + cylinders + year, data = train_dataset2)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5079872 0.4920128 
## 
## Group means:
##   displacement   weight cylinders     year
## 0     274.5031 3593.113  6.767296 74.44025
## 1     119.1591 2368.740  4.220779 77.61039

Pred_QDA_e = predict(QDA_e, newdata=test_dataset2, type="response")$class
confu_QDA_e=table(Pred_QDA_e, test_dataset2$mpg01)
confu_QDA_e

##           
## Pred_QDA_e  0  1
##          0 34  0
##          1  3 42

1-sum(diag(confu_QDA_e)) / sum(confu_QDA_e)

## [1] 0.03797468

Test error of the QDA model obtained is 0.03797468

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

GLM_f = glm(mpg01~displacement+weight+cylinders+year, data=train_dataset2,family="binomial")
summary(GLM_f)

## 
## Call:
## glm(formula = mpg01 ~ displacement + weight + cylinders + year, 
##     family = "binomial", data = train_dataset2)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.991e+01  5.104e+00  -3.901 9.58e-05 ***
## displacement -7.408e-03  1.012e-02  -0.732    0.464    
## weight       -3.897e-03  9.345e-04  -4.170 3.05e-05 ***
## cylinders    -7.234e-02  4.214e-01  -0.172    0.864    
## year          4.278e-01  7.926e-02   5.397 6.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 433.83  on 312  degrees of freedom
## Residual deviance: 144.38  on 308  degrees of freedom
## AIC: 154.38
## 
## Number of Fisher Scoring iterations: 7

Pred_GLM_f = predict(GLM_f, newdata=test_dataset2, type="response")
preds = rep(0, dim(test_dataset2)[1])
preds[Pred_GLM_f>0.5]=1            
confu_mat_f=table(preds, Auto$mpg01[-split])
confu_mat_f

##      
## preds  0  1
##     0 35  3
##     1  2 39

1-sum(diag(confu_mat_f)) / sum(confu_mat_f)

## [1] 0.06329114

Test error of the glm model obtained is 0.06329114

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

selectedvars <- c("displacement", "weight", "cylinders", "year")
train_dataX <- train_dataset2[, selectedvars]
test_dataX <- test_dataset2[, selectedvars]
train_dataY <- train_dataset2$mpg01

accuracies <- numeric(30)
errors <- numeric(30)

for (k in 1:30) {
    knn_model <- knn(train = train_dataX, test = test_dataX, cl = train_dataY, k = k)
    accuracy <- mean(knn_model == test_dataset2$mpg01)
    accuracies[k] <- accuracy
    error <- 1 - accuracy
    errors[k] <- error
}

best_k <- which.max(accuracies)

print("Accuracies of k value:")

## [1] "Accuracies of k value:"

print(accuracies)

##  [1] 0.8481013 0.8734177 0.9113924 0.9367089 0.9240506 0.9240506 0.9113924
##  [8] 0.8987342 0.9113924 0.9113924 0.9113924 0.8987342 0.8987342 0.8987342
## [15] 0.9113924 0.8987342 0.9113924 0.9113924 0.8987342 0.8987342 0.9113924
## [22] 0.8987342 0.9113924 0.9113924 0.9113924 0.9367089 0.9240506 0.9367089
## [29] 0.9367089 0.9240506

print("Errors of k value:")

## [1] "Errors of k value:"

print(errors)

##  [1] 0.15189873 0.12658228 0.08860759 0.06329114 0.07594937 0.07594937
##  [7] 0.08860759 0.10126582 0.08860759 0.08860759 0.08860759 0.10126582
## [13] 0.10126582 0.10126582 0.08860759 0.10126582 0.08860759 0.08860759
## [19] 0.10126582 0.10126582 0.08860759 0.10126582 0.08860759 0.08860759
## [25] 0.08860759 0.06329114 0.07594937 0.06329114 0.06329114 0.07594937

print(paste("Best k value:", best_k, "Accuracy:", accuracies[best_k]))

## [1] "Best k value: 4 Accuracy: 0.936708860759494"

testerrorbestk <- 1 - accuracies[best_k]
print(paste("Test Error of k value:", testerrorbestk))

## [1] "Test Error of k value: 0.0632911392405063"

From the KNN best k value is 4 with accuracy:0.936708860759494 and error:0.0632911392405063

16. Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

data("Boston")

Boston$Crime_Median <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
pairs(Boston)

Correlated variables are nox, age, dis, and medv.

Split the data into training and test sets

set.seed(100)
trainidx_set <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
traindata_set <- Boston[trainidx_set, ]
testdata_set <- Boston[-trainidx_set, ]

Logistic Regression

glmmodel <- glm(Crime_Median ~nox + age + dis + medv, data = traindata_set, family = binomial)
summary(glmmodel)

## 
## Call:
## glm(formula = Crime_Median ~ nox + age + dis + medv, family = binomial, 
##     data = traindata_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.661969   3.698424  -6.668 2.59e-11 ***
## nox          37.489248   5.205419   7.202 5.94e-13 ***
## age           0.015628   0.009448   1.654  0.09810 .  
## dis           0.405202   0.157548   2.572  0.01011 *  
## medv          0.080958   0.028236   2.867  0.00414 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.05  on 403  degrees of freedom
## Residual deviance: 251.15  on 399  degrees of freedom
## AIC: 261.15
## 
## Number of Fisher Scoring iterations: 7

Age is not important.eliminating the factors that are not significant in order to carry out more modeling.

GLM without age:

glmmodel2 <- glm(Crime_Median ~nox + dis + medv, data = traindata_set, family = binomial)
summary(glmmodel2)

## 
## Call:
## glm(formula = Crime_Median ~ nox + dis + medv, family = binomial, 
##     data = traindata_set)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -23.24752    3.51155  -6.620 3.58e-11 ***
## nox          38.18751    5.14900   7.416 1.20e-13 ***
## dis           0.30324    0.14663   2.068   0.0386 *  
## medv          0.06676    0.02616   2.552   0.0107 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.05  on 403  degrees of freedom
## Residual deviance: 253.94  on 400  degrees of freedom
## AIC: 261.94
## 
## Number of Fisher Scoring iterations: 7

The dis is negligible. Eliminating the factors that are not significant in order to carry out more modeling.

GLM without Displacement:

glmmodel3 <- glm(Crime_Median ~nox + medv, data = traindata_set, family = binomial)
summary(glmmodel3)

## 
## Call:
## glm(formula = Crime_Median ~ nox + medv, family = binomial, data = traindata_set)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -18.02179    2.10973  -8.542   <2e-16 ***
## nox          30.99656    3.30098   9.390   <2e-16 ***
## medv          0.05420    0.02458   2.205   0.0274 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.05  on 403  degrees of freedom
## Residual deviance: 258.07  on 401  degrees of freedom
## AIC: 264.07
## 
## Number of Fisher Scoring iterations: 6

When utilizing this model to compute confusion matrix and accuracy, nox and medv are significant.

Confusion Matrix:

glmpred <- predict(glmmodel3, newdata = testdata_set, type = "response")

glmpredbinary <- ifelse(glmpred > 0.5, 1, 0)

confmatrix16 <- table(glmpredbinary, testdata_set$Crime_Median)
confmatrix16

##              
## glmpredbinary  0  1
##             0 48 14
##             1  2 38

Accuracy16=sum(diag(confmatrix16)) / sum(confmatrix16)
Error=1-sum(diag(confmatrix16)) / sum(confmatrix16)
print(paste("Accuracy:",Accuracy16))

## [1] "Accuracy: 0.843137254901961"

print(paste("Error:", Error))

## [1] "Error: 0.156862745098039"

The logistic regression model achieved an accuracy of 84.31% with an error rate of 15.69%. It correctly classified 48 instances as class 0 and 38 instances as class 1. However, it misclassified 14 instances from class 0 as class 1 and 2 instances from class 1 as class 0. Overall, the model demonstrated good predictive performance but may benefit from adjustments to minimize misclassifications.

LDA

ldamodel16 <- lda(Crime_Median ~nox + age+ dis+medv, data = traindata_set)
ldamodel16

## Call:
## lda(Crime_Median ~ nox + age + dis + medv, data = traindata_set)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5024752 0.4975248 
## 
## Group means:
##         nox      age      dis     medv
## 0 0.4737714 51.20690 5.034042 24.62906
## 1 0.6410199 85.28756 2.516774 19.67313
## 
## Coefficients of linear discriminants:
##              LD1
## nox  10.40924908
## age   0.01167138
## dis  -0.02797181
## medv  0.01504052

Confusion Matrix:

ldapredbinary = predict(ldamodel16, newdata=testdata_set, type="response")$class
confmatrix16lda=table(ldapredbinary, testdata_set$Crime_Median)
confmatrix16lda

##              
## ldapredbinary  0  1
##             0 48 12
##             1  2 40

Accuracy16lda=sum(diag(confmatrix16lda)) / sum(confmatrix16lda)
Errorlda=1-sum(diag(confmatrix16lda)) / sum(confmatrix16lda)
print(paste("Accuracy:",Accuracy16lda))

## [1] "Accuracy: 0.862745098039216"

print(paste("Error:", Errorlda))

## [1] "Error: 0.137254901960784"

The linear discriminant analysis (LDA) model achieved an accuracy of 86.27% with an error rate of 13.73%. It correctly classified 48 instances as class 0 and 40 instances as class 1. However, it misclassified 12 instances from class 0 as class 1 and 2 instances from class 1 as class 0. Overall, the LDA model demonstrated strong predictive performance with a low error rate.

naiveBayes

nbmodel <- naiveBayes(Crime_Median ~ nox + age + dis + medv, data = traindata_set)
nbmodel

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.5024752 0.4975248 
## 
## Conditional probabilities:
##    nox
## Y        [,1]       [,2]
##   0 0.4737714 0.05786753
##   1 0.6410199 0.09892426
## 
##    age
## Y       [,1]     [,2]
##   0 51.20690 25.81400
##   1 85.28756 18.14747
## 
##    dis
## Y       [,1]     [,2]
##   0 5.034042 2.067880
##   1 2.516774 1.111571
## 
##    medv
## Y       [,1]     [,2]
##   0 24.62906 6.987867
##   1 19.67313 9.929394

nb_pred <- predict(nbmodel, newdata = testdata_set)
conf_nbpred <- table(nb_pred, testdata_set$Crime_Median)
conf_nbpred

##        
## nb_pred  0  1
##       0 44  6
##       1  6 46

accuracy_nb <- sum(diag(conf_nbpred)) / sum(conf_nbpred)
print(paste("Accuracy:", accuracy_nb))

## [1] "Accuracy: 0.88235294117647"

# Calculate error rate
error_rate_nb <- 1 - accuracy_nb
print(paste("Error Rate:", error_rate_nb))

## [1] "Error Rate: 0.117647058823529"

The Naive Bayes model achieved an accuracy of 88.24% and an error rate of 11.76%. It accurately predicted 44 instances in class 0 and 46 instances in class 1. However, it misclassified 6 instances from class 0 as class 1 and 6 instances from class 1 as class 0. Overall, the model demonstrated strong predictive performance but may benefit from further refinement to minimize misclassifications.

KNN

set.seed(100)

trainX <- traindata_set[, c("nox", "age", "dis", "medv")]
trainY <- traindata_set$Crime_Median

testX <- testdata_set[, c("nox", "age", "dis", "medv")]
testY <- testdata_set$Crime_Median

accuracies <- numeric(35)
errorrates <- numeric(35)

for (k in 1:35) {
  knnmodel <- knn(trainX, testX, trainY, k = k)
  accuracy <- mean(knnmodel == testY)
  accuracies[k] <- accuracy
  errorrate <- 1 - accuracy
  errorrates[k] <- errorrate
}

bestk <- which.max(accuracies)

print(paste("K Value | Accuracy | Error Rate"))

## [1] "K Value | Accuracy | Error Rate"

for (i in 1:35) {
  print(paste(i, " | ", accuracies[i], " | ", errorrates[i]))
}

## [1] "1  |  0.784313725490196  |  0.215686274509804"
## [1] "2  |  0.725490196078431  |  0.274509803921569"
## [1] "3  |  0.754901960784314  |  0.245098039215686"
## [1] "4  |  0.754901960784314  |  0.245098039215686"
## [1] "5  |  0.803921568627451  |  0.196078431372549"
## [1] "6  |  0.803921568627451  |  0.196078431372549"
## [1] "7  |  0.803921568627451  |  0.196078431372549"
## [1] "8  |  0.803921568627451  |  0.196078431372549"
## [1] "9  |  0.794117647058823  |  0.205882352941177"
## [1] "10  |  0.813725490196078  |  0.186274509803922"
## [1] "11  |  0.794117647058823  |  0.205882352941177"
## [1] "12  |  0.823529411764706  |  0.176470588235294"
## [1] "13  |  0.803921568627451  |  0.196078431372549"
## [1] "14  |  0.823529411764706  |  0.176470588235294"
## [1] "15  |  0.823529411764706  |  0.176470588235294"
## [1] "16  |  0.803921568627451  |  0.196078431372549"
## [1] "17  |  0.833333333333333  |  0.166666666666667"
## [1] "18  |  0.833333333333333  |  0.166666666666667"
## [1] "19  |  0.823529411764706  |  0.176470588235294"
## [1] "20  |  0.813725490196078  |  0.186274509803922"
## [1] "21  |  0.813725490196078  |  0.186274509803922"
## [1] "22  |  0.823529411764706  |  0.176470588235294"
## [1] "23  |  0.813725490196078  |  0.186274509803922"
## [1] "24  |  0.813725490196078  |  0.186274509803922"
## [1] "25  |  0.813725490196078  |  0.186274509803922"
## [1] "26  |  0.813725490196078  |  0.186274509803922"
## [1] "27  |  0.813725490196078  |  0.186274509803922"
## [1] "28  |  0.813725490196078  |  0.186274509803922"
## [1] "29  |  0.813725490196078  |  0.186274509803922"
## [1] "30  |  0.803921568627451  |  0.196078431372549"
## [1] "31  |  0.813725490196078  |  0.186274509803922"
## [1] "32  |  0.803921568627451  |  0.196078431372549"
## [1] "33  |  0.803921568627451  |  0.196078431372549"
## [1] "34  |  0.803921568627451  |  0.196078431372549"
## [1] "35  |  0.803921568627451  |  0.196078431372549"

print(paste("Best k value:", bestk, "Accuracy:", accuracies[bestk],"Error:", errorrates[bestk]))

## [1] "Best k value: 17 Accuracy: 0.833333333333333 Error: 0.166666666666667"

KKN model predicted k value at 17 with Accuracy: 0.833333333333333 and Error: 0.166666666666667.

Assignment_3

Karthik

2024-02-24

(f) Repeat (d) using QDA.

(g) Repeat (d) using KNN with K = 1.

(h) Repeat (d) using naive Bayes.

GLM - Lag4 predictor:

GLM - lag interactions:

GLM - lag predictors:

NaiveBayes:

LDA:

QDA:

KNN with best k- value with lag2 to lag4 predictors:

14. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(c) Split the data into a training set and a test set.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Split the data into training and test sets

Logistic Regression

GLM without age:

GLM without Displacement:

Confusion Matrix:

LDA

Confusion Matrix:

naiveBayes

KNN