##This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
##a). Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? install.packages(ISLR2)
library(MASS)
library(class)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(ISLR)
library(e1071)
library(knitr)
cor_tbl<- kable(cor(Weekly[,-9]))
cor_tbl
| Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | |
|---|---|---|---|---|---|---|---|---|
| Year | 1.0000000 | -0.0322893 | -0.0333900 | -0.0300065 | -0.0311279 | -0.0305191 | 0.8419416 | -0.0324599 |
| Lag1 | -0.0322893 | 1.0000000 | -0.0748531 | 0.0586357 | -0.0712739 | -0.0081831 | -0.0649513 | -0.0750318 |
| Lag2 | -0.0333900 | -0.0748531 | 1.0000000 | -0.0757209 | 0.0583815 | -0.0724995 | -0.0855131 | 0.0591667 |
| Lag3 | -0.0300065 | 0.0586357 | -0.0757209 | 1.0000000 | -0.0753959 | 0.0606572 | -0.0692877 | -0.0712436 |
| Lag4 | -0.0311279 | -0.0712739 | 0.0583815 | -0.0753959 | 1.0000000 | -0.0756750 | -0.0610746 | -0.0078259 |
| Lag5 | -0.0305191 | -0.0081831 | -0.0724995 | 0.0606572 | -0.0756750 | 1.0000000 | -0.0585174 | 0.0110127 |
| Volume | 0.8419416 | -0.0649513 | -0.0855131 | -0.0692877 | -0.0610746 | -0.0585174 | 1.0000000 | -0.0330778 |
| Today | -0.0324599 | -0.0750318 | 0.0591667 | -0.0712436 | -0.0078259 | 0.0110127 | -0.0330778 | 1.0000000 |
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
plot(Today~Lag1, data=Weekly)
simplelm = lm(Today~Lag1, data=Weekly)
abline(simplelm, lwd= 3)
pairs(Weekly)
With the exception of Year and Volume and possibly even Today and Direction, it appears that there are no discernible correlations between any two variables based on an examination of the pair-wise graphs. According to the plots, volume will rise over time, and if a market is up one week, the percentage return for that week (today) will typically be larger.The correlation table (correlation coefficient of ~0.84) demonstrates the strong relationship between Year and Volume.
##b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
logmod = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,family = "binomial", data=Weekly)
summary(logmod)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only Lag2, or the values from two weeks prior to the current value, appears to be statistically significant to predict the Direction among the six predictors. Given that the estimate coefficient is 0.058, an increase of 1 in Lag2 corresponds to an increase in the probability of a positive direction of e^0.058 = 1,06.
##c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression
probs = predict(logmod, type="response")
preds = rep("Down", 1089)
preds[probs > 0.5] = "Up"
table(preds, Weekly$Direction)
##
## preds Down Up
## Down 54 48
## Up 430 557
We predicted the majority of cases to go up because, as the confusion matrix results show, we are really not conservative enough to consider the results as going up. Naturally, this indicates that we detect a very high proportion of true positives, but at a significant cost: we also detect 430 False Positives. Only 54/484 of the real negatives are considered as negative by us, which is a remarkably low percentage. This model has a misclassification rate of (48+430)/(54+48+430+557) = 43.9%.
##d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
trainset= Weekly[Weekly$Year<2009,]
logreg = glm(Direction~Lag2, data= trainset, family = "binomial")
summary(logreg)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = trainset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
testset= Weekly[Weekly$Year>2008,]
testprob = predict(logreg, type="response", newdata = testset)
testsetpreds = rep("Down", 104)
testsetpreds[testprob>0.5] = "Up"
testdiractual = Weekly$Direction[Weekly$Year>2008]
confusionmatrix2<-table(testsetpreds, testdiractual)
print(confusionmatrix2)
## testdiractual
## testsetpreds Down Up
## Down 9 5
## Up 34 56
The sum of correct/total predictions, or (9+56)/9+5+34+56, yields the overall fraction of correct forecasts, which is 62.5%.The number of True Positives (TP) that the model accurately predicted would be “Up” is 56. In these cases, the actual outcome was truly “Up.” The number of True Negatives (TN) that the model accurately predicted would be “Down” is nine. In these cases, the actual outcome was truly “Down.” Five of these are False Positives (FP), or situations in which the model predicted “Up” while the real result was “Down.” The number of False Negatives (FN) is 34. In these cases, the model predicted “Down” when the true result was “Up.”
##(e) Repeat (d) using LDA.
Lda = lda(Direction ~ Lag2,
data = trainset)
Lda
## Call:
## lda(Direction ~ Lag2, data = trainset)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
testdata = Weekly[Weekly$Year>2008,]
ldateste = predict(Lda, newdata=testdata, type="response")
ldaclass = ldateste$class
table(ldaclass, testdata$Direction)
##
## ldaclass Down Up
## Down 9 5
## Up 34 56
The total fraction of accurate predictions is equal to the sum of correct/total predictions, or (9+56)/9+5+34+56, or 62.5%, as in logistic regression.
qda = qda(Direction ~ Lag2,
data = trainset)
qda
## Call:
## qda(Direction ~ Lag2, data = trainset)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
testdata = Weekly[Weekly$Year>2008,]
qdatest = predict(qda, newdata=testdata, type="response")
qdaclass = qdatest$class
table(qdaclass, testdata$Direction)
##
## qdaclass Down Up
## Down 0 0
## Up 43 61
61% is the overall fraction of correct predictions, calculated as the sum of correct/total forecasts, or (0+61)/0+0+43+61. The number of True Positives (TP) that the model accurately predicted would be “Up” is 61. In these cases, the actual outcome was truly “Up.”There are 43 False Positives (FP)—instances when the model predicted “Up” while the real result was “Down.”True Negatives (TN): 0 - In this scenario, there are no cases where the model accurately predicted “Down” when the actual result was “Down.”False Negatives (FN): 0 - In this scenario, there are no cases where the model predicted “Down” while the actual result was “Up”.
set.seed(1)
trainknn = cbind(trainset$Lag2)
testknn = cbind(testdata$Lag2)
trainknn = cbind(trainset$Direction)
knnpred = knn(trainknn, testknn, trainknn, k=1)
table(knnpred, testdata$Direction)
##
## knnpred Down Up
## 1 28 39
## 2 15 22
The Overall fraction of correct predictions = sum of correct/Total predictions ,(28+22)/28+39+15+22 which is equal to 76%.
True Positives (TP): These are instances where the model correctly predicted “Up” when the actual outcome was indeed “Up.” In this case, it’s 22.
False Positives (FP): These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.” Here, it’s 15.
True Negatives (TN): These are instances where the model correctly predicted “Down” when the actual outcome was indeed “Down.” It’s 28 in this scenario.
False Negatives (FN): These are instances where the model incorrectly predicted “Down” when the actual outcome was “Up.” There are 39 instances.
nb = naiveBayes(Direction ~ Lag2,
data = trainset)
nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
testdata = Weekly[Weekly$Year>2008,]
nbtest = predict(nb, newdata=testdata, type="class")
nbclass = nbtest
table(nbclass, testdata$Direction)
##
## nbclass Down Up
## Down 0 0
## Up 43 61
The overall fraction of correct predictions is 61%, which is the same as the qda model. It is calculated as the sum of correct/total predictions, or (0+61)/0+0+43+61.
##(i) Which of these methods appears to provide the best results on this data? 62.5% was the accurate prediction made using LDA and logistic regression.
trainset4= Weekly[Weekly$Year<2009,]
logreg4 = glm(Direction~Lag4, data= trainset4, family = "binomial")
testset4= Weekly[Weekly$Year>2008,]
testprob4 = predict(logreg, type="response", newdata = testset4)
testpreds4 = rep("Down", 104)
testpreds4[testprob4>0.5] = "Up"
testactual4 = Weekly$Direction[Weekly$Year>2008]
confusionmatrix4<-table(testpreds4, testactual4)
sum(diag(confusionmatrix4)) / sum(confusionmatrix4)
## [1] 0.625
trainset= Weekly[Weekly$Year<2009,]
logall = glm(Direction ~ Lag1 * Lag2 * Lag3 * Lag4, data= trainset, family = "binomial")
testset= Weekly[Weekly$Year>2008,]
testprob = predict(logall, type="response", newdata = testset)
testpreds = rep("Down", 104)
testpreds[testprob>0.5] = "Up"
testactual = Weekly$Direction[Weekly$Year>2008]
confusionall<-table(testpreds, testactual)
sum(diag(confusionall)) / sum(confusionall)
## [1] 0.5961538
trainset= Weekly[Weekly$Year<2009,]
logall2 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data= trainset, family = "binomial")
testset= Weekly[Weekly$Year>2008,]
testprob = predict(logall2, type="response", newdata = testset)
testpreds = rep("Down", 104)
testpreds[testprob>0.5] = "Up"
testactual = Weekly$Direction[Weekly$Year>2008]
confusionall2<-table(testpreds, testactual)
sum(diag(confusionall2)) / sum(confusionall2)
## [1] 0.5865385
nbh3 = naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
nbtest3 = predict(nbh3, newdata=testdata, type="class")
nbclass3 = nbtest3
nb3=table(nbclass3, testdata$Direction)
sum(diag(nb3)) / sum(nb3)
## [1] 0.5096154
ldaall = lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
ldatest = predict(ldaall, newdata=testdata, type="response")
ldatestclass = ldatest$class
ldacon=table(ldatestclass, testdata$Direction)
sum(diag(ldacon)) / sum(ldacon)
## [1] 0.5769231
qdaall = qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = trainset)
testdata = Weekly[Weekly$Year>2008,]
qdatest = predict(qdaall, newdata=testdata, type="response")
qdatestclass = qdatest$class
qdacon=table(qdatestclass, testdata$Direction)
sum(diag(qdacon)) / sum(qdacon)
## [1] 0.5192308
set.seed(1)
trainXknn <- trainset[, 3:5]
testXknn <- testdata[, 3:5]
trainYknn <- trainset$Direction
allk <- numeric(30)
bestkconfusionmatrix <- NULL
for (k in 1:30) {
knnpred <- knn(trainXknn, testXknn, trainYknn, k = k)
confusionmatrix <- table(knnpred, testdata$Direction)
allk[k] <- sum(diag(confusionmatrix)) / sum(confusionmatrix)
if (k == which.max(allk)) {
bestkconfusionmatrix <- confusionmatrix
}
}
bestk <- which.max(allk)
print(allk)
## [1] 0.5096154 0.4903846 0.5000000 0.4807692 0.5192308 0.4903846 0.5384615
## [8] 0.5576923 0.5576923 0.4903846 0.5673077 0.5769231 0.5769231 0.5192308
## [15] 0.5288462 0.5384615 0.5576923 0.5192308 0.5384615 0.5096154 0.5096154
## [22] 0.5384615 0.5576923 0.5576923 0.5576923 0.5576923 0.5769231 0.5576923
## [29] 0.5480769 0.5384615
print(paste("k value:", bestk, "Accuracy:", allk[bestk]))
## [1] "k value: 12 Accuracy: 0.576923076923077"
print("Confusion Matrix for k value:")
## [1] "Confusion Matrix for k value:"
print(bestkconfusionmatrix)
##
## knnpred Down Up
## Down 17 18
## Up 26 43
correctpredictions <- sum(diag(bestkconfusionmatrix))
totalpredictions <- sum(bestkconfusionmatrix)
correctpredictionrate <- correctpredictions / totalpredictions
print(paste("Correct Prediction Rate k value:", correctpredictionrate))
## [1] "Correct Prediction Rate k value: 0.576923076923077"
When k is set as the best value at 12, KNN with the lag2–lag4 variables yields higher accuracy than logistic regression with additional predictors.
##(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
attach(Auto)
## The following object is masked from package:lubridate:
##
## origin
## The following object is masked from package:ggplot2:
##
## mpg
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
##(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(na.omit(Auto[-9]))
boxplot(origin ~ mpg01, data = Auto, main = "Origin vs mpg01")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "year vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "displacement vs mpg01")
“mpg01”, “cylinders”, “weight”, “displacement”, “year”, and “horsepower” are related.
split <- sample(1:dim(Auto)[1], size=dim(Auto)[1]*0.80)
train_dataset2 <- Auto[split,]
test_dataset2 = Auto[-split,]
LDA_d = lda(mpg01~displacement+weight+cylinders+year, data=train_dataset2)
LDA_d
## Call:
## lda(mpg01 ~ displacement + weight + cylinders + year, data = train_dataset2)
##
## Prior probabilities of groups:
## 0 1
## 0.5079872 0.4920128
##
## Group means:
## displacement weight cylinders year
## 0 274.5031 3593.113 6.767296 74.44025
## 1 119.1591 2368.740 4.220779 77.61039
##
## Coefficients of linear discriminants:
## LD1
## displacement 0.001519587
## weight -0.001043916
## cylinders -0.444835554
## year 0.122867997
Pred_lda_d = predict(LDA_d, newdata=test_dataset2, type="response")$class
confu_m_2_d=table(Pred_lda_d, test_dataset2$mpg01)
confu_m_2_d
##
## Pred_lda_d 0 1
## 0 33 0
## 1 4 42
1-sum(diag(confu_m_2_d)) / sum(confu_m_2_d)
## [1] 0.05063291
Test error of the LDA model obtained is 0.05063291
QDA_e= qda(mpg01~displacement+weight+cylinders+year, data=train_dataset2)
QDA_e
## Call:
## qda(mpg01 ~ displacement + weight + cylinders + year, data = train_dataset2)
##
## Prior probabilities of groups:
## 0 1
## 0.5079872 0.4920128
##
## Group means:
## displacement weight cylinders year
## 0 274.5031 3593.113 6.767296 74.44025
## 1 119.1591 2368.740 4.220779 77.61039
Pred_QDA_e = predict(QDA_e, newdata=test_dataset2, type="response")$class
confu_QDA_e=table(Pred_QDA_e, test_dataset2$mpg01)
confu_QDA_e
##
## Pred_QDA_e 0 1
## 0 34 0
## 1 3 42
1-sum(diag(confu_QDA_e)) / sum(confu_QDA_e)
## [1] 0.03797468
Test error of the QDA model obtained is 0.03797468
GLM_f = glm(mpg01~displacement+weight+cylinders+year, data=train_dataset2,family="binomial")
summary(GLM_f)
##
## Call:
## glm(formula = mpg01 ~ displacement + weight + cylinders + year,
## family = "binomial", data = train_dataset2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.991e+01 5.104e+00 -3.901 9.58e-05 ***
## displacement -7.408e-03 1.012e-02 -0.732 0.464
## weight -3.897e-03 9.345e-04 -4.170 3.05e-05 ***
## cylinders -7.234e-02 4.214e-01 -0.172 0.864
## year 4.278e-01 7.926e-02 5.397 6.77e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.83 on 312 degrees of freedom
## Residual deviance: 144.38 on 308 degrees of freedom
## AIC: 154.38
##
## Number of Fisher Scoring iterations: 7
Pred_GLM_f = predict(GLM_f, newdata=test_dataset2, type="response")
preds = rep(0, dim(test_dataset2)[1])
preds[Pred_GLM_f>0.5]=1
confu_mat_f=table(preds, Auto$mpg01[-split])
confu_mat_f
##
## preds 0 1
## 0 35 3
## 1 2 39
1-sum(diag(confu_mat_f)) / sum(confu_mat_f)
## [1] 0.06329114
Test error of the glm model obtained is 0.06329114
selectedvars <- c("displacement", "weight", "cylinders", "year")
train_dataX <- train_dataset2[, selectedvars]
test_dataX <- test_dataset2[, selectedvars]
train_dataY <- train_dataset2$mpg01
accuracies <- numeric(30)
errors <- numeric(30)
for (k in 1:30) {
knn_model <- knn(train = train_dataX, test = test_dataX, cl = train_dataY, k = k)
accuracy <- mean(knn_model == test_dataset2$mpg01)
accuracies[k] <- accuracy
error <- 1 - accuracy
errors[k] <- error
}
best_k <- which.max(accuracies)
print("Accuracies of k value:")
## [1] "Accuracies of k value:"
print(accuracies)
## [1] 0.8481013 0.8734177 0.9113924 0.9367089 0.9240506 0.9240506 0.9113924
## [8] 0.8987342 0.9113924 0.9113924 0.9113924 0.8987342 0.8987342 0.8987342
## [15] 0.9113924 0.8987342 0.9113924 0.9113924 0.8987342 0.8987342 0.9113924
## [22] 0.8987342 0.9113924 0.9113924 0.9113924 0.9367089 0.9240506 0.9367089
## [29] 0.9367089 0.9240506
print("Errors of k value:")
## [1] "Errors of k value:"
print(errors)
## [1] 0.15189873 0.12658228 0.08860759 0.06329114 0.07594937 0.07594937
## [7] 0.08860759 0.10126582 0.08860759 0.08860759 0.08860759 0.10126582
## [13] 0.10126582 0.10126582 0.08860759 0.10126582 0.08860759 0.08860759
## [19] 0.10126582 0.10126582 0.08860759 0.10126582 0.08860759 0.08860759
## [25] 0.08860759 0.06329114 0.07594937 0.06329114 0.06329114 0.07594937
print(paste("Best k value:", best_k, "Accuracy:", accuracies[best_k]))
## [1] "Best k value: 4 Accuracy: 0.936708860759494"
testerrorbestk <- 1 - accuracies[best_k]
print(paste("Test Error of k value:", testerrorbestk))
## [1] "Test Error of k value: 0.0632911392405063"
From the KNN best k value is 4 with accuracy:0.936708860759494 and error:0.0632911392405063
data("Boston")
Boston$Crime_Median <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
pairs(Boston)
Correlated variables are nox, age, dis, and medv.
set.seed(100)
trainidx_set <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
traindata_set <- Boston[trainidx_set, ]
testdata_set <- Boston[-trainidx_set, ]
glmmodel <- glm(Crime_Median ~nox + age + dis + medv, data = traindata_set, family = binomial)
summary(glmmodel)
##
## Call:
## glm(formula = Crime_Median ~ nox + age + dis + medv, family = binomial,
## data = traindata_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.661969 3.698424 -6.668 2.59e-11 ***
## nox 37.489248 5.205419 7.202 5.94e-13 ***
## age 0.015628 0.009448 1.654 0.09810 .
## dis 0.405202 0.157548 2.572 0.01011 *
## medv 0.080958 0.028236 2.867 0.00414 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.05 on 403 degrees of freedom
## Residual deviance: 251.15 on 399 degrees of freedom
## AIC: 261.15
##
## Number of Fisher Scoring iterations: 7
Age is not important.eliminating the factors that are not significant in order to carry out more modeling.
glmmodel2 <- glm(Crime_Median ~nox + dis + medv, data = traindata_set, family = binomial)
summary(glmmodel2)
##
## Call:
## glm(formula = Crime_Median ~ nox + dis + medv, family = binomial,
## data = traindata_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.24752 3.51155 -6.620 3.58e-11 ***
## nox 38.18751 5.14900 7.416 1.20e-13 ***
## dis 0.30324 0.14663 2.068 0.0386 *
## medv 0.06676 0.02616 2.552 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.05 on 403 degrees of freedom
## Residual deviance: 253.94 on 400 degrees of freedom
## AIC: 261.94
##
## Number of Fisher Scoring iterations: 7
The dis is negligible. Eliminating the factors that are not significant in order to carry out more modeling.
glmmodel3 <- glm(Crime_Median ~nox + medv, data = traindata_set, family = binomial)
summary(glmmodel3)
##
## Call:
## glm(formula = Crime_Median ~ nox + medv, family = binomial, data = traindata_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.02179 2.10973 -8.542 <2e-16 ***
## nox 30.99656 3.30098 9.390 <2e-16 ***
## medv 0.05420 0.02458 2.205 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.05 on 403 degrees of freedom
## Residual deviance: 258.07 on 401 degrees of freedom
## AIC: 264.07
##
## Number of Fisher Scoring iterations: 6
When utilizing this model to compute confusion matrix and accuracy, nox and medv are significant.
glmpred <- predict(glmmodel3, newdata = testdata_set, type = "response")
glmpredbinary <- ifelse(glmpred > 0.5, 1, 0)
confmatrix16 <- table(glmpredbinary, testdata_set$Crime_Median)
confmatrix16
##
## glmpredbinary 0 1
## 0 48 14
## 1 2 38
Accuracy16=sum(diag(confmatrix16)) / sum(confmatrix16)
Error=1-sum(diag(confmatrix16)) / sum(confmatrix16)
print(paste("Accuracy:",Accuracy16))
## [1] "Accuracy: 0.843137254901961"
print(paste("Error:", Error))
## [1] "Error: 0.156862745098039"
The logistic regression model achieved an accuracy of 84.31% with an error rate of 15.69%. It correctly classified 48 instances as class 0 and 38 instances as class 1. However, it misclassified 14 instances from class 0 as class 1 and 2 instances from class 1 as class 0. Overall, the model demonstrated good predictive performance but may benefit from adjustments to minimize misclassifications.
ldamodel16 <- lda(Crime_Median ~nox + age+ dis+medv, data = traindata_set)
ldamodel16
## Call:
## lda(Crime_Median ~ nox + age + dis + medv, data = traindata_set)
##
## Prior probabilities of groups:
## 0 1
## 0.5024752 0.4975248
##
## Group means:
## nox age dis medv
## 0 0.4737714 51.20690 5.034042 24.62906
## 1 0.6410199 85.28756 2.516774 19.67313
##
## Coefficients of linear discriminants:
## LD1
## nox 10.40924908
## age 0.01167138
## dis -0.02797181
## medv 0.01504052
ldapredbinary = predict(ldamodel16, newdata=testdata_set, type="response")$class
confmatrix16lda=table(ldapredbinary, testdata_set$Crime_Median)
confmatrix16lda
##
## ldapredbinary 0 1
## 0 48 12
## 1 2 40
Accuracy16lda=sum(diag(confmatrix16lda)) / sum(confmatrix16lda)
Errorlda=1-sum(diag(confmatrix16lda)) / sum(confmatrix16lda)
print(paste("Accuracy:",Accuracy16lda))
## [1] "Accuracy: 0.862745098039216"
print(paste("Error:", Errorlda))
## [1] "Error: 0.137254901960784"
The linear discriminant analysis (LDA) model achieved an accuracy of 86.27% with an error rate of 13.73%. It correctly classified 48 instances as class 0 and 40 instances as class 1. However, it misclassified 12 instances from class 0 as class 1 and 2 instances from class 1 as class 0. Overall, the LDA model demonstrated strong predictive performance with a low error rate.
nbmodel <- naiveBayes(Crime_Median ~ nox + age + dis + medv, data = traindata_set)
nbmodel
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5024752 0.4975248
##
## Conditional probabilities:
## nox
## Y [,1] [,2]
## 0 0.4737714 0.05786753
## 1 0.6410199 0.09892426
##
## age
## Y [,1] [,2]
## 0 51.20690 25.81400
## 1 85.28756 18.14747
##
## dis
## Y [,1] [,2]
## 0 5.034042 2.067880
## 1 2.516774 1.111571
##
## medv
## Y [,1] [,2]
## 0 24.62906 6.987867
## 1 19.67313 9.929394
nb_pred <- predict(nbmodel, newdata = testdata_set)
conf_nbpred <- table(nb_pred, testdata_set$Crime_Median)
conf_nbpred
##
## nb_pred 0 1
## 0 44 6
## 1 6 46
accuracy_nb <- sum(diag(conf_nbpred)) / sum(conf_nbpred)
print(paste("Accuracy:", accuracy_nb))
## [1] "Accuracy: 0.88235294117647"
# Calculate error rate
error_rate_nb <- 1 - accuracy_nb
print(paste("Error Rate:", error_rate_nb))
## [1] "Error Rate: 0.117647058823529"
The Naive Bayes model achieved an accuracy of 88.24% and an error rate of 11.76%. It accurately predicted 44 instances in class 0 and 46 instances in class 1. However, it misclassified 6 instances from class 0 as class 1 and 6 instances from class 1 as class 0. Overall, the model demonstrated strong predictive performance but may benefit from further refinement to minimize misclassifications.
set.seed(100)
trainX <- traindata_set[, c("nox", "age", "dis", "medv")]
trainY <- traindata_set$Crime_Median
testX <- testdata_set[, c("nox", "age", "dis", "medv")]
testY <- testdata_set$Crime_Median
accuracies <- numeric(35)
errorrates <- numeric(35)
for (k in 1:35) {
knnmodel <- knn(trainX, testX, trainY, k = k)
accuracy <- mean(knnmodel == testY)
accuracies[k] <- accuracy
errorrate <- 1 - accuracy
errorrates[k] <- errorrate
}
bestk <- which.max(accuracies)
print(paste("K Value | Accuracy | Error Rate"))
## [1] "K Value | Accuracy | Error Rate"
for (i in 1:35) {
print(paste(i, " | ", accuracies[i], " | ", errorrates[i]))
}
## [1] "1 | 0.784313725490196 | 0.215686274509804"
## [1] "2 | 0.725490196078431 | 0.274509803921569"
## [1] "3 | 0.754901960784314 | 0.245098039215686"
## [1] "4 | 0.754901960784314 | 0.245098039215686"
## [1] "5 | 0.803921568627451 | 0.196078431372549"
## [1] "6 | 0.803921568627451 | 0.196078431372549"
## [1] "7 | 0.803921568627451 | 0.196078431372549"
## [1] "8 | 0.803921568627451 | 0.196078431372549"
## [1] "9 | 0.794117647058823 | 0.205882352941177"
## [1] "10 | 0.813725490196078 | 0.186274509803922"
## [1] "11 | 0.794117647058823 | 0.205882352941177"
## [1] "12 | 0.823529411764706 | 0.176470588235294"
## [1] "13 | 0.803921568627451 | 0.196078431372549"
## [1] "14 | 0.823529411764706 | 0.176470588235294"
## [1] "15 | 0.823529411764706 | 0.176470588235294"
## [1] "16 | 0.803921568627451 | 0.196078431372549"
## [1] "17 | 0.833333333333333 | 0.166666666666667"
## [1] "18 | 0.833333333333333 | 0.166666666666667"
## [1] "19 | 0.823529411764706 | 0.176470588235294"
## [1] "20 | 0.813725490196078 | 0.186274509803922"
## [1] "21 | 0.813725490196078 | 0.186274509803922"
## [1] "22 | 0.823529411764706 | 0.176470588235294"
## [1] "23 | 0.813725490196078 | 0.186274509803922"
## [1] "24 | 0.813725490196078 | 0.186274509803922"
## [1] "25 | 0.813725490196078 | 0.186274509803922"
## [1] "26 | 0.813725490196078 | 0.186274509803922"
## [1] "27 | 0.813725490196078 | 0.186274509803922"
## [1] "28 | 0.813725490196078 | 0.186274509803922"
## [1] "29 | 0.813725490196078 | 0.186274509803922"
## [1] "30 | 0.803921568627451 | 0.196078431372549"
## [1] "31 | 0.813725490196078 | 0.186274509803922"
## [1] "32 | 0.803921568627451 | 0.196078431372549"
## [1] "33 | 0.803921568627451 | 0.196078431372549"
## [1] "34 | 0.803921568627451 | 0.196078431372549"
## [1] "35 | 0.803921568627451 | 0.196078431372549"
print(paste("Best k value:", bestk, "Accuracy:", accuracies[bestk],"Error:", errorrates[bestk]))
## [1] "Best k value: 17 Accuracy: 0.833333333333333 Error: 0.166666666666667"
KKN model predicted k value at 17 with Accuracy: 0.833333333333333 and Error: 0.166666666666667.