This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(class)
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
##
## Attaching package: 'ISLR2'
##
## The following object is masked from 'package:MASS':
##
## Boston
library(e1071)
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##13. This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
corrplot(cor(Weekly[, -9]), type = "full", method = "number")
pairs(Weekly)
Interpretation: From both correlation and pairs plot it is understood
that volume and year are highly correlated, it interpret that as the
time increase the market share of the volume also increases, here the
correlation between volume and year is 0.84.
logreg_10b <- glm(Direction~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(logreg_10b)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Interpretation: Lag2 with p-vlaue 0.0296 is the only predictors appear to be statistically significant.
prd <- predict(logreg_10b, type = "response") > 0.5
(confusion_matrix <- table(ifelse(prd, "Up (prd)", "Down (prd)"), Weekly$Direction))
##
## Down Up
## Down (prd) 54 48
## Up (prd) 430 557
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(54+48)/54+48+430+557 which is equal to 56.1%
True Positives (TP): 557 - These are instances where the model correctly predicted that the outcome would be “Up.”
True Negatives (TN): 54 - These are instances where the model correctly predicted that the outcome would be “Down.”
False Positives (FP): 48 - These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.” This is also known as a Type I error.
False Negatives (FN): 430 - These are instances where the model incorrectly predicted “Down” when the actual outcome was “Up.” This is also known as a Type II error.
train_set= Weekly[Weekly$Year<2009,]
log_reg2 = glm(Direction~Lag2, data= train_set, family = "binomial")
summary(log_reg2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
test_set= Weekly[Weekly$Year>2008,]
test_prob = predict(log_reg2, type="response", newdata = test_set)
test_set_preds = rep("Down", 104)
test_set_preds[test_prob>0.5] = "Up"
test_dir_actual = Weekly$Direction[Weekly$Year>2008]
confusion_matrix2<-table(test_set_preds, test_dir_actual)
print(confusion_matrix2)
## test_dir_actual
## test_set_preds Down Up
## Down 9 5
## Up 34 56
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(9+56)/9+5+34+56 which is equal to 62.5%
True Positives (TP): 56 - These are instances where the model correctly predicted that the outcome would be “Up” when the actual outcome was indeed “Up.” True Negatives (TN): 9 - These are instances where the model correctly predicted that the outcome would be “Down” when the actual outcome was indeed “Down.” False Positives (FP): 5 - These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.” False Negatives (FN): 34 - These are instances where the model incorrectly predicted “Down” when the actual outcome was “Up.”
Lda_e = lda(Direction ~ Lag2,
data = train_set)
Lda_e
## Call:
## lda(Direction ~ Lag2, data = train_set)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
test_data = Weekly[Weekly$Year>2008,]
lda_test_e = predict(Lda_e, newdata=test_data, type="response")
lda_e_class = lda_test_e$class
table(lda_e_class, test_data$Direction)
##
## lda_e_class Down Up
## Down 9 5
## Up 34 56
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(9+56)/9+5+34+56 which is equal to 62.5% which is same as logistic regression.
qda_f = qda(Direction ~ Lag2,
data = train_set)
qda_f
## Call:
## qda(Direction ~ Lag2, data = train_set)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
test_data = Weekly[Weekly$Year>2008,]
qda_test_f = predict(qda_f, newdata=test_data, type="response")
qda_f_class = qda_test_f$class
table(qda_f_class, test_data$Direction)
##
## qda_f_class Down Up
## Down 0 0
## Up 43 61
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(0+61)/0+0+43+61 which is equal to 61%.
True Positives (TP): 61 - These are instances where the model correctly predicted that the outcome would be “Up” when the actual outcome was indeed “Up.”
False Positives (FP): 43 - These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.”
True Negatives (TN): 0 - There are no instances where the model correctly predicted “Down” when the actual outcome was “Down” in this scenario.
False Negatives (FN): 0 - There are no instances where the model incorrectly predicted “Down” when the actual outcome was “Up” in this scenario.
set.seed(1)
train_X_knn = cbind(train_set$Lag2)
test_X_knn = cbind(test_data$Lag2)
train_Y_knn = cbind(train_set$Direction)
knn_pred = knn(train_X_knn, test_X_knn, train_Y_knn, k=1)
table(knn_pred, test_data$Direction)
##
## knn_pred Down Up
## 1 21 30
## 2 22 31
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(21+31)/21+30+22+31 which is equal to 50%.
True Positives (TP): These are instances where the model correctly predicted “Up” when the actual outcome was indeed “Up.” In this case, it’s 31.
False Positives (FP): These are instances where the model incorrectly predicted “Up” when the actual outcome was “Down.” Here, it’s 22.
True Negatives (TN): These are instances where the model correctly predicted “Down” when the actual outcome was indeed “Down.” It’s 21 in this scenario.
False Negatives (FN): These are instances where the model incorrectly predicted “Down” when the actual outcome was “Up.” There are 30 instances.
nb_h = naiveBayes(Direction ~ Lag2,
data = train_set)
nb_h
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
test_data = Weekly[Weekly$Year>2008,]
nb_test_h = predict(nb_h, newdata=test_data, type="class")
nb_h_class = nb_test_h
table(nb_h_class, test_data$Direction)
##
## nb_h_class Down Up
## Down 0 0
## Up 43 61
Interpretation:
The Overall fraction of correct predictions = sum of correct/Total predictions ,(0+61)/0+0+43+61 which is equal to 61%, which is same as qda model.
Answer: Logistic regressiona and LDA with 62.5% as correct prediction.
train_set4= Weekly[Weekly$Year<2009,]
log_reg4 = glm(Direction~Lag4, data= train_set4, family = "binomial")
test_set4= Weekly[Weekly$Year>2008,]
test_prob4 = predict(log_reg4, type="response", newdata = test_set4)
test_set_preds4 = rep("Down", 104)
test_set_preds4[test_prob4>0.5] = "Up"
test_dir_actual4 = Weekly$Direction[Weekly$Year>2008]
confusion_matrix4<-table(test_set_preds4, test_dir_actual4)
sum(diag(confusion_matrix4)) / sum(confusion_matrix4)
## [1] 0.4134615
train_set= Weekly[Weekly$Year<2009,]
log_reg_all = glm(Direction ~ Lag1 * Lag2 * Lag3 * Lag4, data= train_set, family = "binomial")
test_set= Weekly[Weekly$Year>2008,]
test_prob = predict(log_reg_all, type="response", newdata = test_set)
test_set_preds = rep("Down", 104)
test_set_preds[test_prob>0.5] = "Up"
test_dir_actual = Weekly$Direction[Weekly$Year>2008]
confusion_matrix_all<-table(test_set_preds, test_dir_actual)
sum(diag(confusion_matrix_all)) / sum(confusion_matrix_all)
## [1] 0.5961538
train_set= Weekly[Weekly$Year<2009,]
log_reg_all2 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data= train_set, family = "binomial")
test_set= Weekly[Weekly$Year>2008,]
test_prob = predict(log_reg_all2, type="response", newdata = test_set)
test_set_preds = rep("Down", 104)
test_set_preds[test_prob>0.5] = "Up"
test_dir_actual = Weekly$Direction[Weekly$Year>2008]
confusion_matrix_all2<-table(test_set_preds, test_dir_actual)
sum(diag(confusion_matrix_all2)) / sum(confusion_matrix_all2)
## [1] 0.5865385
nb_h3 = naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = train_set)
test_data = Weekly[Weekly$Year>2008,]
nb_test_h3 = predict(nb_h3, newdata=test_data, type="class")
nb_h_class3 = nb_test_h3
nb3=table(nb_h_class3, test_data$Direction)
sum(diag(nb3)) / sum(nb3)
## [1] 0.5096154
lda_all = lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = train_set)
test_data = Weekly[Weekly$Year>2008,]
lda_test = predict(lda_all, newdata=test_data, type="response")
lda_test_class = lda_test$class
lda_con=table(lda_test_class, test_data$Direction)
sum(diag(lda_con)) / sum(lda_con)
## [1] 0.5769231
qda_all = qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4,data = train_set)
test_data = Weekly[Weekly$Year>2008,]
qda_test = predict(qda_all, newdata=test_data, type="response")
qda_test_class = qda_test$class
qda_con=table(qda_test_class, test_data$Direction)
sum(diag(qda_con)) / sum(qda_con)
## [1] 0.5192308
set.seed(1)
train_X_knn <- train_set[, 2:4]
test_X_knn <- test_set[, 2:4]
train_Y_knn <- train_set$Direction
accuracies <- numeric(20)
for (k in 1:20) {
knn_pred <- knn(train_X_knn, test_X_knn, train_Y_knn, k = k)
accuracy <- mean(knn_pred == test_dir_actual)
accuracies[k] <- accuracy
}
best_k <- which.max(accuracies)
print(accuracies)
## [1] 0.4903846 0.5000000 0.5096154 0.5576923 0.5288462 0.5576923 0.5769231
## [8] 0.5384615 0.5288462 0.5865385 0.5673077 0.5673077 0.6153846 0.5673077
## [15] 0.5865385 0.5961538 0.5961538 0.6057692 0.6250000 0.6153846
print(paste("Best k value:", best_k, "Accuracy:", accuracies[best_k]))
## [1] "Best k value: 19 Accuracy: 0.625"
Interpretation: KNN using the lag1 to lag3 variables provide better accuracy than logistic regression with other predictors when used k as best value at 19.
attach(Auto)
## The following object is masked from package:lubridate:
##
## origin
## The following object is masked from package:ggplot2:
##
## mpg
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
pairs(na.omit(Auto[-9]))
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
Interpretation: There is an association between “mpg01” &
“cylinders”, and “weight”, “displacement” “year”, &
“horsepower”.
idxs <- sample(1:dim(Auto)[1], size=dim(Auto)[1]*0.75)
training <- Auto[idxs,]
test = Auto[-idxs,]
LDA_2d = lda(mpg01~displacement+weight+cylinders+year, data=training)
LDA_2d
## Call:
## lda(mpg01 ~ displacement + weight + cylinders + year, data = training)
##
## Prior probabilities of groups:
## 0 1
## 0.4829932 0.5170068
##
## Group means:
## displacement weight cylinders year
## 0 274.2606 3610.380 6.774648 74.34507
## 1 117.0164 2372.783 4.190789 77.65132
##
## Coefficients of linear discriminants:
## LD1
## displacement 0.0020071281
## weight -0.0008940318
## cylinders -0.5629470320
## year 0.1173041428
Pred_lda2d = predict(LDA_2d, newdata=test, type="response")$class
confu_2d=table(Pred_lda2d, test$mpg01)
confu_2d
##
## Pred_lda2d 0 1
## 0 46 0
## 1 8 44
1-sum(diag(confu_2d)) / sum(confu_2d)
## [1] 0.08163265
Interpretation: Test error of the LDA model obtained is 0.09183673.
QDA_2e = qda(mpg01~displacement+weight+cylinders+year, data=training)
QDA_2e
## Call:
## qda(mpg01 ~ displacement + weight + cylinders + year, data = training)
##
## Prior probabilities of groups:
## 0 1
## 0.4829932 0.5170068
##
## Group means:
## displacement weight cylinders year
## 0 274.2606 3610.380 6.774648 74.34507
## 1 117.0164 2372.783 4.190789 77.65132
Pred_QDA_2e = predict(QDA_2e, newdata=test, type="response")$class
confu_2e=table(Pred_QDA_2e, test$mpg01)
confu_2e
##
## Pred_QDA_2e 0 1
## 0 48 1
## 1 6 43
1-sum(diag(confu_2e)) / sum(confu_2e)
## [1] 0.07142857
Interpretation: Test error of the QDA model obtained is 0.09183673.
GLM_2f = glm(mpg01~displacement+weight+cylinders+year, data=training,family="binomial")
summary(GLM_2f)
##
## Call:
## glm(formula = mpg01 ~ displacement + weight + cylinders + year,
## family = "binomial", data = training)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.990e+01 5.499e+00 -3.619 0.000296 ***
## displacement -6.854e-03 1.072e-02 -0.639 0.522674
## weight -3.612e-03 9.965e-04 -3.624 0.000290 ***
## cylinders -2.488e-01 4.250e-01 -0.585 0.558257
## year 4.291e-01 8.587e-02 4.996 5.84e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 407.23 on 293 degrees of freedom
## Residual deviance: 133.25 on 289 degrees of freedom
## AIC: 143.25
##
## Number of Fisher Scoring iterations: 7
Pred_GLM_2f = predict(GLM_2f, newdata=test, type="response")
preds = rep(0, dim(test)[1])
preds[Pred_GLM_2f>0.5]=1
confu_2f=table(preds, Auto$mpg01[-idxs])
confu_2f
##
## preds 0 1
## 0 49 2
## 1 5 42
1-sum(diag(confu_2f)) / sum(confu_2f)
## [1] 0.07142857
Interpretation: Test error of the glm model obtained is 0.1020408.
nb_model_14g <- naiveBayes(mpg01 ~ displacement + weight + cylinders + year, data = training)
nb_pred_14g <- predict(nb_model_14g, newdata = test, type = "raw")
preds_14g <- ifelse(nb_pred_14g[, "1"] > 0.5, 1, 0)
confusion_matrix_14g <- table(preds_14g, test$mpg01)
confusion_matrix_14g
##
## preds_14g 0 1
## 0 45 1
## 1 9 43
1 - sum(diag(confusion_matrix_14g)) / sum(confusion_matrix_14g)
## [1] 0.1020408
Interpretation: Test error of the naiveBayes model obtained is 0.09183673.
selected_vars <- c("displacement", "weight", "cylinders", "year")
train_X <- training[, selected_vars]
test_X <- test[, selected_vars]
train_Y <- training$mpg01
accuracies <- numeric(30)
errors <- numeric(30)
for (k in 1:30) {
knn_model <- knn(train = train_X, test = test_X, cl = train_Y, k = k)
accuracy <- mean(knn_model == test$mpg01)
accuracies[k] <- accuracy
error <- 1 - accuracy
errors[k] <- error
}
best_k <- which.max(accuracies)
print("Accuracies for each k value:")
## [1] "Accuracies for each k value:"
print(accuracies)
## [1] 0.8673469 0.8775510 0.8979592 0.9081633 0.8775510 0.8775510 0.8877551
## [8] 0.8673469 0.8877551 0.8877551 0.8877551 0.8877551 0.8877551 0.8877551
## [15] 0.8877551 0.8877551 0.8775510 0.8775510 0.8775510 0.8775510 0.8877551
## [22] 0.8877551 0.8877551 0.8877551 0.8877551 0.8877551 0.8877551 0.8979592
## [29] 0.8979592 0.8979592
print("Errors for each k value:")
## [1] "Errors for each k value:"
print(errors)
## [1] 0.13265306 0.12244898 0.10204082 0.09183673 0.12244898 0.12244898
## [7] 0.11224490 0.13265306 0.11224490 0.11224490 0.11224490 0.11224490
## [13] 0.11224490 0.11224490 0.11224490 0.11224490 0.12244898 0.12244898
## [19] 0.12244898 0.12244898 0.11224490 0.11224490 0.11224490 0.11224490
## [25] 0.11224490 0.11224490 0.11224490 0.10204082 0.10204082 0.10204082
print(paste("Best k value:", best_k, "Accuracy:", accuracies[best_k]))
## [1] "Best k value: 4 Accuracy: 0.908163265306122"
test_error_best_k <- 1 - accuracies[best_k]
print(paste("Test Error for the Best k value:", test_error_best_k))
## [1] "Test Error for the Best k value: 0.0918367346938775"
data("Boston")
# Binary variable with census tract has a crime rate above or below the median
Boston$CrimeAboveMedian <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
pairs(Boston)
Interpretation: From the pairs plot it is understood that nox, age, dis,
medv have correlation with our crime rate.
set.seed(123)
train_idx <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]
# Fit logistic regression model
glm_model <- glm(CrimeAboveMedian ~nox + age + dis + medv, data = train_data, family = binomial,maxit = 1000)
summary(glm_model)
##
## Call:
## glm(formula = CrimeAboveMedian ~ nox + age + dis + medv, family = binomial,
## data = train_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.64257 3.95520 -6.230 4.65e-10 ***
## nox 36.96201 5.65857 6.532 6.49e-11 ***
## age 0.02096 0.01031 2.033 0.04210 *
## dis 0.34586 0.16695 2.072 0.03830 *
## medv 0.09356 0.02930 3.193 0.00141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.74 on 353 degrees of freedom
## Residual deviance: 216.40 on 349 degrees of freedom
## AIC: 226.4
##
## Number of Fisher Scoring iterations: 7
Interpretation: Age is not significant.Performing further modeling by removing the non-significant variables.
glm_model2 <- glm(CrimeAboveMedian ~nox + dis + medv, data = train_data, family = binomial,maxit = 1000)
summary(glm_model2)
##
## Call:
## glm(formula = CrimeAboveMedian ~ nox + dis + medv, family = binomial,
## data = train_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -22.89086 3.74758 -6.108 1.01e-09 ***
## nox 38.14579 5.59055 6.823 8.90e-12 ***
## dis 0.21348 0.15575 1.371 0.17047
## medv 0.07548 0.02669 2.828 0.00468 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.74 on 353 degrees of freedom
## Residual deviance: 220.75 on 350 degrees of freedom
## AIC: 228.75
##
## Number of Fisher Scoring iterations: 7
Interpretation: Displacement is not significant.Performing further modeling by removing the non-significant variables.
glm_model3 <- glm(CrimeAboveMedian ~nox + medv, data = train_data, family = binomial,maxit = 1000)
summary(glm_model3)
##
## Call:
## glm(formula = CrimeAboveMedian ~ nox + medv, family = binomial,
## data = train_data, maxit = 1000)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.19463 2.37661 -8.076 6.67e-16 ***
## nox 33.00251 3.79765 8.690 < 2e-16 ***
## medv 0.06698 0.02573 2.603 0.00925 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.74 on 353 degrees of freedom
## Residual deviance: 222.56 on 351 degrees of freedom
## AIC: 228.56
##
## Number of Fisher Scoring iterations: 6
Interpretation: nox, medv are significant using this model to calculate confusion matrix and accuracy.
glm_pred <- predict(glm_model3, newdata = test_data, type = "response")
glm_pred_binary <- ifelse(glm_pred > 0.5, 1, 0)
conf_matrix_last_question <- table(glm_pred_binary, test_data$CrimeAboveMedian)
conf_matrix_last_question
##
## glm_pred_binary 0 1
## 0 63 13
## 1 12 64
Accuracy_last=sum(diag(conf_matrix_last_question)) / sum(conf_matrix_last_question)
Error=1-sum(diag(conf_matrix_last_question)) / sum(conf_matrix_last_question)
print(paste("Accuracy:",Accuracy_last))
## [1] "Accuracy: 0.835526315789474"
print(paste("Error:", Error))
## [1] "Error: 0.164473684210526"
Interpretation:
True Positives (TP): There are 64 instances where the model correctly predicted that crime rates are above the median. True Negatives (TN): The model correctly predicted 63 instances where crime rates are below the median. False Positives (FP): There are 13 instances where the model incorrectly predicted that crime rates are above the median when they are actually below the median. False Negatives (FN): The model incorrectly predicted 12 instances where crime rates are below the median when they are actually above the median.
Accuracy: The accuracy of the model is 83.55%. This means that 83.55% of the predictions made by the model were correct.
Error Rate: The error rate of the model is 16.45%. This represents the proportion of incorrect predictions made by the model.
lda_model3 <- lda(CrimeAboveMedian ~nox + age+ dis+medv, data = train_data)
lda_model3
## Call:
## lda(CrimeAboveMedian ~ nox + age + dis + medv, data = train_data)
##
## Prior probabilities of groups:
## 0 1
## 0.5028249 0.4971751
##
## Group means:
## nox age dis medv
## 0 0.4695708 50.20730 5.184995 24.48427
## 1 0.6344489 86.15795 2.510232 20.09318
##
## Coefficients of linear discriminants:
## LD1
## nox 9.85966127
## age 0.01496032
## dis -0.03523792
## medv 0.01852111
lda_pred_binary = predict(lda_model3, newdata=test_data, type="response")$class
conf_matrix_last_question_lda=table(lda_pred_binary, test_data$CrimeAboveMedian)
conf_matrix_last_question_lda
##
## lda_pred_binary 0 1
## 0 64 10
## 1 11 67
Accuracy_last_lda=sum(diag(conf_matrix_last_question_lda)) / sum(conf_matrix_last_question_lda)
Error_lda=1-sum(diag(conf_matrix_last_question_lda)) / sum(conf_matrix_last_question_lda)
print(paste("Accuracy:",Accuracy_last_lda))
## [1] "Accuracy: 0.861842105263158"
print(paste("Error:", Error_lda))
## [1] "Error: 0.138157894736842"
Interpretation:
True Negatives (TN): The model correctly predicted 64 instances where crime rates are below the median. False Positives (FP): There are 10 instances where the model incorrectly predicted that crime rates are above the median when they are actually below the median. False Negatives (FN): There are 11 instances where the model incorrectly predicted that crime rates are below the median when they are actually above the median. True Positives (TP): The model correctly predicted 67 instances where crime rates are above the median.
Accuracy: The accuracy of the model is 86.18%. This means that 86.18% of the predictions made by the model were correct.
Error Rate: The error rate of the model is 13.81%. This represents the proportion of incorrect predictions made by the model.
set.seed(123)
train_X <- train_data[, c("nox", "age", "dis", "medv")]
train_Y <- train_data$CrimeAboveMedian
test_X <- test_data[, c("nox", "age", "dis", "medv")]
test_Y <- test_data$CrimeAboveMedian
accuracies <- numeric(20)
error_rates <- numeric(20)
for (k in 1:20) {
knn_model <- knn(train_X, test_X, train_Y, k = k)
accuracy <- mean(knn_model == test_Y)
accuracies[k] <- accuracy
error_rate <- 1 - accuracy
error_rates[k] <- error_rate
}
best_k <- which.max(accuracies)
print(paste("K Value | Accuracy | Error Rate"))
## [1] "K Value | Accuracy | Error Rate"
for (i in 1:20) {
print(paste(i, " | ", accuracies[i], " | ", error_rates[i]))
}
## [1] "1 | 0.743421052631579 | 0.256578947368421"
## [1] "2 | 0.756578947368421 | 0.243421052631579"
## [1] "3 | 0.756578947368421 | 0.243421052631579"
## [1] "4 | 0.756578947368421 | 0.243421052631579"
## [1] "5 | 0.756578947368421 | 0.243421052631579"
## [1] "6 | 0.769736842105263 | 0.230263157894737"
## [1] "7 | 0.743421052631579 | 0.256578947368421"
## [1] "8 | 0.743421052631579 | 0.256578947368421"
## [1] "9 | 0.769736842105263 | 0.230263157894737"
## [1] "10 | 0.75 | 0.25"
## [1] "11 | 0.75 | 0.25"
## [1] "12 | 0.756578947368421 | 0.243421052631579"
## [1] "13 | 0.736842105263158 | 0.263157894736842"
## [1] "14 | 0.75 | 0.25"
## [1] "15 | 0.75 | 0.25"
## [1] "16 | 0.75 | 0.25"
## [1] "17 | 0.75 | 0.25"
## [1] "18 | 0.75 | 0.25"
## [1] "19 | 0.776315789473684 | 0.223684210526316"
## [1] "20 | 0.763157894736842 | 0.236842105263158"
print(paste("Best k value:", best_k, "Accuracy:", accuracies[best_k],"Error:", error_rates[best_k]))
## [1] "Best k value: 19 Accuracy: 0.776315789473684 Error: 0.223684210526316"
nb_model <- naiveBayes(CrimeAboveMedian ~ nox + age + dis + medv, data = train_data)
nb_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5028249 0.4971751
##
## Conditional probabilities:
## nox
## Y [,1] [,2]
## 0 0.4695708 0.05622812
## 1 0.6344489 0.09839784
##
## age
## Y [,1] [,2]
## 0 50.20730 25.70990
## 1 86.15795 17.82768
##
## dis
## Y [,1] [,2]
## 0 5.184995 2.162853
## 1 2.510232 1.090256
##
## medv
## Y [,1] [,2]
## 0 24.48427 6.674894
## 1 20.09318 10.580481
nb_pred <- predict(nb_model, newdata = test_data)
conf_matrix_nb_pred <- table(nb_pred, test_data$CrimeAboveMedian)
conf_matrix_nb_pred
##
## nb_pred 0 1
## 0 61 8
## 1 14 69
accuracy_nb <- sum(diag(conf_matrix_nb_pred)) / sum(conf_matrix_nb_pred)
print(paste("Accuracy:", accuracy_nb))
## [1] "Accuracy: 0.855263157894737"
# Calculate error rate
error_rate_nb <- 1 - accuracy_nb
print(paste("Error Rate:", error_rate_nb))
## [1] "Error Rate: 0.144736842105263"
Interpretation:
True Negative (TN): 61 instances were correctly predicted as class 0 (below the median). False Positive (FP): 8 instances were incorrectly predicted as class 1 (above the median). False Negative (FN): 14 instances were incorrectly predicted as class 0 (below the median). True Positive (TP): 69 instances were correctly predicted as class 1 (above the median).
Accuracy: The accuracy of the model is 85.55%. This means that 85.55% of the predictions made by the model were correct.
Error Rate: The error rate of the model is 14.47%. This represents the proportion of incorrect predictions made by the model.