This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. (a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
#Solution
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
The correlation matrix shows that the only significant correlation is the one between volumes and shares.
attach(Weekly)
plot(Volume)
From the plot above, we see that there is a positive association between
volume and year,as the volume variable is increasing with time. This
implies that the average number of daily shares traded in billions
increased from 1990 to 2010.
#Solution
logistic.model <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family="binomial")
summary(logistic.model)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The lag2 variable appears to be statistically significant as it has a small p-value.
#Solution
direction.prob <- predict(logistic.model, type="response")
direction.pred <- rep("Down", 1089)
direction.pred[direction.prob>0.5]="Up"
table(direction.pred, Direction)
## Direction
## direction.pred Down Up
## Down 54 48
## Up 430 557
mean(direction.pred==Direction)
## [1] 0.5610652
From the confusion matrix, we see that our model correctly predicted that the market had a positive return on 557 weeks and and would have a negative return on 54 weeks. From the output, we see that the model correctly predicted the movement of the market 56.1% of the time. The mistake logistic regression made was train and test the model using the same set of observations. We achieved a 43.9% training error rate.
#Solution
train <- (Year<2009)
Weekly.2008 <- Weekly[!train,]
Direction.2008 <- Direction[!train]
train.model <- glm(Direction~Lag2,data=Weekly, family="binomial",subset=train )
train.probs <- predict(train.model, Weekly.2008, type="response")
train.pred <- rep("Down",104 )
train.pred[train.probs>0.5] <- "Up"
table(train.pred, Direction.2008)
## Direction.2008
## train.pred Down Up
## Down 9 5
## Up 34 56
mean(train.pred==Direction.2008)
## [1] 0.625
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## The following object is masked from 'package:dplyr':
##
## select
direction.lda <- lda(Direction~Lag2, data=Weekly, subset = train)
lda.pred <- predict(direction.lda, Weekly.2008)
lda.class <- lda.pred$class
table(lda.class,Direction.2008)
## Direction.2008
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class==Direction.2008)
## [1] 0.625
#Solution
direction.qda <- qda(Direction~Lag2, data=Weekly, subset = train)
qda.class <- predict(direction.qda, Weekly.2008)$class
table(qda.class,Direction.2008)
## Direction.2008
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==Direction.2008)
## [1] 0.5865385
#Solution
library(class)
train.X <- as.matrix(Lag2[train])
test.X <- as.matrix((Lag2[!train]))
train.Direction <- Direction[train]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction,k=1)
table(knn.pred, Direction.2008)
## Direction.2008
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==Direction.2008)
## [1] 0.5
#Solution
library(e1071)
## Warning: package 'e1071' was built under R version 4.1.3
nb.fit <- naiveBayes(Direction~Lag2, data=Weekly, subset=train)
nb.class <- predict(nb.fit, Direction.2008)
## Warning in predict.naiveBayes(nb.fit, Direction.2008): Type mismatch between
## training and new data for variable 'Lag2'. Did you use factors with numeric
## labels for training, and numeric values for new data?
table(nb.class, Direction.2008)
## Direction.2008
## nb.class Down Up
## Down 0 0
## Up 43 61
mean(nb.class==Direction.2008)
## [1] 0.5865385
#solution
mean(train.pred!=Direction.2008)
## [1] 0.375
mean(lda.class!=Direction.2008)
## [1] 0.375
mean(qda.class!=Direction.2008)
## [1] 0.4134615
mean(knn.pred!=Direction.2008)
## [1] 0.5
Based on the results of the test error rate, we see that the logistic regression model and the linear discriminant analysis classifier provided the best results on the data.
#Solution
#1: The combination of lag1 and lag2 except naives bayes
#LG model
train.model1 <- glm(Direction~Lag1*Lag2,data=Weekly, family="binomial",subset=train )
train.probs1 <- predict(train.model1, Weekly.2008, type="response")
train.pred1 <- rep("Down",104 )
train.pred1[train.probs1>0.5] <- "Up"
table(train.pred1, Direction.2008)
## Direction.2008
## train.pred1 Down Up
## Down 7 8
## Up 36 53
mean(train.pred1==Direction.2008)
## [1] 0.5769231
#LDA MODEL
library(MASS)
direction.lda1 <- lda(Direction~Lag1+Lag2+Lag1:Lag2, data=Weekly, subset = train)
lda.pred1 <- predict(direction.lda1, Weekly.2008)
lda.class1 <- lda.pred1$class
table(lda.class1,Direction.2008)
## Direction.2008
## lda.class1 Down Up
## Down 7 8
## Up 36 53
mean(lda.class1==Direction.2008)
## [1] 0.5769231
#QDA MODEL
direction.qda1 <- qda(Direction~Lag1+Lag2+Lag1:Lag2, data=Weekly, subset = train)
qda.class1 <- predict(direction.qda1, Weekly.2008)$class
table(qda.class1,Direction.2008)
## Direction.2008
## qda.class1 Down Up
## Down 23 36
## Up 20 25
mean(qda.class1==Direction.2008)
## [1] 0.4615385
#KNN model
library(class)
train.X1 <-cbind(Lag1, Lag2)[train,]
test.X1 <- cbind(Lag1, Lag2)[!train,]
train.Direction1 <- Direction[train]
set.seed(3)
knn.pred1 <- knn(train.X1, test.X1, train.Direction1,k=1)
table(knn.pred1, Direction.2008)
## Direction.2008
## knn.pred1 Down Up
## Down 18 29
## Up 25 32
mean(knn.pred1==Direction.2008)
## [1] 0.4807692
# Naives Bayes Model
library(e1071)
nb.fit1 <- naiveBayes(Direction~Lag1+Lag2, data=Weekly, subset=train)
nb.class1 <- predict(nb.fit1, Direction.2008)
## Warning in predict.naiveBayes(nb.fit1, Direction.2008): Type mismatch between
## training and new data for variable 'Lag1'. Did you use factors with numeric
## labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(nb.fit1, Direction.2008): Type mismatch between
## training and new data for variable 'Lag2'. Did you use factors with numeric
## labels for training, and numeric values for new data?
table(nb.class1, Direction.2008)
## Direction.2008
## nb.class1 Down Up
## Down 0 0
## Up 43 61
mean(nb.class1==Direction.2008)
## [1] 0.5865385
From the output above, we see that the logistic regression model and LDA classifier provides the best output for the data.
#2 Transforming the predictors with quadratic transformation
#LG model
train.model2 <- glm(Direction~Lag1+I(Lag1^2),data=Weekly, family="binomial",subset=train )
train.probs2 <- predict(train.model2, Weekly.2008, type="response")
train.pred2 <- rep("Down",104 )
train.pred2[train.probs2>0.5] <- "Up"
table(train.pred2, Direction.2008)
## Direction.2008
## train.pred2 Down Up
## Down 4 5
## Up 39 56
mean(train.pred2==Direction.2008)
## [1] 0.5769231
#LDA MODEL
library(MASS)
direction.lda2 <- lda(Direction~Lag1+I(Lag1^2), data=Weekly, subset = train)
lda.pred2 <- predict(direction.lda2, Weekly.2008)
lda.class2 <- lda.pred2$class
table(lda.class2,Direction.2008)
## Direction.2008
## lda.class2 Down Up
## Down 4 6
## Up 39 55
mean(lda.class2==Direction.2008)
## [1] 0.5673077
#QDA MODEL
direction.qda2 <- qda(Direction~Lag1+I(Lag1^2), data=Weekly, subset = train)
qda.class2 <- predict(direction.qda2, Weekly.2008)$class
table(qda.class2,Direction.2008)
## Direction.2008
## qda.class2 Down Up
## Down 32 54
## Up 11 7
mean(qda.class2==Direction.2008)
## [1] 0.375
#KNN model
library(class)
train.X2 <-cbind(Lag1, I(Lag1^2))[train,]
test.X2 <- cbind(Lag1, I(Lag1^2))[!train,]
train.Direction2 <- Direction[train]
set.seed(4)
knn.pred2 <- knn(train.X2, test.X2, train.Direction2,k=1)
table(knn.pred2, Direction.2008)
## Direction.2008
## knn.pred2 Down Up
## Down 18 30
## Up 25 31
mean(knn.pred2==Direction.2008)
## [1] 0.4711538
# Naives Bayes Model
library(e1071)
nb.fit2 <- naiveBayes(Direction~Lag1+I(Lag1^2), data=Weekly, subset=train)
nb.class2 <- predict(nb.fit2, Direction.2008)
## Warning in predict.naiveBayes(nb.fit2, Direction.2008): Type mismatch between
## training and new data for variable 'Lag1'. Did you use factors with numeric
## labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(nb.fit2, Direction.2008): Type mismatch between
## training and new data for variable 'I(Lag1^2)'. Did you use factors with numeric
## labels for training, and numeric values for new data?
table(nb.class2, Direction.2008)
## Direction.2008
## nb.class2 Down Up
## Down 0 0
## Up 43 61
mean(nb.class2==Direction.2008)
## [1] 0.5865385
Here, we see that the naives bayes classifier produced the best result for the data.
library(class)
train.X3 <- as.matrix(Lag2[train])
test.X3 <- as.matrix((Lag2[!train]))
train.Direction3 <- Direction[train]
set.seed(1)
knn.pred3 <- knn(train.X3, test.X3, train.Direction3,k=2)
table(knn.pred3, Direction.2008)
## Direction.2008
## knn.pred3 Down Up
## Down 19 27
## Up 24 34
mean(knn.pred3==Direction.2008)
## [1] 0.5096154
library(class)
train.X4 <- as.matrix(Lag2[train])
test.X4 <- as.matrix((Lag2[!train]))
train.Direction4 <- Direction[train]
set.seed(1)
knn.pred4 <- knn(train.X4, test.X4, train.Direction4,k=2)
table(knn.pred4, Direction.2008)
## Direction.2008
## knn.pred4 Down Up
## Down 19 27
## Up 24 34
mean(knn.pred4==Direction.2008)
## [1] 0.5096154
#K=10
library(class)
train.X5 <- as.matrix(Lag2[train])
test.X5 <- as.matrix((Lag2[!train]))
train.Direction5 <- Direction[train]
set.seed(1)
knn.pred5 <- knn(train.X5, test.X5, train.Direction5,k=2)
table(knn.pred5, Direction.2008)
## Direction.2008
## knn.pred5 Down Up
## Down 19 27
## Up 24 34
mean(knn.pred5==Direction.2008)
## [1] 0.5096154
From the results above, we see that increasing K did not improve the prediction accuracy of the KNN classifer.
#Solution
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
Auto <- Auto|>mutate(mpg01=case_when(mpg>median(mpg)~1, mpg<median(mpg)~0))
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
par(mfrow=c(2,2))
boxplot(cylinders~mpg01, data=Auto)
boxplot(displacement~mpg01, data=Auto)
boxplot(horsepower~mpg01, data=Auto)
boxplot(weight~mpg01, data=Auto)
boxplot(acceleration~mpg01, data=Auto)
boxplot(year~mpg01, data=Auto)
boxplot(origin~mpg01, data=Auto)
From the plot and pairwisw correlation matrix, There seem to be an
association between mpg01 and cylinders, displacement,
horsepower,weight,and origin.
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(20)
split <- sample.split(Auto, SplitRatio = 0.8)
Autotrain <- subset(Auto, split==TRUE)
Autotest<- subset(Auto, split==FALSE)
library(MASS)
mpg01.lda <- lda(mpg01 ~ cylinders + weight + displacement + horsepower+origin, data = Autotrain)
mpg01.pred <- predict(mpg01.lda, Autotest)
table(mpg01.pred$class, Autotest$mpg01)
##
## 0 1
## 0 36 4
## 1 4 34
mean(mpg01.pred$class != Autotest$mpg01)
## [1] 0.1025641
The test error is 10.26%
fit.qda <- qda(mpg01 ~ cylinders + weight + displacement + horsepower +origin, data = Autotrain)
pred.qda <- predict(fit.qda, Autotest)
table(pred.qda$class, Autotest$mpg01)
##
## 0 1
## 0 37 7
## 1 3 31
mean(pred.qda$class !=Autotest$mpg01 )
## [1] 0.1282051
The test error is 12.82%
log.glm <- glm(mpg01 ~ cylinders + weight + displacement + horsepower+origin, data = Autotrain, family = binomial)
probs.log <- predict(log.glm, Autotest, type = "response")
pred.log <- rep(0, length(probs.log))
pred.log[probs.log > 0.5] <- 1
table(pred.log, Autotest$mpg01)
##
## pred.log 0 1
## 0 38 4
## 1 2 34
mean(pred.log != Autotest$mpg01)
## [1] 0.07692308
The test error is 7.69%
library(e1071)
nb.train <- naiveBayes(mpg01 ~ cylinders + weight + displacement + horsepower+origin, data = Autotrain)
class.train <- predict(nb.train, Autotest)
table(class.train, Autotest$mpg01)
##
## class.train 0 1
## 0 36 3
## 1 4 35
mean(class.train!=Autotest$mpg01)
## [1] 0.08974359
The test error is 8.97%
#STANDARDIZE THE DATA
library(caTools)
set.seed(2055)
standardized.X <- scale(Auto[,c(-1,-6,-7,-9,-10)])
standard.split<- sample(1:nrow(standardized.X), size=nrow(standardized.X)*0.8,replace=F)
standard.train <- standardized.X[standard.split,]
standard.test<- standardized.X[-standard.split,]
mpg09.train <- Auto[standard.split, 10]
mpg09.test <- Auto[-standard.split, 10]
library(class)
set.seed(689)
knn.1 <- knn(standard.train,standard.test,mpg09.train,k=1)
knn.3 <- knn(standard.train, standard.test,mpg09.train,k=3)
knn.5 <- knn(standard.train, standard.test,mpg09.train,k=5)
knn.7 <- knn(standard.train, standard.test,mpg09.train,k=7)
knn.9 <- knn(standard.train, standard.test,mpg09.train,k=9)
knn.11 <- knn(standard.train, standard.test,mpg09.train,k=11)
knn.13 <- knn(standard.train, standard.test,mpg09.train,k=13)
knn.15 <- knn(standard.train, standard.test,mpg09.train,k=15)
knn.17 <- knn(standard.train, standard.test,mpg09.train,k=17)
table (knn.1, mpg09.test)
## mpg09.test
## knn.1 0 1
## 0 37 3
## 1 5 34
table (knn.3, mpg09.test)
## mpg09.test
## knn.3 0 1
## 0 36 3
## 1 6 34
table (knn.5, mpg09.test)
## mpg09.test
## knn.5 0 1
## 0 37 3
## 1 5 34
table (knn.7, mpg09.test)
## mpg09.test
## knn.7 0 1
## 0 36 2
## 1 6 35
table (knn.9, mpg09.test)
## mpg09.test
## knn.9 0 1
## 0 36 1
## 1 6 36
table (knn.11, mpg09.test)
## mpg09.test
## knn.11 0 1
## 0 36 1
## 1 6 36
table (knn.13, mpg09.test)
## mpg09.test
## knn.13 0 1
## 0 36 2
## 1 6 35
table (knn.15, mpg09.test)
## mpg09.test
## knn.15 0 1
## 0 36 2
## 1 6 35
table (knn.17, mpg09.test)
## mpg09.test
## knn.17 0 1
## 0 36 2
## 1 6 35
mean(knn.1!=mpg09.test)
## [1] 0.1012658
mean(knn.3!=mpg09.test)
## [1] 0.1139241
mean(knn.5!=mpg09.test)
## [1] 0.1012658
mean(knn.7!=mpg09.test)
## [1] 0.1012658
mean(knn.9!=mpg09.test)
## [1] 0.08860759
mean(knn.11!=mpg09.test)
## [1] 0.08860759
mean(knn.13!=mpg09.test)
## [1] 0.1012658
mean(knn.15!=mpg09.test)
## [1] 0.1012658
mean(knn.17!=mpg09.test)
## [1] 0.1012658
We see that the KNN classifier with k=9 and k=11 had the lowest test error rate of 8.86%
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
##Solution
attach(Boston)
Boston <- Boston|>mutate(crime=case_when(crim>median(crim)~1, crim<median(crim)~0))
cor(Boston[, ])
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crime 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crime
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crime -0.35121093 0.4532627 -0.2630167 1.00000000
From the pairwise correlation matrix, we see there is an association between crime and the indus, nox, age, dis, rad, tax predictors. So our classifiers will be built based on those predictors.
Boston$crime <- as.factor(Boston$crime)
library(caTools)
set.seed(27)
split1<- sample.split(Boston, SplitRatio = 0.8)
Bostontrain <- subset(Boston, split1==TRUE)
Bostontest<- subset(Boston, split1==FALSE)
Boston.glm <- glm(crime ~ indus+ nox + age +dis +rad +tax, data = Bostontrain, family = binomial)
probs.Boston<- predict(Boston.glm, Bostontest, type = "response")
pred.Boston<- rep(0, length(probs.Boston))
pred.Boston[probs.Boston> 0.5] <- 1
table(pred.Boston, Bostontest$crime)
##
## pred.Boston 0 1
## 0 48 9
## 1 2 42
mean(pred.Boston == Bostontest$crime)
## [1] 0.8910891
mean(pred.Boston != Bostontest$crime)
## [1] 0.1089109
library(MASS)
Boston.lda <- lda(crime ~ indus+ nox + age +dis +rad +tax, data = Bostontrain)
Boston.predict<- predict(Boston.lda, Bostontest)
table(Boston.predict$class, Bostontest$crime)
##
## 0 1
## 0 48 14
## 1 2 37
mean(Boston.predict$class==Bostontest$crime)
## [1] 0.8415842
mean(Boston.predict$class!=Bostontest$crime)
## [1] 0.1584158
library(e1071)
nb.Boston <- naiveBayes(crime ~ indus+ nox + age +dis +rad +tax, data = Bostontrain)
predict.Boston <- predict(nb.Boston, Bostontest)
table(predict.Boston, Bostontest$crime)
##
## predict.Boston 0 1
## 0 48 13
## 1 2 38
mean(predict.Boston==Bostontest$crime)
## [1] 0.8514851
mean(predict.Boston!=Bostontest$crime)
## [1] 0.1485149
Boston1 <- Boston
Boston1$crime <- as.numeric(Boston1$crime)
#STANDARDIZE THE DATA
library(caTools)
set.seed(2955)
standardized.K <- scale(Boston1[,c(3,5,7,8,9,10)])
standard.splits<- sample(1:nrow(standardized.K), size=nrow(standardized.K)*0.8,replace=F)
Boston.train <- standardized.K[standard.splits,]
Boston.test<- standardized.K[-standard.splits,]
crime.train <- Boston1[standard.splits, 15]
crime.test <- Boston1[-standard.splits, 15]
library(class)
set.seed(689)
knna.1 <- knn(Boston.train,Boston.test,crime.train,k=1)
knnb.3 <- knn(Boston.train, Boston.test,crime.train,k=3)
knnc.5 <- knn(Boston.train, Boston.test,crime.train,k=5)
knnd.7 <- knn(Boston.train, Boston.test,crime.train,k=7)
knne.9 <- knn(Boston.train, Boston.test,crime.train,k=9)
knnf.11 <- knn(Boston.train, Boston.test,crime.train,k=11)
knng.13 <- knn(Boston.train, Boston.test,crime.train,k=13)
knnh.15 <- knn(Boston.train,Boston.test,crime.train,k=15)
knni.17 <- knn(Boston.train, Boston.test,crime.train,k=17)
table (knna.1, crime.test)
## crime.test
## knna.1 1 2
## 1 51 3
## 2 5 43
table (knnb.3,crime.test)
## crime.test
## knnb.3 1 2
## 1 52 3
## 2 4 43
table (knnc.5, crime.test)
## crime.test
## knnc.5 1 2
## 1 52 3
## 2 4 43
table (knnd.7, crime.test)
## crime.test
## knnd.7 1 2
## 1 52 4
## 2 4 42
table (knne.9, crime.test)
## crime.test
## knne.9 1 2
## 1 50 3
## 2 6 43
table (knnf.11, crime.test)
## crime.test
## knnf.11 1 2
## 1 50 3
## 2 6 43
table (knng.13, crime.test)
## crime.test
## knng.13 1 2
## 1 50 3
## 2 6 43
table (knnh.15, crime.test)
## crime.test
## knnh.15 1 2
## 1 49 3
## 2 7 43
table (knni.17, crime.test)
## crime.test
## knni.17 1 2
## 1 49 3
## 2 7 43
mean(knna.1!=crime.test)
## [1] 0.07843137
mean(knnb.3!=crime.test)
## [1] 0.06862745
mean(knnc.5!=crime.test)
## [1] 0.06862745
mean(knnd.7!=crime.test)
## [1] 0.07843137
mean(knne.9!=crime.test)
## [1] 0.08823529
mean(knnf.11!=crime.test)
## [1] 0.08823529
mean(knng.13!=crime.test)
## [1] 0.08823529
mean(knnh.15!=crime.test)
## [1] 0.09803922
mean(knni.17!=crime.test)
## [1] 0.09803922
After building the classifiers from the training dataset, their performance was assessed using the confusion matrix and the corresponding test error rate. From the output, we see that the KNN classifiers performed the best as they have the smallest error rate, and the naives bayes classifier performed the worst.
We also see that the KNN classifier with K=3 and K=5 performed the best.