This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(class)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:ISLR2':
##
## Boston
##
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
corrplot(cor(Weekly[,-9]), method="square")
plot(Today~Lag1, col="darkred", data=Weekly)
simplelm = lm(Today~Lag1, data=Weekly)
abline(simplelm, lwd= 3, col= "darkgreen")
pairs(Weekly)
The
only variables that are shown to have a Linear relationship is Year
& Volume. The correlation plot doesn’t show that other variables are
linearly related. From the pair plot we see the exponential increase in
volume vs year. b) Use the full data set to perform a logistic
regression with Direction as the response and the five lag variables
plus Volume as predictors. Use the summary function to print the
results. Do any of the predictors appear to be statistically
significant? If so, which ones?
attach(Weekly)
Modlogreg<-glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly,family=binomial)
summary(Modlogreg)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only variable testing as statistically significant. The other variables fail to rejects the null hypothesis.Lag 1 is close to rejecting but fails to.
probs= predict(Modlogreg, type='response')
preds =rep("Down", length(probs))
preds[probs > 0.5] = "Up"
table(preds, Direction)
## Direction
## preds Down Up
## Down 54 48
## Up 430 557
For the determination of the percentage of the current predictions, the following was used total- (54+577)/(54+48+430+557)= 0.5611 UP- 557/(48+557)=0.9207 down- 54/(430+54)=0.1115 Once found the shows the model predicted the market trend correctly %56.11 of the time. The model also correctly predicted the UP weekly trends %92.07 of the time and the contrast down only %11.15 of the time.
hist(probs, breaks= 100, col= "darkred")
abline(v = mean(probs), lwd = 2)
plot(probs, col= ifelse(Weekly$Direction=="Down", "red","green"), pch=16)
abline(h = 0.5, lwd= 3)
train = (Year<2009)
Weekly0910 <-Weekly[!train,]
Weeklyfit<-glm(Direction~Lag2, data=Weekly,family=binomial, subset=train)
logWeeklyprob= predict(Weeklyfit, Weekly0910, type = "response")
logWeeklypred = rep("Down", length(logWeeklyprob))
logWeeklypred[logWeeklyprob > 0.5] = "Up"
Direction0910 = Direction[!train]
table(logWeeklypred, Direction0910)
## Direction0910
## logWeeklypred Down Up
## Down 9 5
## Up 34 56
mean(logWeeklypred == Direction0910)
## [1] 0.625
When separating the Weekly data set into separate training & testing data sets the model predicted the weekly trends at a rate of 62.5%, which is an improvement from the one model that used the whole data set. To add it also wasn’t better at predicting the upward trends to 91.80% & the downward trends to 20.93% which is better.
Weeklyldafit<-lda(Direction~Lag2, data=Weekly,family=binomial, subset=train)
Weeklyldapred<-predict(Weeklyldafit, Weekly0910)
table(Weeklyldapred$class, Direction0910)
## Direction0910
## Down Up
## Down 9 5
## Up 34 56
mean(Weeklyldapred$class==Direction0910)
## [1] 0.625
WHen using the Linear Discriminant Analysis to develop a classifaction model showed simular result as the Logistic regression did in part d.
Weeklyqdafit = qda(Direction ~ Lag2, data = Weekly, subset = train)
Weeklyqdapred = predict(Weeklyqdafit, Weekly0910)$class
table(Weeklyqdapred, Direction0910)
## Direction0910
## Weeklyqdapred Down Up
## Down 0 0
## Up 43 61
mean(Weeklyqdapred==Direction0910)
## [1] 0.5865385
The Quadratic Linear Analysis got the accuracy of 58.65% which is lower than the other methods. Also this model did not consider predicting the down trends and only considered the up trends. g) Repeat (d) using KNN with K = 1.
Weektrain=as.matrix(Lag2[train])
Weektest=as.matrix(Lag2[!train])
trainDirection =Direction[train]
set.seed(1)
Weekknnpred=knn(Weektrain,Weektest,trainDirection,k=1)
table(Weekknnpred,Direction0910)
## Direction0910
## Weekknnpred Down Up
## Down 21 30
## Up 22 31
mean(Weekknnpred == Direction0910)
## [1] 0.5
Showing a more random chance of 50% accuracy of the classifying model as resulted by yhr K-nearest nieghbors.
nbayes=naiveBayes(Direction~Lag2 ,data=Weekly ,subset=train)
nbayes
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
nbayes.class=predict(nbayes ,Weekly0910)
table(nbayes.class ,Direction0910)
## Direction0910
## nbayes.class Down Up
## Down 0 0
## Up 43 61
mean(nbayes.class == Direction0910)
## [1] 0.5865385
As shown the naive Bayes model has a prediction accuracy of 58.65% which is the same as the QDA model.
The Logistic Regression model and the Linear Discriminant Analysis both have the highest accuracy rate of 62.5% meaning they resulted the best within the model.
Weeklyfit<-glm(Direction~Lag2:Lag4+Lag2, data=Weekly,family=binomial, subset=train)
logWeeklyprob= predict(Weeklyfit, Weekly0910, type = "response")
logWeeklypred = rep("Down", length(logWeeklyprob))
logWeeklypred[logWeeklyprob > 0.5] = "Up"
Direction0910 = Direction[!train]
table(logWeeklypred, Direction0910)
## Direction0910
## logWeeklypred Down Up
## Down 3 4
## Up 40 57
mean(logWeeklypred == Direction0910)
## [1] 0.5769231
Weeklyldafit<-lda(Direction~Lag2:Lag4+Lag2, data=Weekly,family=binomial, subset=train)
Weeklyldapred<-predict(Weeklyldafit, Weekly0910)
table(Weeklyldapred$class, Direction0910)
## Direction0910
## Down Up
## Down 3 3
## Up 40 58
mean(Weeklyldapred$class==Direction0910)
## [1] 0.5865385
Weeklyqdafit = qda(Direction ~ poly(Lag2,2), data = Weekly, subset = train)
Weeklyqdapred = predict(Weeklyqdafit, Weekly0910)$class
table(Weeklyqdapred, Direction0910)
## Direction0910
## Weeklyqdapred Down Up
## Down 7 3
## Up 36 58
mean(Weeklyqdapred==Direction0910)
## [1] 0.625
Weektrain=as.matrix(Lag2[train])
Weektest=as.matrix(Lag2[!train])
trainDirection =Direction[train]
set.seed(1)
Weekknnpred=knn(Weektrain,Weektest,trainDirection,k=10)
table(Weekknnpred,Direction0910)
## Direction0910
## Weekknnpred Down Up
## Down 17 21
## Up 26 40
mean(Weekknnpred == Direction0910)
## [1] 0.5480769
Weektrain=as.matrix(Lag2[train])
Weektest=as.matrix(Lag2[!train])
trainDirection =Direction[train]
set.seed(1)
Weekknnpred=knn(Weektrain,Weektest,trainDirection,k=100)
table(Weekknnpred,Direction0910)
## Direction0910
## Weekknnpred Down Up
## Down 10 11
## Up 33 50
mean(Weekknnpred == Direction0910)
## [1] 0.5769231
detach(Weekly)
After the attempts of the different transformations for the different methods the original model results of the LDA and logistic regression still have the best accuracy rates. To add When squaring the Lag2 and using it with the QDA method, it resulted the same accuracy as the original LDA & Logistic regression models accuracy.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(tidyverse)
library(ISLR2)
library(corrplot)
library(MASS)
library(ggplot2)
library(e1071)
data("Auto")
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
corrplot::corrplot.mixed(cor(Auto[, -9]), upper="circle")
pairs(Auto[, -9])
par(mfrow=c(2,3))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")
There is an association that is shown with displacement, horsepower, weight, and cylinders. and the MPG01 variable.
set.seed(123)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]
lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4963504 0.5036496
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.786765 3641.022 275.2941 130.96324
## 1 4.188406 2314.000 114.5290 78.00725
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.3974647924
## weight -0.0009670704
## displacement -0.0029615583
## horsepower 0.0049004106
lda_pred = predict(lda_model, testing_data)
names(lda_pred)
## [1] "class" "posterior" "x"
pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)
## mpg01.test
## 0 1
## 0 50 3
## 1 10 55
mean(pred.lda$class != mpg01.test)
## [1] 0.1101695
as found above the Test error is 11.02% with the LDA model
qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda_model
## Call:
## qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4963504 0.5036496
##
## Group means:
## cylinders horsepower weight acceleration
## 0 6.786765 130.96324 3641.022 14.55588
## 1 4.188406 78.00725 2314.000 16.55072
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)
##
## qda.class 0 1
## 0 53 4
## 1 7 54
mean(qda.class != testing_data$mpg01)
## [1] 0.09322034
AS as found above the Test error is 9.325 with the QDA model.
glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
summary(glm_model)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = binomial, data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44120 -0.17870 0.08712 0.31147 3.05303
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.8103006 2.0819718 5.673 1.41e-08 ***
## cylinders 0.1869071 0.3972245 0.471 0.63797
## weight -0.0020251 0.0008573 -2.362 0.01817 *
## displacement -0.0164493 0.0095899 -1.715 0.08629 .
## horsepower -0.0443408 0.0172072 -2.577 0.00997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 379.83 on 273 degrees of freedom
## Residual deviance: 138.27 on 269 degrees of freedom
## AIC: 148.27
##
## Number of Fisher Scoring iterations: 7
probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)
## mpg01.test
## pred.glm 0 1
## 0 53 6
## 1 7 52
mean(pred.glm != mpg01.test)
## [1] 0.1101695
AS found above the Test error is 11.02% with the logistic regression model.
bayes.Auto=naiveBayes(mpg01~weight+acceleration+horsepower,data=training_data)
pred4.bayes=predict(bayes.Auto,testing_data)
table(pred4.bayes,mpg01.test)
## mpg01.test
## pred4.bayes 0 1
## 0 47 5
## 1 13 53
mean(pred4.bayes != mpg01.test)
## [1] 0.1525424
As found above the Test error is 15.25% with the Naïve Bayes model.
str(Auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
data = scale(Auto[,-c(9,10)])
set.seed(1234)
train <- sample(1:dim(Auto)[1], 392*.7, rep=FALSE)
#train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train,c("cylinders","horsepower","weight","acceleration")]
testing_data = data[test, c("cylinders", "horsepower","weight","acceleration")]
train.mpg01 = Auto$mpg01[train]
test.mpg01= Auto$mpg01[test]
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.mpg01, k = 1)
table(knn_pred_y, test.mpg01)
## test.mpg01
## knn_pred_y 0 1
## 0 57 5
## 1 7 49
mean(knn_pred_y != test.mpg01)
## [1] 0.1016949
As found above the Test error is 10.17% with the KNN model with K being 1
knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.mpg01,k=i)
error_rate[i] = mean(test.mpg01 != knn_pred_y)
}
min_error_rate = min(error_rate)
print(min_error_rate)
## [1] 0.09322034
K = which(error_rate == min_error_rate)
print(K)
## [1] 4
Now when me train the k=4 we get the lower rate of 9.32%
qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
library(MASS)
library(corrplot)
library(ggplot2)
library(tidyverse)
library(e1071)
library(ISLR2)
data("Boston")
crim01 <- rep(0, length(Boston$crim))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston <- data.frame(Boston, crim01)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv crim01
## Min. : 1.73 Min. : 5.00 Min. :0.0
## 1st Qu.: 6.95 1st Qu.:17.02 1st Qu.:0.0
## Median :11.36 Median :21.20 Median :0.5
## Mean :12.65 Mean :22.53 Mean :0.5
## 3rd Qu.:16.95 3rd Qu.:25.00 3rd Qu.:1.0
## Max. :37.97 Max. :50.00 Max. :1.0
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]
fit.glm1 <- glm(crim01 ~ . - crim01 - crim, data = Boston, family = binomial)
fit.glm1
##
## Call: glm(formula = crim01 ~ . - crim01 - crim, family = binomial,
## data = Boston)
##
## Coefficients:
## (Intercept) zn indus chas nox rm
## -34.103704 -0.079918 -0.059389 0.785327 48.523782 -0.425596
## age dis rad tax ptratio black
## 0.022172 0.691400 0.656465 -0.006412 0.368716 -0.013524
## lstat medv
## 0.043862 0.167130
##
## Degrees of Freedom: 505 Total (i.e. Null); 492 Residual
## Null Deviance: 701.5
## Residual Deviance: 211.9 AIC: 239.9
corrplot::corrplot.mixed(cor(Boston[, -1]), upper="circle")
fit.glm <- glm(crim01 ~ nox + indus + age + rad, data = Boston, family = binomial)
probs <- predict(fit.glm, Boston.test, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, crim01.test)
## crim01.test
## pred.glm 0 1
## 0 68 18
## 1 7 59
mean(pred.glm != crim01.test)
## [1] 0.1644737
As found in the logistic regression model the test error rate is 16.45%
fit.lda <- lda(crim01 ~ nox + indus + age + rad , data = Boston)
pred.lda <- predict(fit.lda, Boston.test)
table(pred.lda$class, crim01.test)
## crim01.test
## 0 1
## 0 72 25
## 1 3 52
mean(pred.lda$class != crim01.test)
## [1] 0.1842105
As seen in the LDA model, the test shows a test error rate of 18.42%. This showing as higher than the Logistic regression model.
pred.lda <- predict(fit.lda, Boston.test)
table(pred.lda$class, crim01.test)
## crim01.test
## 0 1
## 0 72 25
## 1 3 52
mean(pred.lda$class != crim01.test)
## [1] 0.1842105
As shown in the Naive Bayes model, the test shows a test error rate of 18.42%. which is the same as the LDA model
data = scale(Boston[,-c(1,15)])
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train, c("nox" , "indus" , "age" , "rad")]
testing_data = data[test, c("nox" , "indus" , "age" , "rad")]
train.crime01 = Boston$crim01[train]
test.crime01= Boston$crim01[test]
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.crime01, k = 1)
table(knn_pred_y, test.crime01)
## test.crime01
## knn_pred_y 0 1
## 0 67 6
## 1 8 71
mean(knn_pred_y != test.crime01)
## [1] 0.09210526
As seen in the KNN model, the test shows a test error rate of 9.21% making it the lowest.
knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.crime01,k=i)
error_rate[i] = mean(test.crime01 != knn_pred_y)
}
min_error_rate = min(error_rate)
print(min_error_rate)
## [1] 0.06578947
K = which(error_rate == min_error_rate)
print(K)
## [1] 4
Now once tried a loop to find the best/lowest error rate of 6.58% the best K is 4
qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))
In conclusion dues to having the lowest test error rate of 6.58% the best overall model to use for the Boston data set is the KNN model with the K as Four. This means it has the best accuracy rate of 93.42%.