This question should be answered using the Weekly data
set, which is part of the ISLR2 package. This data is
similar in nature to the Smarket data from this chapter’s
lab, except that it contains 1, 089 weekly returns for 21 years, from
the beginning of 1990 to the end of 2010.
library(ISLR2)
library(scales)
library(MASS)
library(e1071)
library(class)
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
dim(Weekly)
## [1] 1089 9
View(Weekly)
Q13(a): Produce some numerical and graphical
summaries of the Weekly data. Do there appear to be any patterns?
A13(a): There are a few patterns that can be seen both
in the graphs and through correlations. The correlations between the lag
variables and today’s returns are close to zero meaning there appears to
be little correlation between today’s returns and previous days’
returns. The only substantial correlation is between Year
and Volume with a correlation of .84194162. On
plot that stood out is Volume by Year showing
that Volume is increasing over time.
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
attach(Weekly)
pairs(Weekly)
plot(Volume, Year)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Q13(b): Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
A13(b): The Lag2 predictor is the only
predictor that according to this model appears to be statistically
significant with a P value of 0.0296. All other variables in the model
are statistically insignificant with p-values above .05.
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data=Weekly,
family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Q13(c) Compute the confusion matrix and overall
fraction of correct predictions. Explain what the confusion matrix is
telling you about the types of mistakes made by logistic
regression.
A13(c) According to the confusion matrix there were 611
of 1089 correct predictions, resulting in a classification Accuracy/True
Positive of 56.107%” and an error rate (classifications) of 43.893%.
Sensitivity (accuracy for accounting for “Up”) is 92.066%“, and
Specificity (accuracy for accounting for”Down”) is 11.157%.
glm.prob = predict(glm.fit,
type="response")
glm.pred = rep("Down", 1089)
glm.pred[glm.prob > .5] = "Up"
table(glm.pred,
Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
TruePos = mean(glm.pred==Direction)
print(paste("Accuracy/True Positive is", percent(TruePos, accuracy=.001)))
## [1] "Accuracy/True Positive is 56.107%"
print(paste("Error Rate", percent((1-TruePos), accuracy=.001)))
## [1] "Error Rate 43.893%"
Sensitivity = 557/(557+48)
print(paste("Sensitivity is", percent(Sensitivity, accuracy=.001)))
## [1] "Sensitivity is 92.066%"
Specificity = 54/(430+54)
print(paste("Specificity is", percent(Specificity, accuracy=.001)))
## [1] "Specificity is 11.157%"
Q13(d) Now fit the logistic regression model using a
training data period from 1990 to 2008, with Lag2 as the only predictor.
Compute the confusion matrix and the overall fraction of correct
predictions for the held out data (that is, the data from 2009 and
2010).
A13(d) LR - The logistic regression model Accuracy/True
Positive is 62.500%.
train = (Year < 2009)
test.2009.2010 = Weekly[!train,]
Direction.2009.2010 = Direction[!train]
lr.fit = glm(Direction ~ Lag2,
data = Weekly,
family = binomial,
subset = train)
lr.Prob = predict(lr.fit, test.2009.2010, type = "response")
lr.pred = rep("Down", 104)
lr.pred[lr.Prob > .5] = "Up"
table(lr.pred, Direction.2009.2010)
## Direction.2009.2010
## lr.pred Down Up
## Down 9 5
## Up 34 56
accuracy = mean(lr.pred==Direction.2009.2010)
print(paste("Log Reg Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "Log Reg Accuracy/True Positive is 62.500%"
Q13(e) Repeat (d) using LDA.
A13(e) LDA - The LDA model Accuracy/True Positive is
62.500%.
lda.fit = lda(Direction ~ Lag2,
data = Weekly,
subset = train)
lda.prob = predict(lda.fit, test.2009.2010, type = "response")
lda.class = lda.prob$class
table(lda.class, Direction.2009.2010)
## Direction.2009.2010
## lda.class Down Up
## Down 9 5
## Up 34 56
accuracy = mean(lda.class==Direction.2009.2010)
print(paste("LDA Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "LDA Accuracy/True Positive is 62.500%"
Q13(f) Repeat (d) using QDA.
A13(f) QDA - The QDA model Accuracy/True Positive is
58.654%. The QDA model didn’t predict any down predictions, questioning
its reliability.
qda.fit = qda(Direction ~ Lag2,
data = Weekly,
subset = train)
qda.prob = predict(qda.fit, test.2009.2010, type = "response")
qda.class = qda.prob$class
table(qda.class, Direction.2009.2010)
## Direction.2009.2010
## qda.class Down Up
## Down 0 0
## Up 43 61
accuracy = mean(qda.class==Direction.2009.2010)
print(paste("QDA Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "QDA Accuracy/True Positive is 58.654%"
Q13(g) Repeat (d) using KNN with K = 1.
A13(g) KNN - The KNN model Accuracy/True Positive is
50%.
train.set = Lag2[train]
test.set = Lag2[!train]
train.Direction = Direction[train]
length(train.set) == length(test.set)
## [1] FALSE
set.seed(1)
knn.pred = knn(data.frame(train.set),
data.frame(test.set),
train.Direction, k=1)
table(knn.pred, Direction.2009.2010)
## Direction.2009.2010
## knn.pred Down Up
## Down 21 30
## Up 22 31
accuracy = mean(knn.pred==Direction.2009.2010)
print(paste("KNN = 1 Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "KNN = 1 Accuracy/True Positive is 50.000%"
Q13(h) Repeat (d) using naive Bayes.
A13(h) naive Bayes - naive Bayes Accuracy/True Positive
is 58.654%.
Bayes.fit = naiveBayes(Direction ~ Lag2,
data = Weekly,
subset = train)
Bayes.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
Bayes.class = predict(Bayes.fit, test.set)
table(Bayes.class, Direction.2009.2010)
## Direction.2009.2010
## Bayes.class Down Up
## Down 0 0
## Up 43 61
accuracy = mean(Bayes.class == Direction.2009.2010)
print(paste("naive Bayes Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "naive Bayes Accuracy/True Positive is 58.654%"
Q13(i) Which of these methods appears to provide the
best results on this data?
A13(i) The model method that appears to provide the
best results on this data is the Logistic Regression and LDA models,
both with an accuracy of 62.5% and test error rate of 37.5%.
Q13(j) Experiment with different combinations of
predictors, including possible transformations and interactions, for
each of the methods. Report the variables, method, and associated
confusion matrix that appears to provide the best results on the held
out data. Note that you should also experiment with values for K in the
KNN classifier.
A13(j) From my various experiments, the best results
from the held out data was utilizing the LDA method and an interaction
between the Lag2 and Lag3 variables resulting
in an accuracy/true positive of 62.5%.
Bayes.fit = naiveBayes(Direction ~ Lag1,
data = Weekly,
subset = train)
Bayes.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag1
## Y [,1] [,2]
## Down 0.289444444 2.211721
## Up -0.009213235 2.308387
Bayes.class = predict(Bayes.fit, test.set)
table(Bayes.class, Direction.2009.2010)
## Direction.2009.2010
## Bayes.class Down Up
## Down 0 0
## Up 43 61
accuracy = mean(Bayes.class == Direction.2009.2010)
print(paste("Utilizing Lag1 naive Bayes Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "Utilizing Lag1 naive Bayes Accuracy/True Positive is 58.654%"
train.X=cbind(Lag1,Lag2)[train,]
test.X=cbind(Lag1,Lag2)[!train,]
train.Direction = Direction[train]
length(train.X) == length(test.X)
## [1] FALSE
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k=3)
table(knn.pred, Direction.2009.2010)
## Direction.2009.2010
## knn.pred Down Up
## Down 22 29
## Up 21 32
accuracy = mean(knn.pred==Direction.2009.2010)
print(paste("KNN = 3 Accuracy/True Positive is", percent(accuracy, accuracy=.001)))
## [1] "KNN = 3 Accuracy/True Positive is 51.923%"
lda.fit = lda(Direction ~ Lag2*Lag3,
data = Weekly,
subset = train)
lda.pred = predict(lda.fit,
test.2009.2010,
type = "response")
lda.class = lda.pred$class
table(lda.class, Direction.2009.2010)
## Direction.2009.2010
## lda.class Down Up
## Down 8 4
## Up 35 57
mean(lda.class == Direction.2009.2010)
## [1] 0.625
detach(Weekly)
In this problem, you will develop a model to predict whether a given
car gets high or low gas mileage based on the Auto data
set.
library(ISLR2)
library(scales)
library(MASS)
library(e1071)
library(class)
attach(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
dim(Auto)
## [1] 392 9
View(Auto)
Q14(a) Create a binary variable, mpg01, that
contains a 1 if mpg contains a value above its median, and a 0 if mpg
contains a value below its median. You can compute the median using the
median() function. Note you may find it helpful to use the data.frame()
function to create a single data set containing both mpg01 and the other
Auto variables.
A14(a): The median is for mpg is 22.75, the binary
variable, mpg01, was creating, it contains a 1 if mpg contains a value
above its median, and a 0 if mpg contains a value below the median.
median(mpg)
## [1] 22.75
mpg01 = rep(1, 392)
mpg01[mpg < median(mpg)] = 0
Auto2 = Auto
Auto2$mpg01 = mpg01
View(Auto2)
Q14(b) Explore the data graphically in order to
investigate the association between mpg01 and the other features. Which
of the other features seem most likely to be useful in predicting mpg01?
Scatterplots and boxplots may be useful tools to answer this question.
Describe your findings. A14(b) The variables
cylinders, weight, displacement,
and horsepower are most highly correlated with
mpg01 with a negative correlation.
cor(Auto2[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
pairs(Auto2[,-9])
library(corrplot)
corrplot(cor(Auto2[, -9]), method = "color")
Q14(c) Split the data into a training set and a test
set.
A14(c) Data has been split into a 70/30 ratio for
training and test sets.
train = Auto2[(1:(392*.7)),]
test = Auto2[(392*.7):392,]
target = test$mpg01
Q14(d) Perform LDA on the training data in order to
predict mpg01 using the variables that seemed most associated with mpg01
in (b). What is the test error of the model obtained?
A14(d) The test error for the LDA model with an 70/30
split is 15.25%.
lda.fit = lda(mpg01 ~ cylinders + weight + displacement + horsepower,
data = train)
lda.pred = predict(lda.fit, test, type='response')
lda.class = lda.pred$class
table(lda.class, target)
## target
## lda.class 0 1
## 0 19 16
## 1 2 81
lda.accuracy = mean(lda.class == target)
lda.error = 1 - mean(lda.class == target)
The test error for the LDA model with an 70/30 split is 15.254%.
Q14(e) Perform QDA on the training data in order to
predict mpg01 using the variables that seemed most associated with mpg01
in (b). What is the test error of the model obtained?
A14(e) The test error for the QDA model with an 70/30
split is 17.797%
qda.fit = qda(mpg01 ~ cylinders + weight + displacement + horsepower,
data = train)
qda.pred = predict(qda.fit, test, type='response')
qda.class = qda.pred$class
table(qda.class, target)
## target
## qda.class 0 1
## 0 19 19
## 1 2 78
qda.accuracy = mean(qda.class == target)
qda.error = 1 - mean(qda.class == target)
qda.error
## [1] 0.1779661
The test error for the QDA model with an 70/30 split is 17.797%.
Q14(f) Perform logistic regression on the training
data in order to predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the model
obtained?
A14(f) The test error for the Logistic Regression model
with an 70/30 split is 24.58%.
lr.fit = glm(mpg01 ~ cylinders + weight + displacement + horsepower,
data = train,
family = binomial)
lr.prob = predict(lr.fit, test, type = "response")
lr.pred = rep("0", 118)
lr.pred[lr.Prob > .5] = 1
table(lr.pred, target)
## target
## lr.pred 0 1
## 0 6 14
## 1 15 83
lr.accuracy = mean(lr.pred==target)
lr.error = 1 - lr.accuracy
lr.error
## [1] 0.2457627
The test error for the Logistic Regression model with an 70/30 split is 24.576%.
Q14(g) Perform naive Bayes on the training data in
order to predict mpg01 using the variables that seemed most associated
with mpg01 in (b). What is the test error of the model obtained?
A14(g) The test error for the naive Bayes model with an
70/30 split is 13.56%.
Bayes.fit = naiveBayes(mpg01 ~ cylinders + weight + displacement + horsepower,
data = train)
Bayes.class = predict(Bayes.fit, test)
table(Bayes.class, target)
## target
## Bayes.class 0 1
## 0 19 14
## 1 2 83
Bayes.accuracy = mean(Bayes.class == target)
Bayes.error = 1 - Bayes.accuracy
The test error for the naive Bayes model with an 70/30 split is 13.559%.
Q14(h) Perform KNN on the training data, with
several values of K, in order to predict mpg01. Use only the variables
that seemed most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data
set?
A14(h) 5 is value of K which seems to perform the best
on this data set with a test error rate of 19.49%. Here are the test
errors obtained using different K values. KNN = 1 Test Error is 22.034%.
KNN = 3 Test Error is 22.034%. KNN = 5 Test Error is 19.492%. KNN = 7
Test Error is 19.492%.
detach(Auto)
attach(Auto2)
train = cbind(cylinders, weight, displacement, horsepower)[(1:(392*.7)),]
test = cbind(cylinders, weight, displacement, horsepower)[(392*.7):392,]
train.target = Auto2$mpg01[(1:(392*.7))]
set.seed(1)
knn1.pred = knn(train = train, test = test, cl = train.target, k=1)
table(knn1.pred, target)
## target
## knn1.pred 0 1
## 0 19 24
## 1 2 73
knn.test.error = 1 - mean(knn1.pred==target)
print(paste("KNN = 1 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 1 Test Error is 22.034%"
knn3.pred = knn(train = train, test = test, cl = train.target, k=3)
table(knn3.pred, target)
## target
## knn3.pred 0 1
## 0 21 26
## 1 0 71
knn.test.error = 1- mean(knn3.pred==target)
print(paste("KNN = 3 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 3 Test Error is 22.034%"
knn5.pred = knn(train = train, test = test, cl = train.target, k=5)
table(knn5.pred, target)
## target
## knn5.pred 0 1
## 0 21 23
## 1 0 74
knn.test.error = 1 - mean(knn5.pred==target)
print(paste("KNN = 5 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 5 Test Error is 19.492%"
knn7.pred = knn(train = train, test = test, cl = train.target, k=7)
table(knn7.pred, target)
## target
## knn7.pred 0 1
## 0 21 23
## 1 0 74
knn.test.error = 1 - mean(knn7.pred==target)
print(paste("KNN = 7 Test Error is", percent(knn.test.error, accuracy=.001)))
## [1] "KNN = 7 Test Error is 19.492%"
Q16Using the Boston data set, fit classification
models in order to predict whether a given census tract has a crime rate
above or below the median. Explore logistic regression, LDA, naive
Bayes, and KNN models using various subsets of the predictors. Describe
your findings. Hint: You will have to create the response variable
yourself, using the variables that are contained in the Boston data
set.
A16 For this problem I started by fitting the logistic
regression, LDA, naive Bayes, and KNN models using a 70/30 train/set
split and predictors that have high correlation (above +/- .6) with
crim01 (nox, chas,
age, dis, rad, tax,
indus). Model performance from best to worst is Logistic
Regression model with an accuracy of 96.05%, KNN model where K = 3 with
an accuracy of 94.737%, LDA model with an accuracy of 93.42% and lastly
the naive Bayes model with an accuracy of 32.24%
attach(Boston)
Create response variable for crime rates above/below median
crim01 = rep(0, length(crim))
crim01[crim > median(crim)] = 1
Boston2 = Boston
Boston2$crim01 = crim01
View(Boston2)
Exploring variables and correlations
cor(Boston2)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crim01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crim01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crim01 -0.35121093 0.4532627 -0.2630167 1.00000000
pairs(Boston2)
length(Boston2[,1])
## [1] 506
corrplot(cor(Boston2), method = "color")
Variables that show a high correlation (above +/- .6) with
crim01 are nox, chas,
age, dis, rad, tax,
indus.
Creating a train/test set with a 70/30 split.
train = Boston2[(1:(506*.7)),]
test = Boston2[((506*.7): 506), ]
test.target = test$crim01
attach(Boston2)
logistic regression The logistic regression model had an accuracy of 96.05%.
lr.fit = glm(crim01 ~ nox + chas + age + dis + rad + tax + indus,
data = train,
family = binomial)
lr.prob = predict(lr.fit, test, type = "response")
lr.pred = ifelse(lr.prob>0.5,1,0)
table(lr.pred, test$crim01)
##
## lr.pred 0 1
## 0 11 0
## 1 6 135
lr.accuracy.boston = mean(lr.pred == test$crim01)
mean(lr.pred == test$crim01)
## [1] 0.9605263
LDA Model The LDA model had an accuracy of 93.42%.
lda.fit = lda(crim01 ~ nox + chas + age + dis + rad + tax + indus,
data = train)
lda.pred = predict(lda.fit, test, type = "response")
lda.class = lda.pred$class
table(lda.class, test.target)
## test.target
## lda.class 0 1
## 0 7 0
## 1 10 135
mean(lda.class == test.target)
## [1] 0.9342105
naive Bayes The accuracy rate for the naive Bayes model with an 70/30 split is 32.24%.
Bayes.fit = naiveBayes(crim01 ~ nox + chas + age + dis + rad + tax + indus,
data = train)
Bayes.class = predict(Bayes.fit, test)
table(Bayes.class, test.target)
## test.target
## Bayes.class 0 1
## 0 4 90
## 1 13 45
mean(Bayes.class == test.target)
## [1] 0.3223684
KNN Out of the tested values, K=3 has the best accuracy rate. KNN = 1 accuracy is 10.526%, KNN = 3 accuracy is 94.737%, KNN = 5 accuracy is 94.737%, KNN = 7 accuracy is 94.737%”
train = cbind.data.frame(nox, chas, age, dis, rad, tax, indus)[(1:(506*.7)),]
test = cbind.data.frame(nox, chas, age, dis, rad, tax, indus)[(506*.7):506,]
train.target = crim01[(1:(506*.7))]
set.seed(1)
knn1.pred = knn(train = train, test = test, cl = train.target, k=1)
table(knn1.pred, test.target)
## test.target
## knn1.pred 0 1
## 0 16 135
## 1 1 0
knn1.accuracy = mean(knn1.pred==test.target)
print(paste("KNN = 1 accuracy is", percent(knn1.accuracy, accuracy=.001)))
## [1] "KNN = 1 accuracy is 10.526%"
knn3.pred = knn(train = train, test = test, cl = train.target, k=3)
table(knn3.pred, test.target)
## test.target
## knn3.pred 0 1
## 0 12 3
## 1 5 132
knn3.accuracy = mean(knn3.pred==test.target)
print(paste("KNN = 3 accuracy is", percent(knn3.accuracy, accuracy=.001)))
## [1] "KNN = 3 accuracy is 94.737%"
knn5.pred = knn(train = train, test = test, cl = train.target, k=5)
table(knn5.pred, test.target)
## test.target
## knn5.pred 0 1
## 0 12 3
## 1 5 132
knn5.accuracy = mean(knn5.pred==test.target)
print(paste("KNN = 5 accuracy is", percent(knn5.accuracy, accuracy=.001)))
## [1] "KNN = 5 accuracy is 94.737%"
knn7.pred = knn(train = train, test = test, cl = train.target, k=7)
table(knn7.pred, test.target)
## test.target
## knn7.pred 0 1
## 0 12 3
## 1 5 132
knn7.accuracy = mean(knn7.pred==test.target)
print(paste("KNN = 7 accuracy is", percent(knn7.accuracy, accuracy=.001)))
## [1] "KNN = 7 accuracy is 94.737%"