Use Weekly data set in ISLR package.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
data(Weekly)
Numerical and Graphical summaries to show any patterns.
names(Weekly)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
pairs(Weekly)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
corrplot(cor(Weekly[,-9]), method = 'circle')
Interpretation
The pairs function and correlation matrix show that the only variables that have a relationship are “Volume” and “Year”. All others are low and close to 0 meaning there is no relationship.
plot(Weekly$Volume)
Interpretation
The plot function shows that Volume (average number of shares traded) increases over time.
Logistic Regression with “Direction” as response and all other as predictors. Identify which predictors appear to be significant.
glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Interpretation
The variable “Lag2” has a low p-value (0.0296), and therefore has a statistically significant relationship with “Direction”. However, all the other variables have a higher p-value and therefore no significant relationship to “Direction”.
Confusion Matrix and overall Fraction of Correct Predictions.
glm_prob <- predict(glm_fit, type = 'response')
glm_pred <- rep("Down", 1089)
glm_pred[glm_prob > 0.5] = "Up"
table(glm_pred, Weekly$Direction)
##
## glm_pred Down Up
## Down 54 48
## Up 430 557
mean(glm_pred == Weekly$Direction)
## [1] 0.5610652
Interpretation
Based on the confusion matrix, the accuracy rate of the model is 56.12%.
Fit the regression model for 1990 to 2008 with “Lag2” as the predictor. Compute Confusion Matrix and overall fraction of correct predictions for the data not in the model (2009 and 2010).
train <- (Weekly$Year < 2009)
Weekly_2009 <- Weekly[!train,]
dim(Weekly)
## [1] 1089 9
dim(Weekly_2009)
## [1] 104 9
Direction_2009 <- Weekly$Direction[!train]
glm_fit_2 <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
glm_prob_2 <- predict(glm_fit_2, Weekly_2009, type = 'response')
glm_pred_2 <- rep("Down", 104)
glm_pred_2[glm_prob_2 > 0.5] = "Up"
table(glm_pred_2, Direction_2009)
## Direction_2009
## glm_pred_2 Down Up
## Down 9 5
## Up 34 56
mean(glm_pred_2 == Direction_2009)
## [1] 0.625
Interpretation
Based on the confusion matrix, the accuracy rate of the model is 62.5%.
Repeat Part D using LDA model.
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
lda_fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda_fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
plot(lda_fit)
lda_pred <- predict(lda_fit, Weekly_2009)
names(lda_pred)
## [1] "class" "posterior" "x"
lda_class <- lda_pred$class
table(lda_class, Direction_2009)
## Direction_2009
## lda_class Down Up
## Down 9 5
## Up 34 56
mean(lda_class == Direction_2009)
## [1] 0.625
Interpretation
Based on the confusion matrix, the accuracy rate of this model is also 62.5%. This model also shows that the market went down 44.77% of the days and up 55.23% of the days. Group means show that the previous Lag2 Day trends negative when the market decreases and positive when it increases.
Repeat Part D using QDA model.
qda_fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda_fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda_pred <- predict(qda_fit, Weekly_2009)
names(qda_pred)
## [1] "class" "posterior"
qda_class <- qda_pred$class
table(qda_class, Direction_2009)
## Direction_2009
## qda_class Down Up
## Down 0 0
## Up 43 61
mean(qda_class == Direction_2009)
## [1] 0.5865385
Interpretation
Based on the confusion matrix, the accuracy rate of this model is 58.65%. This model also also shows that the market went down 44.77% of the days and up 55.23% of the days. Group means show that the previous Lag2 Day trends negative when the market decreases and positive when it increases.
Repeat Part D using KNN with K = 1.
library(class)
train_X <- as.matrix(Weekly$Lag2[train])
test_X <- as.matrix(Weekly$Lag2[!train])
train_Direction <- Weekly$Direction[train]
length(train_X)
## [1] 985
length(test_X)
## [1] 104
length(train_Direction)
## [1] 985
set.seed(1)
knn_pred <- knn(train_X, test_X, train_Direction, k = 1)
table(knn_pred, Direction_2009)
## Direction_2009
## knn_pred Down Up
## Down 21 30
## Up 22 31
mean(knn_pred == Direction_2009)
## [1] 0.5
Interpretation
Based on the confusion matrix, the accuracy rate of this model is 50%.
Which method appears to produce the best results from the data?
Logistic Regression and Linear Discriminant Analysis (LDA) have the highest accuracy rates for the data models. Both are 62.5% accurate.
Experiment with different combinations
Logistic Regression with “Lag2” and “Lag2”:“Lag5”
glm_fit_3 <- glm(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
glm_prob_3 <- predict(glm_fit_3, Weekly_2009, type = 'response')
glm_pred_3 <- rep("Down", length(glm_prob_3))
glm_pred_3[glm_prob_3 > 0.5] = "Up"
Direction_2009 = Weekly$Direction[!train]
table(glm_pred_3, Direction_2009)
## Direction_2009
## glm_pred_3 Down Up
## Down 9 4
## Up 34 57
mean(glm_pred_3 == Direction_2009)
## [1] 0.6346154
KNN = 200
train_X_2 <- as.matrix(Weekly$Lag2[train])
test_X_2 <- as.matrix(Weekly$Lag2[!train])
train_Direction_2 <- Weekly$Direction[train]
set.seed(1)
knn_pred_2 <- knn(train_X_2, test_X_2, train_Direction_2, k = 200)
table(knn_pred_2, Direction_2009)
## Direction_2009
## knn_pred_2 Down Up
## Down 2 0
## Up 41 61
mean(knn_pred_2 == Direction_2009)
## [1] 0.6057692
LDA with “Lag2” and “Lag2”:“Lag5”
lda_fit_2 <- lda(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
lda_pred_2 <- predict(lda_fit_2, Weekly_2009)
table(lda_pred_2$class, Direction_2009)
## Direction_2009
## Down Up
## Down 9 4
## Up 34 57
mean(lda_pred_2$class == Direction_2009)
## [1] 0.6346154
QDA with “Lag2” and “Lag2”:“Lag5”
qda_fit_2 <- qda(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
qda_pred_2 <- predict(qda_fit_2, Weekly_2009)
table(qda_pred_2$class, Direction_2009)
## Direction_2009
## Down Up
## Down 5 13
## Up 38 48
mean(qda_pred_2$class == Direction_2009)
## [1] 0.5096154
Interpretation
Based on above experiments, Logistic Regression and LDA with “Lag2” and “Lag2”:“Lag5” variables seem to be the most accurate models both with 63.46% accuracy rates.
Use Auto data set to develop model for predicting whether a car gets high or low gas mileage.
library(ISLR)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
Create binary variable for “mpg”.
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
Explore the data graphically.
pairs(Auto)
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
corrplot(cor(Auto[,-9]), method = 'circle')
Interpretation
First, the pairs function shows any relationships between variables. The correlation matrix shows it in more detail. “mpg01” has a positive relationship with: “acceleration”, “origin”, and “year” (high values) “mpg01” has a negative relationship with: “horsepower”, “displacement”, “cylinders”, and “weight” (negative values)
Split the data into training and testing sets.
auto_train <- (year %% 2 == 0)
training <- Auto[train,]
testing <- Auto[-train,]
LDA on Training model to predict “mpg01” using variables from Part B. What is the test error?
a_lda_fit <- lda(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training)
a_lda_pred <- predict(a_lda_fit, testing)
table(a_lda_pred$class, testing$mpg01)
##
## 0 1
## 0 166 7
## 1 29 189
mean(a_lda_pred$class != testing$mpg01)
## [1] 0.09207161
Interpretation
LDA model has 92.07% error rate.
QDA on training model to predict “mpg01” using variables from Part B. What is the test error?
a_qda_fit <- qda(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training)
a_qda_pred <- predict(a_qda_fit, testing)
table(a_qda_pred$class, testing$mpg01)
##
## 0 1
## 0 175 14
## 1 20 182
mean(a_qda_pred$class != testing$mpg01)
## [1] 0.08695652
Interpretation
QDA model has 8.69% error rate.
Logistic Regression on training model to predict “mpg01” using variables from Part B. What is the test error?
a_glm_fit <- glm(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training, family = binomial)
a_glm_prob <- predict(a_glm_fit, testing, type = 'response')
a_glm_pred <- rep(0, length(a_glm_prob))
a_glm_pred[a_glm_prob > 0.5] = 1
table(a_glm_pred, testing$mpg01)
##
## a_glm_pred 0 1
## 0 172 13
## 1 23 183
mean(a_glm_pred != testing$mpg01)
## [1] 0.09207161
Interpretation
Logistic Regression model has 9.21% error rate.
KNN to predict “mpg01” with variables from Part B. What are the test errors? Which K Values perform best?
KNN_auto_train <- cbind(displacement,horsepower,weight,cylinders,year,origin)[auto_train,] KNN_auto_test <- cbind(displacement,horsepower,weight,cylinders,year,origin)[-auto_train,] set.seed(1) KNN_auto_pred <- knn(KNN_auto_train, KNN_auto_test, training\(mpg01, k=1) mean(KNN_auto_pred != testing\)mpg01)
detach(Auto)
Use Boston data set to fit classification models to predict whether a suburb has higher or lower crime rate than the median.
Data Set
attach(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Crime Rate Variable
crime01 <- rep(0, length(crim))
crime01[crim > median(crim)] <- 1
Boston <- data.frame(Boston, crime01)
Training and Testing Data Sets
train_bos <- 1:(dim(Boston)[1]/2)
test_bos <- (dim(Boston)[1]/2 + 1):dim(Boston)[1]
training_b <- Boston[train_bos, ]
testing_b <- Boston[test_bos, ]
c01testing <- crime01[test_bos]
Visual Representation
pairs(Boston)
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crime01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crime01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crime01 -0.35121093 0.4532627 -0.2630167 1.00000000
corrplot(cor(Boston), method = 'circle')
Interpretation
The pairs and correlation plots show “indus”, “nox”, “age”, “dis”, “rad”, and “tax” have the strongest relationship.
Logistic Regression
set.seed(1)
GLM_B_fit <- glm(crime01 ~ indus + nox + age + dis + rad + tax, data = training_b, family = binomial)
GLM_B_prob <- predict(GLM_B_fit, testing_b, type = "response")
GLM_B_pred <- rep(0, length(GLM_B_prob))
GLM_B_pred[GLM_B_prob > 0.5] = 1
table(GLM_B_pred, c01testing)
## c01testing
## GLM_B_pred 0 1
## 0 75 8
## 1 15 155
mean(GLM_B_pred != c01testing)
## [1] 0.09090909
Interpretation The confusion matrix predicts that crime will increase 8 values above the median and down 15 below the median. Test error = 9.09%
LDA
lda.boston.fit <- lda(crime01 ~ indus + nox + age + dis + rad + tax, data = training_b, family = binomial)
lda.boston.pred <- predict(lda.boston.fit, testing_b)
table(lda.boston.pred$class, c01testing)
## c01testing
## 0 1
## 0 81 18
## 1 9 145
mean(lda.boston.pred$class != c01testing)
## [1] 0.1067194
Interpretation
This confusion matrix predicts crime would increase 18 values above the median and decrease 9 values below the median. Test error = 10.67%
KNN K = 1
knn.boston.train <- cbind(indus, nox, age, dis, rad, tax)[train_bos,]
knn.boston.test <- cbind(indus, nox, age, dis, rad, tax)[test_bos,]
set.seed(1)
knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=1)
table(knn.boston.pred, c01testing)
## c01testing
## knn.boston.pred 0 1
## 0 31 155
## 1 59 8
mean(knn.boston.pred != c01testing)
## [1] 0.8458498
K = 50
knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=50)
mean(knn.boston.pred != c01testing)
## [1] 0.2648221
K = 100
knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=100)
mean(knn.boston.pred != c01testing)
## [1] 0.3043478
Interpretation
This confusion matrix predicts crime rate will increase up to 155 values above the median and decrease 59 below the median. K = 1 Test error: 84.58% | K = 50 Test error: 26.48% | K = 100 Test error: 30.43% | K = 50 has lowest test error After execution of all models, Logistic Regression is the most accurate with only a 9.09% error rate.