From chapter 4 ISLR exercises
knitr::opts_chunk$set(echo = TRUE)
library(ISLR2)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071)
weekly = Weekly
str(weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
plot(weekly)
weekly %>%
ggplot(aes(x = Direction, y = Year, fill = Direction, colour = Direction)) +
geom_boxplot(alpha = 0.5)
weekly %>%
ggplot(aes(x = Direction, y = Today, fill = Direction, colour = Direction)) +
geom_boxplot(alpha = 0.5)
weekly %>%
ggplot(aes(x = Year, y = Volume)) +
geom_boxplot(aes(group = Year), alpha = 0.5)
skimr::skim(weekly)
Name | weekly |
Number of rows | 1089 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Direction | 0 | 1 | FALSE | 2 | Up: 605, Dow: 484 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 2000.05 | 6.03 | 1990.00 | 1995.00 | 2000.00 | 2005.00 | 2010.00 | ▇▆▆▆▆ |
Lag1 | 0 | 1 | 0.15 | 2.36 | -18.20 | -1.15 | 0.24 | 1.41 | 12.03 | ▁▁▆▇▁ |
Lag2 | 0 | 1 | 0.15 | 2.36 | -18.20 | -1.15 | 0.24 | 1.41 | 12.03 | ▁▁▆▇▁ |
Lag3 | 0 | 1 | 0.15 | 2.36 | -18.20 | -1.16 | 0.24 | 1.41 | 12.03 | ▁▁▆▇▁ |
Lag4 | 0 | 1 | 0.15 | 2.36 | -18.20 | -1.16 | 0.24 | 1.41 | 12.03 | ▁▁▆▇▁ |
Lag5 | 0 | 1 | 0.14 | 2.36 | -18.20 | -1.17 | 0.23 | 1.41 | 12.03 | ▁▁▆▇▁ |
Volume | 0 | 1 | 1.57 | 1.69 | 0.09 | 0.33 | 1.00 | 2.05 | 9.33 | ▇▂▁▁▁ |
Today | 0 | 1 | 0.15 | 2.36 | -18.20 | -1.15 | 0.24 | 1.41 | 12.03 | ▁▁▆▇▁ |
There is a clear relationship between Year
and
Volume
. It is also ver important to point out that
Today
and Direction
are perfectly
correlated.
glm1 = glm(Direction ~ . - Today - Year, data = weekly, family = "binomial")
summary(glm1)
##
## Call:
## glm(formula = Direction ~ . - Today - Year, family = "binomial",
## data = weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only Lag2 appears to be statistically significant.
glm1_prob = predict(glm1, type = "response")
glm1_pred = rep("Down", length(glm1_prob))
glm1_pred[glm1_prob > 0.5] = "Up"
table(glm1_pred, weekly$Direction)
##
## glm1_pred Down Up
## Down 54 48
## Up 430 557
The confusion matrix tells us that our model is much more likely to
make a prediction of Up
than a prediction of
Down
leading to a lot of false positives. Our model makes
the correct prediction roughly 56% of the time. We could potentially
address this issue by changing the threshold of 0.5.
train_weekly = weekly %>%
filter(Year %in% c(1990:2008))
test_weekly = weekly %>%
filter(Year > 2008)
glm2 = glm(Direction ~ Lag2, data = train_weekly, family = "binomial")
summary(glm2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train_weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm2_prob = predict(glm2, test_weekly, type = "response")
glm2_pred = rep("Down", 104)
glm2_pred[glm2_prob > 0.5] = "Up"
table(glm2_pred, test_weekly$Direction)
##
## glm2_pred Down Up
## Down 9 5
## Up 34 56
This model is also vastly over-estimating Up
but is
slightly improved from the previous model, getting the correct
prediction roughly 63% of the time.
lda1 = MASS::lda(Direction ~ Lag2, data = train_weekly)
lda1
## Call:
## lda(Direction ~ Lag2, data = train_weekly)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda1_pred = predict(lda1, test_weekly)$class
table(lda1_pred, test_weekly$Direction)
##
## lda1_pred Down Up
## Down 9 5
## Up 34 56
The confusion matrix from the LDA model is identical to the previous glm model. We correctly predicted market direction rougly 63% of the time.
qda1 = MASS::qda(Direction ~ Lag2, data = train_weekly)
qda1
## Call:
## qda(Direction ~ Lag2, data = train_weekly)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda1_pred = predict(qda1, test_weekly)$class
table(qda1_pred, test_weekly$Direction)
##
## qda1_pred Down Up
## Down 0 0
## Up 43 61
The qda predicted that every week would be an Up
week.
This resulted in a 59% accuracy rate.
train_KNN = data.frame(train_weekly$Lag2)
test_KNN = data.frame(test_weekly$Lag2)
train_direction_KNN = train_weekly$Direction
set.seed(23)
knn_pred = class::knn(train_KNN, test_KNN, train_direction_KNN, k = 1)
table(knn_pred, test_weekly$Direction)
##
## knn_pred Down Up
## Down 21 29
## Up 22 32
The KNN model achieved 51% accuracy with k = 1.
nb1 = naiveBayes(Direction ~ Lag2, data = train_weekly)
nb1
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
nb1_class = predict(nb1, test_weekly)
table(nb1_class, test_weekly$Direction)
##
## nb1_class Down Up
## Down 0 0
## Up 43 61
The Naive Bayes model performed the same as the qda and predicted all
Up
. This resulted in a 59% accuracy rate.
The glm and lda models perform the best if we are only considering accuracy as our measure of performance.
I am going to transform the data set by taking the sum of all other Lags to create a new variable that I will include in my models with Lag2.
weekly_trans = weekly
weekly_trans$Lag1345 = weekly_trans$Lag1 + weekly_trans$Lag3 + weekly_trans$Lag4 + weekly_trans$Lag5
I will now split the data into test and train
train_weekly_trans = weekly_trans %>%
filter(Year %in% c(1990:2008))
test_weekly_trans = weekly_trans %>%
filter(Year > 2008)
cor(weekly_trans$Today, weekly_trans$Lag2)
## [1] 0.05916672
cor(weekly_trans$Today, weekly_trans$Lag1345)
## [1] -0.07358489
glm model
glm3 = glm(Direction ~ Lag2 + Lag1345, data = train_weekly_trans, family = "binomial")
summary(glm3)
##
## Call:
## glm(formula = Direction ~ Lag2 + Lag1345, family = "binomial",
## data = train_weekly_trans)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22030 0.06502 3.388 0.000704 ***
## Lag2 0.05423 0.02900 1.870 0.061460 .
## Lag1345 -0.03050 0.01505 -2.027 0.042695 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1346.3 on 982 degrees of freedom
## AIC: 1352.3
##
## Number of Fisher Scoring iterations: 4
glm3_prob = predict(glm3, test_weekly_trans, type = "response")
glm3_pred = rep("Down", 104)
glm3_pred[glm3_prob > 0.5] = "Up"
table(glm3_pred, test_weekly_trans$Direction)
##
## glm3_pred Down Up
## Down 11 7
## Up 32 54
mean(glm3_pred == test_weekly_trans$Direction)
## [1] 0.625
Interestingly we got the same accuracy as the previous glm model.
However, we got better at predicting Down
.
lda model
lda2 = MASS::lda(Direction ~ Lag2 + Lag1345, data = train_weekly_trans)
lda2
## Call:
## lda(Direction ~ Lag2 + Lag1345, data = train_weekly_trans)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2 Lag1345
## Down -0.03568254 0.8335941
## Up 0.26036581 0.2125257
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.2871243
## Lag1345 -0.1593457
lda2_pred = predict(lda2, test_weekly_trans)$class
table(lda2_pred, test_weekly_trans$Direction)
##
## lda2_pred Down Up
## Down 10 7
## Up 33 54
mean(lda2_pred == test_weekly_trans$Direction)
## [1] 0.6153846
Preformed slightly worse than the glm with accuracy of 62%.
qda model
qda2 = MASS::qda(Direction ~ Lag2 + Lag1345, data = train_weekly_trans)
qda2
## Call:
## qda(Direction ~ Lag2 + Lag1345, data = train_weekly_trans)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2 Lag1345
## Down -0.03568254 0.8335941
## Up 0.26036581 0.2125257
qda2_pred = predict(qda2, test_weekly_trans)$class
table(qda2_pred, test_weekly_trans$Direction)
##
## qda2_pred Down Up
## Down 0 0
## Up 43 61
Perfromed the same as the previous qda.
KNN model
train_KNN_trans = cbind(train_weekly_trans$Lag2, train_weekly_trans$Lag1345)
test_KNN_trans = cbind(test_weekly_trans$Lag2, test_weekly_trans$Lag1345)
train_direction_KNN_trans = train_weekly_trans$Direction
set.seed(23)
knn_pred_trans = class::knn(train_KNN_trans, test_KNN_trans, train_direction_KNN_trans, k = 1)
table(knn_pred_trans, test_weekly_trans$Direction)
##
## knn_pred_trans Down Up
## Down 21 29
## Up 22 32
set.seed(23)
knn_pred_trans = class::knn(train_KNN_trans, test_KNN_trans, train_direction_KNN_trans, k = 2)
table(knn_pred_trans, test_weekly_trans$Direction)
##
## knn_pred_trans Down Up
## Down 23 33
## Up 20 28
set.seed(23)
knn_pred_trans = class::knn(train_KNN_trans, test_KNN_trans, train_direction_KNN_trans, k = 3)
table(knn_pred_trans, test_weekly_trans$Direction)
##
## knn_pred_trans Down Up
## Down 16 30
## Up 27 31
Interestingly the knn model with k = 2 was more likely to predict
Down
than the other knn models. The knn model with k = 1
performed the same as the one before.
naive bayes model
nb2 = naiveBayes(Direction ~ Lag2 + Lag1345, data = train_weekly_trans)
nb2
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
##
## Lag1345
## Y [,1] [,2]
## Down 0.8335941 4.328928
## Up 0.2125257 4.515744
nb2_class = predict(nb2, test_weekly_trans)
table(nb2_class, test_weekly_trans$Direction)
##
## nb2_class Down Up
## Down 9 11
## Up 34 50
Although this Naive Bayes model had worse accuracy than the previous
one, I did make some Down
predictions, which the other did
not.
Overall, the best model I have found is the glm that includes Lag2 and Lag1345 as predictors. Though it has the same accuracy as the glm with just Lag2 as the predictor, it made more balanced predictions.
auto = read_csv("~/Library/CloudStorage/OneDrive-Personal/Desktop/School/MSDA/Spring 2025/Predictive Modeling/Assignment 1/Auto.csv")
## Rows: 397 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): horsepower, name
## dbl (7): mpg, cylinders, displacement, weight, acceleration, year, origin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(auto)
## spc_tbl_ [397 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ mpg : num [1:397] 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num [1:397] 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num [1:397] 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr [1:397] "130" "165" "150" "150" ...
## $ weight : num [1:397] 3504 3693 3436 3433 3449 ...
## $ acceleration: num [1:397] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num [1:397] 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num [1:397] 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr [1:397] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "spec")=
## .. cols(
## .. mpg = col_double(),
## .. cylinders = col_double(),
## .. displacement = col_double(),
## .. horsepower = col_character(),
## .. weight = col_double(),
## .. acceleration = col_double(),
## .. year = col_double(),
## .. origin = col_double(),
## .. name = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
skimr::skim(auto)
Name | auto |
Number of rows | 397 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
horsepower | 0 | 1 | 1 | 3 | 0 | 94 | 0 |
name | 0 | 1 | 6 | 36 | 0 | 304 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 23.52 | 7.83 | 9 | 17.5 | 23.0 | 29.0 | 46.6 | ▆▇▆▃▁ |
cylinders | 0 | 1 | 5.46 | 1.70 | 3 | 4.0 | 4.0 | 8.0 | 8.0 | ▇▁▃▁▃ |
displacement | 0 | 1 | 193.53 | 104.38 | 68 | 104.0 | 146.0 | 262.0 | 455.0 | ▇▂▂▃▁ |
weight | 0 | 1 | 2970.26 | 847.90 | 1613 | 2223.0 | 2800.0 | 3609.0 | 5140.0 | ▇▇▅▅▂ |
acceleration | 0 | 1 | 15.56 | 2.75 | 8 | 13.8 | 15.5 | 17.1 | 24.8 | ▁▆▇▂▁ |
year | 0 | 1 | 75.99 | 3.69 | 70 | 73.0 | 76.0 | 79.0 | 82.0 | ▇▆▇▆▇ |
origin | 0 | 1 | 1.57 | 0.80 | 1 | 1.0 | 1.0 | 2.0 | 3.0 | ▇▁▂▁▂ |
auto = na.omit(auto)
auto$horsepower = as.numeric(auto$horsepower)
## Warning: NAs introduced by coercion
auto$origin = as.factor(auto$origin)
mpg01 = rep(0, dim(auto)[1])
mpg01[auto$mpg > median(auto$mpg)] = 1
auto = data.frame(auto, mpg01)
auto$mpg01 = as.factor(auto$mpg01)
plot(auto)
auto %>%
ggplot(aes(x = mpg01, y = cylinders, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
auto %>%
ggplot(aes(x = mpg01, y = displacement, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
auto %>%
ggplot(aes(x = mpg01, y = horsepower, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
auto %>%
ggplot(aes(x = mpg01, y = weight, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
auto %>%
ggplot(aes(x = mpg01, y = acceleration, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
auto %>%
ggplot(aes(x = mpg01, y = year, fill = mpg01, colour = mpg01)) +
geom_boxplot(alpha = 0.5) +
theme_minimal()
ggplot(auto, aes(fill = origin, x = mpg01)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Origin is correlated with mpg")
Looking at these visualizations, it seems reasonable to conclude that cylinders, displacement, horsepower and weight will be most useful in predicting mpg01. Origin and year may also be useful but to not seem as significant as the predictors previously mentioned.
set.seed(23)
index <- sample(nrow(auto),0.8*nrow(auto),replace = F) # Setting training sample to be 80% of the data
train_auto <- auto[index,]
test_auto <- auto[-index,]
lda1_auto = MASS::lda(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = train_auto)
lda1_auto
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.5096154 0.4903846
##
## Group means:
## cylinders displacement horsepower weight year origin2 origin3
## 0 6.641509 266.8679 127.36478 3574.906 74.22642 0.0754717 0.05031447
## 1 4.189542 115.6830 79.60784 2338.065 77.56863 0.3006536 0.35294118
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.321266330
## displacement 0.001538272
## horsepower 0.010696725
## weight -0.001398274
## year 0.164717235
## origin2 0.746028869
## origin3 0.536875850
lda1_auto_pred = predict(lda1_auto, test_auto)$class
table(lda1_auto_pred, test_auto$mpg01, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 40 0
## 1 6 34
1 - mean(lda1_auto_pred == test_auto$mpg01)
## [1] 0.075
We got an overall test error of 8.9%.
qda1_auto = MASS::qda(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = train_auto)
qda1_auto
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.5096154 0.4903846
##
## Group means:
## cylinders displacement horsepower weight year origin2 origin3
## 0 6.641509 266.8679 127.36478 3574.906 74.22642 0.0754717 0.05031447
## 1 4.189542 115.6830 79.60784 2338.065 77.56863 0.3006536 0.35294118
qda1_auto_pred = predict(qda1_auto, test_auto)$class
table(qda1_auto_pred, test_auto$mpg01, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 39 0
## 1 7 34
1 - mean(qda1_auto_pred == test_auto$mpg01)
## [1] 0.0875
We got an overall test error of 8.9%.
glm1_auto = glm(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = train_auto, family = "binomial")
summary(glm1_auto)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + year + origin, family = "binomial", data = train_auto)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.944481 6.016814 -4.478 7.53e-06 ***
## cylinders -0.280399 0.497307 -0.564 0.57287
## displacement 0.020963 0.015701 1.335 0.18182
## horsepower -0.025287 0.019342 -1.307 0.19108
## weight -0.006885 0.001534 -4.488 7.17e-06 ***
## year 0.600381 0.105224 5.706 1.16e-08 ***
## origin2 2.642035 0.903624 2.924 0.00346 **
## origin3 1.313454 0.772377 1.701 0.08903 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 432.41 on 311 degrees of freedom
## Residual deviance: 121.29 on 304 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 137.29
##
## Number of Fisher Scoring iterations: 8
glm1_auto_prob = predict(glm1_auto, test_auto, type = "response")
glm1_auto_pred = rep(0, 79)
glm1_auto_pred[glm1_auto_prob > 0.5] = 1
table(glm1_auto_pred, test_auto$mpg01, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 41 1
## 1 5 33
1 - mean(glm1_auto_pred == test_auto$mpg01)
## [1] 0.075
Test error of 6.3%
nb1_auto = naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = train_auto)
nb1_auto
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5047319 0.4952681
##
## Conditional probabilities:
## cylinders
## Y [,1] [,2]
## 0 6.637500 1.4941711
## 1 4.184713 0.6681553
##
## displacement
## Y [,1] [,2]
## 0 266.4500 92.12575
## 1 115.4299 36.25057
##
## horsepower
## Y [,1] [,2]
## 0 127.36478 37.98774
## 1 79.60784 16.40933
##
## weight
## Y [,1] [,2]
## 0 3570.531 697.2998
## 1 2336.497 393.6503
##
## year
## Y [,1] [,2]
## 0 74.22500 2.900770
## 1 77.57962 3.693597
##
## origin
## Y 1 2 3
## 0 0.8750000 0.0750000 0.0500000
## 1 0.3503185 0.3057325 0.3439490
nb1_auto_class = predict(nb1_auto, test_auto)
table(nb1_auto_class, test_auto$mpg01)
##
## nb1_auto_class 0 1
## 0 40 0
## 1 6 34
1 - mean(nb1_auto_class == test_auto$mpg01)
## [1] 0.075
Test error of 10.1%
train_auto = na.omit(train_auto)
test_auto = na.omit(test_auto)
knn_auto_train = cbind(train_auto$cylinders, train_auto$displacement, train_auto$horsepower, train_auto$weight, train_auto$year, train_auto$origin)
knn_auto_test = cbind(test_auto$cylinders, test_auto$displacement, test_auto$horsepower, test_auto$weight, test_auto$year, test_auto$origin)
knn_train_mpg = train_auto$mpg01
set.seed(23)
knn_auto_pred = class::knn(knn_auto_train, knn_auto_test, knn_train_mpg, k = 1)
table(knn_auto_pred, test_auto$mpg01)
##
## knn_auto_pred 0 1
## 0 38 0
## 1 8 34
1 - mean(knn_auto_pred == test_auto$mpg01)
## [1] 0.1
For k = 1, test error is 10.1%
set.seed(23)
knn_auto_pred = class::knn(knn_auto_train, knn_auto_test, knn_train_mpg, k = 2)
table(knn_auto_pred, test_auto$mpg01)
##
## knn_auto_pred 0 1
## 0 39 0
## 1 7 34
1 - mean(knn_auto_pred == test_auto$mpg01)
## [1] 0.0875
For k = 2, test error is 10.1%
set.seed(23)
knn_auto_pred = class::knn(knn_auto_train, knn_auto_test, knn_train_mpg, k = 3)
table(knn_auto_pred, test_auto$mpg01)
##
## knn_auto_pred 0 1
## 0 38 0
## 1 8 34
1 - mean(knn_auto_pred == test_auto$mpg01)
## [1] 0.1
For k = 3, test error is 8.9%
set.seed(23)
knn_auto_pred = class::knn(knn_auto_train, knn_auto_test, knn_train_mpg, k = 4)
table(knn_auto_pred, test_auto$mpg01)
##
## knn_auto_pred 0 1
## 0 40 1
## 1 6 33
1 - mean(knn_auto_pred == test_auto$mpg01)
## [1] 0.0875
For k = 4, test error is 11.4%
set.seed(23)
knn_auto_pred = class::knn(knn_auto_train, knn_auto_test, knn_train_mpg, k = 5)
table(knn_auto_pred, test_auto$mpg01)
##
## knn_auto_pred 0 1
## 0 36 0
## 1 10 34
1 - mean(knn_auto_pred == test_auto$mpg01)
## [1] 0.125
For k = 5, test error is 11.4%
The best model is when k = 3.
boston = Boston
crim_med = rep(0, dim(boston)[1])
crim_med[boston$crim > median(boston$crim)] = 1
boston = data.frame(boston, crim_med)
cor(boston[,-14])[, "crim"]
## crim zn indus chas nox rm
## 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
## age dis rad tax ptratio lstat
## 0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 0.45562148
## medv
## -0.38830461
The most significantly correlated variables are; indus, nox, age, dis, rad, tax, lstat and medv.
Convert crim_med to factor
boston$crim_med = as.factor(boston$crim_med)
Split the data
set.seed(23)
index <- sample(nrow(boston),0.8*nrow(boston),replace = F) # Setting training sample to be 80% of the data
train_boston <- boston[index,]
test_boston <- boston[-index,]
I am going to run through all the models using all the predictors minus crim and then a second time using the predictors highlighted to be most correlated.
glm1_bos = glm(crim_med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = train_boston, family = "binomial")
glm1_bos_prob = predict(glm1_bos, test_boston, type = "response")
glm1_bos_pred = rep(0, 102)
glm1_bos_pred[glm1_bos_prob > 0.5] = 1
table(glm1_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 39 4
## 1 7 52
1 - mean(glm1_bos_pred == test_boston$crim_med)
## [1] 0.1078431
With all the predictors we have a test error rate of 10.8%
glm2_bos = glm(crim_med ~ indus + nox + age + dis + rad + tax + lstat + medv, data = train_boston, family = "binomial")
glm2_bos_prob = predict(glm2_bos, test_boston, type = "response")
glm2_bos_pred = rep(0, 102)
glm2_bos_pred[glm2_bos_prob > 0.5] = 1
table(glm2_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 41 10
## 1 5 46
1 - mean(glm2_bos_pred == test_boston$crim_med)
## [1] 0.1470588
With the selected predictors, we have a test error rate of 14.7%
lda1_bos = MASS::lda(crim_med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = train_boston)
lda1_bos_pred = predict(lda1_bos, test_boston)$class
table(lda1_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 40 12
## 1 6 44
1 - mean(lda1_bos_pred == test_boston$crim_med)
## [1] 0.1764706
With all the predictors we have a test error rate of 17.6%
lda2_bos = MASS::lda(crim_med ~ indus + nox + age + dis + rad + tax + lstat + medv, data = train_boston)
lda2_bos_pred = predict(lda2_bos, test_boston)$class
table(lda2_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 41 12
## 1 5 44
1 - mean(lda2_bos_pred == test_boston$crim_med)
## [1] 0.1666667
With the selected predictors, we have a test error rate of 16.7%
nb1_bos = naiveBayes(crim_med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, data = train_boston)
nb1_bos_pred = predict(nb1_bos, test_boston)
table(nb1_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 38 9
## 1 8 47
1 - mean(nb1_bos_pred == test_boston$crim_med)
## [1] 0.1666667
With all the predictors we have a test error rate of 16.7%
nb2_bos = naiveBayes(crim_med ~ indus + nox + age + dis + rad + tax + lstat + medv, data = train_boston)
nb2_bos_pred = predict(nb2_bos, test_boston)
table(nb2_bos_pred, test_boston$crim_med, dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 40 12
## 1 6 44
1 - mean(nb2_bos_pred == test_boston$crim_med)
## [1] 0.1764706
With the selected predictors, we have a test error rate of 17.6%
knn_bos_train = cbind(train_boston$zn, train_boston$indus, train_boston$chas, train_boston$nox, train_boston$rm, train_boston$age, train_boston$dis, train_boston$rad, train_boston$tax, train_boston$ptratio, train_boston$lstat, train_boston$medv)
knn_bos_test = cbind(test_boston$zn, test_boston$indus, test_boston$chas, test_boston$nox, test_boston$rm, test_boston$age, test_boston$dis, test_boston$rad, test_boston$tax, test_boston$ptratio, test_boston$lstat, test_boston$medv)
knn_train_crim_med = train_boston$crim_med
set.seed(23)
knn_bos_pred = class::knn(knn_bos_train, knn_bos_test, knn_train_crim_med, k = 1)
table(knn_bos_pred, test_boston$crim_med)
##
## knn_bos_pred 0 1
## 0 40 2
## 1 6 54
1 - mean(knn_bos_pred == test_boston$crim_med)
## [1] 0.07843137
For our KNN model With all the predictors and k = 1, we have a test error rate of 7.8%
set.seed(23)
knn_bos_pred = class::knn(knn_bos_train, knn_bos_test, knn_train_crim_med, k = 2)
table(knn_bos_pred, test_boston$crim_med)
##
## knn_bos_pred 0 1
## 0 40 2
## 1 6 54
1 - mean(knn_bos_pred == test_boston$crim_med)
## [1] 0.07843137
For our KNN model With all the predictors and k = 2, we have a test error rate of 7.8%
set.seed(23)
knn_bos_pred = class::knn(knn_bos_train, knn_bos_test, knn_train_crim_med, k = 3)
table(knn_bos_pred, test_boston$crim_med)
##
## knn_bos_pred 0 1
## 0 41 2
## 1 5 54
1 - mean(knn_bos_pred == test_boston$crim_med)
## [1] 0.06862745
For our KNN model With all the predictors and k = 3, we have a test error rate of 6.9%
set.seed(23)
knn_bos_pred = class::knn(knn_bos_train, knn_bos_test, knn_train_crim_med, k = 4)
table(knn_bos_pred, test_boston$crim_med)
##
## knn_bos_pred 0 1
## 0 42 2
## 1 4 54
1 - mean(knn_bos_pred == test_boston$crim_med)
## [1] 0.05882353
For our KNN model With all the predictors and k = 4, we have a test error rate of 5.8%
set.seed(23)
knn_bos_pred = class::knn(knn_bos_train, knn_bos_test, knn_train_crim_med, k = 5)
table(knn_bos_pred, test_boston$crim_med)
##
## knn_bos_pred 0 1
## 0 42 2
## 1 4 54
1 - mean(knn_bos_pred == test_boston$crim_med)
## [1] 0.05882353
For our KNN model With all the predictors and k = 5, we have a test error rate of 5.9%
knn2_bos_train = cbind(train_boston$indus, train_boston$nox, train_boston$age, train_boston$dis, train_boston$rad, train_boston$tax, train_boston$lstat, train_boston$medv)
knn2_bos_test = cbind(test_boston$indus, test_boston$nox, test_boston$age, test_boston$dis, test_boston$rad, test_boston$tax, test_boston$lstat, test_boston$medv)
knn2_train_crim_med = train_boston$crim_med
set.seed(23)
knn2_bos_pred = class::knn(knn2_bos_train, knn2_bos_test, knn2_train_crim_med, k = 1)
table(knn2_bos_pred, test_boston$crim_med)
##
## knn2_bos_pred 0 1
## 0 41 3
## 1 5 53
1 - mean(knn2_bos_pred == test_boston$crim_med)
## [1] 0.07843137
For our KNN model With the selected predictors and k = 1, we have a test error rate of 7.8%
set.seed(23)
knn2_bos_pred = class::knn(knn2_bos_train, knn2_bos_test, knn2_train_crim_med, k = 2)
table(knn2_bos_pred, test_boston$crim_med)
##
## knn2_bos_pred 0 1
## 0 42 2
## 1 4 54
1 - mean(knn2_bos_pred == test_boston$crim_med)
## [1] 0.05882353
For our KNN model With the selected predictors and k = 2, we have a test error rate of 5.9%
set.seed(23)
knn2_bos_pred = class::knn(knn2_bos_train, knn2_bos_test, knn2_train_crim_med, k = 3)
table(knn2_bos_pred, test_boston$crim_med)
##
## knn2_bos_pred 0 1
## 0 42 2
## 1 4 54
1 - mean(knn2_bos_pred == test_boston$crim_med)
## [1] 0.05882353
For our KNN model With the selected predictors and k = 3, we have a test error rate of 5.9%
set.seed(23)
knn2_bos_pred = class::knn(knn2_bos_train, knn2_bos_test, knn2_train_crim_med, k = 4)
table(knn2_bos_pred, test_boston$crim_med)
##
## knn2_bos_pred 0 1
## 0 43 2
## 1 3 54
1 - mean(knn2_bos_pred == test_boston$crim_med)
## [1] 0.04901961
For our KNN model With the selected predictors and k = 4, we have a test error rate of 4.9%
set.seed(23)
knn2_bos_pred = class::knn(knn2_bos_train, knn2_bos_test, knn2_train_crim_med, k = 5)
table(knn2_bos_pred, test_boston$crim_med)
##
## knn2_bos_pred 0 1
## 0 43 3
## 1 3 53
1 - mean(knn2_bos_pred == test_boston$crim_med)
## [1] 0.05882353
For our KNN model With the selected predictors and k = 5, we have a test error rate of 5.9%
Overall, the best models were the KNN models. Specifically, the KNN with selected predictors and K = 4 was the best with a very low test error rate of 4.9%