In this lab I will be performing classification method on three datasets from the ISLR2 package, Weekly, Auto, and Boston. The following three questions I will answer are exercises 13, 14, and 16 from the ISLR textbook. I will perform necessary data analysis such as test/training splits, data summaries, data visualization, and variable creation. The methods I will be testing with will be Logistic Regression, LDA, QDA, Naive Bayes, and KNN. Using these methods I will determine the test error/accuracy score of the model by utilizing a confusion matrix. Please feel free to toggle the code Show/Hide button to view my code.
library(tidyverse)
library(ISLR2)
library(gtsummary)
library(gt)
library(MASS)
library(class)
library(e1071)
library(rsample)
This question should be Interpretationed using the Weekly data set, which is part of the ISLP package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
dim(Weekly)
## [1] 1089 9
cor(Weekly [,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Interpretation: This Data set contains 8 numerical variables with one Factor variable – Direction. Direction has two levels “Up” and “Down” that can be used in a classification problem. There is an interesting upward relationship between volume and year showing the newer stocks contain a higher volume of shares traded.The only significant correlation appears to be between Volume and Year. The Volume variable increases as time goes on visible in the graph below.
# Expore the Volume variable
attach(Weekly)
plot(Volume, col="maroon", pch =8, cex=0.7)
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
logistic = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(logistic)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Interpretation: The only variable statistically significant variable in the model would be Lag2. With a P value of 0.024503 Lag2 is a significant predictor of direction although the differences between the boxplot is very minimal this remains the strongest predictor.
Weekly |>
ggplot(aes(
x=Direction,
y=Lag2)) +
geom_boxplot(fill = "maroon", color="black") +
labs(title = "Lag2's Relationship with Direction")
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm.probs <- predict(logistic, type = "response")
glm.pred <- rep("Down", 1089)
glm.pred[glm.probs > .5] = "Up"
# Create the table
t=table(glm.pred, Weekly$Direction)
t |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | Down | Up |
|---|---|---|
| Down | 54 | 48 |
| Up | 430 | 557 |
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652
Interpretation: Here the table is telling us the model predicts the stock market to go down 49 of the days and for it to go up the other 564 days. The other incorrect predictions are 41 and 435. I will compute a fraction to determine how often this model is correct. This model correctly predicted the direction of the market 56.3% of the time.
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
w = Weekly |> filter(Year > 1990 & Year <= 2008)
logistic1 = glm(Direction ~ Lag2, data = w, family = binomial)
summary(logistic1)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = w)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21128 0.06592 3.205 0.00135 **
## Lag2 0.06058 0.02940 2.061 0.03934 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1289.2 on 937 degrees of freedom
## Residual deviance: 1284.9 on 936 degrees of freedom
## AIC: 1288.9
##
## Number of Fisher Scoring iterations: 4
glm.probs1 <- predict(logistic1, type = "response")
glm.pred1 <- rep("Down", 938)
glm.pred1[glm.probs1 > .5] = "Up"
t1=table(glm.pred1, w$Direction)
t1 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | Down | Up |
|---|---|---|
| Down | 22 | 19 |
| Up | 396 | 501 |
mean(glm.pred1 == w$Direction)
## [1] 0.5575693
Interpretation: Here the table is telling us the model predicts the stock market to go down 22 of the days between 1990 and 2008 and for it to go up the other 501 days. The other predictions are incorrect that read are 396 and 19. Between 1990 and 2008 we can predict the direction of the market with a 55.28% accuracy score.
Repeat (d) using LDA.
lda = lda(Direction ~ Lag2, data = w)
print(lda)
## Call:
## lda(Direction ~ Lag2, data = w)
##
## Prior probabilities of groups:
## Down Up
## 0.445629 0.554371
##
## Group means:
## Lag2
## Down -0.04077751
## Up 0.26935769
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.440232
# Computer the prediction scores
lda.pred <- predict(lda, w)
lda.class <- lda.pred$class
# Create the table and fraction
t2=table(lda.class, w$Direction)
t2 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | Down | Up |
|---|---|---|
| Down | 21 | 19 |
| Up | 397 | 501 |
mean(lda.class == w$Direction)
## [1] 0.5565032
Interpretation: The table is telling us the model predicts the stock market to go down 21 of the days between 1990 and 2008 and for it to go up the other 501 days. The other predictions are incorrect that read are 397 and 19. Between 1990 and 2008 we can predict the direction of the market with a 55.65% accuracy score.
Repeat (d) using QDA.
qda = qda(Direction ~ Lag2, data = w)
print(qda)
## Call:
## qda(Direction ~ Lag2, data = w)
##
## Prior probabilities of groups:
## Down Up
## 0.445629 0.554371
##
## Group means:
## Lag2
## Down -0.04077751
## Up 0.26935769
qda.class <- predict(qda, w)$class
# Convert to table
t3=table(qda.class, w$Direction)
t3 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | Down | Up |
|---|---|---|
| Down | 0 | 0 |
| Up | 418 | 520 |
mean(qda.class == w$Direction)
## [1] 0.554371
Interpretation: This models accuracy score is 55.44%.
Both models show that within the training data, the market goes down 44.56% of the time and 55.44% goes up. According to the model, the average down score of Lag2 was -0.04 and the average up score was .2694
Repeat (d) using KNN with K =1.
# Define train
train = Weekly |> filter(Year > 1990 & Year <= 2008)
test = Weekly |> filter(Year > 2008)
# Create test and training dataset for KNN
train.x = cbind(train$Lag2)
test.x = cbind(test$Lag2)
train.Direction = train$Direction
knn <- knn(train.x, test.x, train.Direction, k=1)
t5=table(knn, test$Direction)
t5 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | Down | Up |
|---|---|---|
| Down | 21 | 28 |
| Up | 22 | 33 |
mean(knn == test$Direction)
## [1] 0.5192308
Interpretation: This models accuracy score is 51.92%.
Repeat (d) using naive Bayes.
nb <- naiveBayes(Direction ~ Lag2, data = w)
nb.class <- predict(nb, w)
table(nb.class, w$Direction)
##
## nb.class Down Up
## Down 0 0
## Up 418 520
mean(nb.class == w$Direction)
## [1] 0.554371
Interpretation: This models accuracy score is 55.44%.
Which of these methods appears to provide the best results on this data?
Best Model: Out of the models that only assessed Lag2 the most accurate model was the Linear Discriminant Analysis model with a score of 55.65%.
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
Using QDA and the following model I achieve a 57.36% accuracy score.
qda(Direction ~ Lag2 +Lag1+Lag5+Lag3, data = w)
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() method of the data frame. Note you may find it helpful to add a column mpg01 to the data frame by assignment. Assuming you have stored the data frame as Auto, this can be done as follows: Auto[‘mpg01’] = mpg01
auto = Auto |>
mutate(mpg01 = factor(case_when(
mpg > 22.75 ~ 1,
mpg <= 22.75 ~ 0))
)
glimpse(auto)
## Rows: 392
## Columns: 10
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## $ mpg01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to Interpretation this question. Describe your findings.
auto |> ggplot(
aes(x = mpg01,
y = horsepower)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Horsepower on MPG") +
theme_gray()
auto |> ggplot(
aes(x = mpg01,
y = displacement)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Displacement on MPG") +
theme_gray()
auto |> ggplot(
aes(x = mpg01,
y = weight)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Weight on MPG") +
theme_gray()
auto |> ggplot(
aes(x = mpg01,
y = acceleration )) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Acceleration on MPG") +
theme_gray()
Interpretation: The useful variable in predicting a car with higher or lower case mileage appear to be Weight, Acceleration, Displacement, and Horsepower.
Split the data into a training set and a test set.
split <- initial_split(auto, prop = .80, strata = mpg01)
auto_test <- testing(split)
auto_training <- training(split)
glimpse(auto_test)
## Rows: 80
## Columns: 10
## $ mpg <dbl> 14, 21, 24, 26, 11, 25, 16, 14, 13, 18, 31, 35, 20, 14, 1…
## $ cylinders <int> 8, 6, 4, 4, 8, 4, 6, 8, 8, 6, 4, 4, 4, 8, 8, 8, 4, 4, 4, …
## $ displacement <dbl> 340, 200, 107, 121, 318, 113, 225, 351, 400, 250, 71, 72,…
## $ horsepower <int> 160, 85, 90, 113, 210, 95, 105, 153, 170, 88, 65, 69, 90,…
## $ weight <int> 3609, 2587, 2430, 2234, 4382, 2228, 3439, 4154, 4746, 313…
## $ acceleration <dbl> 8.0, 16.0, 14.5, 12.5, 13.5, 14.0, 15.5, 13.5, 12.0, 14.5…
## $ year <int> 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 7…
## $ origin <int> 1, 1, 2, 2, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 2, 2, 3, …
## $ name <fct> plymouth 'cuda 340, ford maverick, audi 100 ls, bmw 2002,…
## $ mpg01 <fct> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, …
glimpse(auto_training)
## Rows: 312
## Columns: 10
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 15, 14, 22, 1…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 8, 8, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 40…
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 15…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## $ mpg01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_lda = lda(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)
print(auto_lda)
## Call:
## lda(mpg01 ~ weight + acceleration + displacement + horsepower,
## data = auto_training)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## weight acceleration displacement horsepower
## 0 3614.397 14.41474 275.7372 131.80128
## 1 2336.026 16.47821 116.3942 78.57692
##
## Coefficients of linear discriminants:
## LD1
## weight -0.000938530
## acceleration 0.008114803
## displacement -0.008745945
## horsepower 0.004525878
lda.pred1 <- predict(auto_lda, auto_training)
lda.class1 <- lda.pred1$class
# Create the table and fraction
t6=table(lda.class1, auto_training$mpg01)
t6 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 129 | 5 |
| 1 | 27 | 151 |
42 / 312 = .135~
Interpretation: The LDA performed with a 13.5% test error on the model.
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_qda = qda(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)
print(auto_qda)
## Call:
## qda(mpg01 ~ weight + acceleration + displacement + horsepower,
## data = auto_training)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## weight acceleration displacement horsepower
## 0 3614.397 14.41474 275.7372 131.80128
## 1 2336.026 16.47821 116.3942 78.57692
qda.pred1 <- predict(auto_qda, auto_training)
qda.class1 <- lda.pred1$class
# Create the table and fraction
t7=table(qda.class1, auto_training$mpg01)
t7 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 129 | 5 |
| 1 | 27 | 151 |
42 / 312 = .135~
Interpretation: The QDA performed with a 13.5% test error on the model.
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_logistic = glm(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training, family = binomial)
print(auto_logistic)
##
## Call: glm(formula = mpg01 ~ weight + acceleration + displacement +
## horsepower, family = binomial, data = auto_training)
##
## Coefficients:
## (Intercept) weight acceleration displacement horsepower
## 11.7484130 -0.0009667 -0.0416305 -0.0161292 -0.0585108
##
## Degrees of Freedom: 311 Total (i.e. Null); 307 Residual
## Null Deviance: 432.5
## Residual Deviance: 167.4 AIC: 177.4
glm.probs2 <- predict(auto_logistic, type = "response")
glm.pred2 <- rep("Down", 312)
glm.pred2[glm.probs2 > .5] = "Up"
# Create the table
t8=table(glm.pred2, auto_training$mpg01)
t8 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| Down | 138 | 12 |
| Up | 18 | 144 |
37 / 312 = .119 ~
Interpretation: The logistic regression model performs with a 11.9% test error on its model.
Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_nb <- naiveBayes(mpg01 ~ weight + acceleration + displacement + horsepower, data = auto_training)
print(auto_logistic)
##
## Call: glm(formula = mpg01 ~ weight + acceleration + displacement +
## horsepower, family = binomial, data = auto_training)
##
## Coefficients:
## (Intercept) weight acceleration displacement horsepower
## 11.7484130 -0.0009667 -0.0416305 -0.0161292 -0.0585108
##
## Degrees of Freedom: 311 Total (i.e. Null); 307 Residual
## Null Deviance: 432.5
## Residual Deviance: 167.4 AIC: 177.4
nb.pred1 <- predict(auto_nb, auto_training)
t9=table(nb.pred1, Actual = auto_training$mpg01)
# Format into table
t9 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 128 | 9 |
| 1 | 28 | 147 |
156 / 312 = .5 ~
Interpretation: The Naive Bayes model performs with a 50% test error on its model.
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
train_mpg01 = auto_training$mpg01
auto_training1 = auto_training |> dplyr::select(weight, acceleration, displacement, horsepower)
auto_test1 = auto_test |> dplyr::select(weight, acceleration, displacement, horsepower)
auto_knn <- knn(
train = auto_training1,
test = auto_test1,
cl = train_mpg01,
k=1)
print(auto_knn)
## [1] 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1
## [39] 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [77] 0 1 1 0
## Levels: 0 1
# Create the table
t10=table(Predicted = auto_knn, Actual = auto_test$mpg01)
t10 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 33 | 6 |
| 1 | 7 | 34 |
5 / 80 = .0625 ~
Interpretation: The KNN model performs with a 6.25% test error on its TEST model. The KNN model was the best model to predict whether a car will have a high MPG or low MPG.
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
# Create the binary variable based on the median 0.25651
boston = Boston |>
mutate(crim01 = factor(case_when(
crim > 0.25651 ~ 1,
crim <= 0.25651 ~ 0))
)
glimpse(boston)
## Rows: 506
## Columns: 15
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
## $ crim01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
boston |> ggplot(
aes(x = crim01,
y = dis)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Weighted distance to employment centers on Crime Rate") +
theme_gray()
boston |> ggplot(
aes(x = crim01,
y = zn)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Zoned residential land proportion on Crime Rate") +
theme_gray()
boston |> ggplot(
aes(x = crim01,
y = nox)) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Nitric oxides concentration on Crime Rate") +
theme_gray()
boston |> ggplot(
aes(x = crim01,
y = age )) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Average rooms per dwelling on Crime Rate") +
theme_gray()
boston |> ggplot(
aes(x = crim01,
y = lstat )) +
geom_boxplot(fill = "darkred") +
labs(title = "The Affect of Acceleration on Crime Rate") +
theme_gray()
Interpretation: The useful variable in predicting crime rate in the Boston dataset are dis, zn, nox, age, and lstat
Split the data into a training set and a test set.
split1 <- initial_split(boston, prop = .80, strata = crim01)
boston_test <- testing(split1)
boston_training <- training(split1)
glimpse(boston_test)
## Rows: 102
## Columns: 15
## $ crim <dbl> 0.02731, 0.02985, 0.78420, 0.84054, 1.38799, 1.15172, 1.61282,…
## $ zn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 85, 0, 0, 0, 25, 0, 0, 0, 0, 0, 0, 0, …
## $ indus <dbl> 7.07, 2.18, 8.14, 8.14, 8.14, 8.14, 8.14, 5.96, 0.74, 10.81, 1…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.469, 0.458, 0.538, 0.538, 0.538, 0.538, 0.538, 0.499, 0.410,…
## $ rm <dbl> 6.421, 6.430, 5.990, 5.599, 5.950, 5.701, 6.096, 5.841, 6.383,…
## $ age <dbl> 78.9, 58.7, 81.7, 85.7, 82.0, 95.0, 96.9, 61.4, 35.7, 6.6, 53.…
## $ dis <dbl> 4.9671, 6.0622, 4.2579, 4.4546, 3.9900, 3.7872, 3.7598, 3.3779…
## $ rad <int> 2, 3, 4, 4, 4, 4, 4, 5, 2, 4, 5, 5, 4, 2, 2, 2, 5, 5, 5, 2, 4,…
## $ tax <dbl> 242, 222, 307, 307, 307, 307, 307, 279, 313, 305, 398, 398, 28…
## $ ptratio <dbl> 17.8, 18.7, 21.0, 21.0, 21.0, 21.0, 21.0, 19.2, 17.3, 19.2, 18…
## $ black <dbl> 396.90, 394.12, 386.75, 303.42, 232.60, 358.77, 248.31, 377.56…
## $ lstat <dbl> 9.14, 5.21, 14.67, 16.51, 27.71, 18.35, 20.34, 11.41, 5.77, 6.…
## $ medv <dbl> 21.6, 28.7, 17.5, 13.9, 13.2, 13.1, 13.5, 20.0, 24.7, 24.2, 21…
## $ crim01 <fct> 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
glimpse(boston_training)
## Rows: 404
## Columns: 15
## $ crim <dbl> 0.00632, 0.02729, 0.03237, 0.06905, 0.08829, 0.14455, 0.21124,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5,…
## $ indus <dbl> 2.31, 7.07, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 7.185, 6.998, 7.147, 6.012, 6.172, 5.631, 6.004, 6.377,…
## $ age <dbl> 65.2, 61.1, 45.8, 54.2, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 3…
## $ dis <dbl> 4.0900, 4.9671, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921…
## $ rad <int> 1, 2, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3,…
## $ tax <dbl> 296, 242, 222, 222, 311, 311, 311, 311, 311, 311, 311, 279, 27…
## $ ptratio <dbl> 15.3, 17.8, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 392.83, 394.63, 396.90, 395.60, 396.90, 386.63, 386.71…
## $ lstat <dbl> 4.98, 4.03, 2.94, 5.33, 12.43, 19.15, 29.93, 17.10, 20.45, 13.…
## $ medv <dbl> 24.0, 34.7, 33.4, 36.2, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21…
## $ crim01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Perform LDA on the training data in order to predict crim01
boston_lda = lda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
print(boston_lda)
## Call:
## lda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## dis zn nox age lstat
## 0 5.141737 20.638614 0.4712718 51.47525 9.628812
## 1 2.494462 1.306931 0.6382921 85.27871 15.857574
##
## Coefficients of linear discriminants:
## LD1
## dis -0.079498246
## zn -0.002608143
## nox 9.886254941
## age 0.007922197
## lstat -0.004336295
lda.pred2 <- predict(boston_lda, boston_training)
lda.class2 <- lda.pred2$class
# Create the table and fraction
t10 =table(lda.class2, boston_training$crim01)
t10 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 175 | 40 |
| 1 | 27 | 162 |
58 / 404 = 0.144~
Interpretation: The LDA performed with a 14.4% test error on the model.
Perform QDA on the training data in order to predict crim01
boston_qda = qda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
print(boston_qda)
## Call:
## qda(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## dis zn nox age lstat
## 0 5.141737 20.638614 0.4712718 51.47525 9.628812
## 1 2.494462 1.306931 0.6382921 85.27871 15.857574
qda.pred2<- predict(boston_qda, boston_training)
qda.class2 <- lda.pred2$class
# Create the table and fraction
t11=table(qda.class2, boston_training$crim01)
t11 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 175 | 40 |
| 1 | 27 | 162 |
58 / 404 = 0.144~
Interpretation: The QDA performed with a 14.4% test error on the model.
Perform logistic regression on the training data in order to predict crim01
boston_logistic = glm(crim01 ~ dis + zn + nox + age + lstat, data = boston_training, family = binomial)
print(boston_logistic)
##
## Call: glm(formula = crim01 ~ dis + zn + nox + age + lstat, family = binomial,
## data = boston_training)
##
## Coefficients:
## (Intercept) dis zn nox age lstat
## -18.76335 0.32642 -0.05275 33.22231 0.00459 -0.03989
##
## Degrees of Freedom: 403 Total (i.e. Null); 398 Residual
## Null Deviance: 560.1
## Residual Deviance: 248.2 AIC: 260.2
glm.probs3 <- predict(boston_logistic, type = "response")
glm.pred3 <- rep("Down", 404)
glm.pred3[glm.probs3 > .5] = "Up"
# Create the table
t12=table(glm.pred3, boston_training$crim01)
t12 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| Down | 172 | 27 |
| Up | 30 | 175 |
53 / 404 = 0.131~
Interpretation: The logistic regression performed with a 13.1% test error on the model.
Perform naive Bayes on the training data in order to predict crim01
boston_nb <- naiveBayes(crim01 ~ dis + zn + nox + age + lstat, data = boston_training)
print(boston_nb)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## dis
## Y [,1] [,2]
## 0 5.141737 2.098737
## 1 2.494462 1.132570
##
## zn
## Y [,1] [,2]
## 0 20.638614 28.35210
## 1 1.306931 4.99899
##
## nox
## Y [,1] [,2]
## 0 0.4712718 0.05526323
## 1 0.6382921 0.09847185
##
## age
## Y [,1] [,2]
## 0 51.47525 25.85871
## 1 85.27871 19.15260
##
## lstat
## Y [,1] [,2]
## 0 9.628812 4.950626
## 1 15.857574 7.625921
nb.pred2 <- predict(boston_nb, boston_training)
t13=table(nb.pred2, Actual = boston_training$crim01)
# Format into table
t13 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 151 | 17 |
| 1 | 51 | 185 |
73 / 404 = .5 ~
Interpretation: The Naive Bayes model performs with a 18.1% test error on its model.
Perform KNN on the training data, with several values of K, in order to predict crim01
train_crim01 = boston_training$crim01
boston_training1 = boston_training |> dplyr::select(dis, zn, nox, age, lstat)
boston_test1 = boston_test |> dplyr::select(dis, zn, nox, age, lstat)
boston_knn <- knn(
train = boston_training1,
test = boston_test1,
cl = train_crim01,
k=1)
print(boston_knn)
## [1] 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0
## [38] 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1
## [75] 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1
## Levels: 0 1
# Create the table
t14=table(Predicted = boston_knn, Actual = boston_test$crim01)
t14 |>
as.data.frame.matrix() |>
rownames_to_column("Predicted") |>
gt()
| Predicted | 0 | 1 |
|---|---|---|
| 0 | 35 | 17 |
| 1 | 16 | 34 |
25 / 80 = .3125 ~
Interpretation: The KNN model performs with a 31.25% test error on its TEST model.
The best model to predict crime rates was the logistic regression model with a test error of 13.1%. The best variables to put into this model for predicting crime rates were dis, zn, nox, age, and lstat.