library(ISLR2)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:ISLR2':
##
## Boston
library(class)
library(e1071)
#Load Weekly dataset
data('Weekly')
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
corrplot(cor(Weekly[, sapply(Weekly, is.numeric)]), method = 'number')
ggplot(Weekly, aes(x = Year, y = Volume, group = Year)) +
geom_boxplot() +
ggtitle('Trading Volume has increased over time')
ggplot(Weekly, aes(x = Today)) +
geom_histogram(binwidth = 0.5, fill = 'blue', color = 'black') +
ggtitle('Returns has a normally distributed behavior')
ggplot(Weekly, aes(x = Direction, y = Volume, fill = Direction)) +
geom_boxplot() +
ggtitle("Volume does not differ between Directions")
Lags are likely to be multicollinear because the value of one, directly affects the other, and summary statistics look eerily similar.
The only strong correlation is the positive correlation between
Volume and Year. Volume's trend
is positive with each increase in Year.
Today has a normal distribution. This shows that there
aren’t more positive or negative returns, they are about the same.
Volume does not differ between Direction's
which shows that Trading Volume is probably not associated with
Direction.
Our data seems to be balanced. Direction has 484 ‘Down’
and 605 ‘Up’.
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
logistic_model_13b <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = 'binomial')
summary(logistic_model_13b)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only predictor in our model that is
statistically significant. P-value(0.0296) is lower than 0.05.
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
#Probabilities
pred_probs_13c <- predict(logistic_model_13b, type = 'response')
#Convert probabilities to class labels
pred_class_13c <- ifelse(pred_probs_13c > 0.5, 'Up', 'Down')
#Confusion Matrix
confusionMatrix(as.factor(pred_class_13c), Weekly$Direction, positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 48
## Up 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : Up
##
Our confusion matrix is telling us that our model is biased towards
predicting Up, as seen in the large number of False
Positives(430). It struggles to correctly predict
Down cases, meaning it overestimates the likelihood of the
market going up.
For model improvement, an option would be to use the pROC to find the optimal threshold of balance.
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train_df <- Weekly$Year <= 2008
test_df <- Weekly$Year > 2008
logistic_model_13d <- glm(Direction ~ Lag2, data = Weekly, family = 'binomial', subset = train_df)
#Prediction probabilities
pred_probs_13d <- predict(logistic_model_13d, Weekly[test_df, ], type = 'response')
#Class
pred_class_13d <- ifelse(pred_probs_13d > 0.5, 'Up', 'Down')
confusionMatrix(as.factor(pred_class_13d), Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Repeat (d) using LDA.
#Fit model
lda_model_13e <- lda(Direction ~ Lag2, data = Weekly, subset = train_df)
#Probabilities
pred_probs_13e <- predict(lda_model_13e, Weekly[test_df, ])
#Class
pred_class_13e <- pred_probs_13e$class
#Confusion matrix
confusionMatrix(pred_class_13e, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Repead (d) using QDA.
qda_model_13f <- qda(Direction ~ Lag2, data = Weekly, subset = train_df)
#Prediction probabilities
pred_probs_13f <- predict(qda_model_13f, Weekly[test_df, ])
pred_class_13f<- pred_probs_13f$class
#Confusion Matrix
confusionMatrix(pred_class_13f, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Repeat (d) using KNN with K = 1.
train_x <- Weekly[train_df, 'Lag2', drop = FALSE]
test_x <- Weekly[test_df, 'Lag2', drop = FALSE]
train_y <- Weekly$Direction[train_df]
test_y <- Weekly$Direction[test_df]
# KNN model
set.seed(42)
knn_preds_13g <- knn(train_x, test_x, train_y, k = 1)
#Confusion Matrix
confusionMatrix(knn_preds_13g, test_y, positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 30
## Up 22 31
##
## Accuracy : 0.5
## 95% CI : (0.4003, 0.5997)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9700
##
## Kappa : -0.0033
##
## Mcnemar's Test P-Value : 0.3317
##
## Sensitivity : 0.5082
## Specificity : 0.4884
## Pos Pred Value : 0.5849
## Neg Pred Value : 0.4118
## Prevalence : 0.5865
## Detection Rate : 0.2981
## Detection Prevalence : 0.5096
## Balanced Accuracy : 0.4983
##
## 'Positive' Class : Up
##
Repeat (d) using naive Bayes.
nb_model_13h <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train_df)
#Predict on test set
pred_probs_13h <- predict(nb_model_13h, Weekly[test_df, ])
#Confusion matrix
confusionMatrix(pred_probs_13h, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Which of these methods appears to provide the best results on this data?
Logistic Regression and LDA both appear to provide the best results. Both have an accuracy of 62.5%, sensitivity of 91.8%, and sensitivity 20.9%. However, the best model would be Logistic Regression due to the interpret-ability we can get from the summary() function.
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
log_model_13j <- glm(Direction ~ I(Lag1^2) * Lag2, data = Weekly, subset = train_df, family = 'binomial')
summary(log_model_13j)
##
## Call:
## glm(formula = Direction ~ I(Lag1^2) * Lag2, family = "binomial",
## data = Weekly, subset = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1954492 0.0719405 2.717 0.00659 **
## I(Lag1^2) 0.0001890 0.0074888 0.025 0.97986
## Lag2 0.0716183 0.0315379 2.271 0.02316 *
## I(Lag1^2):Lag2 -0.0007183 0.0011226 -0.640 0.52229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1349.3 on 981 degrees of freedom
## AIC: 1357.3
##
## Number of Fisher Scoring iterations: 4
pred_probs_13j_log <- predict(log_model_13j, Weekly[test_df, ], type = 'response')
pred_class_13j_log <- ifelse(pred_probs_13j_log > 0.5, 'Up', 'Down')
confusionMatrix(as.factor(pred_class_13j_log), Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 8 2
## Up 35 59
##
## Accuracy : 0.6442
## 95% CI : (0.5443, 0.7357)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.1364
##
## Kappa : 0.1728
##
## Mcnemar's Test P-Value : 1.435e-07
##
## Sensitivity : 0.9672
## Specificity : 0.1860
## Pos Pred Value : 0.6277
## Neg Pred Value : 0.8000
## Prevalence : 0.5865
## Detection Rate : 0.5673
## Detection Prevalence : 0.9038
## Balanced Accuracy : 0.5766
##
## 'Positive' Class : Up
##
#Fit model
lda_model_13j <- lda(Direction ~ Lag1 * Lag2 + Volume, data = Weekly, subset = train_df)
#Probabilities
pred_probs_13j <- predict(lda_model_13j, Weekly[test_df, ])
#Class
pred_class_13j <- pred_probs_13j$class
#Confusion matrix
confusionMatrix(pred_class_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 27 32
## Up 16 29
##
## Accuracy : 0.5385
## 95% CI : (0.438, 0.6367)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.86308
##
## Kappa : 0.0979
##
## Mcnemar's Test P-Value : 0.03038
##
## Sensitivity : 0.4754
## Specificity : 0.6279
## Pos Pred Value : 0.6444
## Neg Pred Value : 0.4576
## Prevalence : 0.5865
## Detection Rate : 0.2788
## Detection Prevalence : 0.4327
## Balanced Accuracy : 0.5517
##
## 'Positive' Class : Up
##
qda_model_13j <- qda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train_df)
#Prediction probabilities
pred_probs_13j <- predict(qda_model_13j, Weekly[test_df, ])
pred_class_13j<- pred_probs_13j$class
#Confusion Matrix
confusionMatrix(pred_class_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 23 36
## Up 20 25
##
## Accuracy : 0.4615
## 95% CI : (0.3633, 0.562)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.99616
##
## Kappa : -0.0524
##
## Mcnemar's Test P-Value : 0.04502
##
## Sensitivity : 0.4098
## Specificity : 0.5349
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.3898
## Prevalence : 0.5865
## Detection Rate : 0.2404
## Detection Prevalence : 0.4327
## Balanced Accuracy : 0.4724
##
## 'Positive' Class : Up
##
train_x <- Weekly[train_df, c('Lag1', 'Lag2'), drop = FALSE]
test_x <- Weekly[test_df, c('Lag1', 'Lag2'), drop = FALSE]
train_y <- Weekly$Direction[train_df]
test_y <- Weekly$Direction[test_df]
train_x <- scale(train_x)
test_x <- scale(test_x)
# KNN model
set.seed(42)
knn_preds_13j <- knn(train_x, test_x, train_y, k = 10)
#Confusion Matrix
confusionMatrix(knn_preds_13j, test_y, positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 23 28
## Up 20 33
##
## Accuracy : 0.5385
## 95% CI : (0.438, 0.6367)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.8631
##
## Kappa : 0.0738
##
## Mcnemar's Test P-Value : 0.3123
##
## Sensitivity : 0.5410
## Specificity : 0.5349
## Pos Pred Value : 0.6226
## Neg Pred Value : 0.4510
## Prevalence : 0.5865
## Detection Rate : 0.3173
## Detection Prevalence : 0.5096
## Balanced Accuracy : 0.5379
##
## 'Positive' Class : Up
##
nb_model_13j <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = Weekly, subset = train_df)
#Predict on test set
pred_probs_13j <- predict(nb_model_13j, Weekly[test_df, ])
#Confusion matrix
confusionMatrix(pred_probs_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 41 58
## Up 2 3
##
## Accuracy : 0.4231
## 95% CI : (0.3268, 0.5239)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9997
##
## Kappa : 0.0022
##
## Mcnemar's Test P-Value : 1.243e-12
##
## Sensitivity : 0.04918
## Specificity : 0.95349
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.41414
## Prevalence : 0.58654
## Detection Rate : 0.02885
## Detection Prevalence : 0.04808
## Balanced Accuracy : 0.50133
##
## 'Positive' Class : Up
##
In terms of accuracy, our logistic regression model
with the interaction of Lag1 ^2 and Lag2 gave
us the best with 64.44%. The only concerning thing
about this model is that most of our models predictions are ‘Up’
In terms of balance in sensitivity and specificity as well as high accuracy, our KNN and LDA models show solid performance by both having an accuracy of 53.85%, with KNN having the best balance of sensitivity and specificity of 54.10% and 53.49%.
KNN used Lag1 and Lag2 while LDA used the
interaction of Lag1 and Lag2 +
Volume.
data('Auto')
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto[, sapply(Auto, is.numeric)])
We have got a few variables with linear negative relationships to
mpg like displacement,
horsepower, weight, and a few positive linear
relationships like acceleration and year.
boxplot(Auto$horsepower ~ Auto$mpg01, main = "Horsepower vs mpg01", xlab = "mpg01", ylab = "Horsepower",
col = c("red", "blue"))
boxplot(Auto$displacement ~ Auto$mpg01, main = "Displacement vs mpg01", xlab = "mpg01", ylab = "Displacement",
col = c("red", "blue"))
boxplot(Auto$weight ~ Auto$mpg01, main = "Weight vs mpg01", xlab = "mpg01", ylab = "Weight",
col = c("red", "blue"))
boxplot(Auto$acceleration ~ Auto$mpg01, main = "Acceleration vs mpg01", xlab = "mpg01", ylab = "Acceleration",
col = c("red", "blue"))
boxplot(Auto$year ~ Auto$mpg01, main = "Year vs mpg01", xlab = "mpg01", ylab = "Year",
col = c("red", "blue"))
boxplot(Auto$cylinders ~ Auto$mpg01, main = "Cylinders vs mpg01", xlab = "mpg01", ylab = "Cylinders",
col = c("red", "blue"))
Since we are interested in our new categorical variable of
mpg01, we use this with boxplots to see if our predictors
can separate our variable.
Based on our boxplots, we can see that those same variables that had
a linear relationship with mpg also have a trend that
separates our mpg01 which just reinforces our findings of
using those variables to predict higher vs lower mileage. Another
interesting finding is that cylinders shows a big
separation, which could be due to an unbalance in the amount of cars
with lower cylinders, but generally more cylinders mean lower
mileage.
knitr::kable(table(Auto$cylinders), col.names = c('Cylinders', 'Frequency'), align = 'c')
| Cylinders | Frequency |
|---|---|
| 3 | 4 |
| 4 | 199 |
| 5 | 3 |
| 6 | 83 |
| 8 | 103 |
We can confirm that our box plots behavior for cylinders
was not due to an unbalance in the data, so it has potential for being a
significant predictor to our outcome variable mpg01.
Split the data into a training set and a test set.
set.seed(42)
train_index_Auto <- createDataPartition(Auto$mpg01, p = 0.7, list = FALSE)
train_data_Auto <- Auto[train_index_Auto, ]
test_data_Auto <- Auto[-train_index_Auto, ]
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lda_model_14d <- lda(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year,
data = train_data_Auto)
lda_preds_14d <- predict(lda_model_14d, test_data_Auto)
lda_class_14d <- lda_preds_14d$class
confusionMatrix(lda_class_14d, as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 52 4
## 1 6 54
##
## Accuracy : 0.9138
## 95% CI : (0.8472, 0.9579)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8276
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8966
## Specificity : 0.9310
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.9000
## Prevalence : 0.5000
## Detection Rate : 0.4483
## Detection Prevalence : 0.4828
## Balanced Accuracy : 0.9138
##
## 'Positive' Class : 0
##
After creating a LDA model using the variables that seemed associated
the most with mpg01, we got a test accuracy of
91.38%.
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda_model_15e <- qda(mpg01 ~ horsepower + weight + displacement + acceleration + cylinders + year,
data = train_data_Auto)
qda_preds_15e <- predict(qda_model_15e, test_data_Auto)
qda_class_15e <- qda_preds_15e$class
confusionMatrix(qda_class_15e, as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 3
## 1 8 55
##
## Accuracy : 0.9052
## 95% CI : (0.8367, 0.9517)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8103
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.8621
## Specificity : 0.9483
## Pos Pred Value : 0.9434
## Neg Pred Value : 0.8730
## Prevalence : 0.5000
## Detection Rate : 0.4310
## Detection Prevalence : 0.4569
## Balanced Accuracy : 0.9052
##
## 'Positive' Class : 0
##
After creating a QDA model using the variables that seemed associated
the most with mpg01, we got a test accuracy of
90.52%.
Also, we kept cylinders as a numerical value because QDA
does not take categorical variables as predictors.
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b).
What is the test error of the model obtained?
log_model_15f <- glm(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year,
data = train_data_Auto, family = 'binomial')
log_probs_15f <- predict(log_model_15f, test_data_Auto, type = 'response')
log_preds_15f <- ifelse(log_probs_15f > 0.5, 1, 0)
confusionMatrix(as.factor(log_preds_15f), as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 3
## 1 7 55
##
## Accuracy : 0.9138
## 95% CI : (0.8472, 0.9579)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8276
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.8793
## Specificity : 0.9483
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.8871
## Prevalence : 0.5000
## Detection Rate : 0.4397
## Detection Prevalence : 0.4655
## Balanced Accuracy : 0.9138
##
## 'Positive' Class : 0
##
After creating a logistic regression model using the variables that
seemed associated the most with mpg01, we got a test
accuracy of 91.38%.
Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
nb_model_15g <- naiveBayes(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year,
data = train_data_Auto)
nb_preds_15g <- predict(nb_model_15g, test_data_Auto)
confusionMatrix(as.factor(nb_preds_15g), as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 1
## 1 12 57
##
## Accuracy : 0.8879
## 95% CI : (0.816, 0.939)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7759
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.7931
## Specificity : 0.9828
## Pos Pred Value : 0.9787
## Neg Pred Value : 0.8261
## Prevalence : 0.5000
## Detection Rate : 0.3966
## Detection Prevalence : 0.4052
## Balanced Accuracy : 0.8879
##
## 'Positive' Class : 0
##
After creating a Naive Bayes model using the variables that seemed
associated the most with mpg01, we got a test
accuracy of 88.79%.
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b).
What test errors do you obtain? Which value of K seems to perform the best on this data set?
set.seed(42)
train_x_15h <- train_data_Auto[, c('horsepower', 'weight', 'displacement', 'acceleration', 'cylinders', 'year')]
test_x_15h <- test_data_Auto[, c('horsepower', 'weight', 'displacement', 'acceleration', 'cylinders', 'year')]
train_y_15h <- train_data_Auto$mpg01
test_y_15h <- test_data_Auto$mpg01
knn_preds_15h <- knn(train_x_15h, test_x_15h, train_y_15h, k = 1)
knn_preds_15h2 <- knn(train_x_15h, test_x_15h, train_y_15h, k = 4)
knn_preds_15h3 <- knn(train_x_15h, test_x_15h, train_y_15h, k = 8)
#Confusion Matrix
confusionMatrix(as.factor(knn_preds_15h), as.factor(test_y_15h))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 4
## 1 11 54
##
## Accuracy : 0.8707
## 95% CI : (0.7957, 0.9258)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7414
##
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.8103
## Specificity : 0.9310
## Pos Pred Value : 0.9216
## Neg Pred Value : 0.8308
## Prevalence : 0.5000
## Detection Rate : 0.4052
## Detection Prevalence : 0.4397
## Balanced Accuracy : 0.8707
##
## 'Positive' Class : 0
##
confusionMatrix(as.factor(knn_preds_15h2), as.factor(test_y_15h))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 4
## 1 8 54
##
## Accuracy : 0.8966
## 95% CI : (0.8263, 0.9454)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7931
##
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.8621
## Specificity : 0.9310
## Pos Pred Value : 0.9259
## Neg Pred Value : 0.8710
## Prevalence : 0.5000
## Detection Rate : 0.4310
## Detection Prevalence : 0.4655
## Balanced Accuracy : 0.8966
##
## 'Positive' Class : 0
##
confusionMatrix(as.factor(knn_preds_15h3), as.factor(test_y_15h))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 3
## 1 10 55
##
## Accuracy : 0.8879
## 95% CI : (0.816, 0.939)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7759
##
## Mcnemar's Test P-Value : 0.09609
##
## Sensitivity : 0.8276
## Specificity : 0.9483
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.8462
## Prevalence : 0.5000
## Detection Rate : 0.4138
## Detection Prevalence : 0.4397
## Balanced Accuracy : 0.8879
##
## 'Positive' Class : 0
##
After creating 3 KNN models using the variables that seemed
associated the most with mpg01, we got:
For k = 1 a test accuracy of 87.07%
For k = 4 a test accuracy of 89.66%
For k = 8 a test accuracy of 88.79%
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
data('Boston')
# (1) is high crime and (0) is low crime
Boston$crim01 <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
knitr::kable(summary(Boston))
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | crim01 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00632 | Min. : 0.00 | Min. : 0.46 | Min. :0.00000 | Min. :0.3850 | Min. :3.561 | Min. : 2.90 | Min. : 1.130 | Min. : 1.000 | Min. :187.0 | Min. :12.60 | Min. : 0.32 | Min. : 1.73 | Min. : 5.00 | Min. :0.0 | |
| 1st Qu.: 0.08205 | 1st Qu.: 0.00 | 1st Qu.: 5.19 | 1st Qu.:0.00000 | 1st Qu.:0.4490 | 1st Qu.:5.886 | 1st Qu.: 45.02 | 1st Qu.: 2.100 | 1st Qu.: 4.000 | 1st Qu.:279.0 | 1st Qu.:17.40 | 1st Qu.:375.38 | 1st Qu.: 6.95 | 1st Qu.:17.02 | 1st Qu.:0.0 | |
| Median : 0.25651 | Median : 0.00 | Median : 9.69 | Median :0.00000 | Median :0.5380 | Median :6.208 | Median : 77.50 | Median : 3.207 | Median : 5.000 | Median :330.0 | Median :19.05 | Median :391.44 | Median :11.36 | Median :21.20 | Median :0.5 | |
| Mean : 3.61352 | Mean : 11.36 | Mean :11.14 | Mean :0.06917 | Mean :0.5547 | Mean :6.285 | Mean : 68.57 | Mean : 3.795 | Mean : 9.549 | Mean :408.2 | Mean :18.46 | Mean :356.67 | Mean :12.65 | Mean :22.53 | Mean :0.5 | |
| 3rd Qu.: 3.67708 | 3rd Qu.: 12.50 | 3rd Qu.:18.10 | 3rd Qu.:0.00000 | 3rd Qu.:0.6240 | 3rd Qu.:6.623 | 3rd Qu.: 94.08 | 3rd Qu.: 5.188 | 3rd Qu.:24.000 | 3rd Qu.:666.0 | 3rd Qu.:20.20 | 3rd Qu.:396.23 | 3rd Qu.:16.95 | 3rd Qu.:25.00 | 3rd Qu.:1.0 | |
| Max. :88.97620 | Max. :100.00 | Max. :27.74 | Max. :1.00000 | Max. :0.8710 | Max. :8.780 | Max. :100.00 | Max. :12.127 | Max. :24.000 | Max. :711.0 | Max. :22.00 | Max. :396.90 | Max. :37.97 | Max. :50.00 | Max. :1.0 |
pairs(Boston)
chas is 0’s and 1’s so we convert it to categorical
Boston$chas <- as.factor(Boston$chas)
Boston$crim01 <- as.factor(Boston$crim01)
boxplot(Boston$indus, Boston$crim01, main = 'Industrial Area Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'Proportion of Industrial Area',
col = c('red', 'blue'))
boxplot(Boston$nox, Boston$crim01, main = 'Air Contamination Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'Nitrogen Oxide Concentration',
col = c('red', 'blue'))
boxplot(Boston$tax, Boston$crim01, main = 'Tax rate values Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'Properties Tax Rate',
col = c('red', 'blue'))
boxplot(Boston$medv, Boston$crim01, main = 'Value of homes Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'value of homes',
col = c('red', 'blue'))
boxplot(Boston$rad, Boston$crim01,
main = 'Zone with highway access Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'Access to highways',
col = c('red', 'blue'))
boxplot(Boston$rm, Boston$crim01,
main = 'Zone with highway access Perfectly separates Crime Rate',
xlab = 'Crime Rate 0 or 1', ylab = 'Access to highways',
col = c('red', 'blue'))
It appears as though indus, nox,
tax, medv, rad and
rm all perfectly separate crime between high and low.
set.seed(42)
train_index_Boston <- createDataPartition(Boston$crim01, p = 0.7, list = FALSE)
train_data_Boston <- Boston[train_index_Boston, ]
test_data_Boston <- Boston[-train_index_Boston, ]
log_model_16 <- glm(crim01 ~ indus + nox + tax + medv + rad, data = train_data_Boston,
family = 'binomial')
probs_log <- predict(log_model_16, test_data_Boston, type = 'response')
preds_log <- ifelse(probs_log > 0.5, 1, 0)
confusionMatrix(as.factor(preds_log), as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 69 14
## 1 6 61
##
## Accuracy : 0.8667
## 95% CI : (0.8016, 0.9166)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7333
##
## Mcnemar's Test P-Value : 0.1175
##
## Sensitivity : 0.9200
## Specificity : 0.8133
## Pos Pred Value : 0.8313
## Neg Pred Value : 0.9104
## Prevalence : 0.5000
## Detection Rate : 0.4600
## Detection Prevalence : 0.5533
## Balanced Accuracy : 0.8667
##
## 'Positive' Class : 0
##
Logistic Regression model has an accuracy of 86.67% when predicting if crime is high or low.
lda_model_16 <- lda(crim01 ~ indus + tax + nox + medv + rad, data = train_data_Boston)
lda_preds_16 <- predict(lda_model_16, test_data_Boston)$class
# Confusion Matrix
confusionMatrix(lda_preds_16, as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 74 19
## 1 1 56
##
## Accuracy : 0.8667
## 95% CI : (0.8016, 0.9166)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7333
##
## Mcnemar's Test P-Value : 0.0001439
##
## Sensitivity : 0.9867
## Specificity : 0.7467
## Pos Pred Value : 0.7957
## Neg Pred Value : 0.9825
## Prevalence : 0.5000
## Detection Rate : 0.4933
## Detection Prevalence : 0.6200
## Balanced Accuracy : 0.8667
##
## 'Positive' Class : 0
##
LDA model has an accuracy of 86.67% when predicting if crime is high or low.
nb_model_16 <- naiveBayes(crim01 ~ indus + tax + nox + medv + rad, data = train_data_Boston)
nb_preds_16 <- predict(nb_model_16, test_data_Boston)
confusionMatrix(nb_preds_16, as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 73 24
## 1 2 51
##
## Accuracy : 0.8267
## 95% CI : (0.7564, 0.8835)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6533
##
## Mcnemar's Test P-Value : 3.814e-05
##
## Sensitivity : 0.9733
## Specificity : 0.6800
## Pos Pred Value : 0.7526
## Neg Pred Value : 0.9623
## Prevalence : 0.5000
## Detection Rate : 0.4867
## Detection Prevalence : 0.6467
## Balanced Accuracy : 0.8267
##
## 'Positive' Class : 0
##
Naive Bayes model has an accuracy of 82.67% when predicting if crime is high or low.
set.seed(42)
train_x_16 <- train_data_Boston[, c('indus', 'tax', 'nox', 'medv', 'rad')]
test_x_16 <- test_data_Boston[, c('indus', 'tax', 'nox', 'medv', 'rad')]
train_y_16 <- train_data_Boston$crim01
test_y_16 <- test_data_Boston$crim01
knn_preds_16 <- knn(train_x_16, test_x_16, train_y_16, k = 1)
knn_preds_16_2 <- knn(train_x_16, test_x_16, train_y_16, k = 2)
knn_preds_16_3 <- knn(train_x_16, test_x_16, train_y_16, k = 4)
#Confusion Matrix
confusionMatrix(as.factor(knn_preds_16), as.factor(test_y_16))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 65 5
## 1 10 70
##
## Accuracy : 0.9
## 95% CI : (0.8404, 0.9429)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.8667
## Specificity : 0.9333
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.8750
## Prevalence : 0.5000
## Detection Rate : 0.4333
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 0
##
confusionMatrix(as.factor(knn_preds_16_2), as.factor(test_y_16))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 3
## 1 7 72
##
## Accuracy : 0.9333
## 95% CI : (0.8808, 0.9676)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8667
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.9067
## Specificity : 0.9600
## Pos Pred Value : 0.9577
## Neg Pred Value : 0.9114
## Prevalence : 0.5000
## Detection Rate : 0.4533
## Detection Prevalence : 0.4733
## Balanced Accuracy : 0.9333
##
## 'Positive' Class : 0
##
confusionMatrix(as.factor(knn_preds_16_3), as.factor(test_y_16))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 67 3
## 1 8 72
##
## Accuracy : 0.9267
## 95% CI : (0.8726, 0.9628)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8533
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.8933
## Specificity : 0.9600
## Pos Pred Value : 0.9571
## Neg Pred Value : 0.9000
## Prevalence : 0.5000
## Detection Rate : 0.4467
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.9267
##
## 'Positive' Class : 0
##
After creating 3 KNN models using the variables that seemed
associated the most with crim01, we got:
For k = 1 a test accuracy of 90%
For k = 2 a test accuracy of 93.33%
For k = 4 a test accuracy of 92.67%
After exploring Logistic Regression, LDA, Naive Bayes, and KNN
models, we saw the best model for our data was KNN. We
ran 3 different models of KNN basically changing the number of K, and
found that specifically K = 2 was our best
performing model with an accuracy of 93.33% which is very
high.
On the other hand, our worst performing model was our Naive Bayes with an accuracy of 82.67% which could still be considered relatively high, but is still vastly outperformed by our KNN model.
Our findings also suggest that changing the predictor variables could help improve other models, as well as using transformations, since each model interprets data differently.