library(tidymodels)
library(ISLR2) # For the Bikeshare data set
library(discrim)
library(poissonreg)
To look at the correlation between the variables, we will use the
corrr
package. The correlate()
function will
calculate the correlation matrix between all the variables that it is
being fed. We will therefore remove Direction
as it is not
numeric. Then we pass that to rplot()
to quickly visualize
the correlation matrix.
library(corrr)
cor_Smarket <- Smarket %>%
select(-Direction) %>%
correlate()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))
We see that these variables are more or less uncorrelated with each
other. The other pair is
Year
and Volume
that
is a little correlated.
#Heatmap styled correlation
library(paletteer)
cor_Smarket %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1)
If we plot
Year
against Volume
we see that
there is an upwards trend in Volume
with time.
ggplot(Smarket, aes(Year, Volume)) +
geom_jitter(height = 0)
We will use logistic_reg() to create a logistic regression model specification.
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
We want to model the Direction
of the stock market based
on the percentage return from the 5 previous days plus the volume of
shares traded. When fitting a classification with parsnip requires that
the response variable is a factor. This is the case for the
Smarket
data set so we don’t need to do adjustments.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit
## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
this fit is done using the glm()
function, and it comes
with a very handy summary()
method as well.
lr_fit %>%
pluck("fit") %>%
summary()
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
This lets us see a couple of different things such as; parameter
estimates, standard errors, p-values, and model fit statistics. we can
use the tidy()
function to extract some of these model
attributes for further analysis or presentation.
tidy(lr_fit)
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
Predictions are done much the same way. Here we use the model to predict on the data it was trained on.
predict(lr_fit, new_data = Smarket)
## # A tibble: 1,250 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Down
## 3 Down
## 4 Up
## 5 Up
## 6 Up
## 7 Down
## 8 Up
## 9 Up
## 10 Down
## # ℹ 1,240 more rows
The result is a tibble with a single column .pred_class
which will be a factor variable of the same labels as the original
training data set.
We can also get back probability predictions, by specifying
type = "prob"
.
predict(lr_fit, new_data = Smarket, type = "prob")
## # A tibble: 1,250 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.493 0.507
## 2 0.519 0.481
## 3 0.519 0.481
## 4 0.485 0.515
## 5 0.489 0.511
## 6 0.493 0.507
## 7 0.507 0.493
## 8 0.491 0.509
## 9 0.482 0.518
## 10 0.511 0.489
## # ℹ 1,240 more rows
Using augment()
we can add the predictions to the
data.frame and then use that to look at model performance metrics.
augment(lr_fit, new_data = Smarket) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
A more visual representation of the confusion matrix (ggplot2 chart)
can be generated from piping the result of conf_mat()
into
autoplot()
.
augment(lr_fit, new_data = Smarket) %>%
conf_mat(truth = Direction, estimate = .pred_class) %>%
autoplot(type = "heatmap")
We can also calculate various performance metrics. One of the most
common metrics is accuracy, which is how often the model predicted
correctly as a percentage.
augment(lr_fit, new_data = Smarket) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.522
We just fit a model and evaluated it on the same data. This doesn’t give us that much information about the model performance. Let us instead split up the data, train it on some of it and then evaluate it on the other part of the data. Since we are working with some data that has a time component, it is natural to fit the model using the first year’s worth of data and evaluate it on the last year. This would more closely match how such a model would be used in real life.
Smarket_train <- Smarket %>%
filter(Year != 2005)
Smarket_test <- Smarket %>%
filter(Year == 2005)
Now that we have split the data into Smarket_train
and
Smarket_test
we can fit a logistic regression model to
Smarket_train
and evaluate it on Smarket_test
to see how well the model generalizes.
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)
And we will evaluate on the testing data set.
augment(lr_fit2, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
augment(lr_fit2, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
We see that this model is not more likely to predict
"Down"
rather than "Up"
. Also, note how the
model performs worse than the last model. This is expected since we are
evaluating on new data.
We recall that the logistic regression model had underwhelming p-values. Let us see what happens if we remove some of the variables that appear not to be helpful we might achieve a more predictive model since the variables that do not have a relationship with the response will cause an increase in variance without a decrease in bias.
lr_fit3 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2,
data = Smarket_train
)
augment(lr_fit3, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
augment(lr_fit3, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
Suppose that we want to predict the returns associated with
particular values of Lag1
and Lag2.
In
particular, we want to predict Direction
on a day when
Lag1
and Lag2
equal 1.2 and 1.1, respectively,
and on a day when they equal 1.5 and −0.8.
For this we start by creating a tibble corresponding to the scenarios we want to predict for
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
And then we will use predict()
predict(
lr_fit3,
new_data = Smarket_new,
type = "prob"
)
## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
Now we will perform LDA on the Smarket
data. We will use
the discrim_linear()
function to create a LDA
specification. We will continue to use 2 predictors for easy
comparison.
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_fit <- lda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
lda_fit
## parsnip model object
##
## Call:
## lda(Direction ~ Lag1 + Lag2, data = data)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
One of the things to look for in the LDA output is the group means. We see that there is a slight difference between the means of the two groups. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous day’s returns to be positive on days when the market declines.
predict(lda_fit, new_data = Smarket_test)
## # A tibble: 252 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Up
## 3 Up
## 4 Up
## 5 Up
## 6 Up
## 7 Up
## 8 Up
## 9 Up
## 10 Up
## # ℹ 242 more rows
predict(lda_fit, new_data = Smarket_test, type = "prob")
## # A tibble: 252 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.490 0.510
## 2 0.479 0.521
## 3 0.467 0.533
## 4 0.474 0.526
## 5 0.493 0.507
## 6 0.494 0.506
## 7 0.495 0.505
## 8 0.487 0.513
## 9 0.491 0.509
## 10 0.484 0.516
## # ℹ 242 more rows
augment(lda_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
augment(lda_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
And we see no markedly different performance between this model and the logistic regression model.
We will now fit a QDA model. The discrim_quad()
function
is used here.
qda_spec <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_fit <- qda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
augment(qda_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 30 20
## Up 81 121
augment(qda_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.599
And we are seeing another increase in accuracy. However this model
still rarely predicts "Down"
. This make it appear that the
quadratic form assumed by QDA captures the relationship more
clearly.
We will now fit a Naive Bayes model to the Smarket data. For this, we
will be using thenaive_Bayes()
function to create the
specification and also set the usekernel
argument to
FALSE.
This means that we are assuming that the predictors
Lag1
and Lag2
are drawn from Gaussian
distributions.
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
nb_fit <- nb_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
#Confusion maxtrix based on the testing data
augment(nb_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 28 20
## Up 83 121
augment(nb_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.591
The accuracy of the Naive Bayes is very similar to that of the QDA
model. This seems reasonable since the below scatter plot shows that
there is no apparent relationship between Lag1
vs
Lag2
and thus the Naive Bayes’ assumption of independently
distributed predictors is not unreasonable.
ggplot(Smarket, aes(Lag1, Lag2)) +
geom_point(alpha = 0.1, size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "No apparent correlation between Lag1 and Lag2")
Lastly let us take a look at a K-Nearest Neighbors model. This is the
first model we have looked at that has a hyperparameter we need to
specify. I have set it to 3 with neighbors = 3
. Fitting is
done like normal.
knn_spec <- nearest_neighbor(neighbors = 3) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_fit <- knn_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
knn_fit
## parsnip model object
##
##
## Call:
## kknn::train.kknn(formula = Direction ~ Lag1 + Lag2, data = data, ks = min_rows(3, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.492986
## Best kernel: optimal
## Best k: 3
augment(knn_fit, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 43 58
## Up 68 83
augment(knn_fit, new_data = Smarket_test) %>%
accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.5
It appears that this model is not performing that well.
We will try using a K-nearest neighbors model in an application to
caravan insurance data. This data set includes 85 predictors that
measure demographic characteristics for 5822 individuals. The response
variable is Purchase
, which indicates whether or not a
given individual purchases a caravan insurance policy. In this data set,
only 6% of people purchased caravan insurance.
We want to build a predictive model that uses the demographic characteristics to predict whether an individual is going to purchase a caravan insurance. Before we go on, we split the data set into a training data set and testing data set. (This is a not the proper way this should be done. See next chapter for the correct way.)
Caravan_test <- Caravan[seq_len(1000), ]
Caravan_train <- Caravan[-seq_len(1000), ]
Since we are using a K-nearest neighbor model, it is importance that
the variables are centered and scaled to make sure that the variables
have a uniform influence. We can accomplish this transformation with
step_normalize()
, which does centering and scaling in one
go.
rec_spec <- recipe(Purchase ~ ., data = Caravan_train) %>%
step_normalize(all_numeric_predictors())
We will be trying different values of K to see how the number of neighbors affect the model performance. A workflow object is created, with just the recipe added.
Caravan_wf <- workflow() %>%
add_recipe(rec_spec)
Next we create a general KNN model specification.
knn_spec <- nearest_neighbor() %>%
set_mode("classification") %>%
set_engine("kknn")
We can then use this model specification along with
Caravan_wf
to create 3 full workflow objects for
K = 1,3,5
.
knn1_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 1))
knn3_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 3))
knn5_wf <- Caravan_wf %>%
add_model(knn_spec %>% set_args(neighbors = 5))
With all these workflow specification we can fit all the models one by one.
knn1_fit <- fit(knn1_wf, data = Caravan_train)
knn3_fit <- fit(knn3_wf, data = Caravan_train)
knn5_fit <- fit(knn5_wf, data = Caravan_train)
And we can calculate all the confusion matrices.
augment(knn1_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 874 50
## Yes 67 9
augment(knn3_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 875 50
## Yes 66 9
augment(knn5_fit, new_data = Caravan_test) %>%
conf_mat(truth = Purchase, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 874 50
## Yes 67 9
And it appears that the model performance doesn’t change much when changing from 1 to 5.