In this assignment, I will be using Tidymodels instead of base R to do coding.
Suppose we collect data for a group of students in a statistics class with variables \(X_1\) = hours studied, \(X_2\) = undergrad GPA, and \(Y\) = receive an A. We fit a logistic regression and produce estimated coefficient, \(\hat{\beta}_0=-6\), \(\hat{\beta}_1=0.05\), \(\hat{\beta}_2=1\).
The equation of logistic regression is: \[ p(X) = \frac{e^{\beta_0 + \beta_1X_1 + … + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X_1 + … + \beta_pX_p}} \] a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of \(3.5\) gets an A in the class.
According to the question, \(X_1\) is 40 hours and \(X_2\) is \(3.5\) so the probability will be: \[ \hat{p}(X) = \frac{e^{-6 + 0.05X_1 + 1X_2}}{(1 + e^{-6 + 0.05X_1 + 1X_2})} \]
e is an Euler’s number, which is close to 2.7182818284590452353602874713527…. Using exp() in the console, the probability of getting an A will be 0.3775 \[
\hat{p}(X) = \frac{e^{-6 + 0.05*40 + 1*3.5}}{(1 + e^{-6 + 0.05*40 + 1*3.5})} = 0.3775407
\] We also can create a function to compute the probability, the answer is p = 37.8 %.
prob <- function(x1,x2){
stopifnot(is.numeric(x1) & is.numeric(x2))
tmp <- exp(-6 + 0.05 * x1 + 1 * x2)
return(tmp/(1 + tmp))
}
prob(40, 3.5)## [1] 0.3775407
Now, we don’t know how many studying hour is, we need to find out \(X_1\). \[ \frac{e^{-6 + 0.05X_1 + 1*3.5}}{(1 + e^{-6 + 0.05X_1 + 1*3.5})} = 0.5 \] And we use 1- 100 to test which hour (\(X_1\)) is exactly matched to 50% chance of getting an A. The answer is 50 hours.
hours <- seq(1 ,100 ,1)
probs <- mapply(hours, 3.5, FUN = prob) #X1, X2, function
names(probs) <- paste(hours,"Hours")
probs %>%
tidy() %>%
filter(x == 0.5)## # A tibble: 1 x 2
## names x
## <chr> <dbl>
## 1 50 Hours 0.5
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First, we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, we use 1-nearest neighbors (i.e. \(K = 1\)) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?
The training error and the test error are two important concepts using in machine learning.
We calculate the test error rate when we calculate the error on data that was unknown during the training phase. In this question, we should focus on test error rate.
According to the question, we can conclude:
Even though the average error rate is 18% for KNN classification procedure, with K=1, the KNN training error rate is actually 0%. This means that we do not make any error on the training data within this setting. This leads to a test error of 36% since the average error rate is 18%. In conclusion, the test error rate of KNN is 36%, which is higher than the test error rate of logistic regression, which is 30%. Therefore, we prefer to use logistic regression for classification of new observations.
You will in this exercise examine the differences between LDA and QDA.
In this exercise, we will explore a data set about cars called auto which you can find here.
The data set contains 1 factor variable and 6 numeric variables. The factor variable mpg has two levels high and low indicating whether the car has a high or low miles per gallon. We will in this exercise investigate if we are able to use a logistic regression classifier to predict if a car has high or low mpg from the other variables.
rsplit object of auto using initial_split(). Use default arguments for initial_split().##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## mpg = col_character(),
## cylinders = col_double(),
## displacement = col_double(),
## horsepower = col_double(),
## weight = col_double(),
## acceleration = col_double(),
## year = col_double()
## )
## spec_tbl_df [397 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ mpg : chr [1:397] "low" "low" "low" "low" ...
## $ cylinders : num [1:397] 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num [1:397] 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num [1:397] 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num [1:397] 3504 3693 3436 3433 3449 ...
## $ acceleration: num [1:397] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num [1:397] 70 70 70 70 70 70 70 70 70 70 ...
## - attr(*, "spec")=
## .. cols(
## .. mpg = col_character(),
## .. cylinders = col_double(),
## .. displacement = col_double(),
## .. horsepower = col_double(),
## .. weight = col_double(),
## .. acceleration = col_double(),
## .. year = col_double()
## .. )
# mutate the response variable to be a factor type
autoDF %>%
mutate(mpg = as.factor(mpg)) -> autoDF
# set a seed
set.seed(1)
# splitting to train and test sets
auto_split <- initial_split(autoDF, strata = mpg)
auto_split## <Analysis/Assess/Total>
## <298/99/397>
training() and testing() respectively.## [1] 298
## [1] 99
logistic_reg(). Use all the 6 numeric variables as predictors (a formula shorthand is to write mpg ~ . where . means everything. Remember to fit the model only using the training data set.# create a model formula
auto_formula <- mpg ~ .
# set the logistic regression engine
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# fit the model
lr_spec %>%
fit(auto_formula, data = auto_train) -> lr_fit
lr_fit## parsnip model object
##
## Fit time: 9ms
##
## Call: stats::glm(formula = mpg ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) cylinders displacement horsepower weight
## 17.897923 -0.166662 0.007754 0.026401 0.004506
## acceleration year
## -0.067708 -0.427629
##
## Degrees of Freedom: 297 Total (i.e. Null); 291 Residual
## Null Deviance: 413.1
## Residual Deviance: 123.4 AIC: 137.4
summary() and tidy(). Which of the variables are significant?According to the outcomes of two tables below, we find intercept, weight and year are significant. That is, \(b_0\), \(b_4\), and \(b_6\). The significant variables are weight and year.
##
## Call:
## stats::glm(formula = mpg ~ ., family = stats::binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2241 -0.2326 -0.0209 0.1218 2.1663
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.897923 6.566039 2.726 0.006414 **
## cylinders -0.166662 0.470427 -0.354 0.723131
## displacement 0.007754 0.011691 0.663 0.507163
## horsepower 0.026401 0.027258 0.969 0.332766
## weight 0.004506 0.001225 3.680 0.000233 ***
## acceleration -0.067708 0.152738 -0.443 0.657553
## year -0.427629 0.083449 -5.124 2.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 413.10 on 297 degrees of freedom
## Residual deviance: 123.35 on 291 degrees of freedom
## AIC: 137.35
##
## Number of Fisher Scoring iterations: 8
# using tidy()
tidy(lr_fit) %>%
mutate(signPredictor = ifelse(p.value < 0.05, "Yes", "No")) %>%
filter(signPredictor == "Yes")## # A tibble: 3 x 6
## term estimate std.error statistic p.value signPredictor
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 17.9 6.57 2.73 0.00641 Yes
## 2 weight 0.00451 0.00122 3.68 0.000233 Yes
## 3 year -0.428 0.0834 -5.12 0.000000298 Yes
training_pred.In order to successfully combine two datasets without has conflicted columns in the following question, we will use predict() rather than augment().
auto_training should be renamed to match your training data set if needed).The training accuracy in the training data is 90.3 %, which is excellent.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.903
The training accuracy in the testing data is 89.9 %, which is outstanding.
augment(lr_fit, new_data = auto_test) -> testing_pred
accuracy(testing_pred, truth = mpg, estimate = .pred_class)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.899
When compared to the training accuracy in both the training and testing datasets, the result is extremely close. Regardless, the high values of predicted accuracy have already indicated that the predictors are good in the auto data frame to predict whether the car has a high or low miles per gallon tendency (mpg).
This exercise should be answered using the Weekly data set, which is part of the LSLR package. If you don’t have it installed already you can install it with
To load the data set run the following code
This data is similar in nature to the Smarket data from chapter 4’s lab, it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
Although we can see the approximate range value for each variable, the summary table does not reveal any significant relationship between these variables.
## [1] 1089 9
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
The table of correlations is shown below. Suggest that we plot this table so that it is clear to the reader.
## # A tibble: 8 x 8
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.0323 -0.0334 -0.0300 -0.0311 -0.0305 0.842 -0.0325
## 2 -0.0323 1 -0.0749 0.0586 -0.0713 -0.00818 -0.0650 -0.0750
## 3 -0.0334 -0.0749 1 -0.0757 0.0584 -0.0725 -0.0855 0.0592
## 4 -0.0300 0.0586 -0.0757 1 -0.0754 0.0607 -0.0693 -0.0712
## 5 -0.0311 -0.0713 0.0584 -0.0754 1 -0.0757 -0.0611 -0.00783
## 6 -0.0305 -0.00818 -0.0725 0.0607 -0.0757 1 -0.0585 0.0110
## 7 0.842 -0.0650 -0.0855 -0.0693 -0.0611 -0.0585 1 -0.0331
## 8 -0.0325 -0.0750 0.0592 -0.0712 -0.00783 0.0110 -0.0331 1
Removing Direction variable as it is a qualitative variable. We can see the correlation between these eight variables. The plot shows Year and Volume has a strong linear relationship.
This plot depicts an increasing volume of stock trading over time; the volume continues to rise as time passes. Is everyone now wealthy? - Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week. That is, Down = positive, Up = negative.
logistic_reg()) with Direction as the response and the five lag variables plus Volume as predictors. Use the summary() (remember to do summary(model_fit$fit)) function to print the results. Do any of the predictors appear to be statistically significant? if so, which ones?The p-value of intercept and \(b_2\) are statistically significant at the level of significance \(\alpha =0.05\). That is, the predictor Lag2 can be rejected the null hypothesis, meaning that there is evidence to say Lag2 has an impact on Direction while the other predictors hold constant.
lr_spec %>%
fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly) -> lr_Weekly_fit
lr_Weekly_fit$fit %>%
summary()##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
conf_int() and accuracy() from yardstick package to calculate the confusion matrix and the accuracy (overall fraction of correct predictions). Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.As shown in the confusion matrix, the logistic regression model operate the five Lag variables and Volume as predictors. The predictions are represented by the rows of the confusion matrix, while the ground truth is represented by the columns. The up-left area represents true Down, and the down-right area represents true Up; these two values are a leading indicator for determining whether the model is right or wrong. The true Down is 54, which greater then the false Down 48, and the true UP is 557, which is greater than false negative 430. Compared to the ground truth, the predictions don’t work well. The model just splitting approximately 50 % to the down and 50 % to the up direction.
preds_Weekly <- augment(lr_Weekly_fit, new_data = Weekly)
preds_Weekly %>%
conf_mat(truth = Direction, estimate = .pred_class) -> confusionMatrixWeekly
confusionMatrixWeekly## Truth
## Prediction Down Up
## Down 54 48
## Up 430 557
# we can also plot the confusion matrix
confusionMatrixWeekly %>%
autoplot(type = "heatmap") +
theme_minimal() Look at the table below, which indicates the accuracy of the data that are predicted correctly. The model only predict the direction correctly 611 (54+557) weeks out of 1089 weeks, for an accuracy of 0.56.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.561
Lag2 as the only predictor. Compute the confusion matrix and accuracy metric using the testing data set.The intercept and Lag2 are in the significant level.
lr_spec %>%
fit(Direction ~ Lag2, data = weekly_training) -> lr_Lag2_fit
lr_Lag2_fit$fit %>%
tidy() %>%
mutate(hypo = ifelse(p.value < 0.05, "H1", "H0"))## # A tibble: 2 x 6
## term estimate std.error statistic p.value hypo
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 0.203 0.0643 3.16 0.00157 H1
## 2 Lag2 0.0581 0.0287 2.02 0.0430 H1
Only Keeping Lag2 as predictor, the predicted accuracy of the logistic regression model augments to 0.625.
preds_lr_Lag2_fit <- augment(lr_Lag2_fit, new_data = weekly_testing)
# confusion matrix table
preds_lr_Lag2_fit %>%
conf_mat(truth = Direction, estimate = .pred_class) -> lrMatrix
lrMatrix## Truth
## Prediction Down Up
## Down 9 5
## Up 34 56
# plot the confusion matrix
lrMatrix %>%
autoplot(type = "heatmap") +
theme_tufte() +
ggtitle("Confusion Matrix")# accuracy
preds_lr_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
mutate(model = "Logistic Regression") -> lr_acc
lr_acc ## # A tibble: 1 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.625 Logistic Regression
# set LDA specification
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS") # the default is "MASS", otherwise is "mda"Fit the linear discriminant analysis model.
## parsnip model object
##
## Fit time: 4ms
## Call:
## lda(Direction ~ Lag2, data = data)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
The result is very similar to the logistic regression model we mentioned before. Using Linear Discriminant Analysis, the model predicts the Direction trend with 0.625 accuracy.
preds_lda_Lag2_fit <- augment(lda_Lag2_fit , new_data = weekly_testing)
# confusion matrix table
preds_lda_Lag2_fit %>%
conf_mat(truth = Direction, estimate = .pred_class) -> ldaMatrix
ldaMatrix## Truth
## Prediction Down Up
## Down 9 5
## Up 34 56
# plot the confusion matrix
ldaMatrix %>%
autoplot(type = "heatmap") +
theme_tufte() +
ggtitle("Confusion Matrix")# accuracy
preds_lda_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
mutate(model = "Linear Discriminant Analysis") -> lda_acc
lda_acc## # A tibble: 1 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.625 Linear Discriminant Analysis
# set QDA specification
qda_spec <- discrim_regularized() %>%
set_mode("classification") %>%
set_args(frac_common_cov = 0, frac_identity = 0) %>%
set_engine("klaR")Fit the Quadratic Discriminant Analysis model.
## parsnip model object
##
## Fit time: 34ms
## Call:
## rda(formula = Direction ~ Lag2, data = data, lambda = ~0, gamma = ~0)
##
## Regularization parameters:
## gamma lambda
## 0 0
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Misclassification rate:
## apparent: 44.772 %
The Quadratic Discriminant Analysis model has a lower accuracy of 0.585 than the LDA and logistic regression models. Look at the confusion matrix, the model cannot predict anything in downward weekly trends, causing a main weakness to decrease the predicted accuracy of the model.
preds_qda_Lag2_fit <- augment(qda_Lag2_fit , new_data = weekly_testing)
# confusion matrix table
preds_qda_Lag2_fit %>%
conf_mat(truth = Direction, estimate = .pred_class) -> qdaMatrix
qdaMatrix## Truth
## Prediction Down Up
## Down 0 0
## Up 43 61
# plot the confusion matrix
qdaMatrix %>%
autoplot(type = "heatmap") +
theme_tufte() +
ggtitle("Confusion Matrix")# accuracy
preds_qda_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
mutate(model = "Quadratic Discriminant Analysis") -> qda_acc
qda_acc## # A tibble: 1 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.587 Quadratic Discriminant Analysis
K = 1.# set a KNN specification
knn_spec <- nearest_neighbor(neighbors = 1) %>%
set_mode("classification") %>%
set_engine("kknn")Fit the K-Nearest Neighbors algorithm model.
## parsnip model object
##
## Fit time: 26ms
##
## Call:
## kknn::train.kknn(formula = Direction ~ Lag2, data = data, ks = min_rows(1, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.4741117
## Best kernel: optimal
## Best k: 1
The K-Nearest Neighbors algorithm produced a classification model with a 50% accuracy rate, which is the same as random chance.
preds_knn_Lag2_fit <- augment(knn_Lag2_fit , new_data = weekly_testing)
# confusion matrix table
preds_knn_Lag2_fit %>%
conf_mat(truth = Direction, estimate = .pred_class) -> knnMatrix
knnMatrix## Truth
## Prediction Down Up
## Down 22 31
## Up 21 30
# plot the confusion matrix
knnMatrix %>%
autoplot(type = "heatmap") +
theme_tufte() +
ggtitle("Confusion Matrix")# accuracy
preds_knn_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
mutate(model = "K-Nearest Neighbors") -> knn_acc
knn_acc## # A tibble: 1 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.5 K-Nearest Neighbors
Using Lag2 as predictor only to predict upward and downward inDirection in Weekly dataset, the Logistic Regression and the Linear Discriminant Analysis models are the best methods based on the accuracy rate with descending order.
## # A tibble: 4 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.625 Logistic Regression
## 2 accuracy binary 0.625 Linear Discriminant Analysis
## 3 accuracy binary 0.587 Quadratic Discriminant Analysis
## 4 accuracy binary 0.5 K-Nearest Neighbors
Alternatively, we may use another method that requires great statistical senses and programming skills to compare several models.
# crate a model list we yielded before
allModels <- list("Logistic Regression" = lr_Lag2_fit,
"LDA" = lda_Lag2_fit,
"QDA" = qda_Lag2_fit,
"KNN" = knn_Lag2_fit)Then, using the purrr package’s imap dfr(), apply augment() to each of the models using the testing data set. .id = "model adds a column called “model” to the resulting tibble that contains the names of allModels.
allPreds <- imap_dfr(allModels, augment,
new_data = weekly_testing, .id = "model")
allPreds %>%
select(model, Direction, .pred_class, .pred_Down, .pred_Up) %>%
head()## model Direction .pred_class .pred_Down .pred_Up
## 1 Logistic Regression Down Up 0.4738709 0.5261291
## 2 Logistic Regression Down Up 0.3552636 0.6447364
## 3 Logistic Regression Down Down 0.5137841 0.4862159
## 4 Logistic Regression Down Down 0.5147999 0.4852001
## 5 Logistic Regression Up Up 0.4802333 0.5197667
## 6 Logistic Regression Down Up 0.4598745 0.5401255
yardstick() package provides multiple different metrics by using metric_set().
group_by() and calculate the metrics for each model.desc() to see the model performances with different matrices.allPreds %>%
group_by(model) %>%
multi_metric(truth = Direction, estimate = .pred_class) %>%
arrange(desc(.estimate))## # A tibble: 12 x 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 QDA spec binary 1
## 2 LDA spec binary 0.918
## 3 Logistic Regression spec binary 0.918
## 4 LDA accuracy binary 0.625
## 5 Logistic Regression accuracy binary 0.625
## 6 QDA accuracy binary 0.587
## 7 KNN sens binary 0.512
## 8 KNN accuracy binary 0.5
## 9 KNN spec binary 0.492
## 10 LDA sens binary 0.209
## 11 Logistic Regression sens binary 0.209
## 12 QDA sens binary 0
yardstick() package also gives a function to create receiver operating characteristic (ROC) curves by using roc_curve(). Because the LDA curve is perfectly hidden behind the logistic regression, it is not visible here.
allPreds %>%
group_by(model) %>%
roc_curve(Direction, .pred_Down) %>%
autoplot() +
ggtitle("ROC Curves")There are many methods to select the predictors. In this session, we only focus on the value of neighbors in KNN model. To find the best value of k, take the square root of the total number of observations (1089), which is 33. We will start with 33 neighbors and then look for the best value of k.
## [1] 33
# set a KNN specification
knn_spec_33 <- nearest_neighbor(neighbors = 33) %>%
set_mode("classification") %>%
set_engine("kknn")Fit the K-Nearest Neighbors algorithm model.
## parsnip model object
##
## Fit time: 43ms
##
## Call:
## kknn::train.kknn(formula = Direction ~ Lag2, data = data, ks = min_rows(33, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.4284264
## Best kernel: optimal
## Best k: 33
The predicted accuracy of the knn33_Lag2_fit model is 6% higher than that of the previous KNN model (K=1).
preds_knn33_Lag2_fit <- augment(knn33_Lag2_fit , new_data = weekly_testing)
# confusion matrix table
preds_knn33_Lag2_fit %>%
conf_mat(truth = Direction, estimate = .pred_class) -> knn33Matrix
knn33Matrix## Truth
## Prediction Down Up
## Down 19 22
## Up 24 39
# plot the confusion matrix
knn33Matrix %>%
autoplot(type = "heatmap") +
theme_tufte() +
ggtitle("Confusion Matrix")# accuracy
preds_knn33_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
mutate(model = "KNN with 33 Neighbors") -> knn33_acc
bind_rows(knn33_acc, knn_acc) %>%
mutate(model = recode(model, "K-Nearest Neighbors" = "KNN with 1 Neighbor"))## # A tibble: 2 x 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.558 KNN with 33 Neighbors
## 2 accuracy binary 0.5 KNN with 1 Neighbor