In this assignment, I will be using Tidymodels instead of base R to do coding.

1 Exercise 1 (10 points)

Suppose we collect data for a group of students in a statistics class with variables $X_1$ = hours studied, $X_2$ = undergrad GPA, and $Y$ = receive an A. We fit a logistic regression and produce estimated coefficient, $\hat{\beta}_0=-6$, $\hat{\beta}_1=0.05$, $\hat{\beta}_2=1$.

The equation of logistic regression is: \[ p(X) = \frac{e^{\beta_0 + \beta_1X_1 + … + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X_1 + … + \beta_pX_p}} \] a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of $3.5$ gets an A in the class.

According to the question, $X_1$ is 40 hours and $X_2$ is $3.5$ so the probability will be: \[ \hat{p}(X) = \frac{e^{-6 + 0.05X_1 + 1X_2}}{(1 + e^{-6 + 0.05X_1 + 1X_2})} \]

e is an Euler’s number, which is close to 2.7182818284590452353602874713527…. Using exp() in the console, the probability of getting an A will be 0.3775 \[ \hat{p}(X) = \frac{e^{-6 + 0.05*40 + 1*3.5}}{(1 + e^{-6 + 0.05*40 + 1*3.5})} = 0.3775407 \] We also can create a function to compute the probability, the answer is p = 37.8 %.

prob <- function(x1,x2){
  stopifnot(is.numeric(x1) & is.numeric(x2))
  tmp <- exp(-6 + 0.05 * x1 + 1 * x2) 
  return(tmp/(1 + tmp))
}

prob(40, 3.5)

## [1] 0.3775407

How many hours would that student in part (a) need to study to have a 50% chance of getting an A in the class?

Now, we don’t know how many studying hour is, we need to find out $X_1$. \[ \frac{e^{-6 + 0.05X_1 + 1*3.5}}{(1 + e^{-6 + 0.05X_1 + 1*3.5})} = 0.5 \] And we use 1- 100 to test which hour ($X_1$) is exactly matched to 50% chance of getting an A. The answer is 50 hours.

hours <- seq(1 ,100 ,1)
probs <- mapply(hours, 3.5, FUN = prob) #X1, X2, function

names(probs) <- paste(hours,"Hours")
probs %>% 
  tidy() %>%
  filter(x == 0.5)

## # A tibble: 1 x 2
##   names        x
##   <chr>    <dbl>
## 1 50 Hours   0.5

2 Exercise 2 (10 points)

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First, we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, we use 1-nearest neighbors (i.e. $K = 1$) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

The training error and the test error are two important concepts using in machine learning.

Training Error: This is obtained by calculating a model’s classification error on the same data that the model was trained on.
Test Error: This is obtained by using two completely disjoint datasets but in the same dataset: one for training the model and the other for calculating the classification error. The first dataset is save as training data, and the second is save as test data.

We calculate the test error rate when we calculate the error on data that was unknown during the training phase. In this question, we should focus on test error rate.

According to the question, we can conclude:

LR training error: 20%
LR test error:30 %
KNN(n=1) average error: 18% (averaged over both test and training data sets)
KNN(n=1) training error: 0% so KNN(n=1) test error: 36% as (x+ 0)/2 = 18, x = 36

Even though the average error rate is 18% for KNN classification procedure, with K=1, the KNN training error rate is actually 0%. This means that we do not make any error on the training data within this setting. This leads to a test error of 36% since the average error rate is 18%. In conclusion, the test error rate of KNN is 36%, which is higher than the test error rate of logistic regression, which is 30%. Therefore, we prefer to use logistic regression for classification of new observations.

3 Exercise 3 (15 points)

You will in this exercise examine the differences between LDA and QDA.

If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

We anticipate that QDA will outperform on the training set because its greater flexibility may result in a better fit. But LDA does not bad on the training set. On the test set, LDA is better because the Bayes decision boundary is linear. QDA will causes overfitting problem.

If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

The answer is the opposite of the previous question. QDA will perform better in training and testing sets because the Bayes decision boundary is non-linear. QDA, for example, excels at the curve pattern. LDA isn’t very good at non-linear patterns.

In general, as the sample size n increases, do we expect the test prediction accuracy or QDA relative to LDA to improve, decline, or be unchanged? Why?

QDA is recommended if the training set is super large, so that the variance of the classifier is not a major concern. That is, if the sample size is large enough, the higher variance is no longer a problem, and QDA performs better.

True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

Ans: False. In comparison to QDA, LDA has a lower variance. Despite the fact that LDA is a less flexible classifier than QDA, it has a lower variance and is therefore more exact, especially in small datasets.

4 Exercise 4 (20 points)

In this exercise, we will explore a data set about cars called auto which you can find here.

The data set contains 1 factor variable and 6 numeric variables. The factor variable mpg has two levels high and low indicating whether the car has a high or low miles per gallon. We will in this exercise investigate if we are able to use a logistic regression classifier to predict if a car has high or low mpg from the other variables.

Read in the data and create a test-train rsplit object of auto using initial_split(). Use default arguments for initial_split().

# read the data
autoDF <- read_csv("data/auto.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   mpg = col_character(),
##   cylinders = col_double(),
##   displacement = col_double(),
##   horsepower = col_double(),
##   weight = col_double(),
##   acceleration = col_double(),
##   year = col_double()
## )

# see each variable's type
str(autoDF)

## spec_tbl_df [397 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg         : chr [1:397] "low" "low" "low" "low" ...
##  $ cylinders   : num [1:397] 8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num [1:397] 307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num [1:397] 130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num [1:397] 3504 3693 3436 3433 3449 ...
##  $ acceleration: num [1:397] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num [1:397] 70 70 70 70 70 70 70 70 70 70 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_character(),
##   ..   cylinders = col_double(),
##   ..   displacement = col_double(),
##   ..   horsepower = col_double(),
##   ..   weight = col_double(),
##   ..   acceleration = col_double(),
##   ..   year = col_double()
##   .. )

# mutate the response variable to be a factor type
autoDF %>%
  mutate(mpg = as.factor(mpg)) -> autoDF
# set a seed
set.seed(1)
# splitting to train and test sets
auto_split <- initial_split(autoDF, strata = mpg)
auto_split

## <Analysis/Assess/Total>
## <298/99/397>

Create the training and testing data set with training() and testing() respectively.

auto_train <- training(auto_split)
auto_train %>%
  nrow()

## [1] 298

auto_test <- testing(auto_split)
auto_test %>%
  nrow()

## [1] 99

Fit a logistic regression model using logistic_reg(). Use all the 6 numeric variables as predictors (a formula shorthand is to write mpg ~ . where . means everything. Remember to fit the model only using the training data set.

# create a model formula
auto_formula <- mpg ~ .

# set the logistic regression engine
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# fit the model
lr_spec %>%
  fit(auto_formula, data = auto_train) -> lr_fit
lr_fit

## parsnip model object
## 
## Fit time:  9ms 
## 
## Call:  stats::glm(formula = mpg ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##  (Intercept)     cylinders  displacement    horsepower        weight  
##    17.897923     -0.166662      0.007754      0.026401      0.004506  
## acceleration          year  
##    -0.067708     -0.427629  
## 
## Degrees of Freedom: 297 Total (i.e. Null);  291 Residual
## Null Deviance:       413.1 
## Residual Deviance: 123.4     AIC: 137.4

Inspect the model with summary() and tidy(). Which of the variables are significant?

According to the outcomes of two tables below, we find intercept, weight and year are significant. That is, $b_0$, $b_4$, and $b_6$. The significant variables are weight and year.

# using summary()
lr_fit %>%
  pluck("fit") %>% # unest the list
  summary()

## 
## Call:
## stats::glm(formula = mpg ~ ., family = stats::binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2241  -0.2326  -0.0209   0.1218   2.1663  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.897923   6.566039   2.726 0.006414 ** 
## cylinders    -0.166662   0.470427  -0.354 0.723131    
## displacement  0.007754   0.011691   0.663 0.507163    
## horsepower    0.026401   0.027258   0.969 0.332766    
## weight        0.004506   0.001225   3.680 0.000233 ***
## acceleration -0.067708   0.152738  -0.443 0.657553    
## year         -0.427629   0.083449  -5.124 2.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 413.10  on 297  degrees of freedom
## Residual deviance: 123.35  on 291  degrees of freedom
## AIC: 137.35
## 
## Number of Fisher Scoring iterations: 8

# using tidy()
tidy(lr_fit) %>%
  mutate(signPredictor = ifelse(p.value < 0.05, "Yes", "No")) %>%
  filter(signPredictor == "Yes")

## # A tibble: 3 x 6
##   term        estimate std.error statistic     p.value signPredictor
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl> <chr>        
## 1 (Intercept) 17.9       6.57         2.73 0.00641     Yes          
## 2 weight       0.00451   0.00122      3.68 0.000233    Yes          
## 3 year        -0.428     0.0834      -5.12 0.000000298 Yes

Predict values for the training data set and save them as training_pred.

In order to successfully combine two datasets without has conflicted columns in the following question, we will use predict() rather than augment().

predict(lr_fit, new_data = auto_train) -> training_pred

Use the following code to calculate the training accuracy (auto_training should be renamed to match your training data set if needed).

The training accuracy in the training data is 90.3 %, which is excellent.

bind_cols(training_pred, auto_train) %>%
accuracy(truth = mpg, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.903

Predict values for the testing data set and use the above code to calculate the testing accuracy. Compare.

The training accuracy in the testing data is 89.9 %, which is outstanding.

augment(lr_fit, new_data = auto_test) -> testing_pred 
accuracy(testing_pred, truth = mpg, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.899

When compared to the training accuracy in both the training and testing datasets, the result is extremely close. Regardless, the high values of predicted accuracy have already indicated that the predictors are good in the auto data frame to predict whether the car has a high or low miles per gallon tendency (mpg).

5 Execise 5 (20 points)

This exercise should be answered using the Weekly data set, which is part of the LSLR package. If you don’t have it installed already you can install it with

# if necessary please run the below line
# install.packages("ISLR")

To load the data set run the following code

library(ISLR)
data("Weekly")

This data is similar in nature to the Smarket data from chapter 4’s lab, it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

Weekly %>%
  head()

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

Produce some numerical and graphical summaries of the data. Does there appear to be any patterns?

Although we can see the approximate range value for each variable, the summary table does not reveal any significant relationship between these variables.

# numerical summary
dim(Weekly)

## [1] 1089    9

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

The table of correlations is shown below. Suggest that we plot this table so that it is clear to the reader.

cor(Weekly[-9]) %>%
  as_tibble()

## # A tibble: 8 x 8
##      Year     Lag1    Lag2    Lag3     Lag4     Lag5  Volume    Today
##     <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1  1      -0.0323  -0.0334 -0.0300 -0.0311  -0.0305   0.842  -0.0325 
## 2 -0.0323  1       -0.0749  0.0586 -0.0713  -0.00818 -0.0650 -0.0750 
## 3 -0.0334 -0.0749   1      -0.0757  0.0584  -0.0725  -0.0855  0.0592 
## 4 -0.0300  0.0586  -0.0757  1      -0.0754   0.0607  -0.0693 -0.0712 
## 5 -0.0311 -0.0713   0.0584 -0.0754  1       -0.0757  -0.0611 -0.00783
## 6 -0.0305 -0.00818 -0.0725  0.0607 -0.0757   1       -0.0585  0.0110 
## 7  0.842  -0.0650  -0.0855 -0.0693 -0.0611  -0.0585   1      -0.0331 
## 8 -0.0325 -0.0750   0.0592 -0.0712 -0.00783  0.0110  -0.0331  1

Removing Direction variable as it is a qualitative variable. We can see the correlation between these eight variables. The plot shows Year and Volume has a strong linear relationship.

# graphical summary
corrplot(cor(Weekly[1:8]), method="square")

This plot depicts an increasing volume of stock trading over time; the volume continues to rise as time passes. Is everyone now wealthy? - Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week. That is, Down = positive, Up = negative.

Weekly %>%
  ggplot(aes(as.factor(Year), Volume)) +
  geom_boxplot() +
  theme_bw()

Use the whole data set to perform a logistic regression (with logistic_reg()) with Direction as the response and the five lag variables plus Volume as predictors. Use the summary() (remember to do summary(model_fit$fit)) function to print the results. Do any of the predictors appear to be statistically significant? if so, which ones?

The p-value of intercept and $b_2$ are statistically significant at the level of significance $\alpha =0.05$. That is, the predictor Lag2 can be rejected the null hypothesis, meaning that there is evidence to say Lag2 has an impact on Direction while the other predictors hold constant.

lr_spec %>%
  fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly) -> lr_Weekly_fit

lr_Weekly_fit$fit %>%
  summary()

## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Use conf_int() and accuracy() from yardstick package to calculate the confusion matrix and the accuracy (overall fraction of correct predictions). Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

As shown in the confusion matrix, the logistic regression model operate the five Lag variables and Volume as predictors. The predictions are represented by the rows of the confusion matrix, while the ground truth is represented by the columns. The up-left area represents true Down, and the down-right area represents true Up; these two values are a leading indicator for determining whether the model is right or wrong. The true Down is 54, which greater then the false Down 48, and the true UP is 557, which is greater than false negative 430. Compared to the ground truth, the predictions don’t work well. The model just splitting approximately 50 % to the down and 50 % to the up direction.

preds_Weekly <- augment(lr_Weekly_fit, new_data = Weekly)
preds_Weekly %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> confusionMatrixWeekly
confusionMatrixWeekly

##           Truth
## Prediction Down  Up
##       Down   54  48
##       Up    430 557

# we can also plot the confusion matrix
confusionMatrixWeekly %>%
  autoplot(type = "heatmap") +
  theme_minimal()

Look at the table below, which indicates the accuracy of the data that are predicted correctly. The model only predict the direction correctly 611 (54+557) weeks out of 1089 weeks, for an accuracy of 0.56.

# accuracy
preds_Weekly %>%
accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.561

Split the data into a training and testing data set using the following code

weekly_training <- Weekly %>% filter(Year <= 2008)
weekly_testing <- Weekly %>% filter(Year > 2008)

Now fit the logistic regression model using the training data, with Lag2 as the only predictor. Compute the confusion matrix and accuracy metric using the testing data set.

The intercept and Lag2 are in the significant level.

lr_spec %>%
  fit(Direction ~ Lag2, data = weekly_training) -> lr_Lag2_fit

lr_Lag2_fit$fit %>%
  tidy() %>%
  mutate(hypo = ifelse(p.value < 0.05, "H1", "H0"))

## # A tibble: 2 x 6
##   term        estimate std.error statistic p.value hypo 
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>
## 1 (Intercept)   0.203     0.0643      3.16 0.00157 H1   
## 2 Lag2          0.0581    0.0287      2.02 0.0430  H1

Only Keeping Lag2 as predictor, the predicted accuracy of the logistic regression model augments to 0.625.

preds_lr_Lag2_fit <- augment(lr_Lag2_fit, new_data = weekly_testing)

# confusion matrix table
preds_lr_Lag2_fit %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> lrMatrix
lrMatrix

##           Truth
## Prediction Down Up
##       Down    9  5
##       Up     34 56

# plot the confusion matrix
lrMatrix %>%
  autoplot(type = "heatmap") +
  theme_tufte() +
  ggtitle("Confusion Matrix")

# accuracy
preds_lr_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
  mutate(model = "Logistic Regression") -> lr_acc
lr_acc

## # A tibble: 1 x 4
##   .metric  .estimator .estimate model              
##   <chr>    <chr>          <dbl> <chr>              
## 1 accuracy binary         0.625 Logistic Regression

Repeat (e) using LDA.

# set LDA specification
lda_spec <- discrim_linear() %>%
  set_mode("classification") %>%
  set_engine("MASS") # the default is "MASS", otherwise is "mda"

Fit the linear discriminant analysis model.

lda_Lag2_fit <- fit(lda_spec, Direction ~ Lag2, data = weekly_training)
lda_Lag2_fit

## parsnip model object
## 
## Fit time:  4ms 
## Call:
## lda(Direction ~ Lag2, data = data)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

The result is very similar to the logistic regression model we mentioned before. Using Linear Discriminant Analysis, the model predicts the Direction trend with 0.625 accuracy.

preds_lda_Lag2_fit <- augment(lda_Lag2_fit , new_data = weekly_testing)

# confusion matrix table
preds_lda_Lag2_fit %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> ldaMatrix
ldaMatrix

##           Truth
## Prediction Down Up
##       Down    9  5
##       Up     34 56

# plot the confusion matrix
ldaMatrix %>%
  autoplot(type = "heatmap") +
  theme_tufte() +
  ggtitle("Confusion Matrix")

# accuracy
preds_lda_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
  mutate(model = "Linear Discriminant Analysis") -> lda_acc
lda_acc

## # A tibble: 1 x 4
##   .metric  .estimator .estimate model                       
##   <chr>    <chr>          <dbl> <chr>                       
## 1 accuracy binary         0.625 Linear Discriminant Analysis

Repeat (e) using QDA.

# set QDA specification
qda_spec <- discrim_regularized() %>%
  set_mode("classification") %>%
  set_args(frac_common_cov = 0, frac_identity = 0) %>%
  set_engine("klaR")

Fit the Quadratic Discriminant Analysis model.

qda_Lag2_fit <- fit(qda_spec, Direction ~ Lag2, data = weekly_training)
qda_Lag2_fit

## parsnip model object
## 
## Fit time:  34ms 
## Call: 
## rda(formula = Direction ~ Lag2, data = data, lambda = ~0, gamma = ~0)
## 
## Regularization parameters: 
##  gamma lambda 
##      0      0 
## 
## Prior probabilities of groups: 
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Misclassification rate: 
##        apparent: 44.772 %

The Quadratic Discriminant Analysis model has a lower accuracy of 0.585 than the LDA and logistic regression models. Look at the confusion matrix, the model cannot predict anything in downward weekly trends, causing a main weakness to decrease the predicted accuracy of the model.

preds_qda_Lag2_fit <- augment(qda_Lag2_fit , new_data = weekly_testing)

# confusion matrix table
preds_qda_Lag2_fit %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> qdaMatrix
qdaMatrix

##           Truth
## Prediction Down Up
##       Down    0  0
##       Up     43 61

# plot the confusion matrix
qdaMatrix %>%
  autoplot(type = "heatmap") +
  theme_tufte() +
  ggtitle("Confusion Matrix")

# accuracy
preds_qda_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
  mutate(model = "Quadratic Discriminant Analysis") -> qda_acc
qda_acc

## # A tibble: 1 x 4
##   .metric  .estimator .estimate model                          
##   <chr>    <chr>          <dbl> <chr>                          
## 1 accuracy binary         0.587 Quadratic Discriminant Analysis

Repeat (e) using KNN with K = 1.

# set a KNN specification
knn_spec <- nearest_neighbor(neighbors = 1) %>%
  set_mode("classification") %>%
  set_engine("kknn")

Fit the K-Nearest Neighbors algorithm model.

knn_Lag2_fit <- fit(knn_spec, Direction ~ Lag2, data = weekly_training)
knn_Lag2_fit

## parsnip model object
## 
## Fit time:  26ms 
## 
## Call:
## kknn::train.kknn(formula = Direction ~ Lag2, data = data, ks = min_rows(1,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.4741117
## Best kernel: optimal
## Best k: 1

The K-Nearest Neighbors algorithm produced a classification model with a 50% accuracy rate, which is the same as random chance.

preds_knn_Lag2_fit <- augment(knn_Lag2_fit , new_data = weekly_testing)

# confusion matrix table
preds_knn_Lag2_fit %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> knnMatrix
knnMatrix

##           Truth
## Prediction Down Up
##       Down   22 31
##       Up     21 30

# plot the confusion matrix
knnMatrix %>%
  autoplot(type = "heatmap") +
  theme_tufte() +
  ggtitle("Confusion Matrix")

# accuracy
preds_knn_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
  mutate(model = "K-Nearest Neighbors") -> knn_acc
knn_acc

## # A tibble: 1 x 4
##   .metric  .estimator .estimate model              
##   <chr>    <chr>          <dbl> <chr>              
## 1 accuracy binary           0.5 K-Nearest Neighbors

Which of these methods appear to provide the best results on the data?

Using Lag2 as predictor only to predict upward and downward inDirection in Weekly dataset, the Logistic Regression and the Linear Discriminant Analysis models are the best methods based on the accuracy rate with descending order.

lr_acc %>%
  bind_rows(lda_acc, qda_acc, knn_acc) %>%
  arrange(desc(.estimate))

## # A tibble: 4 x 4
##   .metric  .estimator .estimate model                          
##   <chr>    <chr>          <dbl> <chr>                          
## 1 accuracy binary         0.625 Logistic Regression            
## 2 accuracy binary         0.625 Linear Discriminant Analysis   
## 3 accuracy binary         0.587 Quadratic Discriminant Analysis
## 4 accuracy binary         0.5   K-Nearest Neighbors

Alternatively, we may use another method that requires great statistical senses and programming skills to compare several models.

# crate a model list we yielded before
allModels <- list("Logistic Regression" = lr_Lag2_fit,
                  "LDA" = lda_Lag2_fit,
                  "QDA" = qda_Lag2_fit,
                  "KNN" = knn_Lag2_fit)

Then, using the purrr package’s imap dfr(), apply augment() to each of the models using the testing data set. .id = "model adds a column called “model” to the resulting tibble that contains the names of allModels.

allPreds <- imap_dfr(allModels, augment, 
                  new_data = weekly_testing, .id = "model")

allPreds %>%
  select(model, Direction, .pred_class, .pred_Down, .pred_Up) %>%
  head()

##                 model Direction .pred_class .pred_Down  .pred_Up
## 1 Logistic Regression      Down          Up  0.4738709 0.5261291
## 2 Logistic Regression      Down          Up  0.3552636 0.6447364
## 3 Logistic Regression      Down        Down  0.5137841 0.4862159
## 4 Logistic Regression      Down        Down  0.5147999 0.4852001
## 5 Logistic Regression        Up          Up  0.4802333 0.5197667
## 6 Logistic Regression      Down          Up  0.4598745 0.5401255

yardstick() package provides multiple different metrics by using metric_set().

multi_metric <- metric_set(accuracy, sensitivity, specificity)

Using group_by() and calculate the metrics for each model.
Using desc() to see the model performances with different matrices.
The following graph is good at knowing what is sensitivity and specificity.

allPreds %>%
  group_by(model) %>%
  multi_metric(truth = Direction, estimate = .pred_class) %>%
  arrange(desc(.estimate))

## # A tibble: 12 x 4
##    model               .metric  .estimator .estimate
##    <chr>               <chr>    <chr>          <dbl>
##  1 QDA                 spec     binary         1    
##  2 LDA                 spec     binary         0.918
##  3 Logistic Regression spec     binary         0.918
##  4 LDA                 accuracy binary         0.625
##  5 Logistic Regression accuracy binary         0.625
##  6 QDA                 accuracy binary         0.587
##  7 KNN                 sens     binary         0.512
##  8 KNN                 accuracy binary         0.5  
##  9 KNN                 spec     binary         0.492
## 10 LDA                 sens     binary         0.209
## 11 Logistic Regression sens     binary         0.209
## 12 QDA                 sens     binary         0

yardstick() package also gives a function to create receiver operating characteristic (ROC) curves by using roc_curve(). Because the LDA curve is perfectly hidden behind the logistic regression, it is not visible here.

allPreds %>%
  group_by(model) %>%
  roc_curve(Direction, .pred_Down) %>%
  autoplot() +
  ggtitle("ROC Curves")

(Optional) Experiment with different combinations of predictors for each of the methods. Report the variables, methods, and associated confusion matrix that appears to provide the best results on the held-out data. Note that you can also experiment with different values of K in KNN. (This kind of running many many models and testing on the testing data set many times is not good practice. We will look at ways in later weeks on how we can properly explore multiple models).

There are many methods to select the predictors. In this session, we only focus on the value of neighbors in KNN model. To find the best value of k, take the square root of the total number of observations (1089), which is 33. We will start with 33 neighbors and then look for the best value of k.

sqrt(nrow(Weekly))

## [1] 33

# set a KNN specification
knn_spec_33 <- nearest_neighbor(neighbors = 33) %>%
  set_mode("classification") %>%
  set_engine("kknn")

Fit the K-Nearest Neighbors algorithm model.

knn33_Lag2_fit <- fit(knn_spec_33, Direction ~ Lag2, data = weekly_training)
knn33_Lag2_fit

## parsnip model object
## 
## Fit time:  43ms 
## 
## Call:
## kknn::train.kknn(formula = Direction ~ Lag2, data = data, ks = min_rows(33,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.4284264
## Best kernel: optimal
## Best k: 33

The predicted accuracy of the knn33_Lag2_fit model is 6% higher than that of the previous KNN model (K=1).

preds_knn33_Lag2_fit <- augment(knn33_Lag2_fit , new_data = weekly_testing)

# confusion matrix table
preds_knn33_Lag2_fit %>%
  conf_mat(truth = Direction, estimate = .pred_class) -> knn33Matrix
knn33Matrix

##           Truth
## Prediction Down Up
##       Down   19 22
##       Up     24 39

# plot the confusion matrix
knn33Matrix %>%
  autoplot(type = "heatmap") +
  theme_tufte() +
  ggtitle("Confusion Matrix")

# accuracy
preds_knn33_Lag2_fit %>%
accuracy(truth = Direction, estimate = .pred_class) %>%
  mutate(model = "KNN with 33 Neighbors") -> knn33_acc
bind_rows(knn33_acc, knn_acc) %>%
  mutate(model = recode(model, "K-Nearest Neighbors" = "KNN with 1 Neighbor"))

## # A tibble: 2 x 4
##   .metric  .estimator .estimate model                
##   <chr>    <chr>          <dbl> <chr>                
## 1 accuracy binary         0.558 KNN with 33 Neighbors
## 2 accuracy binary         0.5   KNN with 1 Neighbor

6 References

http://uc-r.github.io/discriminant_analysis
https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/classification.html
https://rstudio-pubs-static.s3.amazonaws.com/316172_a857ca788d1441f8be1bcd1e31f0e875.html
“An Introduction to Statistical Learning with Applications in R” by G. James, D. Witten, T. Hastie, and R. Tibshirani; Springer, 2013. ISBN 1461471370
https://rapidminer.com/blog/validate-models-training-test-error/

STAT-627 Assignment 2

Yunting Chiu

2021-05-29