We will be using the add-on package discrim to access functions to perform discriminant analysis models with parsnip and the klaR package to perform the QDA calculations. if you haven’t already got it installed run

Create a test-train rsplit object of mlc_churn using initial_split(). Use the arguments to set the proportions of the training data to be 80%. Stratify the sampling according to the churn variable.

Do the following tasks for LDA, QDA and KNN model.

Setup

# install.packages(c("discrim", "klaR"))
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──

## ✓ broom     0.7.6      ✓ recipes   0.1.14
## ✓ dials     0.0.9      ✓ rsample   0.0.9 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.1 
## ✓ parsnip   0.1.5      ✓ yardstick 0.0.7

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

library(discrim)

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

data("mlc_churn")
head(mlc_churn)

## # A tibble: 6 x 20
##   state account_length area_code     international_plan voice_mail_plan
##   <fct>          <int> <fct>         <fct>              <fct>          
## 1 KS               128 area_code_415 no                 yes            
## 2 OH               107 area_code_415 no                 yes            
## 3 NJ               137 area_code_415 no                 no             
## 4 OH                84 area_code_408 yes                no             
## 5 OK                75 area_code_415 yes                no             
## 6 AL               118 area_code_510 yes                no             
## # … with 15 more variables: number_vmail_messages <int>,
## #   total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## #   total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## #   total_night_minutes <dbl>, total_night_calls <int>,
## #   total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## #   total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>

Splitting the data to training and testing sets.

set.seed(1234)
mlc_split <- initial_split(mlc_churn, prop = 0.8, strata = churn)
mlc_split

## <Analysis/Assess/Total>
## <4001/999/5000>

mlc_training <- training(mlc_split)
mlc_testing <- testing(mlc_split)

# model formula
churn_formula <- churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls

LDA (Linear Discriminant Analysis)

Fit a classification model. Use number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.

lda_spec <- discrim_linear() %>%
  set_mode("classification") %>%
  set_engine("MASS") # the default is "MASS", otherwise is "mda"

Fit the LDA model

lda_fit <- fit(lda_spec, churn_formula, data = mlc_training)
lda_fit

## parsnip model object
## 
## Fit time:  13ms 
## Call:
## lda(churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + 
##     total_intl_charge + number_customer_service_calls, data = data)
## 
## Prior probabilities of groups:
##       yes        no 
## 0.1414646 0.8585354 
## 
## Group means:
##     number_vmail_messages total_intl_minutes total_intl_calls total_intl_charge
## yes              4.680212           10.62597         4.121908          2.869470
## no               8.084716           10.17383         4.487045          2.747441
##     number_customer_service_calls
## yes                      2.291519
## no                       1.453857
## 
## Coefficients of linear discriminants:
##                                       LD1
## number_vmail_messages          0.02711856
## total_intl_minutes             1.00746332
## total_intl_calls               0.08589699
## total_intl_charge             -4.07968773
## number_customer_service_calls -0.69807223

The summary shows that our training prior probabilities of customer churn are 0.14 says Yes and 0.86 says No. The group means indicate the mean of each variable in each group. Plus, the coefficients of linear discriminants shows the linear combination of predictor variables that are used to form the LDA decision rule. For instance, LD1 = 0.02711856*number_vmail_messages + 1.00746332*total_intl_minutes + 0.08589699*total_intl_calls - 4.07968773*total_intl_charge - 0.69807223*number_customer_service_calls. Because there are only two response classes in this model, there will be only one set of coefficients (LD1).

Inspect the model with summary() and tidy(). How good are the variables we have chosen?

We can be using summary(lda_fit$fit$means) to see each mean of predictors, the outcome looks not bad.

summary(lda_fit$fit$means)

##  number_vmail_messages total_intl_minutes total_intl_calls total_intl_charge
##  Min.   :4.680         Min.   :10.17      Min.   :4.122    Min.   :2.747    
##  1st Qu.:5.531         1st Qu.:10.29      1st Qu.:4.213    1st Qu.:2.778    
##  Median :6.382         Median :10.40      Median :4.304    Median :2.808    
##  Mean   :6.382         Mean   :10.40      Mean   :4.304    Mean   :2.808    
##  3rd Qu.:7.234         3rd Qu.:10.51      3rd Qu.:4.396    3rd Qu.:2.839    
##  Max.   :8.085         Max.   :10.63      Max.   :4.487    Max.   :2.869    
##  number_customer_service_calls
##  Min.   :1.454                
##  1st Qu.:1.663                
##  Median :1.873                
##  Mean   :1.873                
##  3rd Qu.:2.082                
##  Max.   :2.292

tidy() can not works

tidy(lda_fit)

## Error: No tidy method for objects of class lda

Predict values for the testing data set.

See the probability of answering Yes or No. We only show the first six rows to ensure that the output is clear, but you can manually remove head() to see the all results. All datasets will be still using head() in order to compress the outputs.

predict(lda_fit, new_data = mlc_testing, type = "prob") %>%
  head()

## # A tibble: 6 x 2
##   .pred_yes .pred_no
##       <dbl>    <dbl>
## 1    0.0582    0.942
## 2    0.269     0.731
## 3    0.0673    0.933
## 4    0.320     0.680
## 5    0.319     0.681
## 6    0.135     0.865

Using augment() is the simplest way to incorporate the prediction into an existing data frame.

augment(lda_fit, new_data = mlc_testing) -> ldaPredDF
head(ldaPredDF)

## # A tibble: 6 x 23
##   state account_length area_code     international_plan voice_mail_plan
##   <fct>          <int> <fct>         <fct>              <fct>          
## 1 RI                74 area_code_415 no                 no             
## 2 MT                95 area_code_510 no                 no             
## 3 FL               147 area_code_415 no                 no             
## 4 NE               174 area_code_415 no                 no             
## 5 MT                54 area_code_408 no                 no             
## 6 HI                49 area_code_510 no                 no             
## # … with 18 more variables: number_vmail_messages <int>,
## #   total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## #   total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## #   total_night_minutes <dbl>, total_night_calls <int>,
## #   total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## #   total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## #   .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>

Use conf_mat() to construct a confusion matrix. Does the confusion matrix look good?

The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 12, which greater then the false positive 5, and the true positive is 853, which is greater than false negative 129. Thus, the confusion matrix shows the model has a good prediction.

ldaPredDF %>%
  conf_mat(estimate = .pred_class, truth = churn)

##           Truth
## Prediction yes  no
##        yes  12   5
##        no  129 853

The linear discriminant analysis model has 87 % accuracy for predicting the customer churn.

ldaPredDF %>%
accuracy(truth = churn, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.866

QDA (Quadratic Discriminant Analysis)

Fit a classification model. Use number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.

qda_spec <- discrim_regularized(frac_common_cov = 0, frac_identity = 0) %>%
  set_mode("classification") %>%
  set_engine("klaR")

Fit the QDA model

qda_fit <- fit(qda_spec, formula = churn_formula, data = mlc_training)
qda_fit

## parsnip model object
## 
## Fit time:  141ms 
## Call: 
## rda(formula = churn ~ number_vmail_messages + total_intl_minutes + 
##     total_intl_calls + total_intl_charge + number_customer_service_calls, 
##     data = data, lambda = ~0, gamma = ~0)
## 
## Regularization parameters: 
##  gamma lambda 
##      0      0 
## 
## Prior probabilities of groups: 
##       yes        no 
## 0.1414646 0.8585354 
## 
## Misclassification rate: 
##        apparent: 13.647 %

The fitting time 159ms. We can find the prior probabilities and group means in this table.

Inspect the model with summary() and tidy(). How good are the variables we have chosen?

summary(qda_fit$fit)

##                Length Class  Mode     
## call            5     -none- call     
## regularization  2     -none- numeric  
## classes         2     -none- character
## prior           2     -none- numeric  
## error.rate      1     -none- numeric  
## varnames        5     -none- character
## means          10     -none- numeric  
## covariances    50     -none- numeric  
## covpooled      25     -none- numeric  
## converged       1     -none- logical  
## iter            1     -none- numeric  
## terms           3     terms  call     
## xlevels         0     -none- list

tidy() can not works

tidy(qda_fit)

## Error: No tidy method for objects of class rda

To see more relationship between these predictors. We can also see the “Yes” covariances divided by “No” covariances by using the programming skill.

qda_fit$fit$covariances[,, "yes"] / qda_fit$fit$covariances[,, "no"]

##                               number_vmail_messages total_intl_minutes
## number_vmail_messages                     0.7337243           1.883922
## total_intl_minutes                        1.8839221           1.088717
## total_intl_calls                         -3.6384360           3.541611
## total_intl_charge                         1.8850231           1.088725
## number_customer_service_calls           -96.2601351          49.440174
##                               total_intl_calls total_intl_charge
## number_vmail_messages                -3.638436          1.885023
## total_intl_minutes                    3.541611          1.088725
## total_intl_calls                      1.063298          3.533929
## total_intl_charge                     3.533929          1.088730
## number_customer_service_calls        -9.311266         48.668483
##                               number_customer_service_calls
## number_vmail_messages                            -96.260135
## total_intl_minutes                                49.440174
## total_intl_calls                                  -9.311266
## total_intl_charge                                 48.668483
## number_customer_service_calls                      2.398223

Predict values for the testing data set.

predict(qda_fit, new_data = mlc_testing, type = "prob") %>%
  head()

## # A tibble: 6 x 2
##   .pred_yes .pred_no
##       <dbl>    <dbl>
## 1     0.101    0.899
## 2     0.232    0.768
## 3     0.118    0.882
## 4     0.262    0.738
## 5     0.274    0.726
## 6     0.129    0.871

augment(qda_fit, new_data = mlc_testing) -> qdaPredDF
qdaPredDF %>%
  head()

## # A tibble: 6 x 23
##   state account_length area_code     international_plan voice_mail_plan
##   <fct>          <int> <fct>         <fct>              <fct>          
## 1 RI                74 area_code_415 no                 no             
## 2 MT                95 area_code_510 no                 no             
## 3 FL               147 area_code_415 no                 no             
## 4 NE               174 area_code_415 no                 no             
## 5 MT                54 area_code_408 no                 no             
## 6 HI                49 area_code_510 no                 no             
## # … with 18 more variables: number_vmail_messages <int>,
## #   total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## #   total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## #   total_night_minutes <dbl>, total_night_calls <int>,
## #   total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## #   total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## #   .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>

Use conf_mat() to construct a confusion matrix. Does the confusion matrix look good? The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 27, which greater then the false positive 17, and the true positive is 841, which is greater than false negative 114. Thus, the confusion matrix shows the model has a good prediction.

qdaPredDF %>%
  conf_mat(estimate = .pred_class, truth = churn)

##           Truth
## Prediction yes  no
##        yes  27  17
##        no  114 841

The quadratic discriminant analysis model has 87 % accuracy for predicting the customer churn.

qdaPredDF %>%
accuracy(truth = churn, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.869

KNN (k-Nearest Neighbors Algorithm)

Fit a classification model. Use number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.

knn_spec <- nearest_neighbor(neighbors = 5) %>% # default k is 5
  set_mode("classification") %>%
  set_engine("kknn")

knn_fit <- fit(knn_spec, data = mlc_training, formula = churn_formula)
knn_fit

## parsnip model object
## 
## Fit time:  279ms 
## 
## Call:
## kknn::train.kknn(formula = churn ~ number_vmail_messages + total_intl_minutes +     total_intl_calls + total_intl_charge + number_customer_service_calls,     data = data, ks = min_rows(5, data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.1539615
## Best kernel: optimal
## Best k: 5

Inspect the model with summary() and tidy(). How good are the variables we have chosen?

The misclassification is the percentage of classifications that are incorrect. The minimal misclassification is 0.15, meaning that the predictor variables are not bad.

summary(knn_fit$fit)

## 
## Call:
## kknn::train.kknn(formula = churn ~ number_vmail_messages + total_intl_minutes +     total_intl_calls + total_intl_charge + number_customer_service_calls,     data = data, ks = min_rows(5, data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.1539615
## Best kernel: optimal
## Best k: 5

tidy() can not works

tidy(knn_fit)

## Error: No tidy method for objects of class train.kknn

Predict values for the testing data set.

augment(knn_fit, new_data = mlc_testing) -> knnPredDF
knnPredDF %>%
  head()

## # A tibble: 6 x 23
##   state account_length area_code     international_plan voice_mail_plan
##   <fct>          <int> <fct>         <fct>              <fct>          
## 1 RI                74 area_code_415 no                 no             
## 2 MT                95 area_code_510 no                 no             
## 3 FL               147 area_code_415 no                 no             
## 4 NE               174 area_code_415 no                 no             
## 5 MT                54 area_code_408 no                 no             
## 6 HI                49 area_code_510 no                 no             
## # … with 18 more variables: number_vmail_messages <int>,
## #   total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## #   total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## #   total_night_minutes <dbl>, total_night_calls <int>,
## #   total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## #   total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## #   .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>

Use conf_mat() to construct a confusion matrix. Does the confusion matrix look good?

The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 23, which less then the false positive 49, and the true positive is 809, which is greater than false negative 118. Thus, the confusion matrix shows the model has a good prediction, maybe.

knnPredDF %>%
  conf_mat(estimate = .pred_class, truth = churn)

##           Truth
## Prediction yes  no
##        yes  23  49
##        no  118 809

The KNN model with 5 neighbors has 83 % accuracy for predicting the customer churn.

knnPredDF %>%
accuracy(truth = churn, estimate = .pred_class)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.833

Lab 04 - Week 2 Wednesday

Yunting Chiu

2021-05-27

Setup

LDA (Linear Discriminant Analysis)

QDA (Quadratic Discriminant Analysis)

KNN (k-Nearest Neighbors Algorithm)

References