We will be using the add-on package discrim to access functions to perform discriminant analysis models with parsnip and the klaR package to perform the QDA calculations. if you haven’t already got it installed run
Create a test-train rsplit object of mlc_churn using initial_split(). Use the arguments to set the proportions of the training data to be 80%. Stratify the sampling according to the churn variable.
Do the following tasks for LDA, QDA and KNN model.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.6 ✓ recipes 0.1.14
## ✓ dials 0.0.9 ✓ rsample 0.0.9
## ✓ infer 0.5.3 ✓ tune 0.1.1
## ✓ modeldata 0.1.0 ✓ workflows 0.2.1
## ✓ parsnip 0.1.5 ✓ yardstick 0.0.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
## # A tibble: 6 x 20
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 KS 128 area_code_415 no yes
## 2 OH 107 area_code_415 no yes
## 3 NJ 137 area_code_415 no no
## 4 OH 84 area_code_408 yes no
## 5 OK 75 area_code_415 yes no
## 6 AL 118 area_code_510 yes no
## # … with 15 more variables: number_vmail_messages <int>,
## # total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## # total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## # total_night_minutes <dbl>, total_night_calls <int>,
## # total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## # total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>
Splitting the data to training and testing sets.
## <Analysis/Assess/Total>
## <4001/999/5000>
number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS") # the default is "MASS", otherwise is "mda"Fit the LDA model
## parsnip model object
##
## Fit time: 13ms
## Call:
## lda(churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls +
## total_intl_charge + number_customer_service_calls, data = data)
##
## Prior probabilities of groups:
## yes no
## 0.1414646 0.8585354
##
## Group means:
## number_vmail_messages total_intl_minutes total_intl_calls total_intl_charge
## yes 4.680212 10.62597 4.121908 2.869470
## no 8.084716 10.17383 4.487045 2.747441
## number_customer_service_calls
## yes 2.291519
## no 1.453857
##
## Coefficients of linear discriminants:
## LD1
## number_vmail_messages 0.02711856
## total_intl_minutes 1.00746332
## total_intl_calls 0.08589699
## total_intl_charge -4.07968773
## number_customer_service_calls -0.69807223
The summary shows that our training prior probabilities of customer churn are 0.14 says Yes and 0.86 says No. The group means indicate the mean of each variable in each group. Plus, the coefficients of linear discriminants shows the linear combination of predictor variables that are used to form the LDA decision rule. For instance, LD1 = 0.02711856*number_vmail_messages + 1.00746332*total_intl_minutes + 0.08589699*total_intl_calls - 4.07968773*total_intl_charge - 0.69807223*number_customer_service_calls. Because there are only two response classes in this model, there will be only one set of coefficients (LD1).
summary() and tidy(). How good are the variables we have chosen?We can be using summary(lda_fit$fit$means) to see each mean of predictors, the outcome looks not bad.
## number_vmail_messages total_intl_minutes total_intl_calls total_intl_charge
## Min. :4.680 Min. :10.17 Min. :4.122 Min. :2.747
## 1st Qu.:5.531 1st Qu.:10.29 1st Qu.:4.213 1st Qu.:2.778
## Median :6.382 Median :10.40 Median :4.304 Median :2.808
## Mean :6.382 Mean :10.40 Mean :4.304 Mean :2.808
## 3rd Qu.:7.234 3rd Qu.:10.51 3rd Qu.:4.396 3rd Qu.:2.839
## Max. :8.085 Max. :10.63 Max. :4.487 Max. :2.869
## number_customer_service_calls
## Min. :1.454
## 1st Qu.:1.663
## Median :1.873
## Mean :1.873
## 3rd Qu.:2.082
## Max. :2.292
tidy() can not works
## Error: No tidy method for objects of class lda
See the probability of answering Yes or No. We only show the first six rows to ensure that the output is clear, but you can manually remove head() to see the all results. All datasets will be still using head() in order to compress the outputs.
## # A tibble: 6 x 2
## .pred_yes .pred_no
## <dbl> <dbl>
## 1 0.0582 0.942
## 2 0.269 0.731
## 3 0.0673 0.933
## 4 0.320 0.680
## 5 0.319 0.681
## 6 0.135 0.865
Using augment() is the simplest way to incorporate the prediction into an existing data frame.
## # A tibble: 6 x 23
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 RI 74 area_code_415 no no
## 2 MT 95 area_code_510 no no
## 3 FL 147 area_code_415 no no
## 4 NE 174 area_code_415 no no
## 5 MT 54 area_code_408 no no
## 6 HI 49 area_code_510 no no
## # … with 18 more variables: number_vmail_messages <int>,
## # total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## # total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## # total_night_minutes <dbl>, total_night_calls <int>,
## # total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## # total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## # .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>
conf_mat() to construct a confusion matrix. Does the confusion matrix look good?The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 12, which greater then the false positive 5, and the true positive is 853, which is greater than false negative 129. Thus, the confusion matrix shows the model has a good prediction.
## Truth
## Prediction yes no
## yes 12 5
## no 129 853
The linear discriminant analysis model has 87 % accuracy for predicting the customer churn.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.866
number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.qda_spec <- discrim_regularized(frac_common_cov = 0, frac_identity = 0) %>%
set_mode("classification") %>%
set_engine("klaR")Fit the QDA model
## parsnip model object
##
## Fit time: 141ms
## Call:
## rda(formula = churn ~ number_vmail_messages + total_intl_minutes +
## total_intl_calls + total_intl_charge + number_customer_service_calls,
## data = data, lambda = ~0, gamma = ~0)
##
## Regularization parameters:
## gamma lambda
## 0 0
##
## Prior probabilities of groups:
## yes no
## 0.1414646 0.8585354
##
## Misclassification rate:
## apparent: 13.647 %
The fitting time 159ms. We can find the prior probabilities and group means in this table.
summary() and tidy(). How good are the variables we have chosen?## Length Class Mode
## call 5 -none- call
## regularization 2 -none- numeric
## classes 2 -none- character
## prior 2 -none- numeric
## error.rate 1 -none- numeric
## varnames 5 -none- character
## means 10 -none- numeric
## covariances 50 -none- numeric
## covpooled 25 -none- numeric
## converged 1 -none- logical
## iter 1 -none- numeric
## terms 3 terms call
## xlevels 0 -none- list
tidy() can not works
## Error: No tidy method for objects of class rda
To see more relationship between these predictors. We can also see the “Yes” covariances divided by “No” covariances by using the programming skill.
## number_vmail_messages total_intl_minutes
## number_vmail_messages 0.7337243 1.883922
## total_intl_minutes 1.8839221 1.088717
## total_intl_calls -3.6384360 3.541611
## total_intl_charge 1.8850231 1.088725
## number_customer_service_calls -96.2601351 49.440174
## total_intl_calls total_intl_charge
## number_vmail_messages -3.638436 1.885023
## total_intl_minutes 3.541611 1.088725
## total_intl_calls 1.063298 3.533929
## total_intl_charge 3.533929 1.088730
## number_customer_service_calls -9.311266 48.668483
## number_customer_service_calls
## number_vmail_messages -96.260135
## total_intl_minutes 49.440174
## total_intl_calls -9.311266
## total_intl_charge 48.668483
## number_customer_service_calls 2.398223
## # A tibble: 6 x 2
## .pred_yes .pred_no
## <dbl> <dbl>
## 1 0.101 0.899
## 2 0.232 0.768
## 3 0.118 0.882
## 4 0.262 0.738
## 5 0.274 0.726
## 6 0.129 0.871
## # A tibble: 6 x 23
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 RI 74 area_code_415 no no
## 2 MT 95 area_code_510 no no
## 3 FL 147 area_code_415 no no
## 4 NE 174 area_code_415 no no
## 5 MT 54 area_code_408 no no
## 6 HI 49 area_code_510 no no
## # … with 18 more variables: number_vmail_messages <int>,
## # total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## # total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## # total_night_minutes <dbl>, total_night_calls <int>,
## # total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## # total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## # .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>
conf_mat() to construct a confusion matrix. Does the confusion matrix look good? The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 27, which greater then the false positive 17, and the true positive is 841, which is greater than false negative 114. Thus, the confusion matrix shows the model has a good prediction.## Truth
## Prediction yes no
## yes 27 17
## no 114 841
The quadratic discriminant analysis model has 87 % accuracy for predicting the customer churn.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.869
number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.knn_spec <- nearest_neighbor(neighbors = 5) %>% # default k is 5
set_mode("classification") %>%
set_engine("kknn")
knn_fit <- fit(knn_spec, data = mlc_training, formula = churn_formula)
knn_fit## parsnip model object
##
## Fit time: 279ms
##
## Call:
## kknn::train.kknn(formula = churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls, data = data, ks = min_rows(5, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.1539615
## Best kernel: optimal
## Best k: 5
summary() and tidy(). How good are the variables we have chosen?The misclassification is the percentage of classifications that are incorrect. The minimal misclassification is 0.15, meaning that the predictor variables are not bad.
##
## Call:
## kknn::train.kknn(formula = churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls, data = data, ks = min_rows(5, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.1539615
## Best kernel: optimal
## Best k: 5
tidy() can not works
## Error: No tidy method for objects of class train.kknn
## # A tibble: 6 x 23
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 RI 74 area_code_415 no no
## 2 MT 95 area_code_510 no no
## 3 FL 147 area_code_415 no no
## 4 NE 174 area_code_415 no no
## 5 MT 54 area_code_408 no no
## 6 HI 49 area_code_510 no no
## # … with 18 more variables: number_vmail_messages <int>,
## # total_day_minutes <dbl>, total_day_calls <int>, total_day_charge <dbl>,
## # total_eve_minutes <dbl>, total_eve_calls <int>, total_eve_charge <dbl>,
## # total_night_minutes <dbl>, total_night_calls <int>,
## # total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## # total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## # .pred_class <fct>, .pred_yes <dbl>, .pred_no <dbl>
conf_mat() to construct a confusion matrix. Does the confusion matrix look good?The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 23, which less then the false positive 49, and the true positive is 809, which is greater than false negative 118. Thus, the confusion matrix shows the model has a good prediction, maybe.
## Truth
## Prediction yes no
## yes 23 49
## no 118 809
The KNN model with 5 neighbors has 83 % accuracy for predicting the customer churn.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.833