These sets of labs will introduce you to logistic regression. This will also be your first introduction to the rsample package which we will use to perform train-test split.

Exercise 1

In this exercise we will explore the mlc_churn data set included in tidymodels.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
data("mlc_churn")

# know the Customer churn data in advance
help("mlc_churn")

The data set contains a variable called churn

mlc_churn %>%
  count(churn) %>%
  ggplot(aes(churn, n)) +
  geom_col()

  1. Create a test-train rsplit object of mlc_churn using initial_split(). Use the arguments to set the proportions of the training data to be 80%. Stratify the sampling according to the churn variable. How many observations are in the testing and training sets?
set.seed(1) # Ensure that the data can be samely separated into the same observations.
mlc_split <- initial_split(mlc_churn, prop = 0.8, strata = churn)
mlc_split
## <Analysis/Assess/Total>
## <4001/999/5000>
  1. Create the training and testing data set with training() and testing() respectively. Does the observation counts match what you found in the last question?
mlc_training <- training(mlc_split)
nrow(mlc_training)
## [1] 4001
mlc_testing <- testing(mlc_split)
nrow(mlc_testing)
## [1] 999
  1. Fit a logistic regression model using logistic_reg(). Use number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.
mlc_formula <- churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls

# set a engine as logistic regression
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_spec %>%
  fit(mlc_formula, data = mlc_training) -> lr_fit
  1. Inspect the model with summary() and tidy(). How good are the variables we have chosen?

It can be seen that only 3 out of the 5 predictors are significantly associated to the outcome. With the small p-value, the significant predictors are number_vmail_messages, total_intl_calls, and number_customer_service_calls. The logistic regression coefficients represent the change in the log odds of the outcome for each unit increase in the predictor variable.

# using summary()
lr_fit %>%
  pluck("fit") %>% # unest the list
  summary()
## 
## Call:
## stats::glm(formula = churn ~ number_vmail_messages + total_intl_minutes + 
##     total_intl_calls + total_intl_charge + number_customer_service_calls, 
##     family = stats::binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7805   0.3518   0.4614   0.5777   1.4428  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    2.875577   0.221515  12.981  < 2e-16 ***
## number_vmail_messages          0.023680   0.004096   5.782  7.4e-09 ***
## total_intl_minutes            -1.347891   4.454241  -0.303 0.762188    
## total_intl_calls               0.069261   0.021027   3.294 0.000988 ***
## total_intl_charge              4.730498  16.497912   0.287 0.774317    
## number_customer_service_calls -0.434863   0.032924 -13.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3261.7  on 4000  degrees of freedom
## Residual deviance: 3020.6  on 3995  degrees of freedom
## AIC: 3032.6
## 
## Number of Fisher Scoring iterations: 5

Note:

  1. Null deviance: It shows how well the response variable is predicted by a model that includes only intercept.
  2. Residual deviance: It shows how well the response variable is predicted by a model that include all independent variables.
  3. AIC: AIC provides a method for assessing the quality of your model through comparison of related models. Small value of AIC is better.
# using tidy()
tidy(lr_fit)
## # A tibble: 6 x 5
##   term                          estimate std.error statistic  p.value
##   <chr>                            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                     2.88     0.222      13.0   1.56e-38
## 2 number_vmail_messages           0.0237   0.00410     5.78  7.40e- 9
## 3 total_intl_minutes             -1.35     4.45       -0.303 7.62e- 1
## 4 total_intl_calls                0.0693   0.0210      3.29  9.88e- 4
## 5 total_intl_charge               4.73    16.5         0.287 7.74e- 1
## 6 number_customer_service_calls  -0.435    0.0329    -13.2   7.86e-40

Reference

  1. Predict values for the testing data set. Use the type argument to also get probability predictions.
  • The simplest way to predict based on the data on which it was trained. The default column type of .pred_class is factor.
predict(lr_fit, new_data = mlc_testing) %>% # using reg model and testing dataset for testing 
  head()
## # A tibble: 6 x 1
##   .pred_class
##   <fct>      
## 1 no         
## 2 no         
## 3 no         
## 4 no         
## 5 no         
## 6 no
  • We can also get probability predictions by using type = "prob" in predict().
predict(lr_fit, new_data = mlc_testing, type = "prob") %>%
  head()
## # A tibble: 6 x 2
##   .pred_yes .pred_no
##       <dbl>    <dbl>
## 1    0.260     0.740
## 2    0.0831    0.917
## 3    0.0541    0.946
## 4    0.0836    0.916
## 5    0.0927    0.907
## 6    0.149     0.851
  • augment() can automatically add the .pred_class and the probability of yes and no, and add in the existing data frame as new columns.
preds <- augment(lr_fit, new_data = mlc_testing)
preds %>%
  names()
##  [1] "state"                         "account_length"               
##  [3] "area_code"                     "international_plan"           
##  [5] "voice_mail_plan"               "number_vmail_messages"        
##  [7] "total_day_minutes"             "total_day_calls"              
##  [9] "total_day_charge"              "total_eve_minutes"            
## [11] "total_eve_calls"               "total_eve_charge"             
## [13] "total_night_minutes"           "total_night_calls"            
## [15] "total_night_charge"            "total_intl_minutes"           
## [17] "total_intl_calls"              "total_intl_charge"            
## [19] "number_customer_service_calls" "churn"                        
## [21] ".pred_class"                   ".pred_yes"                    
## [23] ".pred_no"
  1. Use conf_mat() to construct a confusion matrix (error matrix). Does the confusion matrix look good?
  • The up-left area represents true negative, and the down-right area represents true positive; these two values are a leading indicator for determining whether the model is good or not. The true negative is 10, which greater then the false positive 2, and the true positive is 856, which is greater than false negative 131. Thus, the confusion matrix shows the model has a good prediction.
preds %>%
  conf_mat(truth = churn, estimate = .pred_class) -> confusionMatrix
confusionMatrix
##           Truth
## Prediction yes  no
##        yes  10   2
##        no  131 856
# we can also plot the confusion matrix
confusionMatrix %>%
  autoplot(type = "heatmap")

  • The accuracy()is useful for demonstrating the correct percentage of model prediction. So we know the logistic regression model has 87 % accuracy for predicting the customer churn, which can persuade that is a good model again.
preds %>%
accuracy(truth = churn, estimate = .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.867
# preds %>%
#   mutate(.pred_class_40 = ifelse(.pred_no > 0.4, 'no', 'yes')) %>%
#   mutate(.pred_class_40 = as.factor(.pred_class_40)) %>%
#   conf_mat(estimate = .pred_class_40, truth = churn) %>%
#   autoplot(type = "heatmap")

Note: conf_mat() is used as follows, where truth is the name of the true response variable and estimate is the name of the predicted response.