These sets of labs will introduce you to logistic regression. This will also be your first introduction to the rsample package which we will use to perform train-test split.
In this exercise we will explore the mlc_churn data set included in tidymodels.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
data("mlc_churn")
# know the Customer churn data in advance
help("mlc_churn")
The data set contains a variable called churn
churn. In the graph below, we can see that many customers gave negative feedback rather than positive feedback.mlc_churn %>%
count(churn) %>%
ggplot(aes(churn, n)) +
geom_col()
rsplit object of mlc_churn using initial_split(). Use the arguments to set the proportions of the training data to be 80%. Stratify the sampling according to the churn variable. How many observations are in the testing and training sets?set.seed(1) # Ensure that the data can be samely separated into the same observations.
mlc_split <- initial_split(mlc_churn, prop = 0.8, strata = churn)
mlc_split
## <Analysis/Assess/Total>
## <4001/999/5000>
training() and testing() respectively. Does the observation counts match what you found in the last question?nrow(), we can exactly see the traning and testing sets have 4001 and 999, respectively.mlc_training <- training(mlc_split)
nrow(mlc_training)
## [1] 4001
mlc_testing <- testing(mlc_split)
nrow(mlc_testing)
## [1] 999
logistic_reg(). Use number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls as predictors. Remember to fit the model only using the training data set.mlc_formula <- churn ~ number_vmail_messages + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls
# set a engine as logistic regression
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
lr_spec %>%
fit(mlc_formula, data = mlc_training) -> lr_fit
summary() and tidy(). How good are the variables we have chosen?It can be seen that only 3 out of the 5 predictors are significantly associated to the outcome. With the small p-value, the significant predictors are number_vmail_messages, total_intl_calls, and number_customer_service_calls. The logistic regression coefficients represent the change in the log odds of the outcome for each unit increase in the predictor variable.
# using summary()
lr_fit %>%
pluck("fit") %>% # unest the list
summary()
##
## Call:
## stats::glm(formula = churn ~ number_vmail_messages + total_intl_minutes +
## total_intl_calls + total_intl_charge + number_customer_service_calls,
## family = stats::binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7805 0.3518 0.4614 0.5777 1.4428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.875577 0.221515 12.981 < 2e-16 ***
## number_vmail_messages 0.023680 0.004096 5.782 7.4e-09 ***
## total_intl_minutes -1.347891 4.454241 -0.303 0.762188
## total_intl_calls 0.069261 0.021027 3.294 0.000988 ***
## total_intl_charge 4.730498 16.497912 0.287 0.774317
## number_customer_service_calls -0.434863 0.032924 -13.208 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3261.7 on 4000 degrees of freedom
## Residual deviance: 3020.6 on 3995 degrees of freedom
## AIC: 3032.6
##
## Number of Fisher Scoring iterations: 5
Note:
# using tidy()
tidy(lr_fit)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.88 0.222 13.0 1.56e-38
## 2 number_vmail_messages 0.0237 0.00410 5.78 7.40e- 9
## 3 total_intl_minutes -1.35 4.45 -0.303 7.62e- 1
## 4 total_intl_calls 0.0693 0.0210 3.29 9.88e- 4
## 5 total_intl_charge 4.73 16.5 0.287 7.74e- 1
## 6 number_customer_service_calls -0.435 0.0329 -13.2 7.86e-40
.pred_class is factor.predict(lr_fit, new_data = mlc_testing) %>% # using reg model and testing dataset for testing
head()
## # A tibble: 6 x 1
## .pred_class
## <fct>
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
type = "prob" in predict().predict(lr_fit, new_data = mlc_testing, type = "prob") %>%
head()
## # A tibble: 6 x 2
## .pred_yes .pred_no
## <dbl> <dbl>
## 1 0.260 0.740
## 2 0.0831 0.917
## 3 0.0541 0.946
## 4 0.0836 0.916
## 5 0.0927 0.907
## 6 0.149 0.851
augment() can automatically add the .pred_class and the probability of yes and no, and add in the existing data frame as new columns.preds <- augment(lr_fit, new_data = mlc_testing)
preds %>%
names()
## [1] "state" "account_length"
## [3] "area_code" "international_plan"
## [5] "voice_mail_plan" "number_vmail_messages"
## [7] "total_day_minutes" "total_day_calls"
## [9] "total_day_charge" "total_eve_minutes"
## [11] "total_eve_calls" "total_eve_charge"
## [13] "total_night_minutes" "total_night_calls"
## [15] "total_night_charge" "total_intl_minutes"
## [17] "total_intl_calls" "total_intl_charge"
## [19] "number_customer_service_calls" "churn"
## [21] ".pred_class" ".pred_yes"
## [23] ".pred_no"
conf_mat() to construct a confusion matrix (error matrix). Does the confusion matrix look good?preds %>%
conf_mat(truth = churn, estimate = .pred_class) -> confusionMatrix
confusionMatrix
## Truth
## Prediction yes no
## yes 10 2
## no 131 856
# we can also plot the confusion matrix
confusionMatrix %>%
autoplot(type = "heatmap")
accuracy()is useful for demonstrating the correct percentage of model prediction. So we know the logistic regression model has 87 % accuracy for predicting the customer churn, which can persuade that is a good model again.preds %>%
accuracy(truth = churn, estimate = .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.867
# preds %>%
# mutate(.pred_class_40 = ifelse(.pred_no > 0.4, 'no', 'yes')) %>%
# mutate(.pred_class_40 = as.factor(.pred_class_40)) %>%
# conf_mat(estimate = .pred_class_40, truth = churn) %>%
# autoplot(type = "heatmap")
Note: conf_mat() is used as follows, where truth is the name of the true response variable and estimate is the name of the predicted response.