Load Packages and Dataset
6.09 pm start time
Packages used to import and analyse the data include
tidymodels
telecom_df: a dataset contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.
Practice
Objective: we will predict whether a customer cancels the service based on the predictors available in the dataset and the labels.
First - data resampling
This step is the same as what we did with the previous session on tidymodels on regressions. Therefore we are not providing too much descriptions below with the code.
note that here we want to have a 75%/25% split for training/testing data
telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>%
training()
telecom_test <- telecom_split %>%
testing()
nrow(telecom_training)## [1] 731
nrow(telecom_test)## [1] 244
Second - model fitting with training dataset
- create a logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_model## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
- fit to training data. printing a model fit object will display the estimated model coefficients.
logistic_fit <- logistic_model %>%
fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
data = telecom_training)
logistic_fit## parsnip model object
##
## Fit time: 9ms
##
## Call: stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins +
## monthly_charges, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) avg_call_mins avg_intl_mins monthly_charges
## 2.089671 -0.009769 0.021364 -0.001908
##
## Degrees of Freedom: 730 Total (i.e. Null); 727 Residual
## Null Deviance: 932.4
## Residual Deviance: 817.1 AIC: 825.1
Third - combining test data results
couple of points to note:
- not like the R base code, here we use new_data argument. and also for type argument here we call “class”, not “classification”
class_preds <- predict(logistic_fit,
new_data = telecom_test,
type = "class")- obtain estimated probabilities for each outcome value - here the class is “prob” since we are getting the probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test, type = "prob")- combine test set results, and here we created a tibble of model results using the test dataset
telecom_results <- telecom_test %>%
select(canceled_service) %>%
cbind(class_preds, prob_preds)
head(telecom_results,6)## canceled_service .pred_class .pred_yes .pred_no
## 1 yes yes 0.52979047 0.4702095
## 2 no no 0.38170832 0.6182917
## 3 no yes 0.56777475 0.4322252
## 4 no no 0.23016143 0.7698386
## 5 no no 0.24749052 0.7525095
## 6 no no 0.02215866 0.9778413
Accessing model fit
Key points
in tidymodels, outcome variable needs to be a factor. first level is positive class.
to check the ordering of a factor vector, pass it into levels()
confusion matrix for checking corrections. conf_mat()
matrix with counts of all combinations of actual and predicted outcome values
correct predictions: TP (true positive) and TN (true negative)
there are two classification errors: FP and FN
classification metrics with yardstick - need to define truth and estimate
creating confusion matrices and other model fit metrics with yardstick
require a tibble of model results which contain
true outcome values
predicted outcome categories
estimated probabilities of each category
classification accuracy with accuracy function - need to define truth and estimate
not the best matrix as if all classifying as no would also achieve XX% accuracy
takes same arguments as conf_mat
calculates the classification accuracy
- \(\frac{TP + TN} {TP + TN + FP + FN}\)
sensitivity - proportion of all positive cases that were correctly classified. \(\frac{TP} {TP +FN}\)
specificity - proportion of all negative cases that were correctly classified. \(\frac{TN} {TN +FP}\)
can also create a custom metric set
Exercise
calculate conf_mat, accuracy, sensitivity, specificity. specificity results are much higher than sensitivity results which means the model is much better at detecting customers who will not cancel their telecom service versus the ones who will.
conf_mat(telecom_results, truth = canceled_service,
estimate = .pred_class)## Truth
## Prediction yes no
## yes 25 15
## no 57 147
accuracy(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.705
sens(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 0.305
spec(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 0.907
otherwise
- create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.705
## 2 sens binary 0.305
## 3 spec binary 0.907
- create a confusion matrix and passing to summary() to calculate all available binary classification metric at once.
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
summary()## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.705
## 2 kap binary 0.243
## 3 sens binary 0.305
## 4 spec binary 0.907
## 5 ppv binary 0.625
## 6 npv binary 0.721
## 7 mcc binary 0.271
## 8 j_index binary 0.212
## 9 bal_accuracy binary 0.606
## 10 detection_prevalence binary 0.164
## 11 precision binary 0.625
## 12 recall binary 0.305
## 13 f_meas binary 0.410
Visualization of model performance
A couple key points:
heatmap with autoplot
binary threshold probability is 0.5
test across a range of thresholds of probability
use ROC curve as well - covered before. proportion correct among actual positives vs proportion incorrect among actual negatives; used to visualize performance across probability thresholds. also calculate ROC AUC
use roc_curve()
pass to autoplot to plot the ROC curve
use roc_auc to calculate AUC values
Practice:
- plot the confusion matrix in heatmap
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
autoplot(type = "heatmap")- plot the confusion matrix in mosaic - clearly as stated previously, model performs better in terms of specificity than sensitivity.
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
autoplot(type = "mosaic")- ROC curve
first, calculate the metric across thresholds
threshold_df <- telecom_results %>%
roc_curve(truth = canceled_service,
.pred_yes)
threshold_df## # A tibble: 246 × 3
## .threshold specificity sensitivity
## <dbl> <dbl> <dbl>
## 1 -Inf 0 1
## 2 0.0222 0 1
## 3 0.0468 0.00617 1
## 4 0.0522 0.0123 1
## 5 0.0556 0.0185 1
## 6 0.0585 0.0247 1
## 7 0.0688 0.0309 1
## 8 0.0715 0.0370 1
## 9 0.0739 0.0432 1
## 10 0.0752 0.0494 1
## # … with 236 more rows
second plot ROC curve
threshold_df %>%
autoplot()third, calculate AUC - the area under ROC curve is 0.75.
roc_auc(telecom_results, truth = canceled_service, .pred_yes)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.784
Automating the modelling workflow
using last_fit()
accepts classification models
speeds up the modeling process
fits the model to the training data and produces predictions on the test dataset
steps
create a data split object with rsample
specifying a model with parsnip
pass to last_fit(): parsnip model object, model formula, data split object), and use collect_metrics to calculate metrics using test dataset
pass to collect_predictions to create a tibble with all necessary columns for yardstick functions.
for metrics, please note that for accuracy, sensitivity, and specificity, the required arguments are ‘truth’ and ‘estimate’ but for roc_auc, its required argument is ‘truth’ and ‘estimated probabilities’. so need to include both estimates in the ‘cutom_metrics’ function which is from metric_set () - select the criteria for calculation.
some fitting process with last_fit
- train model with last fit and view set metrics
telecom_last_fit <- logistic_model %>%
last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
split = telecom_split)
telecom_last_fit %>%
collect_metrics()## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.705 Preprocessor1_Model1
## 2 roc_auc binary 0.784 Preprocessor1_Model1
- collect predictions, view results and calculate metrics
last_fit_results <- telecom_last_fit %>%
collect_predictions()
last_fit_results## # A tibble: 244 × 7
## id .pred_yes .pred_no .row .pred_class canceled_service .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/tes… 0.530 0.470 11 yes yes Preprocesso…
## 2 train/tes… 0.382 0.618 15 no no Preprocesso…
## 3 train/tes… 0.568 0.432 18 yes no Preprocesso…
## 4 train/tes… 0.230 0.770 20 no no Preprocesso…
## 5 train/tes… 0.247 0.753 23 no no Preprocesso…
## 6 train/tes… 0.0222 0.978 37 no no Preprocesso…
## 7 train/tes… 0.369 0.631 48 no no Preprocesso…
## 8 train/tes… 0.145 0.855 49 no no Preprocesso…
## 9 train/tes… 0.578 0.422 53 yes yes Preprocesso…
## 10 train/tes… 0.309 0.691 59 no no Preprocesso…
## # … with 234 more rows
last_fit_metrics <- metric_set(accuracy, sens, spec, roc_auc)
last_fit_metrics(last_fit_results, truth = canceled_service,
estimate = .pred_class, .pred_yes)## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.705
## 2 sens binary 0.305
## 3 spec binary 0.907
## 4 roc_auc binary 0.784
- test if the model performance improves with additional predictor - months_with_company. the auc result increased from 0.751 to 0.84.
#train a logistic regression model
logistic_fit <- logistic_model %>%
last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company,
split = telecom_split)
#collect metrics
logistic_fit %>%
collect_metrics()## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.832 Preprocessor1_Model1
## 2 roc_auc binary 0.887 Preprocessor1_Model1
#collect model predictions and plot ROC curve
logistic_fit %>%
collect_predictions() %>%
roc_curve(truth = canceled_service, .pred_yes) %>%
autoplot()