Your Name: Elham Jafarghomi
Your G Number: G00954064
# Suppress dplyr summarise grouping warning messages
options(dplyr.summarise.inform = FALSE)
## Add R libraries here
library(tidyverse)
library(tidymodels)
# Load the dataset
telecom_df <- read_rds(url('https://gmubusinessanalytics.netlify.app/data/telecom_df.rds'))
In this section, you must think of at least 5 relevant questions that explore the relationship between canceled_service and the other variables in the telecom_df data set. The goal of your analysis should be discovering which variables drive the differences between customers who do and do not cancel their service.
You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr/tidyr) or a plot (using ggplot) or both.
In total, you must have a minimum of 3 plots (created with ggplot) and 3 summary data frames (created with dplyr) for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, scatter plot, etc…)
See the Data Analysis Project for an example of a question answered with a summary table and plot.
Note: To add an R code chunk to any section of your project, you can use the keyboard shortcut Ctrl + Alt + i or the insert button at the top of your R project template notebook file.
************************ Part 1: Exploratory data Analysis ********************************
Before Starting the questions, in this section we would like to know the number of customers who did cancel their service and who did not.
telecom_df %>% group_by(canceled_service) %>%
summarise(num_customers = n())
## # A tibble: 2 x 2
## canceled_service num_customers
## * <fct> <int>
## 1 yes 427
## 2 no 748
##Data Visualization
ggplot(data = telecom_df, mapping = aes(x = canceled_service, fill = canceled_service))+
geom_bar()
From the summary and the bar plot we can see that out of 1175 customers, 427 canceled their service and 748 customers did not cancel their service. Below we are trying to find out if there is any relationship between customer canceling their service and some variables.
Question:Is there a relationship between canceled service and whether customers had tech support or not?
Answer: Yes. The data and summary table indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support. Among the customers who did not have tech support 45.71% canceled their service which this rate decrease to 1/3 among customers who had tech support and it is only 15.95%.
Also, the stacked percentage bar chart represents this fact that among customers that had tech support the attrition rate is a lot lower than the attrition rate among customers without the tech support. This is shown by the orange area on each bar.
telecom_tech_support <- telecom_df %>% group_by(canceled_service, tech_support) %>%
summarise(num_customers = n()) %>%
arrange(tech_support,canceled_service) %>%
group_by(tech_support) %>%
mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>%
left_join(telecom_df %>% group_by(canceled_service, tech_support) %>%
summarise(num_customers = n()) %>%
ungroup() %>%
mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)),
by =c("canceled_service", "tech_support","num_customers"))
telecom_tech_support
## # A tibble: 4 x 5
## # Groups: tech_support [2]
## canceled_service tech_support num_customers relative_pct_cus~ pct_of_total_cu~
## <fct> <fct> <int> <dbl> <dbl>
## 1 yes yes 59 16.0 5.02
## 2 no yes 311 84.0 26.5
## 3 yes no 368 45.7 31.3
## 4 no no 437 54.3 37.2
ggplot(data = telecom_df, mapping = aes(x = tech_support,
fill = canceled_service))+
geom_bar(stat = "count", position = "fill")+
labs(title ="Canceled service Prevalence by tech support Status(Yes-No)" ,
x = "Tech Support", y ="Proportion of Customers")
Question: Is there a relationship between canceled service and customers internet service type?
Answer: Yes. The data indicates that there is a relationship between the internet service type and service cancelation status. The summary table and the plot shows that customers who have fiber optic internet service, tend to cancel their service at a higher rate than those with digital service. Among the customers with fiber optic internet service, the canceling rate is 45.33% ,but the canceling rate among customers with digital internet service is lot lower and only 18.53%.
Also, the stacked percentage chart represents this fact that among customers with digital service ,the rate of canceling service is a lot lower and only 18.5% (orange area under digital) compare to customers with fiber optic service this rate increases to 45.33%.
It looks like in general, the fiber optic subscription is higher but fiber optic attrition is also higher. We can see that almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but we can see that attrition rate is higher in fiber optic 30% fiber optic attrition vs 6% digital attrition among all customers.
##Code
telecom_interent_service <- telecom_df %>% group_by(canceled_service, internet_service) %>%
summarise(num_customers = n()) %>%
arrange(internet_service,canceled_service) %>%
group_by(internet_service) %>%
mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>%
left_join(telecom_df %>% group_by(canceled_service, internet_service) %>%
summarise(num_customers = n()) %>%
ungroup() %>%
mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)),
by =c("canceled_service", "internet_service","num_customers"))
telecom_interent_service
## # A tibble: 4 x 5
## # Groups: internet_service [2]
## canceled_service internet_service num_customers relative_pct_custs
## <fct> <fct> <int> <dbl>
## 1 yes fiber_optic 354 45.3
## 2 no fiber_optic 427 54.7
## 3 yes digital 73 18.5
## 4 no digital 321 81.5
## # ... with 1 more variable: pct_of_total_custs <dbl>
ggplot(data = telecom_df, mapping = aes(x = internet_service, fill = canceled_service)) +
geom_bar(stat = "count", position = "fill")+
labs(title = "Canceled Service Prevalence Internet Service Type",
x = "Internet Service", y= "Proportion of Customers")
Question:Is there a relationship between canceled service and online security?
Answer:Yes, the data indicates that there is a strong relationship between having online security and service cancelation. The summary table and the stacked bar char represents that customers who did not have online security tend to cancel their service at a significantly higher rate than customers who had online security. Among customers who did not have online security almost 47% canceled their service when compare this to customers who had online security this rate decrease significantly to 16%.
telecom_online_security <- telecom_df %>% group_by(canceled_service, online_security) %>%
summarise(num_customers = n()) %>%
arrange(online_security,canceled_service) %>%
group_by(online_security) %>%
mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>%
left_join(telecom_df %>% group_by(canceled_service, online_security) %>%
summarise(num_customers = n()) %>%
ungroup() %>%
mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)),
by =c("canceled_service", "online_security","num_customers"))
telecom_online_security
## # A tibble: 4 x 5
## # Groups: online_security [2]
## canceled_service online_security num_customers relative_pct_custs
## <fct> <fct> <int> <dbl>
## 1 yes yes 64 16
## 2 no yes 336 84
## 3 yes no 363 46.8
## 4 no no 412 53.2
## # ... with 1 more variable: pct_of_total_custs <dbl>
ggplot(data = telecom_df, mapping = aes(x = online_security, fill = canceled_service)) +
geom_bar(stat = "count", position = "fill")+
labs(title = "Canceled Service Prevalence Online security Status(Yes-No)",
x = "Online Security", y= "Proportion of Customers")
Question:Is there a relationship between canceled service and total number of months with company?
Answer:Yes. The data indicates that there is a strong relationship between total months with company and canceling service. The data and summary table shows that the customers who canceled their service tend to stay fewer months with the company compare with customers who did not cancel their service. Among 427 customers who did cancel their service, the average total months with the company is around 18 months(17.69). But the average total months with company for customers who did not cancel their service is around 39 month (38.66). This indicates that the average total months of the customers who did cancel their service is almost 21 months less than the average total months with the company of those who did not cancel their service.
Among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %.
Description of Boxplot: Also, from the Box Plot below, we can see that the median total months with company for customers who canceled their service is 10 months, which is 4 times smaller than median total months with company for customers who did not cancel their service.(40 months)
Description of the Violin plot: the violin plot also represents that concentration of the data for customers who canceled their service is between 0 to 18 or 20 months. The plot is wider at the bottom and in that range, and this means most of the data points are concentrated there.
telecom_df %>% group_by(canceled_service) %>%
summarise(num_customers = n(),
min_months_with_co = min(months_with_company),
avg_months_with_co = round(mean(months_with_company),2),
median_months_with_co = median(months_with_company),
max_months_with_co = max(months_with_company),
sd_months_with_co = sd(months_with_company),
pct_less_18months = round(mean(months_with_company<18),2))
## # A tibble: 2 x 8
## canceled_service num_customers min_months_with_co avg_months_with_co
## * <fct> <int> <dbl> <dbl>
## 1 yes 427 1 17.7
## 2 no 748 1 38.7
## # ... with 4 more variables: median_months_with_co <dbl>,
## # max_months_with_co <dbl>, sd_months_with_co <dbl>, pct_less_18months <dbl>
#Violin Plot
ggplot(data = telecom_df, mapping = aes(x = canceled_service, months_with_company,
y = months_with_company, fill = canceled_service)) +
geom_violin() +
geom_jitter(width = 0.07, alpha = 0.2) +
labs(title = "Customers Months with Company by status(Canceled Service-Yes/No)",
x = "Canceled Service", y = "Total Months with Comapny")
#BoxPlot
ggplot(data = telecom_df, mapping = aes(x = canceled_service, y = months_with_company, fill=canceled_service)) + geom_boxplot() +
labs(title="Customers Service Monthly Charges by Service Status (Cancelled Service - Yes/No)", x = "Canceled Service", y ="Number of Months With Company" )
Question:Is there a relationship between canceled service and customers monthly charges?
Answer:No. The data does not support that there is a relationship between customers canceling their service and the monthly service charges. The average monthly service charge for the customers who canceled their service is almost 82 dollars (81.88) which is the same for customers who did not cancel their service. They also have average monthly charge of 82 dollars (81.91).
Among 427 customers who did cancel their service, 58% of them had average monthly service charge of more than 80 dollars, which when compare to customers who did not cancel their service, this only decrease to 54%.
telecom_df %>% group_by(canceled_service) %>%
summarise(num_customers = n(),
min_monthly_chrg = min(monthly_charges),
avg_monthly_chrg = round(mean(monthly_charges),2),
median_monthly_chrg = median(monthly_charges),
max_monthly_chrg = max(monthly_charges),
sd_monthly_chrg = sd(monthly_charges),
pct_greater_80 = round(mean(monthly_charges>80),2))
## # A tibble: 2 x 8
## canceled_service num_customers min_monthly_chrg avg_monthly_chrg
## * <fct> <int> <dbl> <dbl>
## 1 yes 427 44.2 81.9
## 2 no 748 43.4 81.9
## # ... with 4 more variables: median_monthly_chrg <dbl>, max_monthly_chrg <dbl>,
## # sd_monthly_chrg <dbl>, pct_greater_80 <dbl>
ggplot(data = telecom_df, mapping = aes(x = canceled_service, y = monthly_charges,
fill = canceled_service))+
geom_boxplot()+
labs(title = "Customers Service Monthly Charges by Status(Canceled Service-Yes/No)",
x = "Canceled Service", y = "Service Monthly Charge")
In this section of the project, you will fit three classification algorithms to predict the response variable,canceled_service. You should use all of the other variables in the telecom_df data as predictor variables for each model.
You must follow the machine learning steps below.
The data splitting and feature engineering steps should only be done once so that your models are using the same data and feature engineering steps for training.
Split the telecom_df data into a training and test set (remember to set your seed)
Specify a feature engineering pipeline with the recipes package
Specify a parsnip model object
You may choose from the following classification algorithms:
Package your recipe and model into a workflow
Fit your workflow to the training data
If your model has hyperparameters:
vfold_cv (remember to set your seed)grid_random() functionsize argument of grid_random() too large. I recommend size = 10 or smaller.select_best() and finalize your workflowEvaluate model performance on the test set by plotting an ROC curve using autoplot() and calculating the area under the ROC curve on your test data
****************************** Part 2: Machine Learning **********************************
Before training any ML models, we will split our data to train and test set for guarding against under-fitting and over-fitting.
##Step 1: Data Spliting
library(tidymodels)
set.seed(222)
telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_split
## <Analysis/Assess/Total>
## <882/293/1175>
In above section we split our telecom_df data into training and test set with randomly selecting 75 % for training and 25 % for testing. We have 882 rows in our training and 293 rows in our test set for later assessing the models performance.
telecom_training <- telecom_split %>% training()
telecom_test <- telecom_split %>% testing()
set.seed(222)
telecom_folds <- vfold_cv(telecom_training, v = 5)
telecom_training
## # A tibble: 882 x 19
## canceled_service senior_citizen spouse_partner dependents cellular_service
## <fct> <fct> <fct> <fct> <fct>
## 1 no no no no single_line
## 2 no no yes yes single_line
## 3 no no yes no multiple_lines
## 4 yes yes yes no multiple_lines
## 5 no no no yes multiple_lines
## 6 no yes no no single_line
## 7 no yes yes no multiple_lines
## 8 no no no no multiple_lines
## 9 no yes yes yes multiple_lines
## 10 no no yes no multiple_lines
## # ... with 872 more rows, and 14 more variables: avg_call_mins <dbl>,
## # avg_intl_mins <dbl>, internet_service <fct>, online_security <fct>,
## # online_backup <fct>, device_protection <fct>, tech_support <fct>,
## # streaming_tv <fct>, streaming_movies <fct>, contract <fct>,
## # paperless_bill <fct>, payment_method <fct>, months_with_company <dbl>,
## # monthly_charges <dbl>
Below the skim function has been used to quickly check for skewness of numeric columns.
library(skimr)
## Warning: package 'skimr' was built under R version 4.0.3
skim(telecom_training)
| Name | telecom_training |
| Number of rows | 882 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| factor | 15 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| canceled_service | 0 | 1 | FALSE | 2 | no: 561, yes: 321 |
| senior_citizen | 0 | 1 | FALSE | 2 | no: 709, yes: 173 |
| spouse_partner | 0 | 1 | FALSE | 2 | no: 483, yes: 399 |
| dependents | 0 | 1 | FALSE | 2 | no: 656, yes: 226 |
| cellular_service | 0 | 1 | FALSE | 2 | mul: 476, sin: 406 |
| internet_service | 0 | 1 | FALSE | 2 | fib: 578, dig: 304 |
| online_security | 0 | 1 | FALSE | 2 | no: 575, yes: 307 |
| online_backup | 0 | 1 | FALSE | 2 | no: 505, yes: 377 |
| device_protection | 0 | 1 | FALSE | 2 | no: 487, yes: 395 |
| tech_support | 0 | 1 | FALSE | 2 | no: 596, yes: 286 |
| streaming_tv | 0 | 1 | FALSE | 2 | yes: 456, no: 426 |
| streaming_movies | 0 | 1 | FALSE | 2 | no: 454, yes: 428 |
| contract | 0 | 1 | FALSE | 3 | mon: 575, one: 160, two: 147 |
| paperless_bill | 0 | 1 | FALSE | 2 | yes: 613, no: 269 |
| payment_method | 0 | 1 | FALSE | 4 | ele: 370, ban: 191, cre: 184, mai: 137 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| avg_call_mins | 0 | 1 | 349.49 | 73.05 | 107.00 | 305.00 | 348.5 | 395.00 | 562.0 | ▁▃▇▅▁ |
| avg_intl_mins | 0 | 1 | 106.00 | 31.73 | 15.00 | 84.00 | 106.0 | 127.00 | 211.0 | ▁▅▇▃▁ |
| months_with_company | 0 | 1 | 31.22 | 24.43 | 1.00 | 8.00 | 27.0 | 54.00 | 72.0 | ▇▃▃▃▅ |
| monthly_charges | 0 | 1 | 81.77 | 18.63 | 43.45 | 69.81 | 83.8 | 95.94 | 118.6 | ▅▅▇▇▃ |
It looks like avg_call_mins and avg_intl_mins columns have almost normal distributions, but months_with_company and monthly_charges columns are not normally distributed. Hense, we will consider the skewness transformation in our data.
##Step 2: Feature Engineering
Since the telecom data has dummy variables and numeric variables, we will create a feature engineering recipe that removes skewness from our numeric predictors, normalize all numeric predictors and create dummy variables for our nominal predictors
telecom_transformation <- recipe(canceled_service ~ ., data = telecom_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
telecom_transformation %>%
prep(training = telecom_training) %>%
bake(new_data = NULL)
## # A tibble: 882 x 22
## avg_call_mins avg_intl_mins months_with_com~ monthly_charges canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 -0.600 -0.0293 0.325 -0.856 no
## 2 -1.09 0.631 -0.789 -1.69 no
## 3 -0.0281 0.757 1.34 1.58 no
## 4 0.0823 -1.48 -0.939 0.679 yes
## 5 0.263 2.35 1.23 0.462 no
## 6 0.543 -0.124 -0.939 -0.928 no
## 7 0.249 -0.534 0.834 1.52 no
## 8 -0.356 1.13 -0.322 1.39 no
## 9 -0.586 1.42 1.20 1.44 no
## 10 -0.800 -1.39 1.36 1.29 no
## # ... with 872 more rows, and 17 more variables: senior_citizen_no <dbl>,
## # spouse_partner_no <dbl>, dependents_no <dbl>,
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # online_security_no <dbl>, online_backup_no <dbl>,
## # device_protection_no <dbl>, tech_support_no <dbl>, streaming_tv_no <dbl>,
## # streaming_movies_no <dbl>, contract_one_year <dbl>,
## # contract_two_year <dbl>, paperless_bill_no <dbl>,
## # payment_method_credit_card <dbl>, payment_method_electronic_check <dbl>,
## # payment_method_mailed_check <dbl>
my_metrics <- metric_set(accuracy, sens, spec, f_meas, roc_auc)
##Step3: Models
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
##Create the workflow for the logistic regression
telecom_logistic_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(telecom_transformation)
telecom_logistic_fit <- telecom_logistic_wf %>%
last_fit(split = telecom_split,
metrics = my_metrics)
## Warning: package 'rlang' was built under R version 4.0.3
## Warning: package 'vctrs' was built under R version 4.0.3
logistic_results <- telecom_logistic_fit %>%
collect_predictions()
logistic_results
## # A tibble: 293 x 7
## id .pred_class .row .pred_yes .pred_no canceled_service .config
## <chr> <fct> <int> <dbl> <dbl> <fct> <chr>
## 1 train/tes~ yes 9 0.843 0.157 no Preprocesso~
## 2 train/tes~ no 20 0.166 0.834 no Preprocesso~
## 3 train/tes~ yes 21 0.882 0.118 yes Preprocesso~
## 4 train/tes~ yes 23 0.710 0.290 no Preprocesso~
## 5 train/tes~ no 33 0.120 0.880 no Preprocesso~
## 6 train/tes~ yes 38 0.665 0.335 no Preprocesso~
## 7 train/tes~ no 41 0.0320 0.968 no Preprocesso~
## 8 train/tes~ no 42 0.143 0.857 no Preprocesso~
## 9 train/tes~ yes 46 0.801 0.199 yes Preprocesso~
## 10 train/tes~ no 50 0.359 0.641 no Preprocesso~
## # ... with 283 more rows
telecom_logistic_fit %>%collect_metrics()
## # A tibble: 5 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.792 Preprocessor1_Model1
## 2 sens binary 0.708 Preprocessor1_Model1
## 3 spec binary 0.840 Preprocessor1_Model1
## 4 f_meas binary 0.711 Preprocessor1_Model1
## 5 roc_auc binary 0.856 Preprocessor1_Model1
roc_curve(logistic_results,
truth = canceled_service,
estimate = .pred_yes) %>%
autoplot()
## The Logistic Model Confusion Matrix
conf_mat(logistic_results,
truth = canceled_service,
estimate = .pred_class)%>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
labs(y = "True Class", x = "Predicted Class", fill = NULL,
title = "Confusion Matrix", subtitle = "For Logistic Regression Model")+
theme_bw()
From the confusion matrix of the logistic regression we can see that the number of False positive is 23 and number of False Negative is 24.
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('classification')
telecom_rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(telecom_transformation)
we set the range from 5 to 15. This is because we have 19 predictor variables and we would like to test mtry() values somewhere in the middle between 1 and 19, and also it is suggested to avoid values that are close to the ends.
#setting the seed for reproducability
set.seed(222)
rf_grid <- grid_random(mtry() %>% range_set(c(5, 15)),
trees(),
min_n(),
size = 10)
##Tuning our Hyperparameters on random forest workflow
set.seed(222)
rf_tuning <- telecom_rf_wf %>%
tune_grid(resamples = telecom_folds,
grid = rf_grid)
## Warning: package 'ranger' was built under R version 4.0.3
rf_tuning %>% show_best('roc_auc')
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 144 5 roc_auc binary 0.877 5 0.00761 Preprocessor1_Model10
## 2 8 1385 10 roc_auc binary 0.876 5 0.00765 Preprocessor1_Model06
## 3 12 1454 2 roc_auc binary 0.876 5 0.00746 Preprocessor1_Model03
## 4 10 602 6 roc_auc binary 0.876 5 0.00770 Preprocessor1_Model04
## 5 6 1121 38 roc_auc binary 0.875 5 0.00855 Preprocessor1_Model01
We can see that the best choice is with with mtry 5 number of trees 867.
best_rf <- rf_tuning %>%
select_best(metric = 'roc_auc')
using the best model based on roc_auc we can finalize our workflow as following:
final_rf_wf <- telecom_rf_wf %>%
finalize_workflow(best_rf)
telecom_rf_lastfit <- final_rf_wf %>%
last_fit(telecom_split, metrics = my_metrics)
##View the Random forest model performance metric on the test set
telecom_rf_lastfit %>% collect_metrics()
## # A tibble: 5 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.792 Preprocessor1_Model1
## 2 sens binary 0.689 Preprocessor1_Model1
## 3 spec binary 0.850 Preprocessor1_Model1
## 4 f_meas binary 0.705 Preprocessor1_Model1
## 5 roc_auc binary 0.844 Preprocessor1_Model1
rf_results <- telecom_rf_lastfit %>%
collect_predictions()
rf_results
## # A tibble: 293 x 7
## id .pred_class .row .pred_yes .pred_no canceled_service .config
## <chr> <fct> <int> <dbl> <dbl> <fct> <chr>
## 1 train/tes~ yes 9 0.716 0.284 no Preprocesso~
## 2 train/tes~ no 20 0.203 0.797 no Preprocesso~
## 3 train/tes~ yes 21 0.791 0.209 yes Preprocesso~
## 4 train/tes~ no 23 0.416 0.584 no Preprocesso~
## 5 train/tes~ no 33 0.439 0.561 no Preprocesso~
## 6 train/tes~ yes 38 0.530 0.470 no Preprocesso~
## 7 train/tes~ no 41 0.184 0.816 no Preprocesso~
## 8 train/tes~ no 42 0.182 0.818 no Preprocesso~
## 9 train/tes~ yes 46 0.694 0.306 yes Preprocesso~
## 10 train/tes~ no 50 0.389 0.611 no Preprocesso~
## # ... with 283 more rows
conf_mat(rf_results,
truth = canceled_service,
estimate = .pred_class)%>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
labs(y = "True Class", x = "Predicted Class", fill = NULL,
title = "Confusion Matrix", subtitle = "For Random Forest Model")+
theme_bw()
From the confusion matrix of the Random Forest Model we can see that the number of False positive is 23 and number of False Negative is 27.
roc_curve(rf_results,
truth = canceled_service,
estimate = .pred_yes) %>%
autoplot()
library(discrim)
## Warning: package 'discrim' was built under R version 4.0.4
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
lda_model <- discrim_regularized(frac_common_cov = 1) %>%
set_engine('klaR') %>%
set_mode('classification')
telecom_lda_wf <- workflow() %>%
add_model(lda_model) %>%
add_recipe(telecom_transformation)
telecom_lda_fit <- telecom_lda_wf %>%
last_fit(split = telecom_split,
metrics = my_metrics)
## Warning: package 'MASS' was built under R version 4.0.4
lda_results <- telecom_lda_fit %>%
collect_predictions()
telecom_lda_fit %>% collect_metrics()
## # A tibble: 5 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.782 Preprocessor1_Model1
## 2 sens binary 0.745 Preprocessor1_Model1
## 3 spec binary 0.802 Preprocessor1_Model1
## 4 f_meas binary 0.712 Preprocessor1_Model1
## 5 roc_auc binary 0.853 Preprocessor1_Model1
roc_curve(lda_results,
truth = canceled_service,
estimate = .pred_yes) %>%
autoplot()
## The LDA Model Confusion Matrix
conf_mat(lda_results,
truth = canceled_service,
estimate = .pred_class)%>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
labs(y = "True Class", x = "Predicted Class", fill = NULL,
title = "Confusion Matrix", subtitle = "For Linear Discriminant Analysis Model")+
theme_bw()
From the confusion matrix of the LDA Model we can see that the number of False positive is 22 and number of False Negative is also 22 which indicates that the LDA model outperformed.
*************************************** Summary of Results *****************************************
Write a summary of your overall findings and recommendations to the executives at the company. Think of this section as your closing remarks of a presentation, where you summarize your key findings, model performance, and make recommendations to improve customer retention and service at this company.
Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:
An introduction where you explain the business problem and goals of your data analysis
What problem(s) is this company trying to solve? Why are they important to their future success?
What was the goal of your analysis? What questions were you trying to answer and why do they matter?
Highlights and key findings from your Exploratory Data Analysis section
What were the interesting findings from your analysis and why are they important for the business?
This section is meant to establish the need for your recommendations in the following section
Your “best” classification model and an analysis of its performance
In this section you should talk about the expected error of your model on future data
You should discuss at least one performance metric, such as an F1, sensitivity, specificity, or ROC AUC for your model. However, you must explain the results in an intuitive, non-technical manner. Your audience in this case are executives at a telecommunications company with limited knowledge of machine learning.
Your recommendations to the company on how to reduce customer attrition rates
Each recommendation must be supported by your data analysis results
You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
You must also describe the potential business impact of your recommendation:
Why is this a good recommendation?
What benefits will the business achieve?
Summary
Add your summary here. Please do not place your text within R code chunks.
The goal of this project is to first explore 5 factors and do exploratory analysis to identify whether there is a relationship between customer canceling the service and any of those selected factors. Finally, by identifying those factors and possible relationships, try to provide the company some insight and recommendations. Additionally, the goal is to provide the company with our best predictive model on whether a customer will cancel their service or not, so the company can take dynamic actions to improve services and offers, and eventually customer satisfactions.
For this purpose, 5 following questions were tried to be answered:
Is there a relationship between canceled service and whether customers had tech support?
Is there a relationship between canceled service and customers internet service type?
Is there a relationship between canceled service and whether customers had online security?
Is there a relationship between canceled service and total months with company?
Is there a relationship between canceled service and customers monthly charges?
Again these questions matters, because by exploring the data and answering them , we can discover potential factors that might have correlation with customers canceling their service and by providing key findings and recommendations to the company we can help them to minimize customer attrition rate, since in the competitive world of telecommunication companies, customer retention is key.
The analysis in this project revealed some interesting patterns and relationships. The result of the analysis represented that the total number of months with company is one of the most significant factor in customer attrition. The result of analysis showed that the median months with company for customers who canceled their service is 10 months. When compare this to median months of customers who did not cancel their service, this median increases to 40 months which is 4 times bigger. Additionally, the result of analysis indicated that there is strong relationship between customer attrition and their internet service type, technical support and online security. In contrast and surprisingly, customers’ monthly charges and fees did not seem to have a relationship with customers attrition.
** Below all of these key findings are explained in detail:
The result of analysis on the first question which was to find if there is a relationship between customer canceling their service and tech support, indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support. The data and summary table above indicates that among the customers who did not have tech support 45.71% canceled their service which this rate decrease to 1/3 among customers who had tech support and it is only 15.95%. Also, the stacked percentage bar chart represents this fact that among customers that have tech support the attrition rate is a lot lower than the attrition rate among customers without the tech support.
The outcome of the second question which was to find if there is a relationship between canceled service and customers internet service type, shows that customers who have fiber optic internet service, tend to cancel their service at a higher rate than those with digital service. Among the customers with fiber optic internet service the canceling rate was 45.33% ,but the canceling rate among customers with digital internet service was a lot lower and only 18.53%. It is interesting that in general, it seems the fiber optic subscription or purchase is higher but the attrition on that is also higher. We can see from the summary table that almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but it is observed that attrition rate is higher in fiber optic type, 30% fiber optic attrition vs 6% digital attrition among all customers.
The outcome of the third question and analysis on finding a relationship between between canceled service and online security, indicated that customers who did not have online security tend to cancel their service at a significantly higher rate than customers who had online security. The summary table and the plot indicates that the attrition rate among customers who did not have online security is almost 47% when compare this to customers who had online security this rate decrease significantly to 16%.
The result of analysis on the forth question which was to find if there is a relationship between canceled service and total months with company, as mentioned earlier indicated a strong relationship. Result shows customers who canceled their service tend to stay fewer months with the company than customers who did not cancel their service. Among 427 customers who canceled their service, the average total months with the company is around 18 months(17.69). But the average total months with company for customers who did not cancel their service is around 39 month (38.66). This shows that the average total months of the customers who did cancel their service is almost 21 months less than the average total months with the company, of those who did not cancel their service.
Among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %.
The result of analysis on last question which was to find if there is a relationship between canceled service and customers monthly charges, indicated that there is no relationship between customers canceling their service and the monthly service charges. The average monthly service charge for the customers who canceled their service is almost 82 dollars (81.88) which is the same for customers who did not cancel their service. The current customers also have average monthly charge of 82 dollars (81.91). Among 427 customers who did cancel their service, 58% of them had average monthly service charge of more than 80 dollars, when compare to customers who did not cancel their service, this only decrease to 54%.Therefore, there is no evidence to show a strong relationship between canceling service and customer monthly charges.
3)Results of predictive models
Three Classification models, Logistic Regression, Random Forest and Linear Discriminant Analysis, were performed in this analysis, for predicting whether customer cancel their service or not based on the set of factors(all variables). The performance of all three models in this project were almost close to each other and results were not significantly different. However, among all three, the best classification model in terms of performance is LDA model with highest accuracy of 85%, the area under the ROC curve of 90%(highest roc-auc), highest sensitivity of 79.24 % and highest F1 Score of 79.24%. The LDA model outperformed all other models with the lower False positive and lower False Negative. False positive here refers to number of customers that model predicted as yes under canceled service and they actually did not cancel service. False Negative here refers to to number of customers that model predicted as no under canceled service and they actually canceled their service. Both of these errors are the lowest for the LDA model . Also, the area under the roc curve represents the performance of the model on classifying the classes. The higher this value and the closest to 1, the better the model in predicting the customer did or did not cancel the service.
** Recommendations:
Based on the result of analysis and some key findings it is crucial to the company to increase and improve the online security for the customers since the analysis indicates that the attrition rate among customers who did not have online security is almost 47% in compare to customers who had online security. This rate decrease significantly to 16% for customers who had online security. Also, it is recommended to improve and provide more tech support for customers, as the data indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support.
Additionally, it is highly recommended to deep-dive into the root of the fiber optic internet service and investigate why people get this service more than the digital service but they cancel this service at a very higher rate than digital. This should not be ignored because among the customers with fiber optic internet service the canceling rate is 45.33% ,but the canceling rate among customers with digital internet service was only 18.53%.
As data supports, it seems the fiber optic subscription or purchase is higher but the attrition on that is also higher. Almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but it is observed that attrition rate is higher in fiber optic type, 30% fiber optic attrition vs 6% digital attrition among all customers. Last but not least, it is recommended that companies provide promotions and offers to customers that are new to company (customers with total months in range 0-18 or 20 months). Since analysis shows among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %. Therefore, it is possible to encourage customer stay by offering promotions or deals.
In conclusion, The mentioned factors suggested to be investigated and make sure every attempt is being made to improve these factors, in order to increase customer satisfaction and eventually minimizing the financial and reputation loss.