Your Name: Jing-Ting Huang
Your G Number: G01226692Q
# Suppress dplyr summarise grouping warning messages
options(dplyr.summarise.inform = FALSE)
## Add R libraries here
library(tidyverse)
library(tidymodels)
library(discrim)
library(klaR)
library(kknn)
library(magrittr)
library(dplyr)
library(ggplot2)
library(vip)
# Load the dataset
telecom_df <- read_rds(url('https://gmubusinessanalytics.netlify.app/data/telecom_df.rds'))
In this section, you must think of at least 5 relevant questions that explore the relationship between canceled_service
and the other variables in the telecom_df
data set. The goal of your analysis should be discovering which variables drive the differences between customers who do and do not cancel their service.
You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr
/tidyr
) or a plot (using ggplot
) or both.
In total, you must have a minimum of 3 plots (created with ggplot
) and 3 summary data frames (created with dplyr
) for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, scatter plot, etc…)
See the Data Analysis Project for an example of a question answered with a summary table and plot.
Note: To add an R code chunk to any section of your project, you can use the keyboard shortcut Ctrl
+ Alt
+ i
or the insert
button at the top of your R project template notebook file.
Question: Is there a relationship between canceled_service, avg_intl_mins ?
Answer: We can tell from the graph that there are indeed many international users who are unwilling to cancel the service, especially if the average call time more than 100 minutes, it will reduce the willingness to cancel the service, so we can know that if we want to improve, we should aim for the people who’s average call time is less than 100 minutes and change service.
Q1 <- telecom_df %>% group_by(canceled_service)%>%
summarise(n_call = n(),
avg_call_mins = mean(avg_intl_mins),
sd_call_mins = sd(avg_intl_mins))
Q1
## # A tibble: 2 x 4
## canceled_service n_call avg_call_mins sd_call_mins
## * <fct> <int> <dbl> <dbl>
## 1 yes 427 93.6 29.5
## 2 no 748 113. 30.9
ggplot(data = telecom_df, aes(x=avg_intl_mins, fill = canceled_service))+
geom_histogram(color = "blue", bins = 30)+
facet_wrap(~ canceled_service, nrow = 2) +
labs(title = "Canceled Services Distribution by Average monthly international call minutes",
x = "Call_Mins", y = "Among of Calls")
Question: Is there a relationship between canceled_service, contract, and months_with_company?
Answer: They are partly related. We can tell from the graph that the willingness of canceled service is not significantly different when people sign in contracts 1 year and 2 years. Most of the people who sign in the contracts 1 year and 2 years have will consider whether to cancel the contract after more than 30 months. However, the one who signs in the month-to-month contract is very unstable. The average month of the month-to-month contract is 19.08949 which is much less than contracts 1 year and 2 years which are 45.70 months and 62.15 months.
Q2 <- telecom_df %>% group_by(contract)%>%
summarise(n_call = n(),
avg_months_with_company = mean(months_with_company),
sd_months_with_company = sd(months_with_company))
Q2
## # A tibble: 3 x 4
## contract n_call avg_months_with_company sd_months_with_company
## * <fct> <int> <dbl> <dbl>
## 1 month_to_month 771 19.1 18.3
## 2 one_year 204 45.7 17.7
## 3 two_year 200 62.2 13.2
ggplot(data = telecom_df, aes(x=months_with_company, fill = canceled_service))+
geom_boxplot(aes (y = contract),color = "blue", bins = 30) +
labs(title = "Canceled Services Distribution Divide By Contract",
x = "Months With Company", y = "Contract")+
theme(plot.title = element_text(size=9))
## Warning: Ignoring unknown parameters: bins
Question: Is there a relationship between canceled_service, contract, and internet_service?
Answer: Yes, We can know from the graph that the cancel service is related to the internet service. The user who use digital as their internet service is easier to start thinking about whether to continue using the service. The most user who use digital will stop the service in 10 months. Compare with the fiber_optic, more user will stop the service in 5 months. In comclude, we can suggest that the company can expand the fiber_optic business since the user are more stable.
telecom_df %>% group_by(internet_service, online_security, online_backup) %>%
summarise(n_employees = n())
## # A tibble: 8 x 4
## # Groups: internet_service, online_security [4]
## internet_service online_security online_backup n_employees
## <fct> <fct> <fct> <int>
## 1 fiber_optic yes yes 125
## 2 fiber_optic yes no 86
## 3 fiber_optic no yes 213
## 4 fiber_optic no no 357
## 5 digital yes yes 97
## 6 digital yes no 92
## 7 digital no yes 62
## 8 digital no no 143
ggplot(data = telecom_df, aes(x=months_with_company, fill = canceled_service))+
geom_boxplot(aes (y = internet_service),color = "blue", bins = 30) +
labs(title = "Canceled Services Distribution Divide By Internet Service",
x = "Months With Company", y = "Internet Service")+
theme(plot.title = element_text(size=10))
## Warning: Ignoring unknown parameters: bins
Question: Is there a relationship between cancel_service and monthly-charge?
Answer: Yes, there are partly relationship between cancel_service and monthly-charge. We can observe from the chart that people in the section between 70 to 100 have more willing to cancel the service. The people in the section between 100 to 120 will hold the service. I think there is no problem with high-end service, but company should review the service in 70 to 100 section which is mid-level service. We can know that mid-level has the most users and the most dissatisfied users. Thus, I suggest comapny can modify mid-level projects.
ggplot(telecom_df, aes(x =canceled_service, y = monthly_charges , fill = canceled_service)) +
geom_violin() +
geom_boxplot(fill = NA, width = 0.1, colour = "black") +
labs(title = "Monthly Charges Distribution Divide By Canceled Services",
x = "Canceled Services", y = "Monyhly Charges")
Question: Is there a relationship between cancel_service, payment_method, and monthly_charges?
Answer: Yes, there are partly relationship between cancel_service and payment_method. We can observe from the chart that the electronic check, bank draft, and credit card are the main payment method. Electronic checks have 43.31% users, but there are 51.86% of the users who canceled the service. The canceled rate for Electronic checks is lower than other payment methods which have an average canceled rate is between 22.52% to 31.21%.
telecom_df %>% group_by(payment_method, canceled_service ) %>%
summarise(n_employees = n())
## # A tibble: 8 x 3
## # Groups: payment_method [4]
## payment_method canceled_service n_employees
## <fct> <fct> <int>
## 1 bank_draft yes 57
## 2 bank_draft no 196
## 3 credit_card yes 52
## 4 credit_card no 188
## 5 electronic_check yes 264
## 6 electronic_check no 245
## 7 mailed_check yes 54
## 8 mailed_check no 119
telecom_df %>% group_by(payment_method) %>%
summarise(n_employees = n())
## # A tibble: 4 x 2
## payment_method n_employees
## * <fct> <int>
## 1 bank_draft 253
## 2 credit_card 240
## 3 electronic_check 509
## 4 mailed_check 173
ggplot(data = telecom_df, aes(x= payment_method, color = canceled_service))+
geom_point(position=position_jitterdodge(),alpha=.6,aes (y = monthly_charges)) +
labs(title = "Canceled Services Distribution Divide By Payment Method",
x = "Payment Method", y = "Monyhly Charges")+
theme(axis.text.x = element_text(size=11) )
In this section of the project, you will fit three classification algorithms to predict the response variable,canceled_service
. You should use all of the other variables in the telecom_df
data as predictor variables for each model.
You must follow the machine learning steps below.
The data splitting and feature engineering steps should only be done once so that your models are using the same data and feature engineering steps for training.
telecom_df
data into a training and test set (remember to set your seed)recipes
package
parsnip
model object
vfold_cv
(remember to set your seed)grid_random()
functionsize
argument of grid_random()
too large. I recommend size
= 10 or smaller.select_best()
and finalize your workflowautoplot()
and calculating the area under the ROC curve on your test dataLogistic Regression
set.seed(271)
telecom_split <- initial_split(telecom_df, prop = 0.75,
strata = canceled_service)
telecom_training <- telecom_split %>%
training()
telecom_test <- telecom_split %>%
testing()
# Create cross validation folds for hyperparameter tuning
set.seed(271)
telecom_folds <- vfold_cv(telecom_training, v = 10)
# Feature Engineering
telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
# Feature Engineering
telecom_recipe %>%
prep(training = telecom_training) %>%
bake(new_data = NULL)
## # A tibble: 882 x 22
## avg_call_mins avg_intl_mins months_with_com~ monthly_charges canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 -0.600 -0.0284 0.358 -0.854 no
## 2 -1.07 0.618 -0.766 -1.69 no
## 3 -0.0438 0.740 1.36 1.58 no
## 4 0.0646 -1.49 -0.920 0.683 yes
## 5 0.519 -0.122 -0.920 -0.926 no
## 6 0.228 -0.528 0.862 1.53 no
## 7 -0.364 1.11 -0.291 1.40 no
## 8 -0.587 1.38 1.23 1.45 no
## 9 -0.794 -1.39 1.38 1.30 no
## 10 2.08 -1.68 0.358 1.26 yes
## # ... with 872 more rows, and 17 more variables: senior_citizen_no <dbl>,
## # spouse_partner_no <dbl>, dependents_no <dbl>,
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # online_security_no <dbl>, online_backup_no <dbl>,
## # device_protection_no <dbl>, tech_support_no <dbl>, streaming_tv_no <dbl>,
## # streaming_movies_no <dbl>, contract_one_year <dbl>,
## # contract_two_year <dbl>, paperless_bill_no <dbl>,
## # payment_method_credit_card <dbl>, payment_method_electronic_check <dbl>,
## # payment_method_mailed_check <dbl>
# Specify Logistic Regression Model
logistic_model <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
# Create a Workflow
logistic_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(telecom_recipe)
# Fit Model
logistic_fit <- logistic_wf %>%
last_fit(split = telecom_split)
# Collect Predictions
logistic_results <- logistic_fit %>%
collect_predictions()
roc_curve(logistic_results,
truth = canceled_service,
estimate = .pred_yes) %>%
autoplot()
# ROC AUC
roc_auc(logistic_results,
truth = canceled_service,
.pred_yes)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.854
# Confusion Matrix
conf_mat(logistic_results,
truth = canceled_service,
estimate = .pred_class)
## Truth
## Prediction yes no
## yes 69 29
## no 37 158
model <- glm( canceled_service ~., data = telecom_training, family = binomial)
tidymodel <- tidy(model)
summary(model)
##
## Call:
## glm(formula = canceled_service ~ ., family = binomial, data = telecom_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7604 -0.5484 0.2027 0.5666 2.3851
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.864538 1.687110 0.512 0.608345
## senior_citizenno 0.242228 0.248553 0.975 0.329782
## spouse_partnerno -0.334288 0.232744 -1.436 0.150921
## dependentsno 0.195467 0.265237 0.737 0.461151
## cellular_servicesingle_line 0.238362 0.265344 0.898 0.369018
## avg_call_mins -0.011029 0.001555 -7.091 1.34e-12 ***
## avg_intl_mins 0.029069 0.003805 7.641 2.16e-14 ***
## internet_servicedigital 1.084527 0.444891 2.438 0.014780 *
## online_securityno -0.542299 0.250539 -2.165 0.030424 *
## online_backupno -0.334633 0.232272 -1.441 0.149670
## device_protectionno -0.007819 0.245870 -0.032 0.974632
## tech_supportno -0.454237 0.272624 -1.666 0.095680 .
## streaming_tvno 0.591607 0.196622 3.009 0.002622 **
## streaming_moviesno 0.749691 0.198046 3.785 0.000153 ***
## contractone_year 0.883310 0.345813 2.554 0.010640 *
## contracttwo_year 0.790897 0.519970 1.521 0.128249
## paperless_billno 0.451350 0.225593 2.001 0.045421 *
## payment_methodcredit_card 0.268948 0.333116 0.807 0.419453
## payment_methodelectronic_check -0.306253 0.275853 -1.110 0.266911
## payment_methodmailed_check -0.111534 0.362938 -0.307 0.758609
## months_with_company 0.031387 0.007058 4.447 8.70e-06 ***
## monthly_charges -0.008919 0.013301 -0.671 0.502515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1156.57 on 881 degrees of freedom
## Residual deviance: 675.09 on 860 degrees of freedom
## AIC: 719.09
##
## Number of Fisher Scoring iterations: 6
vip(model)
Specify LDA model
lda_model <- discrim_regularized(frac_common_cov = 1) %>%
set_engine('klaR') %>%
set_mode('classification')
lda_wf <- workflow() %>%
add_model(lda_model) %>%
add_recipe(telecom_recipe)
lda_fit <- lda_wf %>%
last_fit(split = telecom_split)
lda_results <- lda_fit %>%
collect_predictions()
roc_curve(lda_results,
truth = canceled_service,
estimate = .pred_yes) %>%
autoplot()
roc_auc(lda_results,
truth = canceled_service,
.pred_yes)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.852
conf_mat(lda_results,
truth = canceled_service,
estimate = .pred_class)
## Truth
## Prediction yes no
## yes 71 32
## no 35 155
Random Forest
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('classification')
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(telecom_recipe)
# Hyperparameter Tuning
set.seed(314)
rf_grid <- grid_random(mtry() %>% range_set(c(2, 4)),
trees(),
min_n(),
size = 10)
rf_grid
## # A tibble: 10 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 3 1304 37
## 2 4 477 24
## 3 4 1621 32
## 4 4 1841 6
## 5 3 609 29
## 6 3 1235 16
## 7 4 1822 14
## 8 2 678 19
## 9 3 138 14
## 10 3 1218 4
rf_tuning <- rf_workflow %>%
tune_grid(resamples = telecom_folds,
grid = rf_grid)
## Select best model based on roc_auc
best_rf <- rf_tuning %>%
select_best(metric = 'roc_auc')
# View the best parameters
best_rf
## # A tibble: 1 x 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 2 678 19 Preprocessor1_Model08
final_rf_workflow <- rf_workflow %>%
finalize_workflow(best_rf)
#fit the model
rf_wf_fit <- final_rf_workflow %>%
fit(data = telecom_training)
rf_fit <- rf_wf_fit %>%
pull_workflow_fit()
vip(rf_fit)
#Train with last fit
rf_last_fit <- final_rf_workflow %>%
last_fit(telecom_split)
rf_last_fit %>% collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.771 Preprocessor1_Model1
## 2 roc_auc binary 0.852 Preprocessor1_Model1
#ROC
rf_last_fit %>% collect_predictions() %>%
roc_curve(truth = canceled_service, estimate = .pred_yes) %>%
autoplot()
#Confusion Matrix
rf_predictions <- rf_last_fit %>% collect_predictions()
conf_mat(rf_predictions, truth = canceled_service, estimate = .pred_class)
## Truth
## Prediction yes no
## yes 59 20
## no 47 167
Write a summary of your overall findings and recommendations to the executives at the company. Think of this section as your closing remarks of a presentation, where you summarize your key findings, model performance, and make recommendations to improve customer retention and service at this company.
Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:
An introduction where you explain the business problem and goals of your data analysis
What problem(s) is this company trying to solve? Why are they important to their future success?
What was the goal of your analysis? What questions were you trying to answer and why do they matter?
Highlights and key findings from your Exploratory Data Analysis section
What were the interesting findings from your analysis and why are they important for the business?
This section is meant to establish the need for your recommendations in the following section
Your “best” classification model and an analysis of its performance
Your recommendations to the company on how to reduce customer attrition rates
Each recommendation must be supported by your data analysis results
You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
You must also describe the potential business impact of your recommendation:
Why is this a good recommendation?
What benefits will the business achieve?
Summary
Add your summary here. Please do not place your text within R code chunks.
The telecom company is keeping losing their customer, the canceled service rate is increasing. The telecom company is trying to find out the key reason that affects their customer canceled the telecom service. The goal of this project is to use machine learning tools to find out the key reason and help the company reduce the customer’s willingness to cancel the service.
Exploratory Data Analysis can help us to make the problem more clear and find out valuable information. While doing the data exploration, I find out the dataset content of 14 factors and 4 numeric columns. Thus, I using 4 numeric columns to find out the relationship between canceled_service and other factors. 1.The relationship between canceled_service, avg_intl_minsus is significant and we can figure out that the longer time users use service, the fewer chance users cancel the service. 2. The relationship between canceled_service, contract, and months_with_company are not significant. The graph shows that users who signed the 1-year and 2-year contract will remain their contract after one year. However, The users who signed the month_to_month contract will have more chances to cancel the service in 10 months. 3. The relationship between canceled_service and internet_service is significant. We can tell from the graph that users who use fiber as their internet have more willing not to cancel the service. 4. The relationship between cancel_service and monthly-charge is significant. We can tell from the graph that user in the section between 70 to 100 have more willing to cancel the service. The company should review the service in the 70 to 100 section.
In order to find the most significant factors that have the greatest impact on service cancellation, I used Logistic Regression, LDA model, and Random Forest to find the most significant factors. And use ROC AUC to measure the accuracy of the model. The higher the ROC AUC, the higher the accuracy of the model’s prediction. And use the most accurate model to find the factors that have the most serious impact on the cancellation of services. The ROC AUC of the Logistic Regression model is 85.435%. The ROC AUC of the LDA model is 85.243%. The ROC AUC of the Random Forest model is 85.087%. After confirm Logistic Regression model is the best model, I try to find out the significant factor from the model. As we can tell from the vip function, there are 4 factors affect user cancel the service significantly. 4 factors including avg_call_mins, avg_intl_mins, streaming_moviesno, and months_with_company. Among the 4 factors, avg_call_mins and avg_intl_mins are the most important factors.
Your recommendations to the company on how to reduce customer attrition rates
According to the results of the Logistic Regression model, avg_call_mins and avg_intl_mins are the most important factors. This means that consumers are very sensitive to these factors, so I suggest that companies should investigate consumers who are dissatisfied with this factor. From the relationship diagram between avg_intl_mins and cancelled_service, it can be found that consumers with a shorter time are more likely to cancel the service, so I suggest that the company should make a plan for customers who use avg_intl_mins for a shorter time. 2.We can find that there is a partial relationship between monthly-charge from the process of data exploration. Customers with a price range of 70 to 100 are more likely to cancel the service, while customers with a monthly fee of more than 100 are more likely to continue using the company’s services. However, we can find from the user structure that the number of customers with monthly fees ranging from 70 to 100 is huge. Therefore, the company should improve the mid-priced service to enable these customers to continue to use the company’s services. 3.Customers who use digital as an internet service are more likely to lose, while customers who use fiber do not. Therefore, I suggest that the company should reduce recommending customers to use digital as an internet service, and encourage customers to switch to fiber to consolidate the user base. 4.Users who use 1 year contract and 2 year contract are less likely to cancel the service. On the contrary, users of month_to_month are more likely to cancel the service. Therefore, I suggest that the company should propose more favorable plans for users of month_to_month and encourage these use Use 1 year contract or 2 year contract.