Your Name: Jing-Ting Huang
Your G Number: G01226692Q

# Suppress dplyr summarise grouping warning messages
options(dplyr.summarise.inform = FALSE)

## Add R libraries here
library(tidyverse)
library(tidymodels)
library(discrim)
library(klaR)
library(kknn)
library(magrittr)
library(dplyr)
library(ggplot2)
library(vip)

# Load the dataset
telecom_df <- read_rds(url('https://gmubusinessanalytics.netlify.app/data/telecom_df.rds'))

Data Analysis

In this section, you must think of at least 5 relevant questions that explore the relationship between canceled_service and the other variables in the telecom_df data set. The goal of your analysis should be discovering which variables drive the differences between customers who do and do not cancel their service.

You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr/tidyr) or a plot (using ggplot) or both.

In total, you must have a minimum of 3 plots (created with ggplot) and 3 summary data frames (created with dplyr) for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, scatter plot, etc…)

See the Data Analysis Project for an example of a question answered with a summary table and plot.

Note: To add an R code chunk to any section of your project, you can use the keyboard shortcut Ctrl + Alt + i or the insert button at the top of your R project template notebook file.

Question 1

Question: Is there a relationship between canceled_service, avg_intl_mins ?

Answer: We can tell from the graph that there are indeed many international users who are unwilling to cancel the service, especially if the average call time more than 100 minutes, it will reduce the willingness to cancel the service, so we can know that if we want to improve, we should aim for the people who’s average call time is less than 100 minutes and change service.

Q1 <- telecom_df %>% group_by(canceled_service)%>%                 
                            summarise(n_call = n(),
                            avg_call_mins = mean(avg_intl_mins),
                            sd_call_mins = sd(avg_intl_mins)) 
Q1

## # A tibble: 2 x 4
##   canceled_service n_call avg_call_mins sd_call_mins
## * <fct>             <int>         <dbl>        <dbl>
## 1 yes                 427          93.6         29.5
## 2 no                  748         113.          30.9

ggplot(data = telecom_df, aes(x=avg_intl_mins, fill =  canceled_service))+
  geom_histogram(color = "blue", bins = 30)+
  facet_wrap(~  canceled_service, nrow = 2) +
   labs(title = "Canceled Services Distribution by Average monthly international call minutes",
           x = "Call_Mins", y = "Among of Calls")

Question 2

Question: Is there a relationship between canceled_service, contract, and months_with_company?

Answer: They are partly related. We can tell from the graph that the willingness of canceled service is not significantly different when people sign in contracts 1 year and 2 years. Most of the people who sign in the contracts 1 year and 2 years have will consider whether to cancel the contract after more than 30 months. However, the one who signs in the month-to-month contract is very unstable. The average month of the month-to-month contract is 19.08949 which is much less than contracts 1 year and 2 years which are 45.70 months and 62.15 months.

Q2 <- telecom_df %>% group_by(contract)%>%                 
                            summarise(n_call = n(),
                            avg_months_with_company = mean(months_with_company),
                            sd_months_with_company = sd(months_with_company)) 
Q2

## # A tibble: 3 x 4
##   contract       n_call avg_months_with_company sd_months_with_company
## * <fct>           <int>                   <dbl>                  <dbl>
## 1 month_to_month    771                    19.1                   18.3
## 2 one_year          204                    45.7                   17.7
## 3 two_year          200                    62.2                   13.2

ggplot(data = telecom_df, aes(x=months_with_company, fill = canceled_service))+
  geom_boxplot(aes (y =  contract),color = "blue", bins = 30) +
   labs(title = "Canceled Services Distribution Divide By Contract",
           x = "Months With Company", y = "Contract")+
            theme(plot.title = element_text(size=9))

## Warning: Ignoring unknown parameters: bins

Question 3

Question: Is there a relationship between canceled_service, contract, and internet_service?

Answer: Yes, We can know from the graph that the cancel service is related to the internet service. The user who use digital as their internet service is easier to start thinking about whether to continue using the service. The most user who use digital will stop the service in 10 months. Compare with the fiber_optic, more user will stop the service in 5 months. In comclude, we can suggest that the company can expand the fiber_optic business since the user are more stable.

telecom_df %>% group_by(internet_service, online_security, online_backup) %>% 
                  summarise(n_employees = n())

## # A tibble: 8 x 4
## # Groups:   internet_service, online_security [4]
##   internet_service online_security online_backup n_employees
##   <fct>            <fct>           <fct>               <int>
## 1 fiber_optic      yes             yes                   125
## 2 fiber_optic      yes             no                     86
## 3 fiber_optic      no              yes                   213
## 4 fiber_optic      no              no                    357
## 5 digital          yes             yes                    97
## 6 digital          yes             no                     92
## 7 digital          no              yes                    62
## 8 digital          no              no                    143

ggplot(data = telecom_df, aes(x=months_with_company, fill = canceled_service))+
  geom_boxplot(aes (y =  internet_service),color = "blue", bins = 30) +
   labs(title = "Canceled Services Distribution Divide By Internet Service",
           x = "Months With Company", y = "Internet Service")+
            theme(plot.title = element_text(size=10))

## Warning: Ignoring unknown parameters: bins

Question 4

Question: Is there a relationship between cancel_service and monthly-charge?

Answer: Yes, there are partly relationship between cancel_service and monthly-charge. We can observe from the chart that people in the section between 70 to 100 have more willing to cancel the service. The people in the section between 100 to 120 will hold the service. I think there is no problem with high-end service, but company should review the service in 70 to 100 section which is mid-level service. We can know that mid-level has the most users and the most dissatisfied users. Thus, I suggest comapny can modify mid-level projects.

  ggplot(telecom_df, aes(x =canceled_service, y = monthly_charges , fill = canceled_service)) + 
  geom_violin() +
  geom_boxplot(fill = NA, width = 0.1, colour = "black") +
  labs(title = "Monthly Charges Distribution Divide By Canceled Services",
           x = "Canceled Services", y = "Monyhly Charges")

Question 5

Question: Is there a relationship between cancel_service, payment_method, and monthly_charges?

Answer: Yes, there are partly relationship between cancel_service and payment_method. We can observe from the chart that the electronic check, bank draft, and credit card are the main payment method. Electronic checks have 43.31% users, but there are 51.86% of the users who canceled the service. The canceled rate for Electronic checks is lower than other payment methods which have an average canceled rate is between 22.52% to 31.21%.

telecom_df %>% group_by(payment_method, canceled_service ) %>% 
                  summarise(n_employees = n())

## # A tibble: 8 x 3
## # Groups:   payment_method [4]
##   payment_method   canceled_service n_employees
##   <fct>            <fct>                  <int>
## 1 bank_draft       yes                       57
## 2 bank_draft       no                       196
## 3 credit_card      yes                       52
## 4 credit_card      no                       188
## 5 electronic_check yes                      264
## 6 electronic_check no                       245
## 7 mailed_check     yes                       54
## 8 mailed_check     no                       119

telecom_df %>% group_by(payment_method) %>% 
                  summarise(n_employees = n())

## # A tibble: 4 x 2
##   payment_method   n_employees
## * <fct>                  <int>
## 1 bank_draft               253
## 2 credit_card              240
## 3 electronic_check         509
## 4 mailed_check             173

ggplot(data = telecom_df, aes(x= payment_method, color = canceled_service))+
  geom_point(position=position_jitterdodge(),alpha=.6,aes (y =  monthly_charges)) +
   labs(title = "Canceled Services Distribution Divide By Payment Method",
           x = "Payment Method", y = "Monyhly Charges")+
            theme(axis.text.x = element_text(size=11) )

Machine Learning

In this section of the project, you will fit three classification algorithms to predict the response variable,canceled_service. You should use all of the other variables in the telecom_df data as predictor variables for each model.

You must follow the machine learning steps below.

The data splitting and feature engineering steps should only be done once so that your models are using the same data and feature engineering steps for training.

Split the telecom_df data into a training and test set (remember to set your seed)
Specify a feature engineering pipeline with the recipes package
- You can include steps such as skewness transformation, dummy variable encoding or any other steps you find appropriate
Specify a parsnip model object
- You may choose from the following classification algorithms:
  - Logistic Regression
  - LDA
  - QDA
  - KNN
  - Decision Tree
  - Random Forest
Package your recipe and model into a workflow
Fit your workflow to the training data
- If your model has hyperparameters:
  - Split the training data into 5 folds for 5-fold cross validation using vfold_cv (remember to set your seed)
  - Perform hyperparamter tuning with a random grid search using the grid_random() function
  - Refer to the following tutorial for an example - Random Grid Search
  - Hyperparameter tuning can take a significant amount of computing time. Be careful not to set the size argument of grid_random() too large. I recommend size = 10 or smaller.
  - Select the best model with select_best() and finalize your workflow
Evaluate model performance on the test set by plotting an ROC curve using autoplot() and calculating the area under the ROC curve on your test data

Model 1

Logistic Regression

set.seed(271)

telecom_split <- initial_split(telecom_df, prop = 0.75,
                              strata = canceled_service)

telecom_training <- telecom_split %>% 
                   training()

telecom_test <- telecom_split %>%
               testing()

# Create cross validation folds for hyperparameter tuning
set.seed(271)

telecom_folds <- vfold_cv(telecom_training, v = 10)

# Feature Engineering

telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
                 step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                 step_normalize(all_numeric(), -all_outcomes()) %>% 
                 step_dummy(all_nominal(), -all_outcomes())

# Feature Engineering
telecom_recipe %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = NULL)

## # A tibble: 882 x 22
##    avg_call_mins avg_intl_mins months_with_com~ monthly_charges canceled_service
##            <dbl>         <dbl>            <dbl>           <dbl> <fct>           
##  1       -0.600        -0.0284            0.358          -0.854 no              
##  2       -1.07          0.618            -0.766          -1.69  no              
##  3       -0.0438        0.740             1.36            1.58  no              
##  4        0.0646       -1.49             -0.920           0.683 yes             
##  5        0.519        -0.122            -0.920          -0.926 no              
##  6        0.228        -0.528             0.862           1.53  no              
##  7       -0.364         1.11             -0.291           1.40  no              
##  8       -0.587         1.38              1.23            1.45  no              
##  9       -0.794        -1.39              1.38            1.30  no              
## 10        2.08         -1.68              0.358           1.26  yes             
## # ... with 872 more rows, and 17 more variables: senior_citizen_no <dbl>,
## #   spouse_partner_no <dbl>, dependents_no <dbl>,
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   online_security_no <dbl>, online_backup_no <dbl>,
## #   device_protection_no <dbl>, tech_support_no <dbl>, streaming_tv_no <dbl>,
## #   streaming_movies_no <dbl>, contract_one_year <dbl>,
## #   contract_two_year <dbl>, paperless_bill_no <dbl>,
## #   payment_method_credit_card <dbl>, payment_method_electronic_check <dbl>,
## #   payment_method_mailed_check <dbl>

# Specify Logistic Regression Model
logistic_model <- logistic_reg() %>% 
                  set_engine('glm') %>% 
                  set_mode('classification')

# Create a Workflow
logistic_wf <- workflow() %>% 
               add_model(logistic_model) %>% 
               add_recipe(telecom_recipe)
# Fit Model
logistic_fit <- logistic_wf %>% 
                last_fit(split = telecom_split)

# Collect Predictions

logistic_results <-  logistic_fit %>% 
                     collect_predictions()

roc_curve(logistic_results, 
          truth = canceled_service, 
          estimate = .pred_yes) %>% 
  autoplot()

# ROC AUC
roc_auc(logistic_results, 
        truth = canceled_service,
        .pred_yes)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.854

# Confusion Matrix
conf_mat(logistic_results, 
         truth = canceled_service, 
         estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  69  29
##        no   37 158

model <- glm( canceled_service ~., data = telecom_training, family = binomial)
tidymodel <- tidy(model)
summary(model)

## 
## Call:
## glm(formula = canceled_service ~ ., family = binomial, data = telecom_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7604  -0.5484   0.2027   0.5666   2.3851  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.864538   1.687110   0.512 0.608345    
## senior_citizenno                0.242228   0.248553   0.975 0.329782    
## spouse_partnerno               -0.334288   0.232744  -1.436 0.150921    
## dependentsno                    0.195467   0.265237   0.737 0.461151    
## cellular_servicesingle_line     0.238362   0.265344   0.898 0.369018    
## avg_call_mins                  -0.011029   0.001555  -7.091 1.34e-12 ***
## avg_intl_mins                   0.029069   0.003805   7.641 2.16e-14 ***
## internet_servicedigital         1.084527   0.444891   2.438 0.014780 *  
## online_securityno              -0.542299   0.250539  -2.165 0.030424 *  
## online_backupno                -0.334633   0.232272  -1.441 0.149670    
## device_protectionno            -0.007819   0.245870  -0.032 0.974632    
## tech_supportno                 -0.454237   0.272624  -1.666 0.095680 .  
## streaming_tvno                  0.591607   0.196622   3.009 0.002622 ** 
## streaming_moviesno              0.749691   0.198046   3.785 0.000153 ***
## contractone_year                0.883310   0.345813   2.554 0.010640 *  
## contracttwo_year                0.790897   0.519970   1.521 0.128249    
## paperless_billno                0.451350   0.225593   2.001 0.045421 *  
## payment_methodcredit_card       0.268948   0.333116   0.807 0.419453    
## payment_methodelectronic_check -0.306253   0.275853  -1.110 0.266911    
## payment_methodmailed_check     -0.111534   0.362938  -0.307 0.758609    
## months_with_company             0.031387   0.007058   4.447 8.70e-06 ***
## monthly_charges                -0.008919   0.013301  -0.671 0.502515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1156.57  on 881  degrees of freedom
## Residual deviance:  675.09  on 860  degrees of freedom
## AIC: 719.09
## 
## Number of Fisher Scoring iterations: 6

vip(model)

Model 2

Specify LDA model

lda_model <- discrim_regularized(frac_common_cov = 1) %>% 
             set_engine('klaR') %>% 
             set_mode('classification')

lda_wf <- workflow() %>% 
          add_model(lda_model) %>% 
          add_recipe(telecom_recipe)

lda_fit <- lda_wf %>% 
           last_fit(split = telecom_split)

lda_results <-  lda_fit %>% 
                collect_predictions()

roc_curve(lda_results, 
          truth = canceled_service, 
          estimate = .pred_yes) %>% 
  autoplot()

roc_auc(lda_results, 
        truth = canceled_service, 
        .pred_yes)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.852

conf_mat(lda_results, 
         truth = canceled_service, 
         estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  71  32
##        no   35 155

Model 3

Random Forest

rf_model <- rand_forest(mtry = tune(),
                        trees = tune(),
                        min_n = tune()) %>% 
            set_engine('ranger', importance = "impurity") %>% 
            set_mode('classification')

rf_workflow <- workflow() %>% 
               add_model(rf_model) %>% 
               add_recipe(telecom_recipe)
# Hyperparameter Tuning
set.seed(314)

rf_grid <- grid_random(mtry() %>% range_set(c(2, 4)),
                       trees(),
                       min_n(),
                       size = 10)

rf_grid

## # A tibble: 10 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     3  1304    37
##  2     4   477    24
##  3     4  1621    32
##  4     4  1841     6
##  5     3   609    29
##  6     3  1235    16
##  7     4  1822    14
##  8     2   678    19
##  9     3   138    14
## 10     3  1218     4

rf_tuning <- rf_workflow %>% 
             tune_grid(resamples = telecom_folds,
                       grid = rf_grid)
## Select best model based on roc_auc
best_rf <- rf_tuning %>% 
           select_best(metric = 'roc_auc')

# View the best parameters
best_rf

## # A tibble: 1 x 4
##    mtry trees min_n .config              
##   <int> <int> <int> <chr>                
## 1     2   678    19 Preprocessor1_Model08

final_rf_workflow <- rf_workflow %>% 
                     finalize_workflow(best_rf)
#fit the model
rf_wf_fit <- final_rf_workflow %>% 
             fit(data = telecom_training)

rf_fit <- rf_wf_fit %>% 
          pull_workflow_fit()

vip(rf_fit)

#Train with last fit

rf_last_fit <- final_rf_workflow %>% 
               last_fit(telecom_split)

rf_last_fit %>% collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.771 Preprocessor1_Model1
## 2 roc_auc  binary         0.852 Preprocessor1_Model1

#ROC

rf_last_fit %>% collect_predictions() %>% 
                roc_curve(truth  = canceled_service, estimate = .pred_yes) %>% 
                autoplot()

#Confusion Matrix
rf_predictions <- rf_last_fit %>% collect_predictions()

conf_mat(rf_predictions, truth = canceled_service, estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  59  20
##        no   47 167

Summary of Results

Write a summary of your overall findings and recommendations to the executives at the company. Think of this section as your closing remarks of a presentation, where you summarize your key findings, model performance, and make recommendations to improve customer retention and service at this company.

Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:

An introduction where you explain the business problem and goals of your data analysis
- What problem(s) is this company trying to solve? Why are they important to their future success?
- What was the goal of your analysis? What questions were you trying to answer and why do they matter?
Highlights and key findings from your Exploratory Data Analysis section
- What were the interesting findings from your analysis and why are they important for the business?
- This section is meant to establish the need for your recommendations in the following section
Your “best” classification model and an analysis of its performance
- In this section you should talk about the expected error of your model on future data
  - To estimate future performance, you can use your model performance results on the test data
- You should discuss at least one performance metric, such as an F1, sensitivity, specificity, or ROC AUC for your model. However, you must explain the results in an intuitive, non-technical manner. Your audience in this case are executives at a telecommunications company with limited knowledge of machine learning.
Your recommendations to the company on how to reduce customer attrition rates
- Each recommendation must be supported by your data analysis results
- You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
- You must also describe the potential business impact of your recommendation:
  - Why is this a good recommendation?
  - What benefits will the business achieve?

Summary

Add your summary here. Please do not place your text within R code chunks.

An introduction where you explain the business problem and goals of your data analysis

The telecom company is keeping losing their customer, the canceled service rate is increasing. The telecom company is trying to find out the key reason that affects their customer canceled the telecom service. The goal of this project is to use machine learning tools to find out the key reason and help the company reduce the customer’s willingness to cancel the service.

Highlights and key findings from your Exploratory Data Analysis section

Exploratory Data Analysis can help us to make the problem more clear and find out valuable information. While doing the data exploration, I find out the dataset content of 14 factors and 4 numeric columns. Thus, I using 4 numeric columns to find out the relationship between canceled_service and other factors. 1.The relationship between canceled_service, avg_intl_minsus is significant and we can figure out that the longer time users use service, the fewer chance users cancel the service. 2. The relationship between canceled_service, contract, and months_with_company are not significant. The graph shows that users who signed the 1-year and 2-year contract will remain their contract after one year. However, The users who signed the month_to_month contract will have more chances to cancel the service in 10 months. 3. The relationship between canceled_service and internet_service is significant. We can tell from the graph that users who use fiber as their internet have more willing not to cancel the service. 4. The relationship between cancel_service and monthly-charge is significant. We can tell from the graph that user in the section between 70 to 100 have more willing to cancel the service. The company should review the service in the 70 to 100 section.

Your “best” classification model and an analysis of its performance

In order to find the most significant factors that have the greatest impact on service cancellation, I used Logistic Regression, LDA model, and Random Forest to find the most significant factors. And use ROC AUC to measure the accuracy of the model. The higher the ROC AUC, the higher the accuracy of the model’s prediction. And use the most accurate model to find the factors that have the most serious impact on the cancellation of services. The ROC AUC of the Logistic Regression model is 85.435%. The ROC AUC of the LDA model is 85.243%. The ROC AUC of the Random Forest model is 85.087%. After confirm Logistic Regression model is the best model, I try to find out the significant factor from the model. As we can tell from the vip function, there are 4 factors affect user cancel the service significantly. 4 factors including avg_call_mins, avg_intl_mins, streaming_moviesno, and months_with_company. Among the 4 factors, avg_call_mins and avg_intl_mins are the most important factors.

Your recommendations to the company on how to reduce customer attrition rates
According to the results of the Logistic Regression model, avg_call_mins and avg_intl_mins are the most important factors. This means that consumers are very sensitive to these factors, so I suggest that companies should investigate consumers who are dissatisfied with this factor. From the relationship diagram between avg_intl_mins and cancelled_service, it can be found that consumers with a shorter time are more likely to cancel the service, so I suggest that the company should make a plan for customers who use avg_intl_mins for a shorter time. 2.We can find that there is a partial relationship between monthly-charge from the process of data exploration. Customers with a price range of 70 to 100 are more likely to cancel the service, while customers with a monthly fee of more than 100 are more likely to continue using the company’s services. However, we can find from the user structure that the number of customers with monthly fees ranging from 70 to 100 is huge. Therefore, the company should improve the mid-priced service to enable these customers to continue to use the company’s services. 3.Customers who use digital as an internet service are more likely to lose, while customers who use fiber do not. Therefore, I suggest that the company should reduce recommending customers to use digital as an internet service, and encourage customers to switch to fiber to consolidate the user base. 4.Users who use 1 year contract and 2 year contract are less likely to cancel the service. On the contrary, users of month_to_month are more likely to cancel the service. Therefore, I suggest that the company should propose more favorable plans for users of month_to_month and encourage these use Use 1 year contract or 2 year contract.

Machine Learning Project