Your Name: Elham Jafarghomi

Your G Number: G00954064

# Suppress dplyr summarise grouping warning messages
options(dplyr.summarise.inform = FALSE)

## Add R libraries here
library(tidyverse)
library(tidymodels)

# Load the dataset
telecom_df <- read_rds(url('https://gmubusinessanalytics.netlify.app/data/telecom_df.rds'))

Data Analysis

In this section, you must think of at least 5 relevant questions that explore the relationship between canceled_service and the other variables in the telecom_df data set. The goal of your analysis should be discovering which variables drive the differences between customers who do and do not cancel their service.

You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr/tidyr) or a plot (using ggplot) or both.

In total, you must have a minimum of 3 plots (created with ggplot) and 3 summary data frames (created with dplyr) for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, scatter plot, etc…)

See the Data Analysis Project for an example of a question answered with a summary table and plot.

Note: To add an R code chunk to any section of your project, you can use the keyboard shortcut Ctrl + Alt + i or the insert button at the top of your R project template notebook file.

************************ Part 1: Exploratory data Analysis ********************************

Before Starting the questions, in this section we would like to know the number of customers who did cancel their service and who did not.

Count the number of customers who canceled their service and who did not cancel their Service

telecom_df %>% group_by(canceled_service) %>% 
  summarise(num_customers = n())

## # A tibble: 2 x 2
##   canceled_service num_customers
## * <fct>                    <int>
## 1 yes                        427
## 2 no                         748

##Data Visualization

ggplot(data = telecom_df, mapping = aes(x = canceled_service, fill = canceled_service))+
  geom_bar()

From the summary and the bar plot we can see that out of 1175 customers, 427 canceled their service and 748 customers did not cancel their service. Below we are trying to find out if there is any relationship between customer canceling their service and some variables.

Question 1

Question:Is there a relationship between canceled service and whether customers had tech support or not?

Answer: Yes. The data and summary table indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support. Among the customers who did not have tech support 45.71% canceled their service which this rate decrease to 1/3 among customers who had tech support and it is only 15.95%.

Also, the stacked percentage bar chart represents this fact that among customers that had tech support the attrition rate is a lot lower than the attrition rate among customers without the tech support. This is shown by the orange area on each bar.

Code

telecom_tech_support <- telecom_df %>% group_by(canceled_service, tech_support) %>% 
  summarise(num_customers = n()) %>% 
  arrange(tech_support,canceled_service) %>% 
  group_by(tech_support) %>% 
  mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>% 
  left_join(telecom_df %>% group_by(canceled_service, tech_support) %>% 
              summarise(num_customers = n()) %>% 
              ungroup() %>% 
              mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)), 
            by =c("canceled_service", "tech_support","num_customers"))

telecom_tech_support

## # A tibble: 4 x 5
## # Groups:   tech_support [2]
##   canceled_service tech_support num_customers relative_pct_cus~ pct_of_total_cu~
##   <fct>            <fct>                <int>             <dbl>            <dbl>
## 1 yes              yes                     59              16.0             5.02
## 2 no               yes                    311              84.0            26.5 
## 3 yes              no                     368              45.7            31.3 
## 4 no               no                     437              54.3            37.2

Data Visualization

ggplot(data = telecom_df, mapping = aes(x = tech_support,
        fill = canceled_service))+ 
         geom_bar(stat = "count", position = "fill")+
         labs(title ="Canceled service Prevalence by tech support Status(Yes-No)" , 
              x = "Tech Support", y ="Proportion of Customers")

Question 2

Question: Is there a relationship between canceled service and customers internet service type?

Answer: Yes. The data indicates that there is a relationship between the internet service type and service cancelation status. The summary table and the plot shows that customers who have fiber optic internet service, tend to cancel their service at a higher rate than those with digital service. Among the customers with fiber optic internet service, the canceling rate is 45.33% ,but the canceling rate among customers with digital internet service is lot lower and only 18.53%.

Also, the stacked percentage chart represents this fact that among customers with digital service ,the rate of canceling service is a lot lower and only 18.5% (orange area under digital) compare to customers with fiber optic service this rate increases to 45.33%.

It looks like in general, the fiber optic subscription is higher but fiber optic attrition is also higher. We can see that almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but we can see that attrition rate is higher in fiber optic 30% fiber optic attrition vs 6% digital attrition among all customers.

##Code

telecom_interent_service <- telecom_df %>% group_by(canceled_service, internet_service) %>% 
  summarise(num_customers = n()) %>% 
  arrange(internet_service,canceled_service) %>% 
  group_by(internet_service) %>% 
  mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>% 
  left_join(telecom_df %>% group_by(canceled_service, internet_service) %>% 
              summarise(num_customers = n()) %>% 
              ungroup() %>% 
              mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)), 
            by =c("canceled_service", "internet_service","num_customers"))

telecom_interent_service

## # A tibble: 4 x 5
## # Groups:   internet_service [2]
##   canceled_service internet_service num_customers relative_pct_custs
##   <fct>            <fct>                    <int>              <dbl>
## 1 yes              fiber_optic                354               45.3
## 2 no               fiber_optic                427               54.7
## 3 yes              digital                     73               18.5
## 4 no               digital                    321               81.5
## # ... with 1 more variable: pct_of_total_custs <dbl>

Data Visualization

ggplot(data = telecom_df, mapping = aes(x = internet_service, fill = canceled_service)) +
         geom_bar(stat = "count", position = "fill")+
         labs(title = "Canceled Service Prevalence Internet Service Type", 
              x = "Internet Service", y= "Proportion of Customers")

Question 3

Question:Is there a relationship between canceled service and online security?

Answer:Yes, the data indicates that there is a strong relationship between having online security and service cancelation. The summary table and the stacked bar char represents that customers who did not have online security tend to cancel their service at a significantly higher rate than customers who had online security. Among customers who did not have online security almost 47% canceled their service when compare this to customers who had online security this rate decrease significantly to 16%.

Code

telecom_online_security <- telecom_df %>% group_by(canceled_service, online_security) %>% 
  summarise(num_customers = n()) %>% 
  arrange(online_security,canceled_service) %>% 
  group_by(online_security) %>% 
  mutate(relative_pct_custs = round(100*(num_customers/sum(num_customers)), 2)) %>% 
  left_join(telecom_df %>% group_by(canceled_service, online_security) %>% 
              summarise(num_customers = n()) %>% 
              ungroup() %>% 
              mutate(pct_of_total_custs = round(100*(num_customers/sum(num_customers)), 2)), 
            by =c("canceled_service", "online_security","num_customers"))

telecom_online_security

## # A tibble: 4 x 5
## # Groups:   online_security [2]
##   canceled_service online_security num_customers relative_pct_custs
##   <fct>            <fct>                   <int>              <dbl>
## 1 yes              yes                        64               16  
## 2 no               yes                       336               84  
## 3 yes              no                        363               46.8
## 4 no               no                        412               53.2
## # ... with 1 more variable: pct_of_total_custs <dbl>

Data Visualization

ggplot(data = telecom_df, mapping = aes(x = online_security, fill = canceled_service)) +
         geom_bar(stat = "count", position = "fill")+
         labs(title = "Canceled Service Prevalence Online security Status(Yes-No)", 
              x = "Online Security", y= "Proportion of Customers")

Question 4

Question:Is there a relationship between canceled service and total number of months with company?

Answer:Yes. The data indicates that there is a strong relationship between total months with company and canceling service. The data and summary table shows that the customers who canceled their service tend to stay fewer months with the company compare with customers who did not cancel their service. Among 427 customers who did cancel their service, the average total months with the company is around 18 months(17.69). But the average total months with company for customers who did not cancel their service is around 39 month (38.66). This indicates that the average total months of the customers who did cancel their service is almost 21 months less than the average total months with the company of those who did not cancel their service.

Among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %.

Description of Boxplot: Also, from the Box Plot below, we can see that the median total months with company for customers who canceled their service is 10 months, which is 4 times smaller than median total months with company for customers who did not cancel their service.(40 months)
Description of the Violin plot: the violin plot also represents that concentration of the data for customers who canceled their service is between 0 to 18 or 20 months. The plot is wider at the bottom and in that range, and this means most of the data points are concentrated there.

telecom_df %>% group_by(canceled_service) %>% 
   summarise(num_customers = n(), 
            min_months_with_co = min(months_with_company), 
            avg_months_with_co = round(mean(months_with_company),2), 
            median_months_with_co = median(months_with_company), 
            max_months_with_co = max(months_with_company), 
            sd_months_with_co = sd(months_with_company),
            pct_less_18months = round(mean(months_with_company<18),2))

## # A tibble: 2 x 8
##   canceled_service num_customers min_months_with_co avg_months_with_co
## * <fct>                    <int>              <dbl>              <dbl>
## 1 yes                        427                  1               17.7
## 2 no                         748                  1               38.7
## # ... with 4 more variables: median_months_with_co <dbl>,
## #   max_months_with_co <dbl>, sd_months_with_co <dbl>, pct_less_18months <dbl>

Data Visualization

#Violin Plot

ggplot(data = telecom_df, mapping = aes(x = canceled_service, months_with_company, 
        y = months_with_company, fill = canceled_service)) +
        geom_violin() +  
        geom_jitter(width = 0.07, alpha = 0.2) +
  labs(title = "Customers Months with Company by status(Canceled Service-Yes/No)",
       x = "Canceled Service", y = "Total Months with Comapny")

#BoxPlot

ggplot(data = telecom_df, mapping = aes(x = canceled_service, y = months_with_company,                  fill=canceled_service)) + geom_boxplot() + 
       labs(title="Customers Service Monthly Charges by Service Status (Cancelled Service - Yes/No)", x = "Canceled Service", y ="Number of Months With Company" )

Question 5

Question:Is there a relationship between canceled service and customers monthly charges?

Answer:No. The data does not support that there is a relationship between customers canceling their service and the monthly service charges. The average monthly service charge for the customers who canceled their service is almost 82 dollars (81.88) which is the same for customers who did not cancel their service. They also have average monthly charge of 82 dollars (81.91).

Among 427 customers who did cancel their service, 58% of them had average monthly service charge of more than 80 dollars, which when compare to customers who did not cancel their service, this only decrease to 54%.

Description of the BoxPlot: The boxplot below shows that the median service monthly charge for customers who canceled their service is 84.5 dollars, which is about the same of median monthly charge of customers who did not cancel their service.(their median is 82.72).

telecom_df %>% group_by(canceled_service) %>% 
  summarise(num_customers = n(), 
            min_monthly_chrg = min(monthly_charges),
            avg_monthly_chrg = round(mean(monthly_charges),2), 
            median_monthly_chrg = median(monthly_charges),
            max_monthly_chrg = max(monthly_charges),
            
            sd_monthly_chrg = sd(monthly_charges), 
            pct_greater_80 = round(mean(monthly_charges>80),2))

## # A tibble: 2 x 8
##   canceled_service num_customers min_monthly_chrg avg_monthly_chrg
## * <fct>                    <int>            <dbl>            <dbl>
## 1 yes                        427             44.2             81.9
## 2 no                         748             43.4             81.9
## # ... with 4 more variables: median_monthly_chrg <dbl>, max_monthly_chrg <dbl>,
## #   sd_monthly_chrg <dbl>, pct_greater_80 <dbl>

Data Visualization

ggplot(data = telecom_df, mapping = aes(x = canceled_service, y = monthly_charges, 
            fill = canceled_service))+
            geom_boxplot()+
            labs(title = "Customers Service Monthly Charges by Status(Canceled Service-Yes/No)",
                 x = "Canceled Service", y = "Service Monthly Charge")

Machine Learning

In this section of the project, you will fit three classification algorithms to predict the response variable,canceled_service. You should use all of the other variables in the telecom_df data as predictor variables for each model.

You must follow the machine learning steps below.

The data splitting and feature engineering steps should only be done once so that your models are using the same data and feature engineering steps for training.

Split the telecom_df data into a training and test set (remember to set your seed)
Specify a feature engineering pipeline with the recipes package
- You can include steps such as skewness transformation, dummy variable encoding or any other steps you find appropriate
Specify a parsnip model object
- You may choose from the following classification algorithms:
  - Logistic Regression
  - LDA
  - QDA
  - KNN
  - Decision Tree
  - Random Forest
Package your recipe and model into a workflow
Fit your workflow to the training data
- If your model has hyperparameters:
  - Split the training data into 5 folds for 5-fold cross validation using vfold_cv (remember to set your seed)
  - Perform hyperparamter tuning with a random grid search using the grid_random() function
  - Refer to the following tutorial for an example - Random Grid Search
  - Hyperparameter tuning can take a significant amount of computing time. Be careful not to set the size argument of grid_random() too large. I recommend size = 10 or smaller.
  - Select the best model with select_best() and finalize your workflow
Evaluate model performance on the test set by plotting an ROC curve using autoplot() and calculating the area under the ROC curve on your test data

****************************** Part 2: Machine Learning **********************************

Before training any ML models, we will split our data to train and test set for guarding against under-fitting and over-fitting.

##Step 1: Data Spliting

library(tidymodels)
set.seed(222) 
telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)

telecom_split

## <Analysis/Assess/Total>
## <882/293/1175>

In above section we split our telecom_df data into training and test set with randomly selecting 75 % for training and 25 % for testing. We have 882 rows in our training and 293 rows in our test set for later assessing the models performance.

Extracting training and test sets

telecom_training <- telecom_split %>% training()

telecom_test <- telecom_split %>% testing()

Create cross validation folds for hyperparameter tuning

set.seed(222)

telecom_folds <- vfold_cv(telecom_training, v = 5)

View the training data

telecom_training

## # A tibble: 882 x 19
##    canceled_service senior_citizen spouse_partner dependents cellular_service
##    <fct>            <fct>          <fct>          <fct>      <fct>           
##  1 no               no             no             no         single_line     
##  2 no               no             yes            yes        single_line     
##  3 no               no             yes            no         multiple_lines  
##  4 yes              yes            yes            no         multiple_lines  
##  5 no               no             no             yes        multiple_lines  
##  6 no               yes            no             no         single_line     
##  7 no               yes            yes            no         multiple_lines  
##  8 no               no             no             no         multiple_lines  
##  9 no               yes            yes            yes        multiple_lines  
## 10 no               no             yes            no         multiple_lines  
## # ... with 872 more rows, and 14 more variables: avg_call_mins <dbl>,
## #   avg_intl_mins <dbl>, internet_service <fct>, online_security <fct>,
## #   online_backup <fct>, device_protection <fct>, tech_support <fct>,
## #   streaming_tv <fct>, streaming_movies <fct>, contract <fct>,
## #   paperless_bill <fct>, payment_method <fct>, months_with_company <dbl>,
## #   monthly_charges <dbl>

Below the skim function has been used to quickly check for skewness of numeric columns.

library(skimr)

## Warning: package 'skimr' was built under R version 4.0.3

skim(telecom_training)

Data summary
Name	telecom_training
Number of rows	882
Number of columns	19
_______________________
Column type frequency:
factor	15
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
canceled_service	1	FALSE	2	no: 561, yes: 321
senior_citizen	1	FALSE	2	no: 709, yes: 173
spouse_partner	1	FALSE	2	no: 483, yes: 399
dependents	1	FALSE	2	no: 656, yes: 226
cellular_service	1	FALSE	2	mul: 476, sin: 406
internet_service	1	FALSE	2	fib: 578, dig: 304
online_security	1	FALSE	2	no: 575, yes: 307
online_backup	1	FALSE	2	no: 505, yes: 377
device_protection	1	FALSE	2	no: 487, yes: 395
tech_support	1	FALSE	2	no: 596, yes: 286
streaming_tv	1	FALSE	2	yes: 456, no: 426
streaming_movies	1	FALSE	2	no: 454, yes: 428
contract	1	FALSE	3	mon: 575, one: 160, two: 147
paperless_bill	1	FALSE	2	yes: 613, no: 269
payment_method	1	FALSE	4	ele: 370, ban: 191, cre: 184, mai: 137

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
avg_call_mins	1	349.49	73.05	107.00	305.00	348.5	395.00	562.0	▁▃▇▅▁
avg_intl_mins	1	106.00	31.73	15.00	84.00	106.0	127.00	211.0	▁▅▇▃▁
months_with_company	1	31.22	24.43	1.00	8.00	27.0	54.00	72.0	▇▃▃▃▅
monthly_charges	1	81.77	18.63	43.45	69.81	83.8	95.94	118.6	▅▅▇▇▃

It looks like avg_call_mins and avg_intl_mins columns have almost normal distributions, but months_with_company and monthly_charges columns are not normally distributed. Hense, we will consider the skewness transformation in our data.

##Step 2: Feature Engineering

Since the telecom data has dummy variables and numeric variables, we will create a feature engineering recipe that removes skewness from our numeric predictors, normalize all numeric predictors and create dummy variables for our nominal predictors

Create the recipe

telecom_transformation <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

Check to see if our feature engineering is doing what we expect.

telecom_transformation  %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = NULL)

## # A tibble: 882 x 22
##    avg_call_mins avg_intl_mins months_with_com~ monthly_charges canceled_service
##            <dbl>         <dbl>            <dbl>           <dbl> <fct>           
##  1       -0.600        -0.0293            0.325          -0.856 no              
##  2       -1.09          0.631            -0.789          -1.69  no              
##  3       -0.0281        0.757             1.34            1.58  no              
##  4        0.0823       -1.48             -0.939           0.679 yes             
##  5        0.263         2.35              1.23            0.462 no              
##  6        0.543        -0.124            -0.939          -0.928 no              
##  7        0.249        -0.534             0.834           1.52  no              
##  8       -0.356         1.13             -0.322           1.39  no              
##  9       -0.586         1.42              1.20            1.44  no              
## 10       -0.800        -1.39              1.36            1.29  no              
## # ... with 872 more rows, and 17 more variables: senior_citizen_no <dbl>,
## #   spouse_partner_no <dbl>, dependents_no <dbl>,
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   online_security_no <dbl>, online_backup_no <dbl>,
## #   device_protection_no <dbl>, tech_support_no <dbl>, streaming_tv_no <dbl>,
## #   streaming_movies_no <dbl>, contract_one_year <dbl>,
## #   contract_two_year <dbl>, paperless_bill_no <dbl>,
## #   payment_method_credit_card <dbl>, payment_method_electronic_check <dbl>,
## #   payment_method_mailed_check <dbl>

Create Custom Metrics for using in model perfomance

my_metrics <- metric_set(accuracy, sens, spec, f_meas, roc_auc)

##Step3: Models

Model 1: Logistic Regression

logistic_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

##Create the workflow for the logistic regression

telecom_logistic_wf <- workflow() %>% 
            add_model(logistic_model) %>% 
            add_recipe(telecom_transformation)

Fit the logistic model to the training set

telecom_logistic_fit <- telecom_logistic_wf %>% 
                        last_fit(split = telecom_split, 
                         metrics = my_metrics)

## Warning: package 'rlang' was built under R version 4.0.3

## Warning: package 'vctrs' was built under R version 4.0.3

Collect Predictions

logistic_results <-  telecom_logistic_fit %>% 
                     collect_predictions()

logistic_results

## # A tibble: 293 x 7
##    id         .pred_class  .row .pred_yes .pred_no canceled_service .config     
##    <chr>      <fct>       <int>     <dbl>    <dbl> <fct>            <chr>       
##  1 train/tes~ yes             9    0.843     0.157 no               Preprocesso~
##  2 train/tes~ no             20    0.166     0.834 no               Preprocesso~
##  3 train/tes~ yes            21    0.882     0.118 yes              Preprocesso~
##  4 train/tes~ yes            23    0.710     0.290 no               Preprocesso~
##  5 train/tes~ no             33    0.120     0.880 no               Preprocesso~
##  6 train/tes~ yes            38    0.665     0.335 no               Preprocesso~
##  7 train/tes~ no             41    0.0320    0.968 no               Preprocesso~
##  8 train/tes~ no             42    0.143     0.857 no               Preprocesso~
##  9 train/tes~ yes            46    0.801     0.199 yes              Preprocesso~
## 10 train/tes~ no             50    0.359     0.641 no               Preprocesso~
## # ... with 283 more rows

Evaluate The Logistic Model Performance

telecom_logistic_fit %>%collect_metrics()

## # A tibble: 5 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.792 Preprocessor1_Model1
## 2 sens     binary         0.708 Preprocessor1_Model1
## 3 spec     binary         0.840 Preprocessor1_Model1
## 4 f_meas   binary         0.711 Preprocessor1_Model1
## 5 roc_auc  binary         0.856 Preprocessor1_Model1

The Logistic Model Roc Curve

roc_curve(logistic_results, 
          truth = canceled_service, 
          estimate = .pred_yes) %>% 
  autoplot()

## The Logistic Model Confusion Matrix

conf_mat(logistic_results, 
         truth = canceled_service, 
         estimate = .pred_class)%>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
  labs(y = "True Class", x = "Predicted Class", fill = NULL, 
  title = "Confusion Matrix", subtitle = "For Logistic Regression Model")+ 
  theme_bw()

From the confusion matrix of the logistic regression we can see that the number of False positive is 23 and number of False Negative is 24.

Model 2: Random Forest

rf_model <- rand_forest(mtry = tune(),
                        trees = tune(),
                        min_n = tune()) %>% 
            set_engine('ranger', importance = "impurity") %>% 
            set_mode('classification')

Create workflow for the random forest model

telecom_rf_wf <- workflow() %>% 
               add_model(rf_model) %>% 
               add_recipe(telecom_transformation)

Create a grid of hyper-parameter values to test

we set the range from 5 to 15. This is because we have 19 predictor variables and we would like to test mtry() values somewhere in the middle between 1 and 19, and also it is suggested to avoid values that are close to the ends.

#setting the seed for reproducability
set.seed(222)

rf_grid <- grid_random(mtry() %>% range_set(c(5, 15)),
                       trees(),
                       min_n(),
                       size = 10)

##Tuning our Hyperparameters on random forest workflow

set.seed(222)

rf_tuning <- telecom_rf_wf %>% 
             tune_grid(resamples = telecom_folds,
                       grid = rf_grid)

## Warning: package 'ranger' was built under R version 4.0.3

Show my best rf models based on roc_auc metric

rf_tuning %>% show_best('roc_auc')

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     6   144     5 roc_auc binary     0.877     5 0.00761 Preprocessor1_Model10
## 2     8  1385    10 roc_auc binary     0.876     5 0.00765 Preprocessor1_Model06
## 3    12  1454     2 roc_auc binary     0.876     5 0.00746 Preprocessor1_Model03
## 4    10   602     6 roc_auc binary     0.876     5 0.00770 Preprocessor1_Model04
## 5     6  1121    38 roc_auc binary     0.875     5 0.00855 Preprocessor1_Model01

We can see that the best choice is with with mtry 5 number of trees 867.

Selecting best model based on roc_auc for using in finalized workflow

best_rf <- rf_tuning %>% 
           select_best(metric = 'roc_auc')

using the best model based on roc_auc we can finalize our workflow as following:

Finalize Random Forest model workflow

final_rf_wf <- telecom_rf_wf %>% 
                finalize_workflow(best_rf)

Train and Evaluate using “last_fit” function

telecom_rf_lastfit <- final_rf_wf %>% 
               last_fit(telecom_split, metrics = my_metrics)

##View the Random forest model performance metric on the test set

telecom_rf_lastfit %>% collect_metrics()

## # A tibble: 5 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.792 Preprocessor1_Model1
## 2 sens     binary         0.689 Preprocessor1_Model1
## 3 spec     binary         0.850 Preprocessor1_Model1
## 4 f_meas   binary         0.705 Preprocessor1_Model1
## 5 roc_auc  binary         0.844 Preprocessor1_Model1

Collect Predictions for Random Forest

rf_results <-  telecom_rf_lastfit %>% 
                 collect_predictions()

rf_results

## # A tibble: 293 x 7
##    id         .pred_class  .row .pred_yes .pred_no canceled_service .config     
##    <chr>      <fct>       <int>     <dbl>    <dbl> <fct>            <chr>       
##  1 train/tes~ yes             9     0.716    0.284 no               Preprocesso~
##  2 train/tes~ no             20     0.203    0.797 no               Preprocesso~
##  3 train/tes~ yes            21     0.791    0.209 yes              Preprocesso~
##  4 train/tes~ no             23     0.416    0.584 no               Preprocesso~
##  5 train/tes~ no             33     0.439    0.561 no               Preprocesso~
##  6 train/tes~ yes            38     0.530    0.470 no               Preprocesso~
##  7 train/tes~ no             41     0.184    0.816 no               Preprocesso~
##  8 train/tes~ no             42     0.182    0.818 no               Preprocesso~
##  9 train/tes~ yes            46     0.694    0.306 yes              Preprocesso~
## 10 train/tes~ no             50     0.389    0.611 no               Preprocesso~
## # ... with 283 more rows

Confusion Matrix for random forest model

conf_mat(rf_results, 
         truth = canceled_service, 
         estimate = .pred_class)%>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
  labs(y = "True Class", x = "Predicted Class", fill = NULL, 
  title = "Confusion Matrix", subtitle = "For Random Forest Model")+ 
  theme_bw()

From the confusion matrix of the Random Forest Model we can see that the number of False positive is 23 and number of False Negative is 27.

The Random Forest Model Roc Curve

roc_curve(rf_results, 
          truth = canceled_service, 
          estimate = .pred_yes) %>% 
  autoplot()

Model 3: Linear Discriminant Analysis

library(discrim)

## Warning: package 'discrim' was built under R version 4.0.4

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

lda_model <- discrim_regularized(frac_common_cov = 1) %>% 
             set_engine('klaR') %>% 
             set_mode('classification')

Create workflow for the LDA model

telecom_lda_wf <- workflow() %>% 
            add_model(lda_model) %>% 
            add_recipe(telecom_transformation)

Fit the LDA Model using “last_fit” function

telecom_lda_fit <- telecom_lda_wf %>% 
                  last_fit(split = telecom_split,
                           metrics = my_metrics)

## Warning: package 'MASS' was built under R version 4.0.4

Collect Predictions

lda_results <-  telecom_lda_fit  %>% 
                 collect_predictions()

Evaluate the LDA Model Performance

telecom_lda_fit %>% collect_metrics()

## # A tibble: 5 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.782 Preprocessor1_Model1
## 2 sens     binary         0.745 Preprocessor1_Model1
## 3 spec     binary         0.802 Preprocessor1_Model1
## 4 f_meas   binary         0.712 Preprocessor1_Model1
## 5 roc_auc  binary         0.853 Preprocessor1_Model1

The LDA Model Roc Curve

roc_curve(lda_results, 
          truth = canceled_service, 
          estimate = .pred_yes) %>% 
  autoplot()

## The LDA Model Confusion Matrix

conf_mat(lda_results, 
         truth = canceled_service, 
         estimate = .pred_class)%>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "red", alpha = 1, size = 8) +
  labs(y = "True Class", x = "Predicted Class", fill = NULL, 
  title = "Confusion Matrix", subtitle = "For Linear Discriminant Analysis Model")+ 
  theme_bw()

From the confusion matrix of the LDA Model we can see that the number of False positive is 22 and number of False Negative is also 22 which indicates that the LDA model outperformed.

*************************************** Summary of Results *****************************************

Write a summary of your overall findings and recommendations to the executives at the company. Think of this section as your closing remarks of a presentation, where you summarize your key findings, model performance, and make recommendations to improve customer retention and service at this company.

Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:

An introduction where you explain the business problem and goals of your data analysis
- What problem(s) is this company trying to solve? Why are they important to their future success?
- What was the goal of your analysis? What questions were you trying to answer and why do they matter?
Highlights and key findings from your Exploratory Data Analysis section
- What were the interesting findings from your analysis and why are they important for the business?
- This section is meant to establish the need for your recommendations in the following section
Your “best” classification model and an analysis of its performance
- In this section you should talk about the expected error of your model on future data
  - To estimate future performance, you can use your model performance results on the test data
- You should discuss at least one performance metric, such as an F1, sensitivity, specificity, or ROC AUC for your model. However, you must explain the results in an intuitive, non-technical manner. Your audience in this case are executives at a telecommunications company with limited knowledge of machine learning.
Your recommendations to the company on how to reduce customer attrition rates
- Each recommendation must be supported by your data analysis results
- You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
- You must also describe the potential business impact of your recommendation:
  - Why is this a good recommendation?
  - What benefits will the business achieve?

Summary

Add your summary here. Please do not place your text within R code chunks.

The U.S. telecommunications company is a company which provides internet and cellular service. The company data indicates that among 1175 of their customers, 427 of them have canceled their service. Therefore, The company is looking to see if they can determine the factors that lead to customers canceling their service and whether it can predict if a customer will cancel their service in the future. This is important for the company, because customer retention is crucial for company success and growth since high customer attrition can significantly harm company financially, and also affect their reputation. Therefore, the company has a vested interest in keeping their customer. According to researches obtaining new customers can cost up to 25 times more than retaining existing customers.

The goal of this project is to first explore 5 factors and do exploratory analysis to identify whether there is a relationship between customer canceling the service and any of those selected factors. Finally, by identifying those factors and possible relationships, try to provide the company some insight and recommendations. Additionally, the goal is to provide the company with our best predictive model on whether a customer will cancel their service or not, so the company can take dynamic actions to improve services and offers, and eventually customer satisfactions.

For this purpose, 5 following questions were tried to be answered:

Is there a relationship between canceled service and whether customers had tech support?

Is there a relationship between canceled service and customers internet service type?

Is there a relationship between canceled service and whether customers had online security?

Is there a relationship between canceled service and total months with company?

Is there a relationship between canceled service and customers monthly charges?

Again these questions matters, because by exploring the data and answering them , we can discover potential factors that might have correlation with customers canceling their service and by providing key findings and recommendations to the company we can help them to minimize customer attrition rate, since in the competitive world of telecommunication companies, customer retention is key.

The analysis in this project revealed some interesting patterns and relationships. The result of the analysis represented that the total number of months with company is one of the most significant factor in customer attrition. The result of analysis showed that the median months with company for customers who canceled their service is 10 months. When compare this to median months of customers who did not cancel their service, this median increases to 40 months which is 4 times bigger. Additionally, the result of analysis indicated that there is strong relationship between customer attrition and their internet service type, technical support and online security. In contrast and surprisingly, customers’ monthly charges and fees did not seem to have a relationship with customers attrition.

** Below all of these key findings are explained in detail:

The result of analysis on the first question which was to find if there is a relationship between customer canceling their service and tech support, indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support. The data and summary table above indicates that among the customers who did not have tech support 45.71% canceled their service which this rate decrease to 1/3 among customers who had tech support and it is only 15.95%. Also, the stacked percentage bar chart represents this fact that among customers that have tech support the attrition rate is a lot lower than the attrition rate among customers without the tech support.

The outcome of the second question which was to find if there is a relationship between canceled service and customers internet service type, shows that customers who have fiber optic internet service, tend to cancel their service at a higher rate than those with digital service. Among the customers with fiber optic internet service the canceling rate was 45.33% ,but the canceling rate among customers with digital internet service was a lot lower and only 18.53%. It is interesting that in general, it seems the fiber optic subscription or purchase is higher but the attrition on that is also higher. We can see from the summary table that almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but it is observed that attrition rate is higher in fiber optic type, 30% fiber optic attrition vs 6% digital attrition among all customers.

The outcome of the third question and analysis on finding a relationship between between canceled service and online security, indicated that customers who did not have online security tend to cancel their service at a significantly higher rate than customers who had online security. The summary table and the plot indicates that the attrition rate among customers who did not have online security is almost 47% when compare this to customers who had online security this rate decrease significantly to 16%.

The result of analysis on the forth question which was to find if there is a relationship between canceled service and total months with company, as mentioned earlier indicated a strong relationship. Result shows customers who canceled their service tend to stay fewer months with the company than customers who did not cancel their service. Among 427 customers who canceled their service, the average total months with the company is around 18 months(17.69). But the average total months with company for customers who did not cancel their service is around 39 month (38.66). This shows that the average total months of the customers who did cancel their service is almost 21 months less than the average total months with the company, of those who did not cancel their service.
Among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %.

The result of analysis on last question which was to find if there is a relationship between canceled service and customers monthly charges, indicated that there is no relationship between customers canceling their service and the monthly service charges. The average monthly service charge for the customers who canceled their service is almost 82 dollars (81.88) which is the same for customers who did not cancel their service. The current customers also have average monthly charge of 82 dollars (81.91). Among 427 customers who did cancel their service, 58% of them had average monthly service charge of more than 80 dollars, when compare to customers who did not cancel their service, this only decrease to 54%.Therefore, there is no evidence to show a strong relationship between canceling service and customer monthly charges.

3)Results of predictive models

Three Classification models, Logistic Regression, Random Forest and Linear Discriminant Analysis, were performed in this analysis, for predicting whether customer cancel their service or not based on the set of factors(all variables). The performance of all three models in this project were almost close to each other and results were not significantly different. However, among all three, the best classification model in terms of performance is LDA model with highest accuracy of 85%, the area under the ROC curve of 90%(highest roc-auc), highest sensitivity of 79.24 % and highest F1 Score of 79.24%. The LDA model outperformed all other models with the lower False positive and lower False Negative. False positive here refers to number of customers that model predicted as yes under canceled service and they actually did not cancel service. False Negative here refers to to number of customers that model predicted as no under canceled service and they actually canceled their service. Both of these errors are the lowest for the LDA model . Also, the area under the roc curve represents the performance of the model on classifying the classes. The higher this value and the closest to 1, the better the model in predicting the customer did or did not cancel the service.

** Recommendations:

Based on the result of analysis and some key findings it is crucial to the company to increase and improve the online security for the customers since the analysis indicates that the attrition rate among customers who did not have online security is almost 47% in compare to customers who had online security. This rate decrease significantly to 16% for customers who had online security. Also, it is recommended to improve and provide more tech support for customers, as the data indicates that customers who did not have tech support, tend to cancel their service at a significantly higher rate when compare to customer who had tech support.

Additionally, it is highly recommended to deep-dive into the root of the fiber optic internet service and investigate why people get this service more than the digital service but they cancel this service at a very higher rate than digital. This should not be ignored because among the customers with fiber optic internet service the canceling rate is 45.33% ,but the canceling rate among customers with digital internet service was only 18.53%.

As data supports, it seems the fiber optic subscription or purchase is higher but the attrition on that is also higher. Almost 67% of the total customers had purchased fiber optic and only 33% got digital internet service, but it is observed that attrition rate is higher in fiber optic type, 30% fiber optic attrition vs 6% digital attrition among all customers. Last but not least, it is recommended that companies provide promotions and offers to customers that are new to company (customers with total months in range 0-18 or 20 months). Since analysis shows among the customers who did not cancel their service, only 26 % of them remained less than 18 months with the company. When compare this with customers who did cancel their service, this increase to 66 %. Therefore, it is possible to encourage customer stay by offering promotions or deals.

In conclusion, The mentioned factors suggested to be investigated and make sure every attempt is being made to improve these factors, in order to increase customer satisfaction and eventually minimizing the financial and reputation loss.

Machine Learning Project

Data Analysis

Count the number of customers who canceled their service and who did not cancel their Service

Question 1

Code

Data Visualization

Question 2

Data Visualization

Question 3

Code

Data Visualization

Question 4

Data Visualization

Question 5

Data Visualization

Machine Learning

Extracting training and test sets

Create cross validation folds for hyperparameter tuning

View the training data

Create the recipe

Check to see if our feature engineering is doing what we expect.

Create Custom Metrics for using in model perfomance

Model 1: Logistic Regression

Fit the logistic model to the training set

Collect Predictions

Evaluate The Logistic Model Performance

The Logistic Model Roc Curve

Model 2: Random Forest

Create workflow for the random forest model

Create a grid of hyper-parameter values to test

Show my best rf models based on roc_auc metric

Selecting best model based on roc_auc for using in finalized workflow

Finalize Random Forest model workflow

Train and Evaluate using “last_fit” function

Collect Predictions for Random Forest

Confusion Matrix for random forest model

The Random Forest Model Roc Curve

Model 3: Linear Discriminant Analysis

Create workflow for the LDA model

Fit the LDA Model using “last_fit” function

Collect Predictions

Evaluate the LDA Model Performance

The LDA Model Roc Curve