Description of dataset

The dataset for our classification Machine Learning Project consists of all relevant Telco customer data and it is taken from Kaggle and stems from the IBM sample data set collection. (https://www.kaggle.com/blastchar/telco-customer-churn)

Each row of this data represents a customer, each column contains customer’s attributes.

The data contains 7043 rows (customers) and 21 columns (features):

  • Churn = Customers who left within the last month: our target YES or NO
  • customerID = ID of Customer
  • gender = Whether the customer is a male or a female
  • SeniorCitizen = Whether the customer is a senior citizen or not
  • Partner = Whether the customer has a partner or not
  • Dependents = Whether the customer has dependents or not
  • Tenure = Number of months the customer has stayed with the company
  • PhoneService = Whether the customer has a phone service or not
  • MultipleLines = Whether the customer has multiple lines or not
  • InternetService = Customer’s internet service provider like DSL or Fiber optic
  • OnlineSecurity = Whether the customer has online security or not
  • OnlineBackup = Whether the customer has online backup or not
  • DeviceProtection = Whether the customer has device protection or not
  • TechSupport = Whether the customer has tech support or not
  • StreamingTV = Whether the customer has streaming TV or not
  • StreamingMovies = Whether the customer has streaming movies or not
  • Contract = The contract term of the customer like Month-to-month or One year or Two year
  • PaperlessBilling = Whether the customer has paperless billing or not
  • PaymentMethod = The customer’s payment method like Electronic check, Mailed check, Bank transfer or Credit card
  • MonthlyCharges = The amount charged to the customer monthly
  • TotalCharges = The total amount charged to the customer
Telco <- read.csv('data/Telco-Customer-Churn.csv')
Telco %>% glimpse()
## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOC…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female"…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Ye…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", …
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber op…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", …
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Y…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Ye…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One…
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", …
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "…

Analyses of the data

Nowadays it is necessary for Telco firms to acquire new customers while preventing termination of contracts.

Looking at churn, multiple factors lead consumers to cancel their contracts, such as better pricing deals, more interesting packages, poor service experiences or changing personal conditions for customers.

The main challenge is to determine whether or not an individual customer is going to churn.

Machine learning models are trained on the basis of 70% of the sample data.

The remaining 30% was used to apply the qualified models and to test their predictive power with respect to churn/not churn.

Accuracy is measured in order to compare models and select the best ones for this task.

In conclusion we are going to create a prediction in order to discover the possible maximum amount charge to the customer monthly without his churn.

Clean data and Variable transformations

In order to visualize missing value we used the function summary.

There are 11 NA from the variables TotalCharges, we decided to drop that variable because it depends on the time the customers stay in Telco.

summary(Telco)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
Telco <- Telco[,-20]

We decided not to consider the variable Costumer ID and we trasform every character variables into factorial ones.

Notice that the variable SeniorCitizen that indicate whether the customer is a senior citizen or not need to be renamed to No and Yes.

Telco <- Telco[,-1]
Telco$gender <- as.factor(Telco$gender)
Telco$Partner <- as.factor(Telco$Partner)
Telco$Dependents <- as.factor(Telco$Dependents)
Telco$PhoneService <- as.factor(Telco$PhoneService)
Telco$MultipleLines <- as.factor(Telco$MultipleLines)
Telco$InternetService <- as.factor(Telco$InternetService)
Telco$OnlineSecurity <- as.factor(Telco$OnlineSecurity)
Telco$OnlineBackup <- as.factor(Telco$OnlineBackup)
Telco$DeviceProtection <- as.factor(Telco$DeviceProtection)
Telco$TechSupport <- as.factor(Telco$TechSupport)
Telco$StreamingTV <- as.factor(Telco$StreamingTV)
Telco$StreamingMovies <- as.factor(Telco$StreamingMovies)
Telco$Contract <- as.factor(Telco$Contract)
Telco$PaperlessBilling <- as.factor(Telco$PaperlessBilling)
Telco$PaymentMethod <- as.factor(Telco$PaymentMethod)
Telco$Churn <- as.factor(Telco$Churn)
Telco$SeniorCitizen<-as.factor(mapvalues(Telco$SeniorCitizen,from=c("0","1"),to=c("No", "Yes")))
Telco %>% glimpse()
## Rows: 7,043
## Columns: 19
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Fema…
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Y…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, N…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, …
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes,…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber …
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No i…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No i…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No int…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No in…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No int…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, …
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, N…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank t…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, …

First view of variables

Continuos variables

We create a multiplot in order to show distribution of continuos variables by histogram.

Categorical variables

The barplots below show the different categories of factorial variables.

Target variable

We decided to create a bar plot in order to illustrate the final distribution of people who have churned or not: over 70% of our customers didn't churn Telco.

Training/test Telco division

In order to split the data into train and test we used the function createDataPartition.

Let' visualize the result of the partition: we assign 70% on our Train and 30% on Test.

Churn Yes Churn No
Train 1309 3622
Test 560 1552
set.seed(123)
train <- createDataPartition(Telco$Churn, p = 0.7, list = FALSE) 
Telco.train <- Telco[train,]
Telco.test  <- Telco[-train,]

Definition of Formula

We decide to keep all the variables in our formula.

model1.formula <- Churn ~ .

Gini's tree

The default splitting criterion is the Gini Index (CART algorithm).

First of all, we don't impose any restriction on the tree growth in order to visualize the full tree.

The result of tree is big and difficult to interpret and in fact we will decide to prune it.

set.seed(123)
Telco.tree <- 
  rpart(model1.formula,
        data = Telco.train,
        method = "class",
        minsplit = 100,
        minbucket = 50,
        cp = -1)

printcp(Telco.tree)
## 
## Classification tree:
## rpart(formula = model1.formula, data = Telco.train, method = "class", 
##     minsplit = 100, minbucket = 50, cp = -1)
## 
## Variables actually used in tree construction:
##  [1] Contract         Dependents       DeviceProtection gender          
##  [5] InternetService  MonthlyCharges   MultipleLines    OnlineBackup    
##  [9] OnlineSecurity   PaperlessBilling Partner          PaymentMethod   
## [13] PhoneService     StreamingMovies  tenure          
## 
## Root node error: 1309/4931 = 0.26546
## 
## n= 4931 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0542399      0   1.00000 1.00000 0.023688
## 2  0.0118411      3   0.79144 0.79832 0.021923
## 3  0.0068755      5   0.76776 0.80290 0.021969
## 4  0.0030558      8   0.74102 0.78610 0.021799
## 5  0.0022918     11   0.73186 0.80214 0.021961
## 6  0.0010186     18   0.71352 0.80443 0.021984
## 7  0.0000000     21   0.71047 0.80978 0.022037
## 8 -1.0000000     68   0.71047 0.80978 0.022037
fancyRpartPlot(Telco.tree, palettes = c("Greys", "Oranges"), sub="Full classification tree")

Now we try to analyse the fit of the training set model.

  • The Accuracy = 0.8114 below, indicate the frequency of observation correctly classified.
  • The Kappa = 0.4812, measure of the relationship between predictions their values.
  • The Sensitivity = 0.5386, became from the ratio between the Event (=Churn) correctly classified and the total of Event (=Churn).
  • The Specificity = 0.9100, is obtained from thet the ratio between the No Event (=No Chrun) correctly classified and the total of event (=No Churn).

The confusion matrix shows a grid of true and false predictions compared to the actual values.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3296  604
##        Yes  326  705
##                                           
##                Accuracy : 0.8114          
##                  95% CI : (0.8002, 0.8222)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4812          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5386          
##             Specificity : 0.9100          
##          Pos Pred Value : 0.6838          
##          Neg Pred Value : 0.8451          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1430          
##    Detection Prevalence : 0.2091          
##       Balanced Accuracy : 0.7243          
##                                           
##        'Positive' Class : Yes             
## 

Then we create the same confusion matrix also for the test model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1393  283
##        Yes  159  277
##                                           
##                Accuracy : 0.7907          
##                  95% CI : (0.7727, 0.8079)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 1.476e-09       
##                                           
##                   Kappa : 0.4221          
##                                           
##  Mcnemar's Test P-Value : 4.901e-09       
##                                           
##             Sensitivity : 0.4946          
##             Specificity : 0.8976          
##          Pos Pred Value : 0.6353          
##          Neg Pred Value : 0.8311          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1312          
##    Detection Prevalence : 0.2064          
##       Balanced Accuracy : 0.6961          
##                                           
##        'Positive' Class : Yes             
## 

The confusionMatrix evidences that the Accuracy on train data is higher than that on test data: the tree is overfitted!

Pruning

In order to reduce the size and complexity of the three we prune it.

In fact the complexity parameter is used to optimize the complexity of the tree.

The second part of output below shows the values of complexity parameter coefficient and the related prediction error in the cross-validation.

In order to choose the complexity parameter with the lowest prediction error, we use automatically the function which.mean and later we visualize it:

(opt <- which.min(Telco.tree$cptable[,"xerror"]))
## 4 
## 4
(cp <- Telco.tree$cptable[opt, "CP"])
## [1] 0.003055768

Now we can procede with the pruning step!

Telco.tree_p <- prune(Telco.tree, cp = cp )

Let's print a confusion matrix in order to evaluate the accuracy of the Pruned Tree.

pred.tree_p.train <- predict(Telco.tree_p,Telco.train,type = "class")

confusionMatrix(data = pred.tree_p.train,
                reference = Telco.train$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3189  537
##        Yes  433  772
##                                           
##                Accuracy : 0.8033          
##                  95% CI : (0.7919, 0.8143)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4825          
##                                           
##  Mcnemar's Test P-Value : 0.0009426       
##                                           
##             Sensitivity : 0.5898          
##             Specificity : 0.8805          
##          Pos Pred Value : 0.6407          
##          Neg Pred Value : 0.8559          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1566          
##    Detection Prevalence : 0.2444          
##       Balanced Accuracy : 0.7351          
##                                           
##        'Positive' Class : Yes             
## 
pred.tree_p.test <- predict(Telco.tree_p, Telco.test, type = "class")
confusionMatrix(data = pred.tree_p.test,
                reference = Telco.test$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1357  239
##        Yes  195  321
##                                           
##                Accuracy : 0.7945          
##                  95% CI : (0.7766, 0.8116)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 1.072e-10       
##                                           
##                   Kappa : 0.4591          
##                                           
##  Mcnemar's Test P-Value : 0.03901         
##                                           
##             Sensitivity : 0.5732          
##             Specificity : 0.8744          
##          Pos Pred Value : 0.6221          
##          Neg Pred Value : 0.8503          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1520          
##    Detection Prevalence : 0.2443          
##       Balanced Accuracy : 0.7238          
##                                           
##        'Positive' Class : Yes             
## 

The Pruned tree seems less overfitting and the accuracy on test data became higher than the previous one.

Interpretation of pruned tree

In order to interpret better each node, we plotted two different type of the same pruned tree.

Each node of below tree shows:

  • the predicted class (Churn=No or Churn=Yes)
  • the predicted probability of churned with color intensity denotes probability of 1 or 0
  • the percentage of observations in the node
  • the node number

The final node number 15, which represent the majority of people (14% of total) who churned the company is caratterised by:

  1. Contract = Month to month
  2. Internet Service = Fiber optic
  3. Duration with company < 15 months

The node number 27, which represent thee people (5% of total) who churned the company is caratterised by:

  1. Contract = Month to month
  2. Internet Service = ADSL
  3. Duration with company < 3.5 months
  4. Payment method = Bank transfer

The final node number 119, which represent the of people (5% of total) who churned the company is caratterised by:

  1. Contract = Month to month
  2. Internet Service = Fiber optic
  3. Duration with company < 55 months
  4. Multiple lines = yes
fancyRpartPlot(Telco.tree_p, palettes = c("Greys", "Oranges"), sub="Pruned classification tree")

fancyRpartPlot(Telco.tree_p, palettes = c("Greys", "Oranges"), sub="Pruned classification tree", type= 5)

Comparison between pruned and unpruned trees

Fist of all we predict directly the value of target variable on training and test set.

Then we calculate values of the ROC curves, which show the diagnostic ability of a model by bringing together true positive rate and false positive rate for different thresholds of class predictions.

Finally we visualise both ROC curves on the plot and also the information about values of the Gini coefficients.

If Gini values equals to 1, then the model predicts or classifies perfectly.

As we already said before, the original tree is overfitted.

Bagging with logistic regression model

Bagging is used when the aim is to reduce a decision tree's variance.

The point is to create several subsets of data from a randomly selected training sample with a replacements.

Consequently, each set of subset data is used to train their decision tree. As a consequence, we end up with a collection of different models.

Thus, we have estimated 19 different models and we made prediction for the training and testing set for each estimated model.

Now we generate aggregated prediction with majority voting.

Telco.pred_bag_train_final <-
  ifelse(rowSums(Telco.pred_bag_train < 0.5) > 19/2,
         "No", "Yes") %>% 
  factor(., levels = c("No", "Yes"))

Telco.pred_bag_test_final <-
  ifelse(rowSums(Telco.pred_bag_test < 0.5) > 19/2,
         "No", "Yes") %>% 
  factor(., levels = c("No", "Yes"))

In comparison than before models, the prediction on test data has an accuracy grather than that on train data: the model is not overfitted!

confusionMatrix(data = Telco.pred_bag_train_final,
                reference = Telco.train$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3243  596
##        Yes  379  713
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7909, 0.8133)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4646          
##                                           
##  Mcnemar's Test P-Value : 4.596e-12       
##                                           
##             Sensitivity : 0.5447          
##             Specificity : 0.8954          
##          Pos Pred Value : 0.6529          
##          Neg Pred Value : 0.8448          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1446          
##    Detection Prevalence : 0.2215          
##       Balanced Accuracy : 0.7200          
##                                           
##        'Positive' Class : Yes             
## 
confusionMatrix(data = Telco.pred_bag_test_final,
                reference = Telco.test$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1385  244
##        Yes  167  316
##                                           
##                Accuracy : 0.8054          
##                  95% CI : (0.7879, 0.8221)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 2.048e-14       
##                                           
##                   Kappa : 0.4777          
##                                           
##  Mcnemar's Test P-Value : 0.0001777       
##                                           
##             Sensitivity : 0.5643          
##             Specificity : 0.8924          
##          Pos Pred Value : 0.6542          
##          Neg Pred Value : 0.8502          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1496          
##    Detection Prevalence : 0.2287          
##       Balanced Accuracy : 0.7283          
##                                           
##        'Positive' Class : Yes             
## 

Comparison without cross-validation

In order to compare the model above with the model without the cross-validation process, we estimated it and we use the classical confusion matrix to study the accuracy.

As we can see on the output below, the resampling method without the cross-validation process provides almost the same results.

ctrl_nocv <- trainControl(method = "none")
Telco.logit <- 
  train(model1.formula, 
        data = Telco.train, 
        method = "glm",
        family = "binomial", 
        trControl = ctrl_nocv)

ROC.test.logit <-
  roc(as.numeric(Telco.test$Churn == "Yes"), 
      predict(Telco.logit, newdata = Telco.test, type = "prob")[, "Yes"])

ROC.train.logit <-
  roc(as.numeric(Telco.train$Churn == "Yes"), 
      predict(Telco.logit, newdata = Telco.train, type = "prob")[, "Yes"])

confusionMatrix(data = predict(Telco.logit,
                               newdata = Telco.train),
                reference = Telco.train$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3246  600
##        Yes  376  709
##                                           
##                Accuracy : 0.8021          
##                  95% CI : (0.7907, 0.8131)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4631          
##                                           
##  Mcnemar's Test P-Value : 9.466e-13       
##                                           
##             Sensitivity : 0.5416          
##             Specificity : 0.8962          
##          Pos Pred Value : 0.6535          
##          Neg Pred Value : 0.8440          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1438          
##    Detection Prevalence : 0.2200          
##       Balanced Accuracy : 0.7189          
##                                           
##        'Positive' Class : Yes             
## 
confusionMatrix(data = predict(Telco.logit,
                               newdata = Telco.test),
                reference = Telco.test$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1387  246
##        Yes  165  314
##                                           
##                Accuracy : 0.8054          
##                  95% CI : (0.7879, 0.8221)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 2.048e-14       
##                                           
##                   Kappa : 0.4764          
##                                           
##  Mcnemar's Test P-Value : 7.943e-05       
##                                           
##             Sensitivity : 0.5607          
##             Specificity : 0.8937          
##          Pos Pred Value : 0.6555          
##          Neg Pred Value : 0.8494          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1487          
##    Detection Prevalence : 0.2268          
##       Balanced Accuracy : 0.7272          
##                                           
##        'Positive' Class : Yes             
## 

We calculate values of the ROC curves based on averaged probabilities of churned the company.

In order to compare the two models already created, we plot the graph that illustrates the two ROC curves.

Through the graph we can see how the differences are not very evident.

Random Forest

An extension over bagging is Random Forest.

In addition to the random subset of data, it also involves the random selection of characteristics rather than the use of all characteristics to grow trees.

The advantage of this technique is to handles higher dimensionality data very well and it able to manage the missing values in order to maintains accuracy for missing data.

On other hand, the drawbacks is that seen the final forecast is based on the mean projections of subset trees, it will not provide the regression model with exact values.

Let's try to build our model using the cross-validation process in order to use the best number of parameters.

The output advice us that the final value used for the model was mtry = 10., that is the number of variables randomly sampled as candidates at each split.

parameters_rf <- expand.grid(mtry = 2:14)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)

set.seed(123)
Telco.rf <- train(model1.formula,
          data = Telco.train,
          method = "rf",
          ntree = 100,
          nodesize = 100,
          tuneGrid = parameters_rf,
          trControl = ctrl_oob,
          importance = TRUE)
Telco.rf
## Random Forest 
## 
## 4931 samples
##   18 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7779355  0.2982673
##    3    0.7888866  0.3623904
##    4    0.7945650  0.4064827
##    5    0.7988238  0.4353900
##    6    0.7992294  0.4347173
##    7    0.8012574  0.4504364
##    8    0.7992294  0.4468656
##    9    0.8002434  0.4445643
##   10    0.8014602  0.4470701
##   11    0.7988238  0.4431211
##   12    0.8014602  0.4534413
##   13    0.8002434  0.4430912
##   14    0.7990266  0.4420664
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.

Moreover we visualise our choosen with a dash blue line in the followed two plots.

## integer(0)

## integer(0)

Then we use the classical confusion matrix to study the model accuracy.

confusionMatrix(data = predict(Telco.rf,
                               newdata = Telco.train),
                reference = Telco.train$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3304  594
##        Yes  318  715
##                                           
##                Accuracy : 0.815           
##                  95% CI : (0.8039, 0.8258)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4915          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5462          
##             Specificity : 0.9122          
##          Pos Pred Value : 0.6922          
##          Neg Pred Value : 0.8476          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1450          
##    Detection Prevalence : 0.2095          
##       Balanced Accuracy : 0.7292          
##                                           
##        'Positive' Class : Yes             
## 
confusionMatrix(data = predict(Telco.rf,
                               newdata = Telco.test),
                reference = Telco.test$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1400  280
##        Yes  152  280
##                                           
##                Accuracy : 0.7955          
##                  95% CI : (0.7776, 0.8125)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 5.410e-11       
##                                           
##                   Kappa : 0.4337          
##                                           
##  Mcnemar's Test P-Value : 9.945e-10       
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.9021          
##          Pos Pred Value : 0.6481          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1326          
##    Detection Prevalence : 0.2045          
##       Balanced Accuracy : 0.7010          
##                                           
##        'Positive' Class : Yes             
## 

Ranger Random Forest

Ranger, especially suited for high-dimensional data, is a rapid implementation of random forests or recursive partitioning.

It supports grouping, regression, and forest survival.

We try to set 3 parameters (expand.grid, splitrule and min.node.size) and we apply a resampling method with cross-validation.

The Ranger Random Forest model chosen mtry= 12 and min.node.size = 200.

parameters_ranger <- expand.grid(mtry = 2:14,
              splitrule = "gini",
              min.node.size = c(50, 100, 200))

ctrl_cv5 <- trainControl(method = "cv", 
                         number =  5,
                         classProbs = T)

set.seed(123)
Telco.rf.ranger <- train(model1.formula, 
          data = Telco.train, 
          method = "ranger", 
          num.trees = 100,
          num.threads = 3,
          importance = "impurity",
          tuneGrid = parameters_ranger, 
          trControl = ctrl_cv5)

print(Telco.rf.ranger)
## Random Forest 
## 
## 4931 samples
##   18 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3945, 3945, 3944, 3945, 3945 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa    
##    2     50            0.7838156  0.3147776
##    2    100            0.7801638  0.2945094
##    2    200            0.7755020  0.2653441
##    3     50            0.7965934  0.4078623
##    3    100            0.7905083  0.3769035
##    3    200            0.7911159  0.3670282
##    4     50            0.7939565  0.4137413
##    4    100            0.7969973  0.4158929
##    4    200            0.7929415  0.3927668
##    5     50            0.7955801  0.4276405
##    5    100            0.7972022  0.4290590
##    5    200            0.7953743  0.4100288
##    6     50            0.7963902  0.4321922
##    6    100            0.7967961  0.4300578
##    6    200            0.7996334  0.4261647
##    7     50            0.7955792  0.4311877
##    7    100            0.7994318  0.4392973
##    7    200            0.8022697  0.4416368
##    8     50            0.7965926  0.4350400
##    8    100            0.7998374  0.4391814
##    8    200            0.7990263  0.4351171
##    9     50            0.7992299  0.4447359
##    9    100            0.7998379  0.4439872
##    9    200            0.8004455  0.4413977
##   10     50            0.7955797  0.4329162
##   10    100            0.7969985  0.4380967
##   10    200            0.7974032  0.4337527
##   11     50            0.7933499  0.4278833
##   11    100            0.7994318  0.4448631
##   11    200            0.7984180  0.4407784
##   12     50            0.7984190  0.4396406
##   12    100            0.7965932  0.4387845
##   12    200            0.7990244  0.4396662
##   13     50            0.7963912  0.4339012
##   13    100            0.7990279  0.4434905
##   13    200            0.7996328  0.4445237
##   14     50            0.7996350  0.4438419
##   14    100            0.7976074  0.4382475
##   14    200            0.7976050  0.4386939
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7, splitrule = gini
##  and min.node.size = 200.
plot(Telco.rf.ranger, xlab = "Random Predictors", main = "Random Predictor Accurancy")

Then we use the classical confusion matrix to study the model accuracy.

confusionMatrix(data = predict(Telco.rf.ranger,
                               newdata = Telco.train),
                reference = Telco.train$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3323  619
##        Yes  299  690
##                                           
##                Accuracy : 0.8138          
##                  95% CI : (0.8027, 0.8246)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4822          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5271          
##             Specificity : 0.9174          
##          Pos Pred Value : 0.6977          
##          Neg Pred Value : 0.8430          
##              Prevalence : 0.2655          
##          Detection Rate : 0.1399          
##    Detection Prevalence : 0.2006          
##       Balanced Accuracy : 0.7223          
##                                           
##        'Positive' Class : Yes             
## 
confusionMatrix(data = predict(Telco.rf.ranger,
                               newdata = Telco.test),
                reference = Telco.test$Churn,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1419  293
##        Yes  133  267
##                                           
##                Accuracy : 0.7983          
##                  95% CI : (0.7805, 0.8152)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : 6.491e-12       
##                                           
##                   Kappa : 0.4304          
##                                           
##  Mcnemar's Test P-Value : 1.323e-14       
##                                           
##             Sensitivity : 0.4768          
##             Specificity : 0.9143          
##          Pos Pred Value : 0.6675          
##          Neg Pred Value : 0.8289          
##              Prevalence : 0.2652          
##          Detection Rate : 0.1264          
##    Detection Prevalence : 0.1894          
##       Balanced Accuracy : 0.6955          
##                                           
##        'Positive' Class : Yes             
## 

Finaly, we made a comparison between the Random Forest Models.

set.seed(123)
pred.train.rf <- predict(Telco.rf, 
                          Telco.train, 
                          type = "prob")[, "Yes"]
ROC.train.rf  <- roc(as.numeric(Telco.train$Churn == "Yes"), 
                      pred.train.rf)

pred.test.rf  <- predict(Telco.rf, 
                          Telco.test, 
                          type = "prob")[, "Yes"]
ROC.test.rf   <- roc(as.numeric(Telco.test$Churn == "Yes"), 
                      pred.test.rf)

pred.train.rf.ranger <- predict(Telco.rf.ranger, 
                           Telco.train, 
                           type = "prob")[, "Yes"]
ROC.train.rf.ranger  <- roc(as.numeric(Telco.train$Churn == "Yes"), 
                       pred.train.rf.ranger)
pred.test.rf.ranger  <- predict(Telco.rf.ranger, 
                           Telco.test, 
                           type = "prob")[, "Yes"]
ROC.test.rf.ranger  <- roc(as.numeric(Telco.test$Churn == "Yes"), 
                       pred.test.rf.ranger)

Both of models seem overfitted, but we can also notice that the Random Forest had a better accuracy on test data.

Next, we present them on the plot.

Random Forest Ranger improved our model infact it present a Gini Index grather than competitor.

Summary and conclusion

In order to compare all the models created in our project, we decided to visualize all the ROC curves.

Then, we created a table in order to represent all the Gini-value and Accuracy obtained from our model:

  • the more overtitted models are both type of Random Forest (italics numbers)
  • the best models with the higher percentage of test accuracy are those obtained by logistic regression (bold numbers)
tree tree_p bagging logit rf rf.ranger
Gini.Train 73.4% 63.8% 69.2% 69.2% 72% 74.2%
Gini.Test 65.1% 60.9% 69% 68.9% 67.4% 68.9%
Train Accuracy 81.14% 80.33% 80.23% 80.21% 81.58% 81.5%
Test Accuracy 79.07% 79.45% 80.54% 80.54% 79.5% 79.64%