The dataset for our classification Machine Learning Project consists of all relevant Telco customer data and it is taken from Kaggle and stems from the IBM sample data set collection. (https://www.kaggle.com/blastchar/telco-customer-churn)
Each row of this data represents a customer, each column contains customer’s attributes.
The data contains 7043 rows (customers) and 21 columns (features):
Churn = Customers who left within the last month: our target YES or NOcustomerID = ID of Customergender = Whether the customer is a male or a femaleSeniorCitizen = Whether the customer is a senior citizen or notPartner = Whether the customer has a partner or notDependents = Whether the customer has dependents or notTenure = Number of months the customer has stayed with the companyPhoneService = Whether the customer has a phone service or notMultipleLines = Whether the customer has multiple lines or notInternetService = Customer’s internet service provider like DSL or Fiber opticOnlineSecurity = Whether the customer has online security or notOnlineBackup = Whether the customer has online backup or notDeviceProtection = Whether the customer has device protection or notTechSupport = Whether the customer has tech support or notStreamingTV = Whether the customer has streaming TV or notStreamingMovies = Whether the customer has streaming movies or notContract = The contract term of the customer like Month-to-month or One year or Two yearPaperlessBilling = Whether the customer has paperless billing or notPaymentMethod = The customer’s payment method like Electronic check, Mailed check, Bank transfer or Credit cardMonthlyCharges = The amount charged to the customer monthlyTotalCharges = The total amount charged to the customerTelco <- read.csv('data/Telco-Customer-Churn.csv')
Telco %>% glimpse()## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOC…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female"…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Ye…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", …
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber op…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", …
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Y…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Ye…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One…
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No",…
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", …
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "…
Nowadays it is necessary for Telco firms to acquire new customers while preventing termination of contracts.
Looking at churn, multiple factors lead consumers to cancel their contracts, such as better pricing deals, more interesting packages, poor service experiences or changing personal conditions for customers.
The main challenge is to determine whether or not an individual customer is going to churn.
Machine learning models are trained on the basis of 70% of the sample data.
The remaining 30% was used to apply the qualified models and to test their predictive power with respect to churn/not churn.
Accuracy is measured in order to compare models and select the best ones for this task.
In conclusion we are going to create a prediction in order to discover the possible maximum amount charge to the customer monthly without his churn.
In order to visualize missing value we used the function summary.
There are 11 NA from the variables TotalCharges, we decided to drop that variable because it depends on the time the customers stay in Telco.
summary(Telco)## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
Telco <- Telco[,-20]We decided not to consider the variable Costumer ID and we trasform every character variables into factorial ones.
Notice that the variable SeniorCitizen that indicate whether the customer is a senior citizen or not need to be renamed to No and Yes.
Telco <- Telco[,-1]
Telco$gender <- as.factor(Telco$gender)
Telco$Partner <- as.factor(Telco$Partner)
Telco$Dependents <- as.factor(Telco$Dependents)
Telco$PhoneService <- as.factor(Telco$PhoneService)
Telco$MultipleLines <- as.factor(Telco$MultipleLines)
Telco$InternetService <- as.factor(Telco$InternetService)
Telco$OnlineSecurity <- as.factor(Telco$OnlineSecurity)
Telco$OnlineBackup <- as.factor(Telco$OnlineBackup)
Telco$DeviceProtection <- as.factor(Telco$DeviceProtection)
Telco$TechSupport <- as.factor(Telco$TechSupport)
Telco$StreamingTV <- as.factor(Telco$StreamingTV)
Telco$StreamingMovies <- as.factor(Telco$StreamingMovies)
Telco$Contract <- as.factor(Telco$Contract)
Telco$PaperlessBilling <- as.factor(Telco$PaperlessBilling)
Telco$PaymentMethod <- as.factor(Telco$PaymentMethod)
Telco$Churn <- as.factor(Telco$Churn)
Telco$SeniorCitizen<-as.factor(mapvalues(Telco$SeniorCitizen,from=c("0","1"),to=c("No", "Yes")))
Telco %>% glimpse()## Rows: 7,043
## Columns: 19
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Fema…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Y…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, N…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, …
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes,…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber …
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No i…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No i…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No int…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No in…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No int…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, …
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, N…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank t…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, …
We create a multiplot in order to show distribution of continuos variables by histogram.
The barplots below show the different categories of factorial variables.
We decided to create a bar plot in order to illustrate the final distribution of people who have churned or not: over 70% of our customers didn't churn Telco.
In order to split the data into train and test we used the function createDataPartition.
Let' visualize the result of the partition: we assign 70% on our Train and 30% on Test.
| Churn Yes | Churn No | |
|---|---|---|
| Train | 1309 | 3622 |
| Test | 560 | 1552 |
set.seed(123)
train <- createDataPartition(Telco$Churn, p = 0.7, list = FALSE)
Telco.train <- Telco[train,]
Telco.test <- Telco[-train,]We decide to keep all the variables in our formula.
model1.formula <- Churn ~ .The default splitting criterion is the Gini Index (CART algorithm).
First of all, we don't impose any restriction on the tree growth in order to visualize the full tree.
The result of tree is big and difficult to interpret and in fact we will decide to prune it.
set.seed(123)
Telco.tree <-
rpart(model1.formula,
data = Telco.train,
method = "class",
minsplit = 100,
minbucket = 50,
cp = -1)
printcp(Telco.tree)##
## Classification tree:
## rpart(formula = model1.formula, data = Telco.train, method = "class",
## minsplit = 100, minbucket = 50, cp = -1)
##
## Variables actually used in tree construction:
## [1] Contract Dependents DeviceProtection gender
## [5] InternetService MonthlyCharges MultipleLines OnlineBackup
## [9] OnlineSecurity PaperlessBilling Partner PaymentMethod
## [13] PhoneService StreamingMovies tenure
##
## Root node error: 1309/4931 = 0.26546
##
## n= 4931
##
## CP nsplit rel error xerror xstd
## 1 0.0542399 0 1.00000 1.00000 0.023688
## 2 0.0118411 3 0.79144 0.79832 0.021923
## 3 0.0068755 5 0.76776 0.80290 0.021969
## 4 0.0030558 8 0.74102 0.78610 0.021799
## 5 0.0022918 11 0.73186 0.80214 0.021961
## 6 0.0010186 18 0.71352 0.80443 0.021984
## 7 0.0000000 21 0.71047 0.80978 0.022037
## 8 -1.0000000 68 0.71047 0.80978 0.022037
fancyRpartPlot(Telco.tree, palettes = c("Greys", "Oranges"), sub="Full classification tree")Now we try to analyse the fit of the training set model.
Accuracy = 0.8114 below, indicate the frequency of observation correctly classified.Kappa = 0.4812, measure of the relationship between predictions their values.Sensitivity = 0.5386, became from the ratio between the Event (=Churn) correctly classified and the total of Event (=Churn).Specificity = 0.9100, is obtained from thet the ratio between the No Event (=No Chrun) correctly classified and the total of event (=No Churn).The confusion matrix shows a grid of true and false predictions compared to the actual values.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3296 604
## Yes 326 705
##
## Accuracy : 0.8114
## 95% CI : (0.8002, 0.8222)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4812
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5386
## Specificity : 0.9100
## Pos Pred Value : 0.6838
## Neg Pred Value : 0.8451
## Prevalence : 0.2655
## Detection Rate : 0.1430
## Detection Prevalence : 0.2091
## Balanced Accuracy : 0.7243
##
## 'Positive' Class : Yes
##
Then we create the same confusion matrix also for the test model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1393 283
## Yes 159 277
##
## Accuracy : 0.7907
## 95% CI : (0.7727, 0.8079)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 1.476e-09
##
## Kappa : 0.4221
##
## Mcnemar's Test P-Value : 4.901e-09
##
## Sensitivity : 0.4946
## Specificity : 0.8976
## Pos Pred Value : 0.6353
## Neg Pred Value : 0.8311
## Prevalence : 0.2652
## Detection Rate : 0.1312
## Detection Prevalence : 0.2064
## Balanced Accuracy : 0.6961
##
## 'Positive' Class : Yes
##
The confusionMatrix evidences that the Accuracy on train data is higher than that on test data: the tree is overfitted!
In order to reduce the size and complexity of the three we prune it.
In fact the complexity parameter is used to optimize the complexity of the tree.
The second part of output below shows the values of complexity parameter coefficient and the related prediction error in the cross-validation.
In order to choose the complexity parameter with the lowest prediction error, we use automatically the function which.mean and later we visualize it:
(opt <- which.min(Telco.tree$cptable[,"xerror"]))## 4
## 4
(cp <- Telco.tree$cptable[opt, "CP"])## [1] 0.003055768
Now we can procede with the pruning step!
Telco.tree_p <- prune(Telco.tree, cp = cp )Let's print a confusion matrix in order to evaluate the accuracy of the Pruned Tree.
pred.tree_p.train <- predict(Telco.tree_p,Telco.train,type = "class")
confusionMatrix(data = pred.tree_p.train,
reference = Telco.train$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3189 537
## Yes 433 772
##
## Accuracy : 0.8033
## 95% CI : (0.7919, 0.8143)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4825
##
## Mcnemar's Test P-Value : 0.0009426
##
## Sensitivity : 0.5898
## Specificity : 0.8805
## Pos Pred Value : 0.6407
## Neg Pred Value : 0.8559
## Prevalence : 0.2655
## Detection Rate : 0.1566
## Detection Prevalence : 0.2444
## Balanced Accuracy : 0.7351
##
## 'Positive' Class : Yes
##
pred.tree_p.test <- predict(Telco.tree_p, Telco.test, type = "class")
confusionMatrix(data = pred.tree_p.test,
reference = Telco.test$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1357 239
## Yes 195 321
##
## Accuracy : 0.7945
## 95% CI : (0.7766, 0.8116)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 1.072e-10
##
## Kappa : 0.4591
##
## Mcnemar's Test P-Value : 0.03901
##
## Sensitivity : 0.5732
## Specificity : 0.8744
## Pos Pred Value : 0.6221
## Neg Pred Value : 0.8503
## Prevalence : 0.2652
## Detection Rate : 0.1520
## Detection Prevalence : 0.2443
## Balanced Accuracy : 0.7238
##
## 'Positive' Class : Yes
##
The Pruned tree seems less overfitting and the accuracy on test data became higher than the previous one.
In order to interpret better each node, we plotted two different type of the same pruned tree.
Each node of below tree shows:
The final node number 15, which represent the majority of people (14% of total) who churned the company is caratterised by:
Contract = Month to monthInternet Service = Fiber opticDuration with company < 15 monthsThe node number 27, which represent thee people (5% of total) who churned the company is caratterised by:
Contract = Month to monthInternet Service = ADSLDuration with company < 3.5 monthsPayment method = Bank transferThe final node number 119, which represent the of people (5% of total) who churned the company is caratterised by:
Contract = Month to monthInternet Service = Fiber opticDuration with company < 55 monthsMultiple lines = yesfancyRpartPlot(Telco.tree_p, palettes = c("Greys", "Oranges"), sub="Pruned classification tree")fancyRpartPlot(Telco.tree_p, palettes = c("Greys", "Oranges"), sub="Pruned classification tree", type= 5)Fist of all we predict directly the value of target variable on training and test set.
Then we calculate values of the ROC curves, which show the diagnostic ability of a model by bringing together true positive rate and false positive rate for different thresholds of class predictions.
Finally we visualise both ROC curves on the plot and also the information about values of the Gini coefficients.
If Gini values equals to 1, then the model predicts or classifies perfectly.
As we already said before, the original tree is overfitted.
Bagging is used when the aim is to reduce a decision tree's variance.
The point is to create several subsets of data from a randomly selected training sample with a replacements.
Consequently, each set of subset data is used to train their decision tree. As a consequence, we end up with a collection of different models.
Thus, we have estimated 19 different models and we made prediction for the training and testing set for each estimated model.
Now we generate aggregated prediction with majority voting.
Telco.pred_bag_train_final <-
ifelse(rowSums(Telco.pred_bag_train < 0.5) > 19/2,
"No", "Yes") %>%
factor(., levels = c("No", "Yes"))
Telco.pred_bag_test_final <-
ifelse(rowSums(Telco.pred_bag_test < 0.5) > 19/2,
"No", "Yes") %>%
factor(., levels = c("No", "Yes"))In comparison than before models, the prediction on test data has an accuracy grather than that on train data: the model is not overfitted!
confusionMatrix(data = Telco.pred_bag_train_final,
reference = Telco.train$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3243 596
## Yes 379 713
##
## Accuracy : 0.8023
## 95% CI : (0.7909, 0.8133)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4646
##
## Mcnemar's Test P-Value : 4.596e-12
##
## Sensitivity : 0.5447
## Specificity : 0.8954
## Pos Pred Value : 0.6529
## Neg Pred Value : 0.8448
## Prevalence : 0.2655
## Detection Rate : 0.1446
## Detection Prevalence : 0.2215
## Balanced Accuracy : 0.7200
##
## 'Positive' Class : Yes
##
confusionMatrix(data = Telco.pred_bag_test_final,
reference = Telco.test$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1385 244
## Yes 167 316
##
## Accuracy : 0.8054
## 95% CI : (0.7879, 0.8221)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 2.048e-14
##
## Kappa : 0.4777
##
## Mcnemar's Test P-Value : 0.0001777
##
## Sensitivity : 0.5643
## Specificity : 0.8924
## Pos Pred Value : 0.6542
## Neg Pred Value : 0.8502
## Prevalence : 0.2652
## Detection Rate : 0.1496
## Detection Prevalence : 0.2287
## Balanced Accuracy : 0.7283
##
## 'Positive' Class : Yes
##
In order to compare the model above with the model without the cross-validation process, we estimated it and we use the classical confusion matrix to study the accuracy.
As we can see on the output below, the resampling method without the cross-validation process provides almost the same results.
ctrl_nocv <- trainControl(method = "none")
Telco.logit <-
train(model1.formula,
data = Telco.train,
method = "glm",
family = "binomial",
trControl = ctrl_nocv)
ROC.test.logit <-
roc(as.numeric(Telco.test$Churn == "Yes"),
predict(Telco.logit, newdata = Telco.test, type = "prob")[, "Yes"])
ROC.train.logit <-
roc(as.numeric(Telco.train$Churn == "Yes"),
predict(Telco.logit, newdata = Telco.train, type = "prob")[, "Yes"])
confusionMatrix(data = predict(Telco.logit,
newdata = Telco.train),
reference = Telco.train$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3246 600
## Yes 376 709
##
## Accuracy : 0.8021
## 95% CI : (0.7907, 0.8131)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4631
##
## Mcnemar's Test P-Value : 9.466e-13
##
## Sensitivity : 0.5416
## Specificity : 0.8962
## Pos Pred Value : 0.6535
## Neg Pred Value : 0.8440
## Prevalence : 0.2655
## Detection Rate : 0.1438
## Detection Prevalence : 0.2200
## Balanced Accuracy : 0.7189
##
## 'Positive' Class : Yes
##
confusionMatrix(data = predict(Telco.logit,
newdata = Telco.test),
reference = Telco.test$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1387 246
## Yes 165 314
##
## Accuracy : 0.8054
## 95% CI : (0.7879, 0.8221)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 2.048e-14
##
## Kappa : 0.4764
##
## Mcnemar's Test P-Value : 7.943e-05
##
## Sensitivity : 0.5607
## Specificity : 0.8937
## Pos Pred Value : 0.6555
## Neg Pred Value : 0.8494
## Prevalence : 0.2652
## Detection Rate : 0.1487
## Detection Prevalence : 0.2268
## Balanced Accuracy : 0.7272
##
## 'Positive' Class : Yes
##
We calculate values of the ROC curves based on averaged probabilities of churned the company.
In order to compare the two models already created, we plot the graph that illustrates the two ROC curves.
Through the graph we can see how the differences are not very evident.
An extension over bagging is Random Forest.
In addition to the random subset of data, it also involves the random selection of characteristics rather than the use of all characteristics to grow trees.
The advantage of this technique is to handles higher dimensionality data very well and it able to manage the missing values in order to maintains accuracy for missing data.
On other hand, the drawbacks is that seen the final forecast is based on the mean projections of subset trees, it will not provide the regression model with exact values.
Let's try to build our model using the cross-validation process in order to use the best number of parameters.
The output advice us that the final value used for the model was mtry = 10., that is the number of variables randomly sampled as candidates at each split.
parameters_rf <- expand.grid(mtry = 2:14)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)
set.seed(123)
Telco.rf <- train(model1.formula,
data = Telco.train,
method = "rf",
ntree = 100,
nodesize = 100,
tuneGrid = parameters_rf,
trControl = ctrl_oob,
importance = TRUE)
Telco.rf## Random Forest
##
## 4931 samples
## 18 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7779355 0.2982673
## 3 0.7888866 0.3623904
## 4 0.7945650 0.4064827
## 5 0.7988238 0.4353900
## 6 0.7992294 0.4347173
## 7 0.8012574 0.4504364
## 8 0.7992294 0.4468656
## 9 0.8002434 0.4445643
## 10 0.8014602 0.4470701
## 11 0.7988238 0.4431211
## 12 0.8014602 0.4534413
## 13 0.8002434 0.4430912
## 14 0.7990266 0.4420664
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
Moreover we visualise our choosen with a dash blue line in the followed two plots.
## integer(0)
## integer(0)
Then we use the classical confusion matrix to study the model accuracy.
confusionMatrix(data = predict(Telco.rf,
newdata = Telco.train),
reference = Telco.train$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3304 594
## Yes 318 715
##
## Accuracy : 0.815
## 95% CI : (0.8039, 0.8258)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4915
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5462
## Specificity : 0.9122
## Pos Pred Value : 0.6922
## Neg Pred Value : 0.8476
## Prevalence : 0.2655
## Detection Rate : 0.1450
## Detection Prevalence : 0.2095
## Balanced Accuracy : 0.7292
##
## 'Positive' Class : Yes
##
confusionMatrix(data = predict(Telco.rf,
newdata = Telco.test),
reference = Telco.test$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1400 280
## Yes 152 280
##
## Accuracy : 0.7955
## 95% CI : (0.7776, 0.8125)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 5.410e-11
##
## Kappa : 0.4337
##
## Mcnemar's Test P-Value : 9.945e-10
##
## Sensitivity : 0.5000
## Specificity : 0.9021
## Pos Pred Value : 0.6481
## Neg Pred Value : 0.8333
## Prevalence : 0.2652
## Detection Rate : 0.1326
## Detection Prevalence : 0.2045
## Balanced Accuracy : 0.7010
##
## 'Positive' Class : Yes
##
Ranger, especially suited for high-dimensional data, is a rapid implementation of random forests or recursive partitioning.
It supports grouping, regression, and forest survival.
We try to set 3 parameters (expand.grid, splitrule and min.node.size) and we apply a resampling method with cross-validation.
The Ranger Random Forest model chosen mtry= 12 and min.node.size = 200.
parameters_ranger <- expand.grid(mtry = 2:14,
splitrule = "gini",
min.node.size = c(50, 100, 200))
ctrl_cv5 <- trainControl(method = "cv",
number = 5,
classProbs = T)
set.seed(123)
Telco.rf.ranger <- train(model1.formula,
data = Telco.train,
method = "ranger",
num.trees = 100,
num.threads = 3,
importance = "impurity",
tuneGrid = parameters_ranger,
trControl = ctrl_cv5)
print(Telco.rf.ranger)## Random Forest
##
## 4931 samples
## 18 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3945, 3945, 3944, 3945, 3945
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 2 50 0.7838156 0.3147776
## 2 100 0.7801638 0.2945094
## 2 200 0.7755020 0.2653441
## 3 50 0.7965934 0.4078623
## 3 100 0.7905083 0.3769035
## 3 200 0.7911159 0.3670282
## 4 50 0.7939565 0.4137413
## 4 100 0.7969973 0.4158929
## 4 200 0.7929415 0.3927668
## 5 50 0.7955801 0.4276405
## 5 100 0.7972022 0.4290590
## 5 200 0.7953743 0.4100288
## 6 50 0.7963902 0.4321922
## 6 100 0.7967961 0.4300578
## 6 200 0.7996334 0.4261647
## 7 50 0.7955792 0.4311877
## 7 100 0.7994318 0.4392973
## 7 200 0.8022697 0.4416368
## 8 50 0.7965926 0.4350400
## 8 100 0.7998374 0.4391814
## 8 200 0.7990263 0.4351171
## 9 50 0.7992299 0.4447359
## 9 100 0.7998379 0.4439872
## 9 200 0.8004455 0.4413977
## 10 50 0.7955797 0.4329162
## 10 100 0.7969985 0.4380967
## 10 200 0.7974032 0.4337527
## 11 50 0.7933499 0.4278833
## 11 100 0.7994318 0.4448631
## 11 200 0.7984180 0.4407784
## 12 50 0.7984190 0.4396406
## 12 100 0.7965932 0.4387845
## 12 200 0.7990244 0.4396662
## 13 50 0.7963912 0.4339012
## 13 100 0.7990279 0.4434905
## 13 200 0.7996328 0.4445237
## 14 50 0.7996350 0.4438419
## 14 100 0.7976074 0.4382475
## 14 200 0.7976050 0.4386939
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7, splitrule = gini
## and min.node.size = 200.
plot(Telco.rf.ranger, xlab = "Random Predictors", main = "Random Predictor Accurancy")Then we use the classical confusion matrix to study the model accuracy.
confusionMatrix(data = predict(Telco.rf.ranger,
newdata = Telco.train),
reference = Telco.train$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3323 619
## Yes 299 690
##
## Accuracy : 0.8138
## 95% CI : (0.8027, 0.8246)
## No Information Rate : 0.7345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4822
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5271
## Specificity : 0.9174
## Pos Pred Value : 0.6977
## Neg Pred Value : 0.8430
## Prevalence : 0.2655
## Detection Rate : 0.1399
## Detection Prevalence : 0.2006
## Balanced Accuracy : 0.7223
##
## 'Positive' Class : Yes
##
confusionMatrix(data = predict(Telco.rf.ranger,
newdata = Telco.test),
reference = Telco.test$Churn,
positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1419 293
## Yes 133 267
##
## Accuracy : 0.7983
## 95% CI : (0.7805, 0.8152)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : 6.491e-12
##
## Kappa : 0.4304
##
## Mcnemar's Test P-Value : 1.323e-14
##
## Sensitivity : 0.4768
## Specificity : 0.9143
## Pos Pred Value : 0.6675
## Neg Pred Value : 0.8289
## Prevalence : 0.2652
## Detection Rate : 0.1264
## Detection Prevalence : 0.1894
## Balanced Accuracy : 0.6955
##
## 'Positive' Class : Yes
##
Finaly, we made a comparison between the Random Forest Models.
set.seed(123)
pred.train.rf <- predict(Telco.rf,
Telco.train,
type = "prob")[, "Yes"]
ROC.train.rf <- roc(as.numeric(Telco.train$Churn == "Yes"),
pred.train.rf)
pred.test.rf <- predict(Telco.rf,
Telco.test,
type = "prob")[, "Yes"]
ROC.test.rf <- roc(as.numeric(Telco.test$Churn == "Yes"),
pred.test.rf)
pred.train.rf.ranger <- predict(Telco.rf.ranger,
Telco.train,
type = "prob")[, "Yes"]
ROC.train.rf.ranger <- roc(as.numeric(Telco.train$Churn == "Yes"),
pred.train.rf.ranger)
pred.test.rf.ranger <- predict(Telco.rf.ranger,
Telco.test,
type = "prob")[, "Yes"]
ROC.test.rf.ranger <- roc(as.numeric(Telco.test$Churn == "Yes"),
pred.test.rf.ranger)Both of models seem overfitted, but we can also notice that the Random Forest had a better accuracy on test data.
Next, we present them on the plot.
Random Forest Ranger improved our model infact it present a Gini Index grather than competitor.
In order to compare all the models created in our project, we decided to visualize all the ROC curves.
Then, we created a table in order to represent all the Gini-value and Accuracy obtained from our model:
| tree | tree_p | bagging | logit | rf | rf.ranger | |
|---|---|---|---|---|---|---|
| Gini.Train | 73.4% | 63.8% | 69.2% | 69.2% | 72% | 74.2% |
| Gini.Test | 65.1% | 60.9% | 69% | 68.9% | 67.4% | 68.9% |
| Train Accuracy | 81.14% | 80.33% | 80.23% | 80.21% | 81.58% | 81.5% |
| Test Accuracy | 79.07% | 79.45% | 80.54% | 80.54% | 79.5% | 79.64% |