Telecom Customer Churn

Problem statement and findings

Problem statement

Mobile operator Telecom is struggling with customer churn, which means that the company is losing too many customers.
The company does not know what makes customers to leave and neither whether it is random or they have something in common.
It is much cheaper to retain a customer than to gain a new one, but Telecom finds it challenging to come up with an approach to it.

Objective

Telecom observed a set of customers for a certain period of time and provided the information whether the customer churned or not. Along with this information, Telecom provided several other information that could effect the churn.

The dataset is downloadable here: link

The objective was to look for general patterns in the provided data that would help to identify customers that are more likely to churn, before they do that, so the company could adress them and prevent their churn.

Project duration

Given dataset had 20 columns and 7043 rows. It took me 3 weeks to complete the project.

Business benefits

The graph demonstrates how taking advantage of data can help Telecom to identify the customers that are most likely to churn and give the company opportunity to prevent it. The blue diagonal line represents the random scenario (without using a model). In the test set 467 customers out of 1758 churned, which approximately 26.56 %. If the company paid extra attention to randomly chosen 50 % of the customers, only 50 % of those 467 would have been addressed.

The red diagonal line represents scenario using the delivered model of logistic regression. When the company pays attention to 50 % of the customers with the highest probability of churn computed by the model, 86 % of those 467 will be addressed.
To address these 36 % extra customers that will churn without the model, Telecom would have pay for approaching another 36 % customers.
So in case of addressing 50 % customers, the savings from using the model represents the cost of addressing 36 % customers.
As a matter of fact, the more is the curve representing the model scenario in the top left corner, the better is the model.

Another option of evaluating the model is plotting the receiver operating characteristic curve, as shown below.

ROC Curve is used to see how well the classifier can separate positive and negative cases.
On the x axis is false positive rate, which is the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events.
on the y axis is true positive rate, which it the ratio between the number of correctly categorized positive events and the total number of actual positive events.
The diagonal line represents the random model, that means scenario without using information about patterns found in the data.
The curve represents the scenario, in which is the churn (positive class) predicted by the obtianed model (the delivered logistic regression).

The area under the ROC curve (AUROC) should be between 0.5 and 1.0. This area is a measure of the predictive accuracy of a model. An AUROC equal to 0.5 (i.e. coinciding with the diagonal) indicates the random classification model. As a matter of fact, this area should be greater than 0.5 for a model to be acceptable; a model with AUROC of 0.5 or less is worthless.

AUROC for the delivered logistic model was computed:

## [1] "Area under the receiver operating characteristic curve is 0.83"

Taking into consideration that area under the ROC curve of a perfect model is 1, this model is predicting the churn very well.

For ilustration below is the confusion matrix computed from the 1758 cases used for evaluating the model.

##      y_pred_LR
##         0   1
##   No  925 366
##   Yes  94 373

In the first row and first column here is the number of correctly classified customers that did not churn at the end of the given period of time. In the first row and the second column here is the number of customers that were classified to churn, but they did not. In the second row and the first column here is the number of customers that were classified to not churn, but they did. This is the number we want to minimize without allowing the false positive rate to increase too much as this can add to marketing costs. In the second row and the second column here is the number of correctly classified customers that churned.

Except the information whether a customer is likely to churn or not, I found out which features effect this most.

According to the plot of the decision tree those features are type of contract, internet provider and tenure.
Furthemore the numbers in nodes represent the probability of churn. The top nod has number 0.50. Since I used a sampling method to get a training set with balanced classes, without using any model there is a 50 % chance for each customer to churn.

As we move down the plot, we can see that probability to churn of customers that have one or two years contract is only 0.17, for others it is 0.67. From those, in case they do not have internet service or they have DSL, the probability of churn is 0.52, otherwise for those whose internet service provider is Fiber optic (the last option of internet service) it is 0.77.

Customers whose internet service provider is not Fiber optic and are customers of Telecom at least 12 months have probability to churn 0.34, otherwise it is 0.62.

This was just a look at the most important features generated by decision tree algorithm. Now let’s take a look at the delivered logistic model itself.

General formula for logistic regression is:

ln(p/(1-p)) = b0 + b1x1+ … + bnxn.

Coeffitients b0,b1, …, bn are computed by logistic regression model after training it on a training dataset, where b0 is a constant, which means that is the same for every case (customer) and x1,x2, …, xn are the values of concrete features of the particular case (customer). In this case the solution of the formula is probability that the given customer will churn.

The formula of our logistic regression is

ln(p/(1-p)) = - 1.19163 - 0.34180 * PhoneServiceYes + 0.36673 * MultipleLinesYes + 0.93406 * InternetServiceFiber optic -
- 0.91576 * InternetServiceNo - 0.38856 * OnlineSecurityYes - 0.41665 * TechSupportYes + 0.31064 * StreamingMoviesYes -
- 0.68940 * ContractOne year - 1.74532 * ContractTwo year + 0.21014 * PaperlessBillingYes -
- 0.01245 * PaymentMethodCredit card (automatic) + 0.41793 * PaymentMethodElectronic check -0.08411 * PaymentMethodMailed check +
+ 2.00364 * tenure_group0-12 Months + 0.91525 * tenure_group12-24 Months + 0.63350 * tenure_group24-48 Months +
+ 0.40948 * tenure_group48-60 Months

The model predicts customers with probabilities lower than 0.5 to not churn, and customers with probabilities with 0.5 and higher to churn.

Let’s now take one of our customers from the test set and compute the probability of leaving.

When we fit into the equation the customers information we get:

ln(p/(1-p)) = - 1.19163 - 0.34180 * 0 + 0.36673 * 0 + 0.93406 * 0 -
- 0.91576 * 0 - 0.38856 * 0 - 0.41665 * 0 + 0.31064 * 0 -
- 0.68940 * 0 - 1.74532 * 0 + 0.21014 * 1 - 0.01245 * 0 +
+ 0.41793 * 1 - 0.08411 * 0 + 2.00364 * 1 + 0.91525 * 0 +
+ 0.63350 * 0 + 0.40948 * 0

Computed probability is

##       52 
## 0.607221

As we can see from the table, this customer churned and it was also predicted by our model, since computed p was higher than 0.5.

Let’s now look at our formula of logistic regression and think about how can we take advantage of knowing what the probabilit of churn actually effects.

The b1 coefficient has negative sign, which means that if a customer has a phone service the probability of churn is lower then probability of customer that has not a phone service and the other features are the same in both cases.

In case of our customer we fitted into the formula the x1 is 0, because as we can see from the table, the customer does not have a phone service.

Every other feature also has given available values and works on the same principle.

The another coefficient says if a customer has multiple lines, the probability of churn is higher. We can not effect this, but we can pay extra attention to those, who have them, since the coefficient is quite high.

The next two coefficients are related to internet service. Customers whose internet provider is Fiber optic are more likely to leave and as we saw in the decission tree, internet provider feature is the second most important feature. That is why customers with Fiber optic provider need extra attention. On the other hand having no internet service is lowering the probability of churn.

Having online security and technical support lowers the probability of churn as well, whereas having streaming movies increases it.

Next two coefficients are related to contract. Options for contract are month-to-month, that has no effect on the probability of churn, both one year and two years contract lowers the probability, where two years tremendiously. It is the most important feature, as we also saw on the decission tree plot. Contract can Telecom effect and it is highly recommended to make sure that customers has two years contract.

The next coefficient says that the paperless billing increases the churn probability, so it is not recommended to provide it to customers.

Next three coefficients are related to customer’s payment method. They are not that important except electronic card method which is increasing the churn probability. Again paying extra attention to these customers are recommended.

The last four coefficients are related to tenure. It is very important to pay extra attention to customers that are with Telecom less then one year. But also after that period, they need extra attention.

My approach

Top 5 rows of the dataset are showed below:

head(dataset, 5)

The data preparation, data exploration and data modeling was performed in RStudio.

Dataset contained 11 customers, that were missing information about Total Charges and since it was such few of them they were excluded from the dataset. Information such ‘No internet service’ and ‘No phone service’ where rewritten to ‘No’.

Feature ‘total charges’ were excluded from the dataset, because it is a combination of ‘monthly charges’ and ‘tenure’ and in case it had been kept in the model, the model would not perform well.

Data exploration revealed that most of the provided featers probably effect whether the customer churned or not after the observed period of time. Because the class sizes were very imbalanced, as we can see on the graph below, the training set were sampled right after the split (70 % cases were used for training set - dataset used for building the model, and 30 % case were used for test set - dataset used for evaluation models performance).

In case of not sampling the dataset the accuracy of logistic regression (which is most suitable algorithm for this problem) would be poor, because this makes model more likely to predict the majority class, no matter how well is the modeling done. With logistic regression, the cutoff point can be between 0 and 1. As the cutoff point moves away from 0.5 in either direction, the false positive or false negative rate is being changed.

There are pros and cons with both approaches, i.e. tighter parameter estimates vs a more accurate predictive model. In exploratory modeling the higher precision in the relationship between the parameters is wanted, whereas in predictive modeling it is the lower error rate. The sampled training set was used (graph of the proportions is below), in order to predict model that would minimize false negatives, i. e. ‘catch’ the most people that will churn as possible.

The final subset of features for modeling was obtained using function stepAIC with parameter k adjusted to select variables with p < 0.05.

Listed features were identified as important:

PhoneService,
MultipleLines,
InternetService,
OnlineSecurity,
TechSupport,
StreamingMovies,
Contract,
PaperlessBilling,
PaymentMethod,
tenure_group.

This list leaves as unimportant featues that were originally based on exploratory analasys taken into consideration:

SeniorCitizen,
Partner,
Dependents,
OnlineBackup,
DeviceProtecion,
StreamingTV.

Before modeling the dataset with another algorithms the feature tenure was binned into 5 bins: “0–12 Months”, “12–24 Months”, “24–48 Months”, “48–60 Month”, “> 60 Month” and the feature tenure was excluded from the training set.

Several algorithms were used for data modeling: decision tree, logistic regression, support vector machine and kernel support vector machine. Kernel support vector machine was used in case that the dataset was not linearly separable, but since it did not performed better than the others algorithms, linear separability was assumed and both logistic regression and support vector machine were taken into consideration. The dataset was randomly splitted into a training set, that contained 75 % of the customers and was used to train the models, and a test set, that contained reamining 25 % of the customers and was used to check the models performance.

When choosing the best model two criterias were taken into consideration. The first one the was cross validation and the second one was number of people that churned, even though the model predicted them to stay.

The cross validation is used as a reassurence that the model is not overfitting.

Since Telecom does not want to just find the most accurate model (the model that classify correctly the customers that will churn and those who will churn not) but besides that also the model that will classify correctly most of those, that will churn if the company will not prevent it (that will minimize Type || errors), because it is cheaper to prevent a customer to churn than to find a new customer. This number is obtained by computing a confusion matrix on the test dataset. For ilustration here is the one for logistic regression.

##      y_pred_LR
##         0   1
##   No  925 366
##   Yes  94 373

After evaluation of the models support vector machine and logistic regression was considered as the final models. Logistic regression had slightly higher accuracy but according to SVM’s confussion matrix (showed below), SVM did slightly better in predicting customers that will churn.

##    y_pred_SVM
##      No Yes
##   0 832 459
##   1  82 385

Support vector machine was tried to improve, but tuning revealed that the model performs best with parameter c = 1, so improvement was not possible, since this is the default value.

Eventually the logistic regression algorithm were chosen as the best solution to implement for its easier interpretability and ability to predict probabilities of churn. This means that it is possible to compute the probability of a customer to churn and sort the customers by it. Thanks to that Telecom has option to pay extra attention to for example 100 customers (the number might depends on resources) with the highest probability of churn.

The R code

dataset <- read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', na.strings = c('','?'))

Data preprocessing

Missing values, wrong values, data types

str(dataset)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

summary(dataset)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2810   No                 :2785  
##  No internet service:1526   No internet service:1526  
##  Yes                :2707   Yes                :2732  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5174  
##  1st Qu.: 35.50   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.76   Mean   :2283.3             
##  3rd Qu.: 89.85   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##                   NA's   :11

Computing the percentage of missing values

nrow(dataset[!complete.cases(dataset),])/nrow(dataset) * 100

## [1] 0.1561834

Looking for missing values

sapply(dataset, function(x) sum(is.na(x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

Deleting incomplete rows, since it is so few of them

dataset <- dataset[complete.cases(dataset),]
summary(dataset)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3483   Min.   :0.0000   No :3639   No :4933  
##  0003-MKNFE:   1   Male  :3549   1st Qu.:0.0000   Yes:3393   Yes:2099  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1624                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7026                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 1.00   No : 680     No              :3385   DSL        :2416  
##  1st Qu.: 9.00   Yes:6352     No phone service: 680   Fiber optic:3096  
##  Median :29.00                Yes             :2967   No         :1520  
##  Mean   :32.42                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3497   No                 :3087  
##  No internet service:1520   No internet service:1520  
##  Yes                :2015   Yes                :2425  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3094    No                 :3472  
##  No internet service:1520    No internet service:1520  
##  Yes                :2418    Yes                :2040  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2809   No                 :2781  
##  No internet service:1520   No internet service:1520  
##  Yes                :2703   Yes                :2731  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2864         Bank transfer (automatic):1542  
##  One year      :1472   Yes:4168         Credit card (automatic)  :1521  
##  Two year      :1685                    Electronic check         :2365  
##                                         Mailed check             :1604  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5163  
##  1st Qu.: 35.59   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.80   Mean   :2283.3             
##  3rd Qu.: 89.86   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##

Changing ‘No internet servise’ into ‘No’ in 6 variables

dataset[,c(10:15)] <- apply(dataset[,c(10:15)], 2, function(x) (gsub('No internet service', 'No', x)))
dataset[,c(10:15)]<- lapply(dataset[,c(10:15)], as.factor)
str(dataset)

## 'data.frame':    7032 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Changing ‘No phone service’ into ‘No’ in MultipleLines variable

dataset$MultipleLines <- gsub('No phone service', 'No', dataset$MultipleLines)
dataset$MultipleLines <- as.factor(dataset$MultipleLines)
str(dataset)

## 'data.frame':    7032 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Changing variable SeniorCitizen from 0,1 to ‘No’, ‘Yes’

library(plyr)
dataset$SeniorCitizen <- mapvalues(from = c('0', '1'), 
                                   to = c('No', 'Yes'), dataset$SeniorCitizen)
dataset$SeniorCitizen <- as.factor(dataset$SeniorCitizen)

Removing the columns I do not need for the analysis

dataset$customerID <- NULL

Exploratory data analysis

str(dataset)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Looking at the proportion of customers that churned

library(ggplot2)
ggplot(dataset, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  theme(legend.position="none")

Unevenly sized classes => sampling of the training set is needed.

Identifying and correcting collinearity among numeric variables

Dropping dependent variable for calculating Multicollinearity

dataset_independent = subset(dataset, select = -c(Churn))

Identifying numeric variables

dataset_independent_numeric <- dataset_independent[sapply(dataset_independent, is.numeric)]

Calculating Correlation

descrCor <- cor(dataset_independent_numeric)

Print correlation matrix and look at max correlation

print(descrCor)

##                   tenure MonthlyCharges TotalCharges
## tenure         1.0000000      0.2468618    0.8258805
## MonthlyCharges 0.2468618      1.0000000    0.6510648
## TotalCharges   0.8258805      0.6510648    1.0000000

rm(descrCor)
rm(dataset_independent)
rm(dataset_independent_numeric)

MonthlyCharges and TotalCharges are highly correlated, one of them must be dropped.

Dropping TotalCharges, since it is function of MonthlyCharges and Tenure group

dataset$TotalCharges <- NULL

Looking for outliers in MonthlyCharges with help of standardized values

Computing of the standardized values

zcharges <- scale(dataset$MonthlyCharges, scale = TRUE)

Looking for values higher then 3 and lower then -3

zcharges_dec <- sort(zcharges, decreasing = TRUE)
zcharges_asc <- sort(zcharges, decreasing = FALSE)
head(zcharges_dec, 20)

##  [1] 1.793254 1.789930 1.788268 1.788268 1.779959 1.774973 1.761678
##  [8] 1.755030 1.751706 1.750044 1.746721 1.741735 1.740073 1.733425
## [15] 1.730102 1.728440 1.726778 1.721792 1.721792 1.720130

head(zcharges_asc, 20)

##  [1] -1.547173 -1.542187 -1.537202 -1.532216 -1.532216 -1.530554 -1.528892
##  [8] -1.528892 -1.528892 -1.528892 -1.528892 -1.528892 -1.528892 -1.527230
## [15] -1.527230 -1.527230 -1.527230 -1.527230 -1.525568 -1.525568

No outliers present.

rm(zcharges)
rm(zcharges_dec)
rm(zcharges_asc)

Patterns exploration

Boxplot for continuous variable

ggplot(dataset, aes(x = Churn, y = MonthlyCharges)) +
  geom_boxplot()

Expecting MonthlyCharges to be predictor.

Proportion of churn for categorical variable

str(dataset)

## 'data.frame':    7032 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

p_gender <- ggplot(dataset, aes(x = gender, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count') 

p_SeniorCitizen <- ggplot(dataset, aes(x = SeniorCitizen, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_partner <- ggplot(dataset, aes(x = Partner, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_Dependents <- ggplot(dataset, aes(x = Dependents, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

library(Rmisc)
multiplot(p_gender, p_SeniorCitizen, p_partner, p_Dependents, cols = 2)

Expecting Partner, SeniorCitizen and Dependents to be predictors.

p_PhoneService <- ggplot(dataset, aes(x = PhoneService, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count') 

p_MultipleLines <- ggplot(dataset, aes(x = MultipleLines, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_InternetService <- ggplot(dataset, aes(x = InternetService, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_OnlineSecurity <- ggplot(dataset, aes(x = OnlineSecurity, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

multiplot(p_PhoneService, p_MultipleLines, p_InternetService, p_OnlineSecurity, cols = 2)

Expecting InternetService, and OnlineSecurity to be predictors.

p_OnlineBackup <- ggplot(dataset, aes(x = OnlineBackup, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count') 

p_DeviceProtection <- ggplot(dataset, aes(x = DeviceProtection, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_TechSupport <- ggplot(dataset, aes(x = TechSupport, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_StreamingTV <- ggplot(dataset, aes(x = StreamingTV, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

multiplot(p_OnlineBackup, p_DeviceProtection, p_TechSupport, p_StreamingTV, cols = 2)

Expecting OnlineBackup, TechSupport, DeviceProtection and StreamingTV to be predictors.

p_StreamingMovies <- ggplot(dataset, aes(x = StreamingMovies, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count') 

p_Contract <- ggplot(dataset, aes(x = Contract, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_PaperlessBilling <- ggplot(dataset, aes(x = PaperlessBilling, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count')

p_PaymentMethod <- ggplot(dataset, aes(x = PaymentMethod, fill = Churn)) +
  geom_bar(position = 'fill') +
  ylab('Count') +
  coord_flip()

multiplot(p_StreamingMovies, p_Contract, p_PaperlessBilling, p_PaymentMethod, cols = 2)

Expecting StreamingMovies, PaperlessBilling, Contract and PaymentMethod to be predictors.

Considering changing Tenure into factor

library(ggplot2)
ggplot(dataset, aes(x = as.factor(tenure))) +
  geom_bar()

min(dataset$tenure)

## [1] 1

max(dataset$tenure)

## [1] 72

Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”.

group_tenure <- function(tenure){
  if (tenure >= 0 & tenure <= 12){
    return('0-12 Months')
  }else if(tenure > 12 & tenure <= 24){
    return('12-24 Months')
  }else if (tenure > 24 & tenure <= 48){
    return('24-48 Months')
  }else if (tenure > 48 & tenure <=60){
    return('48-60 Months')
  }else if (tenure > 60){
    return('> 60 Months')
  }
}

dataset$tenure_group <- sapply(dataset$tenure,group_tenure)
dataset$tenure_group <- as.factor(dataset$tenure_group)

ggplot(dataset, aes(x = tenure_group, fill = Churn)) +
  geom_bar(position = 'fill')

Expecting tenure_group to be a predictor.

str(dataset)

## 'data.frame':    7032 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
##  $ tenure_group    : Factor w/ 5 levels "> 60 Months",..: 2 4 2 4 2 2 3 2 4 1 ...

Training and test set preparation

Splitting dataset into training set and test set

library(caTools)
set.seed(123)
split = sample.split(dataset$Churn, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Looking at the proportion of customers that churned

prop_before <- ggplot(training_set, aes(x = Churn, fill = Churn)) +
                 geom_bar() +
                 theme(legend.position="none")
prop_before

Sampling the training set in order to get evenly sized classes with ROSE

library(ROSE)
training_set <- ROSE(Churn ~ ., data=training_set, seed=123)$data

Looking at the proportion of customers that churned after sampling

prop_after <- ggplot(training_set, aes(x = Churn, fill = Churn)) +
               geom_bar() +
               theme(legend.position="none")
prop_after

Exploring the patterns in dataset using Decission Tree

Dropping tenure_group variable for DT

training_set_DT <- training_set[-20]
test_set_DT <- test_set[-20]

Fitting decision tree to the training set without tenure_group variable

library(rpart)
str(training_set_DT)

## 'data.frame':    5274 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 2 2 2 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 1 2 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 1 2 1 1 ...
##  $ tenure          : num  56.1 69.1 55.5 54.4 41.1 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 2 1 2 2 1 2 2 2 2 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 2 1 3 2 2 1 1 2 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 1 2 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 2 2 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 2 2 2 2 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 2 2 2 2 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 3 2 2 1 3 2 2 3 1 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 1 2 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 1 1 3 1 3 2 1 1 1 3 ...
##  $ MonthlyCharges  : num  34.8 120.9 89.4 64.5 16.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

str(test_set_DT)

## 'data.frame':    1758 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 2 2 1 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 2 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 1 2 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
##  $ tenure          : int  34 22 10 16 69 58 71 2 1 13 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 1 3 2 1 2 2 1 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 1 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 1 1 2 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 1 2 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 2 1 1 3 3 3 3 1 1 1 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 2 1 2 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 2 4 2 2 2 2 2 1 3 ...
##  $ MonthlyCharges  : num  57 89.1 29.8 18.9 113.2 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

classifier_DT = rpart(formula = Churn ~ .,
                      data = training_set_DT)

Predicting the Test set results

y_pred_DT = predict(classifier_DT, newdata = test_set[-19], 
                    type = 'class')

Applying k-Fold Cross Validation

library(caret)
set.seed(1)
folds_DT <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_DT, function(x) {
  training_fold = training_set[-x,]
  test_fold = training_set[x,]
  classifier = rpart(formula = Churn ~ .-c(tenure_group),
                     data = training_fold)
  y_pred_DT = predict(classifier, newdata = test_fold[-19], 
                      type = 'class')
  cm = table(test_fold[,19], y_pred_DT)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})

Getting the accuracy

accuracy_DT <- mean(as.numeric(cv))
accuracy_DT

## [1] 0.7574904

Creating the Confussion Matrix

cm_DT <- table(test_set[,19], y_pred_DT)
cm_DT

##      y_pred_DT
##        No Yes
##   No  859 432
##   Yes  84 383

Accuracy is 75.75 %
There are 84 of those who left anaway, 432 of those who didn’t left anyway

Plotting the decision tree without pruning it

library(rpart.plot)
tree <- rpart(Churn ~ ., data = training_set_DT, method = "class")
tree

## n= 5274 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 5274 2617 No (0.5037922 0.4962078)  
##    2) Contract=One year,Two year 1835  308 No (0.8321526 0.1678474) *
##    3) Contract=Month-to-month 3439 1130 Yes (0.3285839 0.6714161)  
##      6) InternetService=DSL,No 1373  662 Yes (0.4821559 0.5178441)  
##       12) tenure>=12.46047 501  169 No (0.6626747 0.3373253) *
##       13) tenure< 12.46047 872  330 Yes (0.3784404 0.6215596) *
##      7) InternetService=Fiber optic 2066  468 Yes (0.2265247 0.7734753) *

prp(tree, type = 2, extra = 6, nn = TRUE, 
    fallen.leaves = TRUE, varlen = 0, faclen = 0)

rm(training_set_DT)
rm(test_set_DT)

contract, InternetService and tenure are identified as the most important predictors.

Dimension reduction

Removing the tenure variable

training_set$tenure <- NULL
test_set$tenure <- NULL

Variable selection for logistic regression with stepAIC

Computing k value for stepAIC to have p = 0.05 as a treshold

computed_k <- qchisq(p = 0.05, 1, lower.tail = F)

Running stepAIC()

initial <- glm(Churn ~ . ,
               data = training_set,
               family = binomial)
library(MASS)
stepAIC(initial, direction = 'both', k = computed_k)

## Start:  AIC=5133.01
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     tenure_group
## 
##                    Df Deviance    AIC
## - OnlineBackup      1   5033.1 5129.2
## - MonthlyCharges    1   5033.7 5129.7
## - DeviceProtection  1   5034.0 5130.0
## - StreamingTV       1   5034.3 5130.3
## - gender            1   5034.3 5130.4
## - SeniorCitizen     1   5035.1 5131.2
## - Dependents        1   5036.1 5132.1
## - Partner           1   5036.3 5132.4
## <none>                  5033.1 5133.0
## - StreamingMovies   1   5038.4 5134.5
## - PaperlessBilling  1   5038.7 5134.7
## - PhoneService      1   5038.7 5134.7
## - MultipleLines     1   5046.5 5142.6
## - OnlineSecurity    1   5052.6 5148.6
## - TechSupport       1   5054.7 5150.7
## - PaymentMethod     3   5065.6 5153.9
## - InternetService   2   5072.0 5164.2
## - Contract          2   5150.7 5242.9
## - tenure_group      4   5273.4 5357.9
## 
## Step:  AIC=5129.18
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + DeviceProtection + 
##     TechSupport + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + tenure_group
## 
##                    Df Deviance    AIC
## - MonthlyCharges    1   5033.8 5126.0
## - DeviceProtection  1   5034.0 5126.2
## - StreamingTV       1   5034.3 5126.5
## - gender            1   5034.3 5126.5
## - SeniorCitizen     1   5035.2 5127.4
## - Dependents        1   5036.1 5128.3
## - Partner           1   5036.3 5128.5
## <none>                  5033.1 5129.2
## - StreamingMovies   1   5038.5 5130.7
## - PaperlessBilling  1   5038.7 5130.9
## - PhoneService      1   5039.0 5131.1
## + OnlineBackup      1   5033.1 5133.0
## - MultipleLines     1   5046.6 5138.8
## - OnlineSecurity    1   5052.7 5144.9
## - TechSupport       1   5054.7 5146.9
## - PaymentMethod     3   5065.6 5150.1
## - InternetService   2   5073.0 5161.4
## - Contract          2   5150.8 5239.2
## - tenure_group      4   5281.8 5362.5
## 
## Step:  AIC=5126
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + DeviceProtection + 
##     TechSupport + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## - DeviceProtection  1   5034.4 5122.7
## - gender            1   5035.0 5123.3
## - SeniorCitizen     1   5035.8 5124.2
## - StreamingTV       1   5036.3 5124.6
## - Dependents        1   5036.7 5125.1
## - Partner           1   5037.0 5125.3
## <none>                  5033.8 5126.0
## - PhoneService      1   5039.2 5127.5
## - PaperlessBilling  1   5039.4 5127.8
## + MonthlyCharges    1   5033.1 5129.2
## + OnlineBackup      1   5033.7 5129.7
## - StreamingMovies   1   5042.4 5130.8
## - MultipleLines     1   5049.7 5138.0
## - OnlineSecurity    1   5052.7 5141.1
## - TechSupport       1   5054.9 5143.2
## - PaymentMethod     3   5066.1 5146.8
## - Contract          2   5152.0 5236.5
## - InternetService   2   5231.8 5316.3
## - tenure_group      4   5281.9 5358.7
## 
## Step:  AIC=5122.71
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## - gender            1   5035.5 5120.1
## - SeniorCitizen     1   5036.4 5121.0
## - StreamingTV       1   5036.7 5121.2
## - Dependents        1   5037.3 5121.8
## - Partner           1   5037.4 5121.9
## <none>                  5034.4 5122.7
## - PhoneService      1   5039.7 5124.2
## - PaperlessBilling  1   5040.0 5124.5
## + DeviceProtection  1   5033.8 5126.0
## + MonthlyCharges    1   5034.0 5126.2
## + OnlineBackup      1   5034.3 5126.5
## - StreamingMovies   1   5042.6 5127.1
## - MultipleLines     1   5050.0 5134.5
## - OnlineSecurity    1   5053.3 5137.8
## - TechSupport       1   5055.9 5140.4
## - PaymentMethod     3   5066.8 5143.6
## - Contract          2   5155.3 5235.9
## - InternetService   2   5233.2 5313.9
## - tenure_group      4   5287.9 5360.9
## 
## Step:  AIC=5120.06
## Churn ~ SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## - SeniorCitizen     1   5037.6 5118.3
## - StreamingTV       1   5037.9 5118.6
## - Dependents        1   5038.4 5119.1
## - Partner           1   5038.5 5119.2
## <none>                  5035.5 5120.1
## - PhoneService      1   5040.8 5121.5
## - PaperlessBilling  1   5041.1 5121.8
## + gender            1   5034.4 5122.7
## + DeviceProtection  1   5035.0 5123.3
## + MonthlyCharges    1   5035.2 5123.6
## + OnlineBackup      1   5035.5 5123.8
## - StreamingMovies   1   5043.7 5124.4
## - MultipleLines     1   5051.2 5131.9
## - OnlineSecurity    1   5054.9 5135.6
## - TechSupport       1   5057.1 5137.8
## - PaymentMethod     3   5068.3 5141.3
## - Contract          2   5156.8 5233.7
## - InternetService   2   5233.9 5310.7
## - tenure_group      4   5288.5 5357.6
## 
## Step:  AIC=5118.3
## Churn ~ Partner + Dependents + PhoneService + MultipleLines + 
##     InternetService + OnlineSecurity + TechSupport + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     tenure_group
## 
##                    Df Deviance    AIC
## - StreamingTV       1   5040.0 5116.8
## - Partner           1   5041.2 5118.0
## <none>                  5037.6 5118.3
## - Dependents        1   5041.6 5118.4
## + SeniorCitizen     1   5035.5 5120.1
## - PhoneService      1   5043.3 5120.1
## - PaperlessBilling  1   5043.6 5120.4
## + gender            1   5036.4 5121.0
## + DeviceProtection  1   5037.0 5121.5
## + MonthlyCharges    1   5037.3 5121.8
## + OnlineBackup      1   5037.5 5122.1
## - StreamingMovies   1   5045.7 5122.5
## - MultipleLines     1   5054.0 5130.8
## - OnlineSecurity    1   5057.2 5134.0
## - TechSupport       1   5060.2 5137.1
## - PaymentMethod     3   5071.4 5140.5
## - Contract          2   5161.9 5234.9
## - InternetService   2   5243.3 5316.3
## - tenure_group      4   5288.6 5353.9
## 
## Step:  AIC=5116.78
## Churn ~ Partner + Dependents + PhoneService + MultipleLines + 
##     InternetService + OnlineSecurity + TechSupport + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## - Partner           1   5043.6 5116.5
## <none>                  5040.0 5116.8
## - Dependents        1   5043.9 5116.9
## + StreamingTV       1   5037.6 5118.3
## + SeniorCitizen     1   5037.9 5118.6
## - PhoneService      1   5045.9 5118.9
## + MonthlyCharges    1   5038.6 5119.2
## + gender            1   5038.7 5119.4
## - PaperlessBilling  1   5046.5 5119.5
## + DeviceProtection  1   5039.5 5120.2
## + OnlineBackup      1   5039.8 5120.5
## - StreamingMovies   1   5053.2 5126.2
## - MultipleLines     1   5057.2 5130.2
## - OnlineSecurity    1   5059.5 5132.5
## - TechSupport       1   5061.5 5134.5
## - PaymentMethod     3   5074.9 5140.2
## - Contract          2   5162.3 5231.5
## - InternetService   2   5259.6 5328.7
## - tenure_group      4   5289.2 5350.7
## 
## Step:  AIC=5116.54
## Churn ~ Dependents + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## - Dependents        1   5045.2 5114.3
## <none>                  5043.6 5116.5
## + Partner           1   5040.0 5116.8
## + SeniorCitizen     1   5040.9 5117.7
## + StreamingTV       1   5041.2 5118.0
## - PhoneService      1   5049.5 5118.6
## + MonthlyCharges    1   5042.2 5119.0
## + gender            1   5042.3 5119.2
## - PaperlessBilling  1   5050.2 5119.3
## + DeviceProtection  1   5043.2 5120.1
## + OnlineBackup      1   5043.4 5120.3
## - StreamingMovies   1   5057.5 5126.7
## - MultipleLines     1   5061.0 5130.1
## - OnlineSecurity    1   5062.6 5131.7
## - TechSupport       1   5065.8 5134.9
## - PaymentMethod     3   5078.9 5140.3
## - Contract          2   5166.0 5231.3
## - InternetService   2   5264.3 5329.6
## - tenure_group      4   5290.5 5348.1
## 
## Step:  AIC=5114.34
## Churn ~ PhoneService + MultipleLines + InternetService + OnlineSecurity + 
##     TechSupport + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + tenure_group
## 
##                    Df Deviance    AIC
## <none>                  5045.2 5114.3
## + SeniorCitizen     1   5041.9 5114.9
## + StreamingTV       1   5042.9 5115.9
## + Dependents        1   5043.6 5116.5
## - PhoneService      1   5051.3 5116.6
## + MonthlyCharges    1   5043.9 5116.8
## + Partner           1   5043.9 5116.9
## + gender            1   5044.0 5117.0
## - PaperlessBilling  1   5052.3 5117.6
## + DeviceProtection  1   5044.8 5117.8
## + OnlineBackup      1   5045.1 5118.1
## - StreamingMovies   1   5059.0 5124.3
## - MultipleLines     1   5063.2 5128.5
## - OnlineSecurity    1   5064.6 5129.9
## - TechSupport       1   5067.7 5133.0
## - PaymentMethod     3   5081.4 5139.0
## - Contract          2   5171.8 5233.2
## - InternetService   2   5270.7 5332.1
## - tenure_group      4   5295.2 5349.0

## 
## Call:  glm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + tenure_group, family = binomial, 
##     data = training_set)
## 
## Coefficients:
##                          (Intercept)  
##                             -1.19163  
##                      PhoneServiceYes  
##                             -0.34180  
##                     MultipleLinesYes  
##                              0.36673  
##           InternetServiceFiber optic  
##                              0.93406  
##                    InternetServiceNo  
##                             -0.91576  
##                    OnlineSecurityYes  
##                             -0.38856  
##                       TechSupportYes  
##                             -0.41665  
##                   StreamingMoviesYes  
##                              0.31064  
##                     ContractOne year  
##                             -0.68940  
##                     ContractTwo year  
##                             -1.74532  
##                  PaperlessBillingYes  
##                              0.21014  
## PaymentMethodCredit card (automatic)  
##                             -0.01245  
##        PaymentMethodElectronic check  
##                              0.41793  
##            PaymentMethodMailed check  
##                             -0.08411  
##              tenure_group0-12 Months  
##                              2.00364  
##             tenure_group12-24 Months  
##                              0.91525  
##             tenure_group24-48 Months  
##                              0.63350  
##             tenure_group48-60 Months  
##                              0.40948  
## 
## Degrees of Freedom: 5273 Total (i.e. Null);  5256 Residual
## Null Deviance:       7311 
## Residual Deviance: 5045  AIC: 5081

Data modeling

Fitting Logistic regression to the training set

classifier_LR <- glm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                       OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                       PaperlessBilling + PaymentMethod + tenure_group, family = binomial, 
                     data = training_set)
summary(classifier_LR)

## 
## Call:
## glm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + tenure_group, family = binomial, 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3555  -0.7476  -0.1730   0.7765   2.7892  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          -1.19163    0.21620  -5.512 3.55e-08
## PhoneServiceYes                      -0.34180    0.13804  -2.476 0.013280
## MultipleLinesYes                      0.36673    0.08660   4.235 2.29e-05
## InternetServiceFiber optic            0.93406    0.09642   9.688  < 2e-16
## InternetServiceNo                    -0.91576    0.13162  -6.957 3.46e-12
## OnlineSecurityYes                    -0.38856    0.08796  -4.417 9.99e-06
## TechSupportYes                       -0.41665    0.08745  -4.765 1.89e-06
## StreamingMoviesYes                    0.31064    0.08402   3.697 0.000218
## ContractOne year                     -0.68940    0.10405  -6.626 3.46e-11
## ContractTwo year                     -1.74532    0.17400 -10.031  < 2e-16
## PaperlessBillingYes                   0.21014    0.07845   2.679 0.007391
## PaymentMethodCredit card (automatic) -0.01245    0.11763  -0.106 0.915728
## PaymentMethodElectronic check         0.41793    0.10005   4.177 2.95e-05
## PaymentMethodMailed check            -0.08411    0.11643  -0.722 0.470009
## tenure_group0-12 Months               2.00364    0.16700  11.998  < 2e-16
## tenure_group12-24 Months              0.91525    0.16296   5.616 1.95e-08
## tenure_group24-48 Months              0.63350    0.14764   4.291 1.78e-05
## tenure_group48-60 Months              0.40948    0.16264   2.518 0.011814
##                                         
## (Intercept)                          ***
## PhoneServiceYes                      *  
## MultipleLinesYes                     ***
## InternetServiceFiber optic           ***
## InternetServiceNo                    ***
## OnlineSecurityYes                    ***
## TechSupportYes                       ***
## StreamingMoviesYes                   ***
## ContractOne year                     ***
## ContractTwo year                     ***
## PaperlessBillingYes                  ** 
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check        ***
## PaymentMethodMailed check               
## tenure_group0-12 Months              ***
## tenure_group12-24 Months             ***
## tenure_group24-48 Months             ***
## tenure_group48-60 Months             *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7311.0  on 5273  degrees of freedom
## Residual deviance: 5045.2  on 5256  degrees of freedom
## AIC: 5081.2
## 
## Number of Fisher Scoring iterations: 5

library(car)
vif(classifier_LR)

##                      GVIF Df GVIF^(1/(2*Df))
## PhoneService     1.316723  1        1.147485
## MultipleLines    1.523190  1        1.234176
## InternetService  2.299331  2        1.231404
## OnlineSecurity   1.151293  1        1.072983
## TechSupport      1.177450  1        1.085104
## StreamingMovies  1.404921  1        1.185294
## Contract         1.790894  2        1.156824
## PaperlessBilling 1.142792  1        1.069015
## PaymentMethod    1.392604  3        1.056748
## tenure_group     2.381935  4        1.114593

Getting the probabilities of the training set

prob_pred_LR_train = predict(classifier_LR, type = 'response', 
                       newdata = training_set[-18])

Predicting the Test set results

str(test_set)

## 'data.frame':    1758 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 2 2 1 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 2 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 1 2 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 1 3 2 1 2 2 1 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 1 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 1 1 2 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 1 2 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 2 1 1 3 3 3 3 1 1 1 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 2 1 2 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 2 4 2 2 2 2 2 1 3 ...
##  $ MonthlyCharges  : num  57 89.1 29.8 18.9 113.2 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tenure_group    : Factor w/ 5 levels "> 60 Months",..: 4 3 2 3 1 5 1 2 2 3 ...

prob_pred_LR = predict(classifier_LR, type = 'response', 
                       newdata = test_set[-18]) 
y_pred_LR = ifelse(prob_pred_LR > 0.5, 1, 0)

Applying k-Fold Cross Validation

set.seed(2)
folds_LR <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_LR, function(x) {
  training_fold = training_set[-x,]
  test_fold = training_set[x,]
  classifier = glm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                     PaperlessBilling + PaymentMethod + tenure_group, 
                   family = binomial, 
                   data = training_set)
  prob_pred_LR = predict(classifier, newdata = test_fold[-18]) 
  y_pred_LR = ifelse(prob_pred_LR > 0.5, 1, 0)
  cm = table(test_fold[,18], y_pred_LR)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})

Getting the accuracy

accuracy_LR <- mean(as.numeric(cv))
accuracy_LR

## [1] 0.7582512

Creating the Confussion Matrix

cm_LR <- table(test_set[,18], y_pred_LR)
cm_LR

##      y_pred_LR
##         0   1
##   No  925 366
##   Yes  94 373

Accuracy is 75.83 %. There are 94 of those who left anaway, 366 of those who didn’t left anyway.

Plotting the ROC curve

library(ROCR)
plot(performance(prediction(prob_pred_LR, test_set$Churn), measure ='tpr', 
                 x.measure ='fpr'),lwd = 7, main = 'ROC curve', col = 'green')
lines(x=c(0, 1), y=c(0, 1), col="blue", lwd=1)

Computing area under the ROC curve

pr <- prediction(prob_pred_LR, test_set$Churn)
auroc <- performance(pr, measure = 'auc')
auroc <- auroc@y.values[[1]]
auc_rounded <- round(auroc,2)
print(paste('Area under the receiver operating characteristic curve is', auc_rounded))

## [1] "Area under the receiver operating characteristic curve is 0.83"

Plotting the CAP curve

test_set$Churn <- as.character(test_set$Churn)
test_set$Churn <- as.numeric(mapvalues(from = c('Yes', 'No'), 
                                       to = c('1', '0'), test_set$Churn))

library(plotrix)
plot(performance(prediction(prob_pred_LR, test_set$Churn), 'tpr', 'rpp'),
     lwd = 7, main = 'CAP curve', col = 'red')
lines(x=c(0, 1), y=c(0, 1), col="blue", lwd=1)
ablineclip(v = 0.5, y2 = 0.863, lty = 'dashed', col = 'green', lwd = 3)
ablineclip(h = 0.863, x2 = 0.5, lty = 'dashed', col = 'green', lwd = 3)

Fitting SVM to the Training set

library(e1071)
classifier_SVM = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                       OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                       PaperlessBilling + PaymentMethod + tenure_group,
                     data = training_set,
                     type = 'C-classification',
                     kernel = 'linear')

Predicting the Test set results

y_pred_SVM = predict(classifier_SVM, newdata = test_set[-18])

Applying k-Fold Cross Validation

set.seed(3)
folds_SVM <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_SVM, function(x) {
  training_fold = training_set[-x,]
  test_fold = training_set[x,]
  classifier = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                     PaperlessBilling + PaymentMethod + tenure_group, 
                   data = training_fold,
                   type = 'C-classification',
                   kernel = 'linear')
  y_pred_SVM = predict(classifier, newdata = test_fold[-18])
  cm = table(test_fold[,18], y_pred_SVM)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})

Getting the accuracy

accuracy_SVM <- mean(as.numeric(cv))
accuracy_SVM

## [1] 0.7400554

Creating the Confussion Matrix

cm_SVM <- table(test_set[,18], y_pred_SVM)
cm_SVM

##    y_pred_SVM
##      No Yes
##   0 832 459
##   1  82 385

Accuracy is 74.00 %. There are 82 of those who left anaway, 459 of those who didn’t left anyway.

Fitting Kernel SVM to the Training set

classifier_KSVM = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                        OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                        PaperlessBilling + PaymentMethod + tenure_group,
                      data = training_set,
                      type = 'C-classification',
                      kernel = 'radial')

Predicting the Test set results

y_pred_KSVM = predict(classifier_KSVM, newdata = test_set[-18])

Applying k-Fold Cross Validation

set.seed(4)
folds_KSVM <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_SVM, function(x) {
  training_fold = training_set[-x,]
  test_fold = training_set[x,]
  classifier = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService + 
                     OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                     PaperlessBilling + PaymentMethod + tenure_group,
                   data = training_fold,
                   type = 'C-classification',
                   kernel = 'radial')
  y_pred_KSVM = predict(classifier, newdata = test_fold[-18])
  cm = table(test_fold[,18], y_pred_KSVM)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})

Getting the accuracy

accuracy_KSVM <- mean(as.numeric(cv))
accuracy_KSVM

## [1] 0.7635665

Creating the Confussion Matrix

cm_KSVM <- table(test_set[,18], y_pred_KSVM)
cm_KSVM

##    y_pred_KSVM
##      No Yes
##   0 917 374
##   1  99 368

Accuracy is 76.45%. There are 99 of those who left anaway, 374 of those who didn’t left anyway Not signifficantly better result from logistic regression and SVM => data are linearly separable.

Logistic regression and SVM have similar results => tuning the SVM model.

Tuning the SVM model

Applying Grid search to find the best parameters

classifier_SVM_tuned <- train(Churn ~ PhoneService + MultipleLines + InternetService + 
                                OnlineSecurity + TechSupport + StreamingMovies + Contract + 
                                PaperlessBilling + PaymentMethod + tenure_group,
                              data = training_set,
                              method = 'svmLinear')
classifier_SVM_tuned

## Support Vector Machines with Linear Kernel 
## 
## 5274 samples
##   10 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5274, 5274, 5274, 5274, 5274, 5274, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.747551  0.4956568
## 
## Tuning parameter 'C' was held constant at a value of 1

The best performance is with parameter c = 1, which is default, so the performance is still the same.

Logistic regression is choosed, because of better interpretation and ability to rank customers by highest probability to leave.

The example:

example <- test_set[13,]
example

example_prob <- prob_pred_LR[13]
example_prob

##       52 
## 0.607221

Data dictionary

List of the further provided information:

Variable name	Data type	Variable description
customerID	factor	a unique number for identifying a customer
gender	factor	available values - female/male
SeniorCitizen	integer	whether the customer is a senior citizen or not (available values: 1/0, where 1 means ‘Yes’ and 0 means ‘No’)
Partner	factor	whether the customer has a partner or not (available values: Yes/No)
Dependents	factor	whether the customer has dependents or not (available values: Yes/No)
tenure	integer	number of months the customer has stayed with the company
PhoneService	factor	whether the customer has a phone service or not (available values: Yes/No)
MultipleLines	factor	whether the customer has multiple lines or not (available values: Yes/No/No phone service)
InternetService	factor	customer’s internet service provider (available values: DSL/Fiber optic/No)
OnlineSecurity	factor	whether the customer has online security or not (available values: Yes/No/No internet service)
OnlineBackup	factor	whether the customer has online backup or not (available values: Yes/No/No internet service)
DeviceProtection	factor	whether the customer has device protection or not (available values: Yes/No/No internet service)
TechSupport	factor	whether the customer has tech support or not (available values: Yes/No/No internet service)
streamingTV	factor	whether the customer has streaming TV or not (available values: Yes/No/No internet service)
streamingMovies	factor	whether the customer has streaming movies or not (available values: Yes/No/No internet service)
Contract	factor	the contract term of the customer (available values: Month-to-month/One year/Two year)
PaperlessBilling	factor	whether the customer has paperless billing or not (available values: Yes/No)
PaymentMethod	factor	the customer’s payment method (available values: Electronic check/Mailed check/Bank transfer (automatic)/Credit card (automatic))
MonthlyCharges	integer	the amount charged to the customer monthly(numeric variable)
TotalCharges	integer	the total amount charged to the customer(numeric variable)

Provided dataset contains information about 7043 customers.

Technical terms explained

stepAIC

It is an R function from MASS package that uses Akaike information criterion for selecting the best subset of features. Akaike information criterion offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. So the goal is to find the model with the smallest AIC by removing or adding variables in the scope. Since each variable in the model is penalized with a factor of 2, this leads to logistic regression tests with p-values < 0.1573. In order to add criterium for selecting the variables with p < 0.05, the correct k parameter was computed .

The cross validation

It is used as a reassurence that the model is not overfitting. Overfitting would mean that the model did not find the general patterns in the dataset, but the prediction on a test set are based on memorizing the training set. In this case, the model would performed very well on the training set (since it would be memorized), but very bad on the new sets of data (which the test set and every new case that needs to be predicted), because the model would not know the general patterns. Cross validation is able to check for overfitting, because within this procedure the train set is devided into 10 portions, the model is trained 10 times, each time on different combination of 9 portions of the training set and the remaining 10th portion is used for the testing and computing the accuracy. The final accuracy of the model is computed as the average of 10 accuracies computed within cross validation.

Logistic regression

Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. So when we have two classes and the probability in question is P_+ -> the probability that a certain data point belongs to the ‘+‘ class. Ofcourse, P_- = 1 - P_+. Thus, the output of Logistic Regression always lies in [0, 1].

SVM

Support vector machine algorithm is in some way different from other machine learning algorithms. The way SVM searches for the best line wich is a boundary between two classes (called maximum margin hyperplane) is through the maximum margin. In two dimensial space margin is represented by the two distances. The distances are between the boundary line and closest points from each class, that are called support vectors. They are called ‘support’ because the other vectors doesn’t contribute to the result of the algorithm, and they are called vectors, because in more dimensional space we can’t visualize them like points, they are vectors. SVM are special because if the machine is supposed to distinquish for example between apples and oranges, the most of the algorithms would look at the most stock standard looking apples in the training set to learn, how an apple look like and the same for oranges. But SVM algorithm look at an apple that looks like an orange, and at the orange that looks like an apple - and these examples are those support vectors.

Linearly separable dataset

It means that the input space can be separated into two nice ‘regions’, one for each class, by a linear(straight) boundary. For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by the input data and the learning algorithm, but in for example two dimensional space it means that the data points MUST be separable into the two aforementioned regions by a linear boundary. If the data points do satisfy this constraint, they are said to be linear-separable.

Kernel SVM

There is a method used for non linearly separable dataset. With help of a mapping function it moves the data into higher dimension space, where the data become linearly separable and finds the boundary between them (using SVM) and then the data are projected to the original space. But his is highly compute-intensive.

What Kernel SVM does is, that it uses the Kernel trick. In example of two dimensional space, where the data are separable by a circle, radial kernel uses a method to find a landmark in the middle of that circle and by fitting every data point into a kernel function it is possible to calculate which class the given data point belongs to. The size of the circle is given by the parameter sigma. It is possible to add up more Kernel functions in order to create more complex boundaries.

Another examples are Sigmoid Kernel and Polynomial Kernel which is are directional functions.

Telecom Customer Churn

Ivana Hybenova

February 02, 2018

Problem statement and findings

Problem statement

Objective

Project duration

Business benefits

My approach

The R code

Data preprocessing

Missing values, wrong values, data types

Exploratory data analysis

Identifying and correcting collinearity among numeric variables

Looking for outliers in MonthlyCharges with help of standardized values

Patterns exploration

Training and test set preparation

Splitting dataset into training set and test set

Sampling the training set in order to get evenly sized classes with ROSE

Exploring the patterns in dataset using Decission Tree

Dimension reduction

Data modeling

Fitting Logistic regression to the training set

Fitting SVM to the Training set

Fitting Kernel SVM to the Training set

Tuning the SVM model

Data dictionary

Technical terms explained