Mobile operator Telecom is struggling with customer churn, which means that the company is losing too many customers.
The company does not know what makes customers to leave and neither whether it is random or they have something in common.
It is much cheaper to retain a customer than to gain a new one, but Telecom finds it challenging to come up with an approach to it.
Telecom observed a set of customers for a certain period of time and provided the information whether the customer churned or not. Along with this information, Telecom provided several other information that could effect the churn.
The dataset is downloadable here: link
The objective was to look for general patterns in the provided data that would help to identify customers that are more likely to churn, before they do that, so the company could adress them and prevent their churn.
Given dataset had 20 columns and 7043 rows. It took me 3 weeks to complete the project.
The graph demonstrates how taking advantage of data can help Telecom to identify the customers that are most likely to churn and give the company opportunity to prevent it. The blue diagonal line represents the random scenario (without using a model). In the test set 467 customers out of 1758 churned, which approximately 26.56 %. If the company paid extra attention to randomly chosen 50 % of the customers, only 50 % of those 467 would have been addressed.
The red diagonal line represents scenario using the delivered model of logistic regression. When the company pays attention to 50 % of the customers with the highest probability of churn computed by the model, 86 % of those 467 will be addressed.
To address these 36 % extra customers that will churn without the model, Telecom would have pay for approaching another 36 % customers.
So in case of addressing 50 % customers, the savings from using the model represents the cost of addressing 36 % customers.
As a matter of fact, the more is the curve representing the model scenario in the top left corner, the better is the model.
Another option of evaluating the model is plotting the receiver operating characteristic curve, as shown below.
ROC Curve is used to see how well the classifier can separate positive and negative cases.
On the x axis is false positive rate, which is the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events.
on the y axis is true positive rate, which it the ratio between the number of correctly categorized positive events and the total number of actual positive events.
The diagonal line represents the random model, that means scenario without using information about patterns found in the data.
The curve represents the scenario, in which is the churn (positive class) predicted by the obtianed model (the delivered logistic regression).
The area under the ROC curve (AUROC) should be between 0.5 and 1.0. This area is a measure of the predictive accuracy of a model. An AUROC equal to 0.5 (i.e. coinciding with the diagonal) indicates the random classification model. As a matter of fact, this area should be greater than 0.5 for a model to be acceptable; a model with AUROC of 0.5 or less is worthless.
AUROC for the delivered logistic model was computed:
## [1] "Area under the receiver operating characteristic curve is 0.83"
Taking into consideration that area under the ROC curve of a perfect model is 1, this model is predicting the churn very well.
For ilustration below is the confusion matrix computed from the 1758 cases used for evaluating the model.
## y_pred_LR
## 0 1
## No 925 366
## Yes 94 373
In the first row and first column here is the number of correctly classified customers that did not churn at the end of the given period of time. In the first row and the second column here is the number of customers that were classified to churn, but they did not. In the second row and the first column here is the number of customers that were classified to not churn, but they did. This is the number we want to minimize without allowing the false positive rate to increase too much as this can add to marketing costs. In the second row and the second column here is the number of correctly classified customers that churned.
Except the information whether a customer is likely to churn or not, I found out which features effect this most.
According to the plot of the decision tree those features are type of contract, internet provider and tenure.
Furthemore the numbers in nodes represent the probability of churn. The top nod has number 0.50. Since I used a sampling method to get a training set with balanced classes, without using any model there is a 50 % chance for each customer to churn.
As we move down the plot, we can see that probability to churn of customers that have one or two years contract is only 0.17, for others it is 0.67. From those, in case they do not have internet service or they have DSL, the probability of churn is 0.52, otherwise for those whose internet service provider is Fiber optic (the last option of internet service) it is 0.77.
Customers whose internet service provider is not Fiber optic and are customers of Telecom at least 12 months have probability to churn 0.34, otherwise it is 0.62.
This was just a look at the most important features generated by decision tree algorithm. Now let’s take a look at the delivered logistic model itself.
General formula for logistic regression is:
ln(p/(1-p)) = b0 + b1x1+ … + bnxn.
Coeffitients b0,b1, …, bn are computed by logistic regression model after training it on a training dataset, where b0 is a constant, which means that is the same for every case (customer) and x1,x2, …, xn are the values of concrete features of the particular case (customer). In this case the solution of the formula is probability that the given customer will churn.
The formula of our logistic regression is
ln(p/(1-p)) = - 1.19163 - 0.34180 * PhoneServiceYes + 0.36673 * MultipleLinesYes + 0.93406 * InternetServiceFiber optic -
- 0.91576 * InternetServiceNo - 0.38856 * OnlineSecurityYes - 0.41665 * TechSupportYes + 0.31064 * StreamingMoviesYes -
- 0.68940 * ContractOne year - 1.74532 * ContractTwo year + 0.21014 * PaperlessBillingYes -
- 0.01245 * PaymentMethodCredit card (automatic) + 0.41793 * PaymentMethodElectronic check -0.08411 * PaymentMethodMailed check +
+ 2.00364 * tenure_group0-12 Months + 0.91525 * tenure_group12-24 Months + 0.63350 * tenure_group24-48 Months +
+ 0.40948 * tenure_group48-60 Months
The model predicts customers with probabilities lower than 0.5 to not churn, and customers with probabilities with 0.5 and higher to churn.
Let’s now take one of our customers from the test set and compute the probability of leaving.
When we fit into the equation the customers information we get:
ln(p/(1-p)) = - 1.19163 - 0.34180 * 0 + 0.36673 * 0 + 0.93406 * 0 -
- 0.91576 * 0 - 0.38856 * 0 - 0.41665 * 0 + 0.31064 * 0 -
- 0.68940 * 0 - 1.74532 * 0 + 0.21014 * 1 - 0.01245 * 0 +
+ 0.41793 * 1 - 0.08411 * 0 + 2.00364 * 1 + 0.91525 * 0 +
+ 0.63350 * 0 + 0.40948 * 0
Computed probability is
## 52
## 0.607221
As we can see from the table, this customer churned and it was also predicted by our model, since computed p was higher than 0.5.
Let’s now look at our formula of logistic regression and think about how can we take advantage of knowing what the probabilit of churn actually effects.
The b1 coefficient has negative sign, which means that if a customer has a phone service the probability of churn is lower then probability of customer that has not a phone service and the other features are the same in both cases.
In case of our customer we fitted into the formula the x1 is 0, because as we can see from the table, the customer does not have a phone service.
Every other feature also has given available values and works on the same principle.
The another coefficient says if a customer has multiple lines, the probability of churn is higher. We can not effect this, but we can pay extra attention to those, who have them, since the coefficient is quite high.
The next two coefficients are related to internet service. Customers whose internet provider is Fiber optic are more likely to leave and as we saw in the decission tree, internet provider feature is the second most important feature. That is why customers with Fiber optic provider need extra attention. On the other hand having no internet service is lowering the probability of churn.
Having online security and technical support lowers the probability of churn as well, whereas having streaming movies increases it.
Next two coefficients are related to contract. Options for contract are month-to-month, that has no effect on the probability of churn, both one year and two years contract lowers the probability, where two years tremendiously. It is the most important feature, as we also saw on the decission tree plot. Contract can Telecom effect and it is highly recommended to make sure that customers has two years contract.
The next coefficient says that the paperless billing increases the churn probability, so it is not recommended to provide it to customers.
Next three coefficients are related to customer’s payment method. They are not that important except electronic card method which is increasing the churn probability. Again paying extra attention to these customers are recommended.
The last four coefficients are related to tenure. It is very important to pay extra attention to customers that are with Telecom less then one year. But also after that period, they need extra attention.
Top 5 rows of the dataset are showed below:
head(dataset, 5)
The data preparation, data exploration and data modeling was performed in RStudio.
Dataset contained 11 customers, that were missing information about Total Charges and since it was such few of them they were excluded from the dataset. Information such ‘No internet service’ and ‘No phone service’ where rewritten to ‘No’.
Feature ‘total charges’ were excluded from the dataset, because it is a combination of ‘monthly charges’ and ‘tenure’ and in case it had been kept in the model, the model would not perform well.
Data exploration revealed that most of the provided featers probably effect whether the customer churned or not after the observed period of time. Because the class sizes were very imbalanced, as we can see on the graph below, the training set were sampled right after the split (70 % cases were used for training set - dataset used for building the model, and 30 % case were used for test set - dataset used for evaluation models performance).
In case of not sampling the dataset the accuracy of logistic regression (which is most suitable algorithm for this problem) would be poor, because this makes model more likely to predict the majority class, no matter how well is the modeling done. With logistic regression, the cutoff point can be between 0 and 1. As the cutoff point moves away from 0.5 in either direction, the false positive or false negative rate is being changed.
There are pros and cons with both approaches, i.e. tighter parameter estimates vs a more accurate predictive model. In exploratory modeling the higher precision in the relationship between the parameters is wanted, whereas in predictive modeling it is the lower error rate. The sampled training set was used (graph of the proportions is below), in order to predict model that would minimize false negatives, i. e. ‘catch’ the most people that will churn as possible.
The final subset of features for modeling was obtained using function stepAIC with parameter k adjusted to select variables with p < 0.05.
Listed features were identified as important:
PhoneService,
MultipleLines,
InternetService,
OnlineSecurity,
TechSupport,
StreamingMovies,
Contract,
PaperlessBilling,
PaymentMethod,
tenure_group.
This list leaves as unimportant featues that were originally based on exploratory analasys taken into consideration:
SeniorCitizen,
Partner,
Dependents,
OnlineBackup,
DeviceProtecion,
StreamingTV.
Before modeling the dataset with another algorithms the feature tenure was binned into 5 bins: “0–12 Months”, “12–24 Months”, “24–48 Months”, “48–60 Month”, “> 60 Month” and the feature tenure was excluded from the training set.
Several algorithms were used for data modeling: decision tree, logistic regression, support vector machine and kernel support vector machine. Kernel support vector machine was used in case that the dataset was not linearly separable, but since it did not performed better than the others algorithms, linear separability was assumed and both logistic regression and support vector machine were taken into consideration. The dataset was randomly splitted into a training set, that contained 75 % of the customers and was used to train the models, and a test set, that contained reamining 25 % of the customers and was used to check the models performance.
When choosing the best model two criterias were taken into consideration. The first one the was cross validation and the second one was number of people that churned, even though the model predicted them to stay.
The cross validation is used as a reassurence that the model is not overfitting.
Since Telecom does not want to just find the most accurate model (the model that classify correctly the customers that will churn and those who will churn not) but besides that also the model that will classify correctly most of those, that will churn if the company will not prevent it (that will minimize Type || errors), because it is cheaper to prevent a customer to churn than to find a new customer. This number is obtained by computing a confusion matrix on the test dataset. For ilustration here is the one for logistic regression.
## y_pred_LR
## 0 1
## No 925 366
## Yes 94 373
After evaluation of the models support vector machine and logistic regression was considered as the final models. Logistic regression had slightly higher accuracy but according to SVM’s confussion matrix (showed below), SVM did slightly better in predicting customers that will churn.
## y_pred_SVM
## No Yes
## 0 832 459
## 1 82 385
Support vector machine was tried to improve, but tuning revealed that the model performs best with parameter c = 1, so improvement was not possible, since this is the default value.
Eventually the logistic regression algorithm were chosen as the best solution to implement for its easier interpretability and ability to predict probabilities of churn. This means that it is possible to compute the probability of a customer to churn and sort the customers by it. Thanks to that Telecom has option to pay extra attention to for example 100 customers (the number might depends on resources) with the highest probability of churn.
dataset <- read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', na.strings = c('','?'))
str(dataset)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
summary(dataset)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies
## No :2810 No :2785
## No internet service:1526 No internet service:1526
## Yes :2707 Yes :2732
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5174
## 1st Qu.: 35.50 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
Computing the percentage of missing values
nrow(dataset[!complete.cases(dataset),])/nrow(dataset) * 100
## [1] 0.1561834
Looking for missing values
sapply(dataset, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
Deleting incomplete rows, since it is so few of them
dataset <- dataset[complete.cases(dataset),]
summary(dataset)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3483 Min. :0.0000 No :3639 No :4933
## 0003-MKNFE: 1 Male :3549 1st Qu.:0.0000 Yes:3393 Yes:2099
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1624
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7026
## tenure PhoneService MultipleLines InternetService
## Min. : 1.00 No : 680 No :3385 DSL :2416
## 1st Qu.: 9.00 Yes:6352 No phone service: 680 Fiber optic:3096
## Median :29.00 Yes :2967 No :1520
## Mean :32.42
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3497 No :3087
## No internet service:1520 No internet service:1520
## Yes :2015 Yes :2425
##
##
##
##
## DeviceProtection TechSupport
## No :3094 No :3472
## No internet service:1520 No internet service:1520
## Yes :2418 Yes :2040
##
##
##
##
## StreamingTV StreamingMovies
## No :2809 No :2781
## No internet service:1520 No internet service:1520
## Yes :2703 Yes :2731
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2864 Bank transfer (automatic):1542
## One year :1472 Yes:4168 Credit card (automatic) :1521
## Two year :1685 Electronic check :2365
## Mailed check :1604
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5163
## 1st Qu.: 35.59 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.80 Mean :2283.3
## 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
##
Changing ‘No internet servise’ into ‘No’ in 6 variables
dataset[,c(10:15)] <- apply(dataset[,c(10:15)], 2, function(x) (gsub('No internet service', 'No', x)))
dataset[,c(10:15)]<- lapply(dataset[,c(10:15)], as.factor)
str(dataset)
## 'data.frame': 7032 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Changing ‘No phone service’ into ‘No’ in MultipleLines variable
dataset$MultipleLines <- gsub('No phone service', 'No', dataset$MultipleLines)
dataset$MultipleLines <- as.factor(dataset$MultipleLines)
str(dataset)
## 'data.frame': 7032 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Changing variable SeniorCitizen from 0,1 to ‘No’, ‘Yes’
library(plyr)
dataset$SeniorCitizen <- mapvalues(from = c('0', '1'),
to = c('No', 'Yes'), dataset$SeniorCitizen)
dataset$SeniorCitizen <- as.factor(dataset$SeniorCitizen)
Removing the columns I do not need for the analysis
dataset$customerID <- NULL
str(dataset)
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Looking at the proportion of customers that churned
library(ggplot2)
ggplot(dataset, aes(x = Churn, fill = Churn)) +
geom_bar() +
theme(legend.position="none")
Unevenly sized classes => sampling of the training set is needed.
Dropping dependent variable for calculating Multicollinearity
dataset_independent = subset(dataset, select = -c(Churn))
Identifying numeric variables
dataset_independent_numeric <- dataset_independent[sapply(dataset_independent, is.numeric)]
Calculating Correlation
descrCor <- cor(dataset_independent_numeric)
Print correlation matrix and look at max correlation
print(descrCor)
## tenure MonthlyCharges TotalCharges
## tenure 1.0000000 0.2468618 0.8258805
## MonthlyCharges 0.2468618 1.0000000 0.6510648
## TotalCharges 0.8258805 0.6510648 1.0000000
rm(descrCor)
rm(dataset_independent)
rm(dataset_independent_numeric)
MonthlyCharges and TotalCharges are highly correlated, one of them must be dropped.
Dropping TotalCharges, since it is function of MonthlyCharges and Tenure group
dataset$TotalCharges <- NULL
Computing of the standardized values
zcharges <- scale(dataset$MonthlyCharges, scale = TRUE)
Looking for values higher then 3 and lower then -3
zcharges_dec <- sort(zcharges, decreasing = TRUE)
zcharges_asc <- sort(zcharges, decreasing = FALSE)
head(zcharges_dec, 20)
## [1] 1.793254 1.789930 1.788268 1.788268 1.779959 1.774973 1.761678
## [8] 1.755030 1.751706 1.750044 1.746721 1.741735 1.740073 1.733425
## [15] 1.730102 1.728440 1.726778 1.721792 1.721792 1.720130
head(zcharges_asc, 20)
## [1] -1.547173 -1.542187 -1.537202 -1.532216 -1.532216 -1.530554 -1.528892
## [8] -1.528892 -1.528892 -1.528892 -1.528892 -1.528892 -1.528892 -1.527230
## [15] -1.527230 -1.527230 -1.527230 -1.527230 -1.525568 -1.525568
No outliers present.
rm(zcharges)
rm(zcharges_dec)
rm(zcharges_asc)
Boxplot for continuous variable
ggplot(dataset, aes(x = Churn, y = MonthlyCharges)) +
geom_boxplot()
Expecting MonthlyCharges to be predictor.
Proportion of churn for categorical variable
str(dataset)
## 'data.frame': 7032 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
p_gender <- ggplot(dataset, aes(x = gender, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_SeniorCitizen <- ggplot(dataset, aes(x = SeniorCitizen, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_partner <- ggplot(dataset, aes(x = Partner, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_Dependents <- ggplot(dataset, aes(x = Dependents, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
library(Rmisc)
multiplot(p_gender, p_SeniorCitizen, p_partner, p_Dependents, cols = 2)
Expecting Partner, SeniorCitizen and Dependents to be predictors.
p_PhoneService <- ggplot(dataset, aes(x = PhoneService, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_MultipleLines <- ggplot(dataset, aes(x = MultipleLines, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_InternetService <- ggplot(dataset, aes(x = InternetService, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_OnlineSecurity <- ggplot(dataset, aes(x = OnlineSecurity, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
multiplot(p_PhoneService, p_MultipleLines, p_InternetService, p_OnlineSecurity, cols = 2)
Expecting InternetService, and OnlineSecurity to be predictors.
p_OnlineBackup <- ggplot(dataset, aes(x = OnlineBackup, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_DeviceProtection <- ggplot(dataset, aes(x = DeviceProtection, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_TechSupport <- ggplot(dataset, aes(x = TechSupport, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_StreamingTV <- ggplot(dataset, aes(x = StreamingTV, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
multiplot(p_OnlineBackup, p_DeviceProtection, p_TechSupport, p_StreamingTV, cols = 2)
Expecting OnlineBackup, TechSupport, DeviceProtection and StreamingTV to be predictors.
p_StreamingMovies <- ggplot(dataset, aes(x = StreamingMovies, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_Contract <- ggplot(dataset, aes(x = Contract, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_PaperlessBilling <- ggplot(dataset, aes(x = PaperlessBilling, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count')
p_PaymentMethod <- ggplot(dataset, aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = 'fill') +
ylab('Count') +
coord_flip()
multiplot(p_StreamingMovies, p_Contract, p_PaperlessBilling, p_PaymentMethod, cols = 2)
Expecting StreamingMovies, PaperlessBilling, Contract and PaymentMethod to be predictors.
Considering changing Tenure into factor
library(ggplot2)
ggplot(dataset, aes(x = as.factor(tenure))) +
geom_bar()
min(dataset$tenure)
## [1] 1
max(dataset$tenure)
## [1] 72
Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”.
group_tenure <- function(tenure){
if (tenure >= 0 & tenure <= 12){
return('0-12 Months')
}else if(tenure > 12 & tenure <= 24){
return('12-24 Months')
}else if (tenure > 24 & tenure <= 48){
return('24-48 Months')
}else if (tenure > 48 & tenure <=60){
return('48-60 Months')
}else if (tenure > 60){
return('> 60 Months')
}
}
dataset$tenure_group <- sapply(dataset$tenure,group_tenure)
dataset$tenure_group <- as.factor(dataset$tenure_group)
ggplot(dataset, aes(x = tenure_group, fill = Churn)) +
geom_bar(position = 'fill')
Expecting tenure_group to be a predictor.
str(dataset)
## 'data.frame': 7032 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## $ tenure_group : Factor w/ 5 levels "> 60 Months",..: 2 4 2 4 2 2 3 2 4 1 ...
library(caTools)
set.seed(123)
split = sample.split(dataset$Churn, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Looking at the proportion of customers that churned
prop_before <- ggplot(training_set, aes(x = Churn, fill = Churn)) +
geom_bar() +
theme(legend.position="none")
prop_before
library(ROSE)
training_set <- ROSE(Churn ~ ., data=training_set, seed=123)$data
Looking at the proportion of customers that churned after sampling
prop_after <- ggplot(training_set, aes(x = Churn, fill = Churn)) +
geom_bar() +
theme(legend.position="none")
prop_after
Dropping tenure_group variable for DT
training_set_DT <- training_set[-20]
test_set_DT <- test_set[-20]
Fitting decision tree to the training set without tenure_group variable
library(rpart)
str(training_set_DT)
## 'data.frame': 5274 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 2 2 2 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 1 2 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 1 2 1 1 ...
## $ tenure : num 56.1 69.1 55.5 54.4 41.1 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 2 1 2 2 1 2 2 2 2 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 2 1 3 2 2 1 1 2 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 1 2 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 2 2 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 2 2 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 2 2 2 2 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 2 2 2 2 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 3 2 2 1 3 2 2 3 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 1 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 1 1 3 1 3 2 1 1 1 3 ...
## $ MonthlyCharges : num 34.8 120.9 89.4 64.5 16.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
str(test_set_DT)
## 'data.frame': 1758 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 2 2 1 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 2 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 1 2 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
## $ tenure : int 34 22 10 16 69 58 71 2 1 13 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 1 3 2 1 2 2 1 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 1 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 1 1 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 1 2 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 2 1 1 3 3 3 3 1 1 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 2 1 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 2 4 2 2 2 2 2 1 3 ...
## $ MonthlyCharges : num 57 89.1 29.8 18.9 113.2 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
classifier_DT = rpart(formula = Churn ~ .,
data = training_set_DT)
Predicting the Test set results
y_pred_DT = predict(classifier_DT, newdata = test_set[-19],
type = 'class')
Applying k-Fold Cross Validation
library(caret)
set.seed(1)
folds_DT <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_DT, function(x) {
training_fold = training_set[-x,]
test_fold = training_set[x,]
classifier = rpart(formula = Churn ~ .-c(tenure_group),
data = training_fold)
y_pred_DT = predict(classifier, newdata = test_fold[-19],
type = 'class')
cm = table(test_fold[,19], y_pred_DT)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
Getting the accuracy
accuracy_DT <- mean(as.numeric(cv))
accuracy_DT
## [1] 0.7574904
Creating the Confussion Matrix
cm_DT <- table(test_set[,19], y_pred_DT)
cm_DT
## y_pred_DT
## No Yes
## No 859 432
## Yes 84 383
Accuracy is 75.75 %
There are 84 of those who left anaway, 432 of those who didn’t left anyway
Plotting the decision tree without pruning it
library(rpart.plot)
tree <- rpart(Churn ~ ., data = training_set_DT, method = "class")
tree
## n= 5274
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 5274 2617 No (0.5037922 0.4962078)
## 2) Contract=One year,Two year 1835 308 No (0.8321526 0.1678474) *
## 3) Contract=Month-to-month 3439 1130 Yes (0.3285839 0.6714161)
## 6) InternetService=DSL,No 1373 662 Yes (0.4821559 0.5178441)
## 12) tenure>=12.46047 501 169 No (0.6626747 0.3373253) *
## 13) tenure< 12.46047 872 330 Yes (0.3784404 0.6215596) *
## 7) InternetService=Fiber optic 2066 468 Yes (0.2265247 0.7734753) *
prp(tree, type = 2, extra = 6, nn = TRUE,
fallen.leaves = TRUE, varlen = 0, faclen = 0)
rm(training_set_DT)
rm(test_set_DT)
contract, InternetService and tenure are identified as the most important predictors.
Removing the tenure variable
training_set$tenure <- NULL
test_set$tenure <- NULL
Variable selection for logistic regression with stepAIC
Computing k value for stepAIC to have p = 0.05 as a treshold
computed_k <- qchisq(p = 0.05, 1, lower.tail = F)
Running stepAIC()
initial <- glm(Churn ~ . ,
data = training_set,
family = binomial)
library(MASS)
stepAIC(initial, direction = 'both', k = computed_k)
## Start: AIC=5133.01
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges +
## tenure_group
##
## Df Deviance AIC
## - OnlineBackup 1 5033.1 5129.2
## - MonthlyCharges 1 5033.7 5129.7
## - DeviceProtection 1 5034.0 5130.0
## - StreamingTV 1 5034.3 5130.3
## - gender 1 5034.3 5130.4
## - SeniorCitizen 1 5035.1 5131.2
## - Dependents 1 5036.1 5132.1
## - Partner 1 5036.3 5132.4
## <none> 5033.1 5133.0
## - StreamingMovies 1 5038.4 5134.5
## - PaperlessBilling 1 5038.7 5134.7
## - PhoneService 1 5038.7 5134.7
## - MultipleLines 1 5046.5 5142.6
## - OnlineSecurity 1 5052.6 5148.6
## - TechSupport 1 5054.7 5150.7
## - PaymentMethod 3 5065.6 5153.9
## - InternetService 2 5072.0 5164.2
## - Contract 2 5150.7 5242.9
## - tenure_group 4 5273.4 5357.9
##
## Step: AIC=5129.18
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
## TechSupport + StreamingTV + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + tenure_group
##
## Df Deviance AIC
## - MonthlyCharges 1 5033.8 5126.0
## - DeviceProtection 1 5034.0 5126.2
## - StreamingTV 1 5034.3 5126.5
## - gender 1 5034.3 5126.5
## - SeniorCitizen 1 5035.2 5127.4
## - Dependents 1 5036.1 5128.3
## - Partner 1 5036.3 5128.5
## <none> 5033.1 5129.2
## - StreamingMovies 1 5038.5 5130.7
## - PaperlessBilling 1 5038.7 5130.9
## - PhoneService 1 5039.0 5131.1
## + OnlineBackup 1 5033.1 5133.0
## - MultipleLines 1 5046.6 5138.8
## - OnlineSecurity 1 5052.7 5144.9
## - TechSupport 1 5054.7 5146.9
## - PaymentMethod 3 5065.6 5150.1
## - InternetService 2 5073.0 5161.4
## - Contract 2 5150.8 5239.2
## - tenure_group 4 5281.8 5362.5
##
## Step: AIC=5126
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
## TechSupport + StreamingTV + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + tenure_group
##
## Df Deviance AIC
## - DeviceProtection 1 5034.4 5122.7
## - gender 1 5035.0 5123.3
## - SeniorCitizen 1 5035.8 5124.2
## - StreamingTV 1 5036.3 5124.6
## - Dependents 1 5036.7 5125.1
## - Partner 1 5037.0 5125.3
## <none> 5033.8 5126.0
## - PhoneService 1 5039.2 5127.5
## - PaperlessBilling 1 5039.4 5127.8
## + MonthlyCharges 1 5033.1 5129.2
## + OnlineBackup 1 5033.7 5129.7
## - StreamingMovies 1 5042.4 5130.8
## - MultipleLines 1 5049.7 5138.0
## - OnlineSecurity 1 5052.7 5141.1
## - TechSupport 1 5054.9 5143.2
## - PaymentMethod 3 5066.1 5146.8
## - Contract 2 5152.0 5236.5
## - InternetService 2 5231.8 5316.3
## - tenure_group 4 5281.9 5358.7
##
## Step: AIC=5122.71
## Churn ~ gender + SeniorCitizen + Partner + Dependents + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + tenure_group
##
## Df Deviance AIC
## - gender 1 5035.5 5120.1
## - SeniorCitizen 1 5036.4 5121.0
## - StreamingTV 1 5036.7 5121.2
## - Dependents 1 5037.3 5121.8
## - Partner 1 5037.4 5121.9
## <none> 5034.4 5122.7
## - PhoneService 1 5039.7 5124.2
## - PaperlessBilling 1 5040.0 5124.5
## + DeviceProtection 1 5033.8 5126.0
## + MonthlyCharges 1 5034.0 5126.2
## + OnlineBackup 1 5034.3 5126.5
## - StreamingMovies 1 5042.6 5127.1
## - MultipleLines 1 5050.0 5134.5
## - OnlineSecurity 1 5053.3 5137.8
## - TechSupport 1 5055.9 5140.4
## - PaymentMethod 3 5066.8 5143.6
## - Contract 2 5155.3 5235.9
## - InternetService 2 5233.2 5313.9
## - tenure_group 4 5287.9 5360.9
##
## Step: AIC=5120.06
## Churn ~ SeniorCitizen + Partner + Dependents + PhoneService +
## MultipleLines + InternetService + OnlineSecurity + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + tenure_group
##
## Df Deviance AIC
## - SeniorCitizen 1 5037.6 5118.3
## - StreamingTV 1 5037.9 5118.6
## - Dependents 1 5038.4 5119.1
## - Partner 1 5038.5 5119.2
## <none> 5035.5 5120.1
## - PhoneService 1 5040.8 5121.5
## - PaperlessBilling 1 5041.1 5121.8
## + gender 1 5034.4 5122.7
## + DeviceProtection 1 5035.0 5123.3
## + MonthlyCharges 1 5035.2 5123.6
## + OnlineBackup 1 5035.5 5123.8
## - StreamingMovies 1 5043.7 5124.4
## - MultipleLines 1 5051.2 5131.9
## - OnlineSecurity 1 5054.9 5135.6
## - TechSupport 1 5057.1 5137.8
## - PaymentMethod 3 5068.3 5141.3
## - Contract 2 5156.8 5233.7
## - InternetService 2 5233.9 5310.7
## - tenure_group 4 5288.5 5357.6
##
## Step: AIC=5118.3
## Churn ~ Partner + Dependents + PhoneService + MultipleLines +
## InternetService + OnlineSecurity + TechSupport + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## tenure_group
##
## Df Deviance AIC
## - StreamingTV 1 5040.0 5116.8
## - Partner 1 5041.2 5118.0
## <none> 5037.6 5118.3
## - Dependents 1 5041.6 5118.4
## + SeniorCitizen 1 5035.5 5120.1
## - PhoneService 1 5043.3 5120.1
## - PaperlessBilling 1 5043.6 5120.4
## + gender 1 5036.4 5121.0
## + DeviceProtection 1 5037.0 5121.5
## + MonthlyCharges 1 5037.3 5121.8
## + OnlineBackup 1 5037.5 5122.1
## - StreamingMovies 1 5045.7 5122.5
## - MultipleLines 1 5054.0 5130.8
## - OnlineSecurity 1 5057.2 5134.0
## - TechSupport 1 5060.2 5137.1
## - PaymentMethod 3 5071.4 5140.5
## - Contract 2 5161.9 5234.9
## - InternetService 2 5243.3 5316.3
## - tenure_group 4 5288.6 5353.9
##
## Step: AIC=5116.78
## Churn ~ Partner + Dependents + PhoneService + MultipleLines +
## InternetService + OnlineSecurity + TechSupport + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod + tenure_group
##
## Df Deviance AIC
## - Partner 1 5043.6 5116.5
## <none> 5040.0 5116.8
## - Dependents 1 5043.9 5116.9
## + StreamingTV 1 5037.6 5118.3
## + SeniorCitizen 1 5037.9 5118.6
## - PhoneService 1 5045.9 5118.9
## + MonthlyCharges 1 5038.6 5119.2
## + gender 1 5038.7 5119.4
## - PaperlessBilling 1 5046.5 5119.5
## + DeviceProtection 1 5039.5 5120.2
## + OnlineBackup 1 5039.8 5120.5
## - StreamingMovies 1 5053.2 5126.2
## - MultipleLines 1 5057.2 5130.2
## - OnlineSecurity 1 5059.5 5132.5
## - TechSupport 1 5061.5 5134.5
## - PaymentMethod 3 5074.9 5140.2
## - Contract 2 5162.3 5231.5
## - InternetService 2 5259.6 5328.7
## - tenure_group 4 5289.2 5350.7
##
## Step: AIC=5116.54
## Churn ~ Dependents + PhoneService + MultipleLines + InternetService +
## OnlineSecurity + TechSupport + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + tenure_group
##
## Df Deviance AIC
## - Dependents 1 5045.2 5114.3
## <none> 5043.6 5116.5
## + Partner 1 5040.0 5116.8
## + SeniorCitizen 1 5040.9 5117.7
## + StreamingTV 1 5041.2 5118.0
## - PhoneService 1 5049.5 5118.6
## + MonthlyCharges 1 5042.2 5119.0
## + gender 1 5042.3 5119.2
## - PaperlessBilling 1 5050.2 5119.3
## + DeviceProtection 1 5043.2 5120.1
## + OnlineBackup 1 5043.4 5120.3
## - StreamingMovies 1 5057.5 5126.7
## - MultipleLines 1 5061.0 5130.1
## - OnlineSecurity 1 5062.6 5131.7
## - TechSupport 1 5065.8 5134.9
## - PaymentMethod 3 5078.9 5140.3
## - Contract 2 5166.0 5231.3
## - InternetService 2 5264.3 5329.6
## - tenure_group 4 5290.5 5348.1
##
## Step: AIC=5114.34
## Churn ~ PhoneService + MultipleLines + InternetService + OnlineSecurity +
## TechSupport + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + tenure_group
##
## Df Deviance AIC
## <none> 5045.2 5114.3
## + SeniorCitizen 1 5041.9 5114.9
## + StreamingTV 1 5042.9 5115.9
## + Dependents 1 5043.6 5116.5
## - PhoneService 1 5051.3 5116.6
## + MonthlyCharges 1 5043.9 5116.8
## + Partner 1 5043.9 5116.9
## + gender 1 5044.0 5117.0
## - PaperlessBilling 1 5052.3 5117.6
## + DeviceProtection 1 5044.8 5117.8
## + OnlineBackup 1 5045.1 5118.1
## - StreamingMovies 1 5059.0 5124.3
## - MultipleLines 1 5063.2 5128.5
## - OnlineSecurity 1 5064.6 5129.9
## - TechSupport 1 5067.7 5133.0
## - PaymentMethod 3 5081.4 5139.0
## - Contract 2 5171.8 5233.2
## - InternetService 2 5270.7 5332.1
## - tenure_group 4 5295.2 5349.0
##
## Call: glm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
## OnlineSecurity + TechSupport + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + tenure_group, family = binomial,
## data = training_set)
##
## Coefficients:
## (Intercept)
## -1.19163
## PhoneServiceYes
## -0.34180
## MultipleLinesYes
## 0.36673
## InternetServiceFiber optic
## 0.93406
## InternetServiceNo
## -0.91576
## OnlineSecurityYes
## -0.38856
## TechSupportYes
## -0.41665
## StreamingMoviesYes
## 0.31064
## ContractOne year
## -0.68940
## ContractTwo year
## -1.74532
## PaperlessBillingYes
## 0.21014
## PaymentMethodCredit card (automatic)
## -0.01245
## PaymentMethodElectronic check
## 0.41793
## PaymentMethodMailed check
## -0.08411
## tenure_group0-12 Months
## 2.00364
## tenure_group12-24 Months
## 0.91525
## tenure_group24-48 Months
## 0.63350
## tenure_group48-60 Months
## 0.40948
##
## Degrees of Freedom: 5273 Total (i.e. Null); 5256 Residual
## Null Deviance: 7311
## Residual Deviance: 5045 AIC: 5081
classifier_LR <- glm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group, family = binomial,
data = training_set)
summary(classifier_LR)
##
## Call:
## glm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
## OnlineSecurity + TechSupport + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + tenure_group, family = binomial,
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3555 -0.7476 -0.1730 0.7765 2.7892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.19163 0.21620 -5.512 3.55e-08
## PhoneServiceYes -0.34180 0.13804 -2.476 0.013280
## MultipleLinesYes 0.36673 0.08660 4.235 2.29e-05
## InternetServiceFiber optic 0.93406 0.09642 9.688 < 2e-16
## InternetServiceNo -0.91576 0.13162 -6.957 3.46e-12
## OnlineSecurityYes -0.38856 0.08796 -4.417 9.99e-06
## TechSupportYes -0.41665 0.08745 -4.765 1.89e-06
## StreamingMoviesYes 0.31064 0.08402 3.697 0.000218
## ContractOne year -0.68940 0.10405 -6.626 3.46e-11
## ContractTwo year -1.74532 0.17400 -10.031 < 2e-16
## PaperlessBillingYes 0.21014 0.07845 2.679 0.007391
## PaymentMethodCredit card (automatic) -0.01245 0.11763 -0.106 0.915728
## PaymentMethodElectronic check 0.41793 0.10005 4.177 2.95e-05
## PaymentMethodMailed check -0.08411 0.11643 -0.722 0.470009
## tenure_group0-12 Months 2.00364 0.16700 11.998 < 2e-16
## tenure_group12-24 Months 0.91525 0.16296 5.616 1.95e-08
## tenure_group24-48 Months 0.63350 0.14764 4.291 1.78e-05
## tenure_group48-60 Months 0.40948 0.16264 2.518 0.011814
##
## (Intercept) ***
## PhoneServiceYes *
## MultipleLinesYes ***
## InternetServiceFiber optic ***
## InternetServiceNo ***
## OnlineSecurityYes ***
## TechSupportYes ***
## StreamingMoviesYes ***
## ContractOne year ***
## ContractTwo year ***
## PaperlessBillingYes **
## PaymentMethodCredit card (automatic)
## PaymentMethodElectronic check ***
## PaymentMethodMailed check
## tenure_group0-12 Months ***
## tenure_group12-24 Months ***
## tenure_group24-48 Months ***
## tenure_group48-60 Months *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7311.0 on 5273 degrees of freedom
## Residual deviance: 5045.2 on 5256 degrees of freedom
## AIC: 5081.2
##
## Number of Fisher Scoring iterations: 5
library(car)
vif(classifier_LR)
## GVIF Df GVIF^(1/(2*Df))
## PhoneService 1.316723 1 1.147485
## MultipleLines 1.523190 1 1.234176
## InternetService 2.299331 2 1.231404
## OnlineSecurity 1.151293 1 1.072983
## TechSupport 1.177450 1 1.085104
## StreamingMovies 1.404921 1 1.185294
## Contract 1.790894 2 1.156824
## PaperlessBilling 1.142792 1 1.069015
## PaymentMethod 1.392604 3 1.056748
## tenure_group 2.381935 4 1.114593
Getting the probabilities of the training set
prob_pred_LR_train = predict(classifier_LR, type = 'response',
newdata = training_set[-18])
Predicting the Test set results
str(test_set)
## 'data.frame': 1758 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 2 2 1 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 2 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 1 2 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 1 3 2 1 2 2 1 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 2 1 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 1 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 1 1 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 1 2 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 2 1 1 3 3 3 3 1 1 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 2 1 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 2 4 2 2 2 2 2 1 3 ...
## $ MonthlyCharges : num 57 89.1 29.8 18.9 113.2 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ tenure_group : Factor w/ 5 levels "> 60 Months",..: 4 3 2 3 1 5 1 2 2 3 ...
prob_pred_LR = predict(classifier_LR, type = 'response',
newdata = test_set[-18])
y_pred_LR = ifelse(prob_pred_LR > 0.5, 1, 0)
Applying k-Fold Cross Validation
set.seed(2)
folds_LR <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_LR, function(x) {
training_fold = training_set[-x,]
test_fold = training_set[x,]
classifier = glm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
family = binomial,
data = training_set)
prob_pred_LR = predict(classifier, newdata = test_fold[-18])
y_pred_LR = ifelse(prob_pred_LR > 0.5, 1, 0)
cm = table(test_fold[,18], y_pred_LR)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
Getting the accuracy
accuracy_LR <- mean(as.numeric(cv))
accuracy_LR
## [1] 0.7582512
Creating the Confussion Matrix
cm_LR <- table(test_set[,18], y_pred_LR)
cm_LR
## y_pred_LR
## 0 1
## No 925 366
## Yes 94 373
Accuracy is 75.83 %. There are 94 of those who left anaway, 366 of those who didn’t left anyway.
Plotting the ROC curve
library(ROCR)
plot(performance(prediction(prob_pred_LR, test_set$Churn), measure ='tpr',
x.measure ='fpr'),lwd = 7, main = 'ROC curve', col = 'green')
lines(x=c(0, 1), y=c(0, 1), col="blue", lwd=1)
Computing area under the ROC curve
pr <- prediction(prob_pred_LR, test_set$Churn)
auroc <- performance(pr, measure = 'auc')
auroc <- auroc@y.values[[1]]
auc_rounded <- round(auroc,2)
print(paste('Area under the receiver operating characteristic curve is', auc_rounded))
## [1] "Area under the receiver operating characteristic curve is 0.83"
Plotting the CAP curve
test_set$Churn <- as.character(test_set$Churn)
test_set$Churn <- as.numeric(mapvalues(from = c('Yes', 'No'),
to = c('1', '0'), test_set$Churn))
library(plotrix)
plot(performance(prediction(prob_pred_LR, test_set$Churn), 'tpr', 'rpp'),
lwd = 7, main = 'CAP curve', col = 'red')
lines(x=c(0, 1), y=c(0, 1), col="blue", lwd=1)
ablineclip(v = 0.5, y2 = 0.863, lty = 'dashed', col = 'green', lwd = 3)
ablineclip(h = 0.863, x2 = 0.5, lty = 'dashed', col = 'green', lwd = 3)
library(e1071)
classifier_SVM = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
data = training_set,
type = 'C-classification',
kernel = 'linear')
Predicting the Test set results
y_pred_SVM = predict(classifier_SVM, newdata = test_set[-18])
Applying k-Fold Cross Validation
set.seed(3)
folds_SVM <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_SVM, function(x) {
training_fold = training_set[-x,]
test_fold = training_set[x,]
classifier = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
data = training_fold,
type = 'C-classification',
kernel = 'linear')
y_pred_SVM = predict(classifier, newdata = test_fold[-18])
cm = table(test_fold[,18], y_pred_SVM)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
Getting the accuracy
accuracy_SVM <- mean(as.numeric(cv))
accuracy_SVM
## [1] 0.7400554
Creating the Confussion Matrix
cm_SVM <- table(test_set[,18], y_pred_SVM)
cm_SVM
## y_pred_SVM
## No Yes
## 0 832 459
## 1 82 385
Accuracy is 74.00 %. There are 82 of those who left anaway, 459 of those who didn’t left anyway.
classifier_KSVM = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
data = training_set,
type = 'C-classification',
kernel = 'radial')
Predicting the Test set results
y_pred_KSVM = predict(classifier_KSVM, newdata = test_set[-18])
Applying k-Fold Cross Validation
set.seed(4)
folds_KSVM <- createFolds(training_set$Churn, k = 10)
cv = lapply(folds_SVM, function(x) {
training_fold = training_set[-x,]
test_fold = training_set[x,]
classifier = svm(formula = Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
data = training_fold,
type = 'C-classification',
kernel = 'radial')
y_pred_KSVM = predict(classifier, newdata = test_fold[-18])
cm = table(test_fold[,18], y_pred_KSVM)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
Getting the accuracy
accuracy_KSVM <- mean(as.numeric(cv))
accuracy_KSVM
## [1] 0.7635665
Creating the Confussion Matrix
cm_KSVM <- table(test_set[,18], y_pred_KSVM)
cm_KSVM
## y_pred_KSVM
## No Yes
## 0 917 374
## 1 99 368
Accuracy is 76.45%. There are 99 of those who left anaway, 374 of those who didn’t left anyway Not signifficantly better result from logistic regression and SVM => data are linearly separable.
Logistic regression and SVM have similar results => tuning the SVM model.
Applying Grid search to find the best parameters
classifier_SVM_tuned <- train(Churn ~ PhoneService + MultipleLines + InternetService +
OnlineSecurity + TechSupport + StreamingMovies + Contract +
PaperlessBilling + PaymentMethod + tenure_group,
data = training_set,
method = 'svmLinear')
classifier_SVM_tuned
## Support Vector Machines with Linear Kernel
##
## 5274 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5274, 5274, 5274, 5274, 5274, 5274, ...
## Resampling results:
##
## Accuracy Kappa
## 0.747551 0.4956568
##
## Tuning parameter 'C' was held constant at a value of 1
The best performance is with parameter c = 1, which is default, so the performance is still the same.
Logistic regression is choosed, because of better interpretation and ability to rank customers by highest probability to leave.
The example:
example <- test_set[13,]
example
example_prob <- prob_pred_LR[13]
example_prob
## 52
## 0.607221
List of the further provided information:
| Variable name | Data type | Variable description |
|---|---|---|
| customerID | factor | a unique number for identifying a customer |
| gender | factor | available values - female/male |
| SeniorCitizen | integer | whether the customer is a senior citizen or not (available values: 1/0, where 1 means ‘Yes’ and 0 means ‘No’) |
| Partner | factor | whether the customer has a partner or not (available values: Yes/No) |
| Dependents | factor | whether the customer has dependents or not (available values: Yes/No) |
| tenure | integer | number of months the customer has stayed with the company |
| PhoneService | factor | whether the customer has a phone service or not (available values: Yes/No) |
| MultipleLines | factor | whether the customer has multiple lines or not (available values: Yes/No/No phone service) |
| InternetService | factor | customer’s internet service provider (available values: DSL/Fiber optic/No) |
| OnlineSecurity | factor | whether the customer has online security or not (available values: Yes/No/No internet service) |
| OnlineBackup | factor | whether the customer has online backup or not (available values: Yes/No/No internet service) |
| DeviceProtection | factor | whether the customer has device protection or not (available values: Yes/No/No internet service) |
| TechSupport | factor | whether the customer has tech support or not (available values: Yes/No/No internet service) |
| streamingTV | factor | whether the customer has streaming TV or not (available values: Yes/No/No internet service) |
| streamingMovies | factor | whether the customer has streaming movies or not (available values: Yes/No/No internet service) |
| Contract | factor | the contract term of the customer (available values: Month-to-month/One year/Two year) |
| PaperlessBilling | factor | whether the customer has paperless billing or not (available values: Yes/No) |
| PaymentMethod | factor | the customer’s payment method (available values: Electronic check/Mailed check/Bank transfer (automatic)/Credit card (automatic)) |
| MonthlyCharges | integer | the amount charged to the customer monthly(numeric variable) |
| TotalCharges | integer | the total amount charged to the customer(numeric variable) |
Provided dataset contains information about 7043 customers.
stepAIC
It is an R function from MASS package that uses Akaike information criterion for selecting the best subset of features. Akaike information criterion offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. So the goal is to find the model with the smallest AIC by removing or adding variables in the scope. Since each variable in the model is penalized with a factor of 2, this leads to logistic regression tests with p-values < 0.1573. In order to add criterium for selecting the variables with p < 0.05, the correct k parameter was computed .
The cross validation
It is used as a reassurence that the model is not overfitting. Overfitting would mean that the model did not find the general patterns in the dataset, but the prediction on a test set are based on memorizing the training set. In this case, the model would performed very well on the training set (since it would be memorized), but very bad on the new sets of data (which the test set and every new case that needs to be predicted), because the model would not know the general patterns. Cross validation is able to check for overfitting, because within this procedure the train set is devided into 10 portions, the model is trained 10 times, each time on different combination of 9 portions of the training set and the remaining 10th portion is used for the testing and computing the accuracy. The final accuracy of the model is computed as the average of 10 accuracies computed within cross validation.
Logistic regression
Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. So when we have two classes and the probability in question is P_+ -> the probability that a certain data point belongs to the ‘+‘ class. Ofcourse, P_- = 1 - P_+. Thus, the output of Logistic Regression always lies in [0, 1].
SVM
Support vector machine algorithm is in some way different from other machine learning algorithms. The way SVM searches for the best line wich is a boundary between two classes (called maximum margin hyperplane) is through the maximum margin. In two dimensial space margin is represented by the two distances. The distances are between the boundary line and closest points from each class, that are called support vectors. They are called ‘support’ because the other vectors doesn’t contribute to the result of the algorithm, and they are called vectors, because in more dimensional space we can’t visualize them like points, they are vectors. SVM are special because if the machine is supposed to distinquish for example between apples and oranges, the most of the algorithms would look at the most stock standard looking apples in the training set to learn, how an apple look like and the same for oranges. But SVM algorithm look at an apple that looks like an orange, and at the orange that looks like an apple - and these examples are those support vectors.
Linearly separable dataset
It means that the input space can be separated into two nice ‘regions’, one for each class, by a linear(straight) boundary. For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by the input data and the learning algorithm, but in for example two dimensional space it means that the data points MUST be separable into the two aforementioned regions by a linear boundary. If the data points do satisfy this constraint, they are said to be linear-separable.
Kernel SVM
There is a method used for non linearly separable dataset. With help of a mapping function it moves the data into higher dimension space, where the data become linearly separable and finds the boundary between them (using SVM) and then the data are projected to the original space. But his is highly compute-intensive.
What Kernel SVM does is, that it uses the Kernel trick. In example of two dimensional space, where the data are separable by a circle, radial kernel uses a method to find a landmark in the middle of that circle and by fitting every data point into a kernel function it is possible to calculate which class the given data point belongs to. The size of the circle is given by the parameter sigma. It is possible to add up more Kernel functions in order to create more complex boundaries.
Another examples are Sigmoid Kernel and Polynomial Kernel which is are directional functions.