This report is intended to predicting Customer Churn in a Telecommunication Company. By using the historical data of their customer’s profile, we hope that we can get some insightful outputs, and predict which kind of customer that going to end their subscription, and determine what kind of treatments that expected going to retain them effectively. Therefore, we will going to need some deep interpretation on our analytics and select the most significant variables. Hopefully, this task will be successfully delivered by using Logistic Regression Model.
Data Exploration
First of all, let’s take a look closer to the Customer Churn data.
customer <- read.csv("Customer-Churn.csv")
master.customer <- customer
#customer <- master.customer
str(customer)## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies
## No :2810 No :2785
## No internet service:1526 No internet service:1526
## Yes :2707 Yes :2732
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5174
## 1st Qu.: 35.50 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
Here is the brief explanation of all 21 variables on this data:
+ customerID - Customer ID
+ gender - Whether the customer is a male or a female
+ SeniorCitizen - Whether the customer is a senior citizen or not (1, 0)
+ Partner - Whether the customer has a partner or not (Yes, No)
+ Dependents - Whether the customer has dependents or not (Yes, No)
+ tenure - Number of months the customer has stayed with the company
+ PhoneService - Whether the customer has a phone service or not (Yes, No)
+ MultipleLines - Whether the customer has multiple lines or not (Yes, No, No phone service)
+ InternetService - Customer’s internet service provider (DSL, Fiber optic, No)
+ OnlineSecurity - Whether the customer has online security or not (Yes, No, No internet service)
+ OnlineBackup - Whether the customer has online backup or not (Yes, No, No internet service)
+ DeviceProtection - Whether the customer has device protection or not (Yes, No, No internet service)
+ TechSupport - Whether the customer has tech support or not (Yes, No, No internet service)
+ StreamingTV - Whether the customer has streaming TV or not (Yes, No, No internet service)
+ StreamingMovies - Whether the customer has streaming movies or not (Yes, No, No internet service)
+ Contract - The contract term of the customer (Month-to-month, One year, Two year)
+ PaperlessBilling - Whether the customer has paperless billing or not (Yes, No)
+ PaymentMethod - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
+ MonthlyCharges - The amount charged to the customer monthly
+ TotalCharges - The total amount charged to the customer
+ Churn - Whether the customer churned or not (Yes or No)
As we can see on the str function, the customerID won’t give us much information since it was a unique variable. So we will going to take it out. And then, we can see from the summary function we can see that we will need to modify the SeniorCitizen because it only consist of two value, and should be represented as categorical variable. Another finding, there are some NA’s value at TotalCharges. We will going to modify them a little bit:
customer <- customer %>%
select(-customerID) %>%
mutate(SeniorCitizen = as.factor(ifelse(SeniorCitizen == 1,"Yes","No"))) %>%
drop_na() # modify the dataA quick check whether still NA’s values.
## [1] FALSE
Since we are going to predict Customer Churn, the variable which going to be the target is Churn. Let’s see the proportion between churned customer and not churned.
##
## No Yes
## 5163 1869
##
## No Yes
## 0.734215 0.265785
Unfortunately, the data proportion is heavily dominated by the non-churner -a common practice actually, but it will need some balancing on the next process.
Feature Selection on Business Wise
Gender Bias
Based on the Company’s Policy, there should be no discrimination among customers based on their gender. Therefore, the gender variable should be eliminated in the analysis. But as a data scientist, we should take a look to the data, just to ensure it won’t going to give us trouble on next process.
##
## Female Male
## No 0.3617747 0.3724403
## Yes 0.1335324 0.1322526
Based on the proportion table, clearly we can see that there is no difference between Male and Female on their decision to be a churner or no. So it will not going to affect the prediction process if we decide to take them out:
Phone Service
If we take a look closer at the summary of customer dataset, we can see that there is strong collinearity between PhoneService and MultipleLines. The number of customers who don’t have PhoneService are precisely reflected on the MultipleLine, which categorized as “No phone service”. This may create singularity during modeling. Therefore, we are going to take out the customer who dont have PhoneService (categorized as “No”), and automatically this variable won’t be relevant and can be eliminated:
Internet Service
Similar finding with the PhoneService, the InternetService also giving the same behavior. There are 6 Variable with “No internet service” category which have the same numbers with customer who didn’t have InternetService. Clearly, we are going to take them out. But this time we will keep the InternetService variable since there are still two category left, and can be compared:
Cross Validation
1st Split
Before going to take another step on modelling, let us split the data for modelling (80%) and evaluation (20%) purpose:
set.seed(289)
index <- sample(nrow(customer), nrow(customer)*0.2)
eval_customer <- customer[index, ]
model_customer <- customer[-index, ]
prop.table(table(eval_customer$Churn))##
## No Yes
## 0.679089 0.320911
##
## No Yes
## 0.6699431 0.3300569
## [1] 966 17
## [1] 3866 17
The proportion of churner and non-churner seems changed from the previous calculation, because we eliminate some observations during the Feature Selection. But still, it is not balanced.
2nd Split
Let us split the modelling data into train set (70%) and test set (30%).
set.seed(168)
index1 <- sample(nrow(model_customer), nrow(model_customer)*0.7)
train_customer <- model_customer[index1, ]
test_customer <- model_customer[-index1, ]
prop.table(table(train_customer$Churn))##
## No Yes
## 0.6747967 0.3252033
##
## No Yes
## 0.6586207 0.3413793
## [1] 2706 17
## [1] 1160 17
The proportion seems similar with the evaluation set, let’s proceed to the next step.
Pra Processing Data
It was clearly that we need to balance the proportion of churner and non-churner, and we will balance it on train and test set.
train_customer.new <- downSample(x = train_customer[ , -17],
y = train_customer$Churn,
yname = "Churn")
test_customer.new <- downSample(x = test_customer[ , -17],
y = test_customer$Churn,
yname = "Churn")
prop.table(table(train_customer.new$Churn))##
## No Yes
## 0.5 0.5
##
## No Yes
## 0.5 0.5
Good, both of them now are perfectly balanced. Let’s go to the modelling part.
Modelling
First of all, since we are going to use Logistic Regression, let’s change the category in Churn variable from “Yes” into “1” and “No into”0" for the sake of simplicity.
train_customer.new <- train_customer.new %>%
mutate(Churn = ifelse(Churn == "Yes",1,0))
test_customer.new <- test_customer.new %>%
mutate(Churn = ifelse(Churn == "Yes",1,0))First Modelling
model_cust1 <- glm(formula = Churn~., data = train_customer.new, family = "binomial")
summary(model_cust1)##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train_customer.new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0159 -0.8780 0.1714 0.8407 2.7981
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.9838335 0.4530403 -2.172
## SeniorCitizenYes 0.1454629 0.1443824 1.007
## PartnerYes 0.0832949 0.1346886 0.618
## DependentsYes -0.1866081 0.1547590 -1.206
## tenure -0.0583239 0.0136871 -4.261
## MultipleLinesYes 0.1901313 0.1353745 1.404
## OnlineSecurityYes -0.4937652 0.1335909 -3.696
## OnlineBackupYes -0.3290763 0.1324216 -2.485
## DeviceProtectionYes -0.1534875 0.1356088 -1.132
## TechSupportYes -0.3858567 0.1372921 -2.810
## StreamingTVYes -0.1166251 0.1527883 -0.763
## StreamingMoviesYes -0.3130408 0.1530001 -2.046
## ContractOne year -0.6390219 0.1803937 -3.542
## ContractTwo year -1.5839567 0.2897794 -5.466
## PaperlessBillingYes 0.4046841 0.1316211 3.075
## PaymentMethodCredit card (automatic) 0.1562398 0.1851553 0.844
## PaymentMethodElectronic check 0.3938941 0.1564075 2.518
## PaymentMethodMailed check 0.3878645 0.2083423 1.862
## MonthlyCharges 0.0239912 0.0065636 3.655
## TotalCharges 0.0003766 0.0001495 2.519
## Pr(>|z|)
## (Intercept) 0.029884 *
## SeniorCitizenYes 0.313703
## PartnerYes 0.536294
## DependentsYes 0.227895
## tenure 0.000020330 ***
## MultipleLinesYes 0.160175
## OnlineSecurityYes 0.000219 ***
## OnlineBackupYes 0.012953 *
## DeviceProtectionYes 0.257702
## TechSupportYes 0.004947 **
## StreamingTVYes 0.445278
## StreamingMoviesYes 0.040755 *
## ContractOne year 0.000397 ***
## ContractTwo year 0.000000046 ***
## PaperlessBillingYes 0.002108 **
## PaymentMethodCredit card (automatic) 0.398764
## PaymentMethodElectronic check 0.011789 *
## PaymentMethodMailed check 0.062650 .
## MonthlyCharges 0.000257 ***
## TotalCharges 0.011768 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2439.9 on 1759 degrees of freedom
## Residual deviance: 1861.9 on 1740 degrees of freedom
## AIC: 1901.9
##
## Number of Fisher Scoring iterations: 5
Interpretation:
- The most significant coefs are tenure, ContractOne year, ContractTwo year and TotalCharges. The significant variables are OnlineBackupYes and MonthlyCharges. As for the quite significant variables are MultipleLinesyes, OnlineSecurityYes, StreamingTVYes and PaymentMethodElectronic check.
- Let’s start from the numeric variables. As we can see, the increment of tenure will 0.92 (exp(-0.0784127)) times decreased the churners. Well this is valid since a customer will less-likely to leave when they spend more time as subcriber than the new one. We can spend more promotion (like free trials) to the new customers to retain them on longer experience.
- In contrary, the increment of MonthlyCharges and TotalCharges will increasing the churners, each for 1.02 (exp(0.0203037)) and 1.00 (exp(0.0005480)) times. It seems the customers more likely to leave when they have high charges. We can set some discount with particular terms and conditions to retain them. - Next, the Contract of One and Two year will going to decrease likelihood of churner. This is valid since customer who attached with long-term contract will going to stay rather than the month-to-month basis customer. We should have promotion campaign for monthly basis customer.
- We also could see that customers who have extra service such as MultipleLines, OnlineSecurity, StreamingTV, and OnlineBackup will tend to be a non-churner rather than the one who only after for the standard services.
- Customer who pay with Electronic check is tend to be a churner rather than the other method.
Evaluate the Model
Let’s evaluate the model to check whether it was fitted or no. Let’s use the 0.5 threshold for initial test.
test_customer.new$Pred.Risk <- predict(model_cust1, test_customer.new, type = "response")
confusionMatrix(data=as.factor(as.numeric(test_customer.new$Pred.Risk>=0.5)), reference=as.factor(test_customer.new$Churn), dnn = c("Pred","Act"), positive="1")## Confusion Matrix and Statistics
##
## Act
## Pred 0 1
## 0 272 63
## 1 124 333
##
## Accuracy : 0.7639
## 95% CI : (0.7327, 0.7931)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5278
##
## Mcnemar's Test P-Value : 0.00001146
##
## Sensitivity : 0.8409
## Specificity : 0.6869
## Pos Pred Value : 0.7287
## Neg Pred Value : 0.8119
## Prevalence : 0.5000
## Detection Rate : 0.4205
## Detection Prevalence : 0.5770
## Balanced Accuracy : 0.7639
##
## 'Positive' Class : 1
##
The Accuracy of the initial model is 0.7626, not bad, but it seems we can make it better. And since our goal is to decrease the chance of churner who predicted as non-churner, we will seek for a higher Sensitivity (initial model scores 0.8359)
Improve the model
Stepwise Method
## Start: AIC=1901.89
## Churn ~ SeniorCitizen + Partner + Dependents + tenure + MultipleLines +
## OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - Partner 1 1862.3 1900.3
## - StreamingTV 1 1862.5 1900.5
## - SeniorCitizen 1 1862.9 1900.9
## - DeviceProtection 1 1863.2 1901.2
## - Dependents 1 1863.3 1901.3
## - MultipleLines 1 1863.9 1901.9
## <none> 1861.9 1901.9
## - PaymentMethod 3 1869.3 1903.3
## - StreamingMovies 1 1866.1 1904.1
## - OnlineBackup 1 1868.0 1906.0
## - TotalCharges 1 1868.3 1906.3
## - TechSupport 1 1869.7 1907.7
## - PaperlessBilling 1 1871.3 1909.3
## - MonthlyCharges 1 1875.1 1913.1
## - OnlineSecurity 1 1875.4 1913.4
## - tenure 1 1880.8 1918.8
## - Contract 2 1897.3 1933.3
##
## Step: AIC=1900.27
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - StreamingTV 1 1862.9 1898.9
## - Dependents 1 1863.4 1899.4
## - SeniorCitizen 1 1863.5 1899.5
## - DeviceProtection 1 1863.5 1899.5
## <none> 1862.3 1900.3
## - MultipleLines 1 1864.3 1900.3
## - PaymentMethod 3 1869.6 1901.6
## - StreamingMovies 1 1866.5 1902.5
## - OnlineBackup 1 1868.5 1904.5
## - TotalCharges 1 1868.7 1904.7
## - TechSupport 1 1870.1 1906.1
## - PaperlessBilling 1 1871.7 1907.7
## - OnlineSecurity 1 1875.8 1911.8
## - MonthlyCharges 1 1875.8 1911.8
## - tenure 1 1880.9 1916.9
## - Contract 2 1898.0 1932.0
##
## Step: AIC=1898.87
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines +
## OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
## StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
## MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - Dependents 1 1864.0 1898.0
## - DeviceProtection 1 1864.0 1898.0
## - SeniorCitizen 1 1864.2 1898.2
## <none> 1862.9 1898.9
## - MultipleLines 1 1865.2 1899.2
## - PaymentMethod 3 1870.1 1900.1
## - StreamingMovies 1 1867.2 1901.2
## - OnlineBackup 1 1868.8 1902.8
## - TotalCharges 1 1869.0 1903.0
## - TechSupport 1 1870.8 1904.8
## - PaperlessBilling 1 1872.1 1906.1
## - OnlineSecurity 1 1876.0 1910.0
## - MonthlyCharges 1 1876.2 1910.2
## - tenure 1 1881.2 1915.2
## - Contract 2 1899.7 1931.7
##
## Step: AIC=1897.97
## Churn ~ SeniorCitizen + tenure + MultipleLines + OnlineSecurity +
## OnlineBackup + DeviceProtection + TechSupport + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges +
## TotalCharges
##
## Df Deviance AIC
## - DeviceProtection 1 1865.2 1897.2
## - SeniorCitizen 1 1865.7 1897.7
## <none> 1864.0 1898.0
## - MultipleLines 1 1866.3 1898.3
## - PaymentMethod 3 1871.2 1899.2
## - StreamingMovies 1 1868.3 1900.3
## - OnlineBackup 1 1869.9 1901.9
## - TotalCharges 1 1870.5 1902.5
## - TechSupport 1 1871.9 1903.9
## - PaperlessBilling 1 1873.5 1905.5
## - OnlineSecurity 1 1877.2 1909.2
## - MonthlyCharges 1 1877.2 1909.2
## - tenure 1 1883.1 1915.1
## - Contract 2 1901.6 1931.6
##
## Step: AIC=1897.2
## Churn ~ SeniorCitizen + tenure + MultipleLines + OnlineSecurity +
## OnlineBackup + TechSupport + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## - SeniorCitizen 1 1867.0 1897.0
## <none> 1865.2 1897.2
## - MultipleLines 1 1868.0 1898.0
## - PaymentMethod 3 1872.4 1898.4
## - StreamingMovies 1 1869.3 1899.3
## - OnlineBackup 1 1871.0 1901.0
## - TotalCharges 1 1871.5 1901.5
## - TechSupport 1 1873.3 1903.3
## - PaperlessBilling 1 1875.0 1905.0
## - MonthlyCharges 1 1877.3 1907.3
## - OnlineSecurity 1 1878.2 1908.2
## - tenure 1 1884.3 1914.3
## - Contract 2 1905.9 1933.9
##
## Step: AIC=1897.05
## Churn ~ tenure + MultipleLines + OnlineSecurity + OnlineBackup +
## TechSupport + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges + TotalCharges
##
## Df Deviance AIC
## <none> 1867.0 1897.0
## - MultipleLines 1 1870.1 1898.1
## - PaymentMethod 3 1874.5 1898.5
## - StreamingMovies 1 1871.1 1899.1
## - OnlineBackup 1 1873.2 1901.2
## - TotalCharges 1 1873.4 1901.4
## - TechSupport 1 1876.3 1904.3
## - PaperlessBilling 1 1877.4 1905.4
## - MonthlyCharges 1 1879.8 1907.8
## - OnlineSecurity 1 1880.8 1908.8
## - tenure 1 1886.0 1914.0
## - Contract 2 1909.9 1935.9
##
## Call:
## glm(formula = Churn ~ tenure + MultipleLines + OnlineSecurity +
## OnlineBackup + TechSupport + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges,
## family = "binomial", data = train_customer.new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0476 -0.8926 0.1831 0.8404 2.8188
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.8528966 0.4302405 -1.982
## tenure -0.0578205 0.0135470 -4.268
## MultipleLinesYes 0.2314720 0.1334928 1.734
## OnlineSecurityYes -0.4938240 0.1325385 -3.726
## OnlineBackupYes -0.3267657 0.1315175 -2.485
## TechSupportYes -0.4152916 0.1360380 -3.053
## StreamingMoviesYes -0.3069506 0.1521664 -2.017
## ContractOne year -0.7031985 0.1776107 -3.959
## ContractTwo year -1.6909384 0.2853521 -5.926
## PaperlessBillingYes 0.4199034 0.1307430 3.212
## PaymentMethodCredit card (automatic) 0.1567953 0.1845085 0.850
## PaymentMethodElectronic check 0.3993476 0.1560503 2.559
## PaymentMethodMailed check 0.3617219 0.2073220 1.745
## MonthlyCharges 0.0212018 0.0059136 3.585
## TotalCharges 0.0003727 0.0001486 2.507
## Pr(>|z|)
## (Intercept) 0.047438 *
## tenure 0.00001970997 ***
## MultipleLinesYes 0.082924 .
## OnlineSecurityYes 0.000195 ***
## OnlineBackupYes 0.012970 *
## TechSupportYes 0.002267 **
## StreamingMoviesYes 0.043674 *
## ContractOne year 0.00007519748 ***
## ContractTwo year 0.00000000311 ***
## PaperlessBillingYes 0.001320 **
## PaymentMethodCredit card (automatic) 0.395436
## PaymentMethodElectronic check 0.010494 *
## PaymentMethodMailed check 0.081031 .
## MonthlyCharges 0.000337 ***
## TotalCharges 0.012174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2439.9 on 1759 degrees of freedom
## Residual deviance: 1867.0 on 1745 degrees of freedom
## AIC: 1897
##
## Number of Fisher Scoring iterations: 5
test_pred <- predict(test_step, test_customer.new, type = "response")
confusionMatrix(data=as.factor(as.numeric(test_pred>=0.5)), reference=as.factor(test_customer.new$Churn), dnn = c("Pred","Act"), positive="1")## Confusion Matrix and Statistics
##
## Act
## Pred 0 1
## 0 269 64
## 1 127 332
##
## Accuracy : 0.7588
## 95% CI : (0.7275, 0.7883)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5177
##
## Mcnemar's Test P-Value : 0.000007252
##
## Sensitivity : 0.8384
## Specificity : 0.6793
## Pos Pred Value : 0.7233
## Neg Pred Value : 0.8078
## Prevalence : 0.5000
## Detection Rate : 0.4192
## Detection Prevalence : 0.5795
## Balanced Accuracy : 0.7588
##
## 'Positive' Class : 1
##
- By using stepwise, some of the variables are eliminated, such as:
DeviceProtection,Partner,DependentsandSeniorCitizen.
- Unfortunately, the Accuracy doesn’t increase and even worst, the Specivity is dropped into 0.8333.
- We will not going to use the stepwise to improve the model.
Changing Threshold
Let’s try to decrease the threshold to get a better Sensitivity, according to our goal.
## Confusion Matrix and Statistics
##
## Act
## Pred 0 1
## 0 232 34
## 1 164 362
##
## Accuracy : 0.75
## 95% CI : (0.7183, 0.7798)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.9141
## Specificity : 0.5859
## Pos Pred Value : 0.6882
## Neg Pred Value : 0.8722
## Prevalence : 0.5000
## Detection Rate : 0.4571
## Detection Prevalence : 0.6641
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : 1
##
Voila, the Sensivity increased into 0.9141 and even the Accuracy is decreased into 0.7361, it still a reasonable result. Therefore, we will going to use the Initial model, but use 0.4 as the threshold.
Ultimate Test
For the final test, let us test our model into the Evaluation Data.
eval_customer <- eval_customer %>%
mutate(Churn = ifelse(Churn == "Yes",1,0))
eval_customer$Pred.Risk <- predict(model_cust1, eval_customer, type = "response")
confusionMatrix(data=as.factor(as.numeric(eval_customer$Pred.Risk>=0.4)), reference=as.factor(eval_customer$Churn), dnn = c("Pred","Act"), positive="1")## Confusion Matrix and Statistics
##
## Act
## Pred 0 1
## 0 361 27
## 1 295 283
##
## Accuracy : 0.6667
## 95% CI : (0.6359, 0.6964)
## No Information Rate : 0.6791
## P-Value [Acc > NIR] : 0.8058
##
## Kappa : 0.3772
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.9129
## Specificity : 0.5503
## Pos Pred Value : 0.4896
## Neg Pred Value : 0.9304
## Prevalence : 0.3209
## Detection Rate : 0.2930
## Detection Prevalence : 0.5983
## Balanced Accuracy : 0.7316
##
## 'Positive' Class : 1
##
The sensitivity reach the number that similar with test data. Unfortunately the Accuracy quite dropped, but we will accept it since the evaluation data is randomly selected and inbalanced. In conclusion, the chosen Logistic Regression model is fitted to predict the Telco Customer Churn.