This report is intended to predicting Customer Churn in a Telecommunication Company. By using the historical data of their customer’s profile, we hope that we can get some insightful outputs, and predict which kind of customer that going to end their subscription, and determine what kind of treatments that expected going to retain them effectively. Therefore, we will going to need some deep interpretation on our analytics and select the most significant variables. Hopefully, this task will be successfully delivered by using Logistic Regression Model.

Data Exploration

First of all, let’s take a look closer to the Customer Churn data.

customer <- read.csv("Customer-Churn.csv")
master.customer <- customer
#customer <- master.customer

str(customer)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

summary(customer)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2810   No                 :2785  
##  No internet service:1526   No internet service:1526  
##  Yes                :2707   Yes                :2732  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5174  
##  1st Qu.: 35.50   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.76   Mean   :2283.3             
##  3rd Qu.: 89.85   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##                   NA's   :11

Here is the brief explanation of all 21 variables on this data:
+ customerID - Customer ID
+ gender - Whether the customer is a male or a female
+ SeniorCitizen - Whether the customer is a senior citizen or not (1, 0)
+ Partner - Whether the customer has a partner or not (Yes, No)
+ Dependents - Whether the customer has dependents or not (Yes, No)
+ tenure - Number of months the customer has stayed with the company
+ PhoneService - Whether the customer has a phone service or not (Yes, No)
+ MultipleLines - Whether the customer has multiple lines or not (Yes, No, No phone service)
+ InternetService - Customer’s internet service provider (DSL, Fiber optic, No)
+ OnlineSecurity - Whether the customer has online security or not (Yes, No, No internet service)
+ OnlineBackup - Whether the customer has online backup or not (Yes, No, No internet service)
+ DeviceProtection - Whether the customer has device protection or not (Yes, No, No internet service)
+ TechSupport - Whether the customer has tech support or not (Yes, No, No internet service)
+ StreamingTV - Whether the customer has streaming TV or not (Yes, No, No internet service)
+ StreamingMovies - Whether the customer has streaming movies or not (Yes, No, No internet service)
+ Contract - The contract term of the customer (Month-to-month, One year, Two year)
+ PaperlessBilling - Whether the customer has paperless billing or not (Yes, No)
+ PaymentMethod - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
+ MonthlyCharges - The amount charged to the customer monthly
+ TotalCharges - The total amount charged to the customer
+ Churn - Whether the customer churned or not (Yes or No)

As we can see on the str function, the customerID won’t give us much information since it was a unique variable. So we will going to take it out. And then, we can see from the summary function we can see that we will need to modify the SeniorCitizen because it only consist of two value, and should be represented as categorical variable. Another finding, there are some NA’s value at TotalCharges. We will going to modify them a little bit:

customer <- customer %>% 
                  select(-customerID) %>% 
                  mutate(SeniorCitizen = as.factor(ifelse(SeniorCitizen == 1,"Yes","No"))) %>% 
                         drop_na() # modify the data

A quick check whether still NA’s values.

anyNA(customer)

## [1] FALSE

Since we are going to predict Customer Churn, the variable which going to be the target is Churn. Let’s see the proportion between churned customer and not churned.

table(customer$Churn)

## 
##   No  Yes 
## 5163 1869

prop.table(table(customer$Churn))

## 
##       No      Yes 
## 0.734215 0.265785

Unfortunately, the data proportion is heavily dominated by the non-churner -a common practice actually, but it will need some balancing on the next process.

Feature Selection on Business Wise

Gender Bias

Based on the Company’s Policy, there should be no discrimination among customers based on their gender. Therefore, the gender variable should be eliminated in the analysis. But as a data scientist, we should take a look to the data, just to ensure it won’t going to give us trouble on next process.

prop.table(table(customer$Churn, customer$gender))

##      
##          Female      Male
##   No  0.3617747 0.3724403
##   Yes 0.1335324 0.1322526

Based on the proportion table, clearly we can see that there is no difference between Male and Female on their decision to be a churner or no. So it will not going to affect the prediction process if we decide to take them out:

customer <- customer %>% 
                        select(-gender)

Phone Service

If we take a look closer at the summary of customer dataset, we can see that there is strong collinearity between PhoneService and MultipleLines. The number of customers who don’t have PhoneService are precisely reflected on the MultipleLine, which categorized as “No phone service”. This may create singularity during modeling. Therefore, we are going to take out the customer who dont have PhoneService (categorized as “No”), and automatically this variable won’t be relevant and can be eliminated:

customer <- customer %>% 
                        filter(PhoneService != "No") %>% 
                        select(-PhoneService) %>% 
                        droplevels()

Internet Service

Similar finding with the PhoneService, the InternetService also giving the same behavior. There are 6 Variable with “No internet service” category which have the same numbers with customer who didn’t have InternetService. Clearly, we are going to take them out. But this time we will keep the InternetService variable since there are still two category left, and can be compared:

customer <- customer %>% 
                        filter(InternetService != "No") %>% 
                        select(-InternetService) %>% 
                        droplevels()

Cross Validation

1st Split

Before going to take another step on modelling, let us split the data for modelling (80%) and evaluation (20%) purpose:

set.seed(289)
index <- sample(nrow(customer), nrow(customer)*0.2)
eval_customer <- customer[index, ]
model_customer <- customer[-index, ]

prop.table(table(eval_customer$Churn))

## 
##       No      Yes 
## 0.679089 0.320911

prop.table(table(model_customer$Churn))

## 
##        No       Yes 
## 0.6699431 0.3300569

dim(eval_customer)

## [1] 966  17

dim(model_customer)

## [1] 3866   17

The proportion of churner and non-churner seems changed from the previous calculation, because we eliminate some observations during the Feature Selection. But still, it is not balanced.

2nd Split

Let us split the modelling data into train set (70%) and test set (30%).

set.seed(168)
index1 <- sample(nrow(model_customer), nrow(model_customer)*0.7)
train_customer <- model_customer[index1, ]
test_customer <- model_customer[-index1, ]

prop.table(table(train_customer$Churn))

## 
##        No       Yes 
## 0.6747967 0.3252033

prop.table(table(test_customer$Churn))

## 
##        No       Yes 
## 0.6586207 0.3413793

dim(train_customer)

## [1] 2706   17

dim(test_customer)

## [1] 1160   17

The proportion seems similar with the evaluation set, let’s proceed to the next step.

Pra Processing Data

It was clearly that we need to balance the proportion of churner and non-churner, and we will balance it on train and test set.

train_customer.new <-  downSample(x = train_customer[ , -17], 
                                  y = train_customer$Churn, 
                                  yname = "Churn")

test_customer.new <-  downSample(x = test_customer[ , -17], 
                                 y = test_customer$Churn, 
                                 yname = "Churn")

prop.table(table(train_customer.new$Churn))

## 
##  No Yes 
## 0.5 0.5

prop.table(table(test_customer.new$Churn))

## 
##  No Yes 
## 0.5 0.5

Good, both of them now are perfectly balanced. Let’s go to the modelling part.

Modelling

First of all, since we are going to use Logistic Regression, let’s change the category in Churn variable from “Yes” into “1” and “No into”0" for the sake of simplicity.

train_customer.new <- train_customer.new %>% 
                                            mutate(Churn = ifelse(Churn == "Yes",1,0))

test_customer.new <- test_customer.new %>% 
                                          mutate(Churn = ifelse(Churn == "Yes",1,0))

First Modelling

model_cust1 <- glm(formula = Churn~., data = train_customer.new, family = "binomial")
summary(model_cust1)

## 
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train_customer.new)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0159  -0.8780   0.1714   0.8407   2.7981  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                          -0.9838335  0.4530403  -2.172
## SeniorCitizenYes                      0.1454629  0.1443824   1.007
## PartnerYes                            0.0832949  0.1346886   0.618
## DependentsYes                        -0.1866081  0.1547590  -1.206
## tenure                               -0.0583239  0.0136871  -4.261
## MultipleLinesYes                      0.1901313  0.1353745   1.404
## OnlineSecurityYes                    -0.4937652  0.1335909  -3.696
## OnlineBackupYes                      -0.3290763  0.1324216  -2.485
## DeviceProtectionYes                  -0.1534875  0.1356088  -1.132
## TechSupportYes                       -0.3858567  0.1372921  -2.810
## StreamingTVYes                       -0.1166251  0.1527883  -0.763
## StreamingMoviesYes                   -0.3130408  0.1530001  -2.046
## ContractOne year                     -0.6390219  0.1803937  -3.542
## ContractTwo year                     -1.5839567  0.2897794  -5.466
## PaperlessBillingYes                   0.4046841  0.1316211   3.075
## PaymentMethodCredit card (automatic)  0.1562398  0.1851553   0.844
## PaymentMethodElectronic check         0.3938941  0.1564075   2.518
## PaymentMethodMailed check             0.3878645  0.2083423   1.862
## MonthlyCharges                        0.0239912  0.0065636   3.655
## TotalCharges                          0.0003766  0.0001495   2.519
##                                         Pr(>|z|)    
## (Intercept)                             0.029884 *  
## SeniorCitizenYes                        0.313703    
## PartnerYes                              0.536294    
## DependentsYes                           0.227895    
## tenure                               0.000020330 ***
## MultipleLinesYes                        0.160175    
## OnlineSecurityYes                       0.000219 ***
## OnlineBackupYes                         0.012953 *  
## DeviceProtectionYes                     0.257702    
## TechSupportYes                          0.004947 ** 
## StreamingTVYes                          0.445278    
## StreamingMoviesYes                      0.040755 *  
## ContractOne year                        0.000397 ***
## ContractTwo year                     0.000000046 ***
## PaperlessBillingYes                     0.002108 ** 
## PaymentMethodCredit card (automatic)    0.398764    
## PaymentMethodElectronic check           0.011789 *  
## PaymentMethodMailed check               0.062650 .  
## MonthlyCharges                          0.000257 ***
## TotalCharges                            0.011768 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2439.9  on 1759  degrees of freedom
## Residual deviance: 1861.9  on 1740  degrees of freedom
## AIC: 1901.9
## 
## Number of Fisher Scoring iterations: 5

Interpretation:
- The most significant coefs are tenure, ContractOne year, ContractTwo year and TotalCharges. The significant variables are OnlineBackupYes and MonthlyCharges. As for the quite significant variables are MultipleLinesyes, OnlineSecurityYes, StreamingTVYes and PaymentMethodElectronic check.
- Let’s start from the numeric variables. As we can see, the increment of tenure will 0.92 (exp(-0.0784127)) times decreased the churners. Well this is valid since a customer will less-likely to leave when they spend more time as subcriber than the new one. We can spend more promotion (like free trials) to the new customers to retain them on longer experience.
- In contrary, the increment of MonthlyCharges and TotalCharges will increasing the churners, each for 1.02 (exp(0.0203037)) and 1.00 (exp(0.0005480)) times. It seems the customers more likely to leave when they have high charges. We can set some discount with particular terms and conditions to retain them. - Next, the Contract of One and Two year will going to decrease likelihood of churner. This is valid since customer who attached with long-term contract will going to stay rather than the month-to-month basis customer. We should have promotion campaign for monthly basis customer.
- We also could see that customers who have extra service such as MultipleLines, OnlineSecurity, StreamingTV, and OnlineBackup will tend to be a non-churner rather than the one who only after for the standard services.
- Customer who pay with Electronic check is tend to be a churner rather than the other method.

Evaluate the Model

Let’s evaluate the model to check whether it was fitted or no. Let’s use the 0.5 threshold for initial test.

test_customer.new$Pred.Risk <- predict(model_cust1, test_customer.new, type = "response")

confusionMatrix(data=as.factor(as.numeric(test_customer.new$Pred.Risk>=0.5)), reference=as.factor(test_customer.new$Churn), dnn = c("Pred","Act"), positive="1")

## Confusion Matrix and Statistics
## 
##     Act
## Pred   0   1
##    0 272  63
##    1 124 333
##                                                
##                Accuracy : 0.7639               
##                  95% CI : (0.7327, 0.7931)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5278               
##                                                
##  Mcnemar's Test P-Value : 0.00001146           
##                                                
##             Sensitivity : 0.8409               
##             Specificity : 0.6869               
##          Pos Pred Value : 0.7287               
##          Neg Pred Value : 0.8119               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4205               
##    Detection Prevalence : 0.5770               
##       Balanced Accuracy : 0.7639               
##                                                
##        'Positive' Class : 1                    
##

The Accuracy of the initial model is 0.7626, not bad, but it seems we can make it better. And since our goal is to decrease the chance of churner who predicted as non-churner, we will seek for a higher Sensitivity (initial model scores 0.8359)

Improve the model

Stepwise Method

test_step <- step(model_cust1)

## Start:  AIC=1901.89
## Churn ~ SeniorCitizen + Partner + Dependents + tenure + MultipleLines + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - Partner           1   1862.3 1900.3
## - StreamingTV       1   1862.5 1900.5
## - SeniorCitizen     1   1862.9 1900.9
## - DeviceProtection  1   1863.2 1901.2
## - Dependents        1   1863.3 1901.3
## - MultipleLines     1   1863.9 1901.9
## <none>                  1861.9 1901.9
## - PaymentMethod     3   1869.3 1903.3
## - StreamingMovies   1   1866.1 1904.1
## - OnlineBackup      1   1868.0 1906.0
## - TotalCharges      1   1868.3 1906.3
## - TechSupport       1   1869.7 1907.7
## - PaperlessBilling  1   1871.3 1909.3
## - MonthlyCharges    1   1875.1 1913.1
## - OnlineSecurity    1   1875.4 1913.4
## - tenure            1   1880.8 1918.8
## - Contract          2   1897.3 1933.3
## 
## Step:  AIC=1900.27
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - StreamingTV       1   1862.9 1898.9
## - Dependents        1   1863.4 1899.4
## - SeniorCitizen     1   1863.5 1899.5
## - DeviceProtection  1   1863.5 1899.5
## <none>                  1862.3 1900.3
## - MultipleLines     1   1864.3 1900.3
## - PaymentMethod     3   1869.6 1901.6
## - StreamingMovies   1   1866.5 1902.5
## - OnlineBackup      1   1868.5 1904.5
## - TotalCharges      1   1868.7 1904.7
## - TechSupport       1   1870.1 1906.1
## - PaperlessBilling  1   1871.7 1907.7
## - OnlineSecurity    1   1875.8 1911.8
## - MonthlyCharges    1   1875.8 1911.8
## - tenure            1   1880.9 1916.9
## - Contract          2   1898.0 1932.0
## 
## Step:  AIC=1898.87
## Churn ~ SeniorCitizen + Dependents + tenure + MultipleLines + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - Dependents        1   1864.0 1898.0
## - DeviceProtection  1   1864.0 1898.0
## - SeniorCitizen     1   1864.2 1898.2
## <none>                  1862.9 1898.9
## - MultipleLines     1   1865.2 1899.2
## - PaymentMethod     3   1870.1 1900.1
## - StreamingMovies   1   1867.2 1901.2
## - OnlineBackup      1   1868.8 1902.8
## - TotalCharges      1   1869.0 1903.0
## - TechSupport       1   1870.8 1904.8
## - PaperlessBilling  1   1872.1 1906.1
## - OnlineSecurity    1   1876.0 1910.0
## - MonthlyCharges    1   1876.2 1910.2
## - tenure            1   1881.2 1915.2
## - Contract          2   1899.7 1931.7
## 
## Step:  AIC=1897.97
## Churn ~ SeniorCitizen + tenure + MultipleLines + OnlineSecurity + 
##     OnlineBackup + DeviceProtection + TechSupport + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     TotalCharges
## 
##                    Df Deviance    AIC
## - DeviceProtection  1   1865.2 1897.2
## - SeniorCitizen     1   1865.7 1897.7
## <none>                  1864.0 1898.0
## - MultipleLines     1   1866.3 1898.3
## - PaymentMethod     3   1871.2 1899.2
## - StreamingMovies   1   1868.3 1900.3
## - OnlineBackup      1   1869.9 1901.9
## - TotalCharges      1   1870.5 1902.5
## - TechSupport       1   1871.9 1903.9
## - PaperlessBilling  1   1873.5 1905.5
## - OnlineSecurity    1   1877.2 1909.2
## - MonthlyCharges    1   1877.2 1909.2
## - tenure            1   1883.1 1915.1
## - Contract          2   1901.6 1931.6
## 
## Step:  AIC=1897.2
## Churn ~ SeniorCitizen + tenure + MultipleLines + OnlineSecurity + 
##     OnlineBackup + TechSupport + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## - SeniorCitizen     1   1867.0 1897.0
## <none>                  1865.2 1897.2
## - MultipleLines     1   1868.0 1898.0
## - PaymentMethod     3   1872.4 1898.4
## - StreamingMovies   1   1869.3 1899.3
## - OnlineBackup      1   1871.0 1901.0
## - TotalCharges      1   1871.5 1901.5
## - TechSupport       1   1873.3 1903.3
## - PaperlessBilling  1   1875.0 1905.0
## - MonthlyCharges    1   1877.3 1907.3
## - OnlineSecurity    1   1878.2 1908.2
## - tenure            1   1884.3 1914.3
## - Contract          2   1905.9 1933.9
## 
## Step:  AIC=1897.05
## Churn ~ tenure + MultipleLines + OnlineSecurity + OnlineBackup + 
##     TechSupport + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges
## 
##                    Df Deviance    AIC
## <none>                  1867.0 1897.0
## - MultipleLines     1   1870.1 1898.1
## - PaymentMethod     3   1874.5 1898.5
## - StreamingMovies   1   1871.1 1899.1
## - OnlineBackup      1   1873.2 1901.2
## - TotalCharges      1   1873.4 1901.4
## - TechSupport       1   1876.3 1904.3
## - PaperlessBilling  1   1877.4 1905.4
## - MonthlyCharges    1   1879.8 1907.8
## - OnlineSecurity    1   1880.8 1908.8
## - tenure            1   1886.0 1914.0
## - Contract          2   1909.9 1935.9

summary(test_step)

## 
## Call:
## glm(formula = Churn ~ tenure + MultipleLines + OnlineSecurity + 
##     OnlineBackup + TechSupport + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges, 
##     family = "binomial", data = train_customer.new)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0476  -0.8926   0.1831   0.8404   2.8188  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                          -0.8528966  0.4302405  -1.982
## tenure                               -0.0578205  0.0135470  -4.268
## MultipleLinesYes                      0.2314720  0.1334928   1.734
## OnlineSecurityYes                    -0.4938240  0.1325385  -3.726
## OnlineBackupYes                      -0.3267657  0.1315175  -2.485
## TechSupportYes                       -0.4152916  0.1360380  -3.053
## StreamingMoviesYes                   -0.3069506  0.1521664  -2.017
## ContractOne year                     -0.7031985  0.1776107  -3.959
## ContractTwo year                     -1.6909384  0.2853521  -5.926
## PaperlessBillingYes                   0.4199034  0.1307430   3.212
## PaymentMethodCredit card (automatic)  0.1567953  0.1845085   0.850
## PaymentMethodElectronic check         0.3993476  0.1560503   2.559
## PaymentMethodMailed check             0.3617219  0.2073220   1.745
## MonthlyCharges                        0.0212018  0.0059136   3.585
## TotalCharges                          0.0003727  0.0001486   2.507
##                                           Pr(>|z|)    
## (Intercept)                               0.047438 *  
## tenure                               0.00001970997 ***
## MultipleLinesYes                          0.082924 .  
## OnlineSecurityYes                         0.000195 ***
## OnlineBackupYes                           0.012970 *  
## TechSupportYes                            0.002267 ** 
## StreamingMoviesYes                        0.043674 *  
## ContractOne year                     0.00007519748 ***
## ContractTwo year                     0.00000000311 ***
## PaperlessBillingYes                       0.001320 ** 
## PaymentMethodCredit card (automatic)      0.395436    
## PaymentMethodElectronic check             0.010494 *  
## PaymentMethodMailed check                 0.081031 .  
## MonthlyCharges                            0.000337 ***
## TotalCharges                              0.012174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2439.9  on 1759  degrees of freedom
## Residual deviance: 1867.0  on 1745  degrees of freedom
## AIC: 1897
## 
## Number of Fisher Scoring iterations: 5

test_pred <- predict(test_step, test_customer.new, type = "response")
confusionMatrix(data=as.factor(as.numeric(test_pred>=0.5)), reference=as.factor(test_customer.new$Churn), dnn = c("Pred","Act"), positive="1")

## Confusion Matrix and Statistics
## 
##     Act
## Pred   0   1
##    0 269  64
##    1 127 332
##                                                
##                Accuracy : 0.7588               
##                  95% CI : (0.7275, 0.7883)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5177               
##                                                
##  Mcnemar's Test P-Value : 0.000007252          
##                                                
##             Sensitivity : 0.8384               
##             Specificity : 0.6793               
##          Pos Pred Value : 0.7233               
##          Neg Pred Value : 0.8078               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4192               
##    Detection Prevalence : 0.5795               
##       Balanced Accuracy : 0.7588               
##                                                
##        'Positive' Class : 1                    
##

By using stepwise, some of the variables are eliminated, such as: DeviceProtection, Partner, Dependents and SeniorCitizen.
Unfortunately, the Accuracy doesn’t increase and even worst, the Specivity is dropped into 0.8333.
We will not going to use the stepwise to improve the model.

Changing Threshold

Let’s try to decrease the threshold to get a better Sensitivity, according to our goal.

confusionMatrix(data=as.factor(as.numeric(test_customer.new$Pred.Risk>=0.4)), reference=as.factor(test_customer.new$Churn), dnn = c("Pred","Act"), positive="1")

## Confusion Matrix and Statistics
## 
##     Act
## Pred   0   1
##    0 232  34
##    1 164 362
##                                                
##                Accuracy : 0.75                 
##                  95% CI : (0.7183, 0.7798)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5                  
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.9141               
##             Specificity : 0.5859               
##          Pos Pred Value : 0.6882               
##          Neg Pred Value : 0.8722               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4571               
##    Detection Prevalence : 0.6641               
##       Balanced Accuracy : 0.7500               
##                                                
##        'Positive' Class : 1                    
##

Voila, the Sensivity increased into 0.9141 and even the Accuracy is decreased into 0.7361, it still a reasonable result. Therefore, we will going to use the Initial model, but use 0.4 as the threshold.

Ultimate Test

For the final test, let us test our model into the Evaluation Data.

eval_customer <- eval_customer %>% 
                                            mutate(Churn = ifelse(Churn == "Yes",1,0))

eval_customer$Pred.Risk <- predict(model_cust1, eval_customer, type = "response")

confusionMatrix(data=as.factor(as.numeric(eval_customer$Pred.Risk>=0.4)), reference=as.factor(eval_customer$Churn), dnn = c("Pred","Act"), positive="1")

## Confusion Matrix and Statistics
## 
##     Act
## Pred   0   1
##    0 361  27
##    1 295 283
##                                              
##                Accuracy : 0.6667             
##                  95% CI : (0.6359, 0.6964)   
##     No Information Rate : 0.6791             
##     P-Value [Acc > NIR] : 0.8058             
##                                              
##                   Kappa : 0.3772             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.9129             
##             Specificity : 0.5503             
##          Pos Pred Value : 0.4896             
##          Neg Pred Value : 0.9304             
##              Prevalence : 0.3209             
##          Detection Rate : 0.2930             
##    Detection Prevalence : 0.5983             
##       Balanced Accuracy : 0.7316             
##                                              
##        'Positive' Class : 1                  
##

The sensitivity reach the number that similar with test data. Unfortunately the Accuracy quite dropped, but we will accept it since the evaluation data is randomly selected and inbalanced. In conclusion, the chosen Logistic Regression model is fitted to predict the Telco Customer Churn.

Predicting Telco Customer Churn

Arya Andhika

2019-08-21