Preface

Logistic Regression

Why do we use logistic regression?

Logistic Regression is a classification algorithm which is used to predict a categorical variable (as an output) (Yes/No) based on a set of independent variable(s).

Logistic Regression is used to explain the relationship between the categorical dependent variable (target variable) and one or more independent variables. When the dependent variable is dichotomous, we use binary logistic regression.

Main Objective: Customer Churn Prediction

Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame. Customer churn prediction is essential to evaluate a business strategy, and of course to avoid more customer churn.

In this section we’re going to predict a customer churn using 2 models which are logistic regression and K-NN model.

Data source: Telco Customer Churn



Library Preparation

# libraries
library(readr)
library(tidyverse)
library(rsample)
library(caret)
library(nortest)
library(class)
library(psych)
library(performance)

Read Data and Data Information

Note: We can put an argument stringsAsFactors = T directly to read.csv() function for efficiency

churn <- read.csv("Kaggle DataSet/Telco Churn/WA_Fn-UseC_-Telco-Customer-Churn.csv", stringsAsFactors = T)
churn

Data Information

  1. customerID : Customer ID
  2. gender : Whether the customer is a male or a female
  3. SeniorCitizen : Whether the customer is a senior citizen or not
  4. Partner : Whether the customer has a partner or not
  5. Dependents : Whether the customer has dependents or not
  6. tenure : Number of months the customer has stayed with the company
  7. PhoneService : Whether the customer has a phone service or not
  8. MultipleLines : Whether the customer has multiple lines or not
  9. InternetService : Customer’s internet service provider
  10. OnlineSecurity : Whether the customer has online security or not
  11. OnlineBackup : Whether the customer has online backup or not
  12. DeviceProtection: Whether the customer has device protection or not
  13. TechSupport : Whether the customer has tech support or not
  14. StreamingTV : Whether the customer has streaming TV or not
  15. StreamingMovies : Whether the customer has streaming movies or not
  16. Contract : The contract term of the customer
  17. PaperlessBilling: Whether the customer has paperless billing or not
  18. PaymentMethod :The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
  19. MonthlyCharges : The amount charged to the customer monthly
  20. TotalCharges : The total amount charged to the customer
  21. Churn : Whether the customer churned or not


Data Wrangling

Now let’s see the summary of our data

summary(churn)
#>       customerID      gender     SeniorCitizen    Partner    Dependents
#>  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
#>  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
#>  0004-TLHLJ:   1                 Median :0.0000                        
#>  0011-IGKFF:   1                 Mean   :0.1621                        
#>  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
#>  0013-MHZWF:   1                 Max.   :1.0000                        
#>  (Other)   :7037                                                       
#>      tenure      PhoneService          MultipleLines     InternetService
#>  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
#>  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
#>  Median :29.00                Yes             :2971   No         :1526  
#>  Mean   :32.37                                                          
#>  3rd Qu.:55.00                                                          
#>  Max.   :72.00                                                          
#>                                                                         
#>              OnlineSecurity              OnlineBackup 
#>  No                 :3498   No                 :3088  
#>  No internet service:1526   No internet service:1526  
#>  Yes                :2019   Yes                :2429  
#>                                                       
#>                                                       
#>                                                       
#>                                                       
#>             DeviceProtection              TechSupport  
#>  No                 :3095    No                 :3473  
#>  No internet service:1526    No internet service:1526  
#>  Yes                :2422    Yes                :2044  
#>                                                        
#>                                                        
#>                                                        
#>                                                        
#>               StreamingTV              StreamingMovies           Contract   
#>  No                 :2810   No                 :2785   Month-to-month:3875  
#>  No internet service:1526   No internet service:1526   One year      :1473  
#>  Yes                :2707   Yes                :2732   Two year      :1695  
#>                                                                             
#>                                                                             
#>                                                                             
#>                                                                             
#>  PaperlessBilling                   PaymentMethod  MonthlyCharges  
#>  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
#>  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
#>                   Electronic check         :2365   Median : 70.35  
#>                   Mailed check             :1612   Mean   : 64.76  
#>                                                    3rd Qu.: 89.85  
#>                                                    Max.   :118.75  
#>                                                                    
#>   TotalCharges    Churn     
#>  Min.   :  18.8   No :5174  
#>  1st Qu.: 401.4   Yes:1869  
#>  Median :1397.5             
#>  Mean   :2283.3             
#>  3rd Qu.:3794.7             
#>  Max.   :8684.8             
#>  NA's   :11

Highlight Our data has 11 missing value in Total Charges variable.

Handling NA value will be performed in further processing

Checking data structure

str(churn)
#> 'data.frame':    7043 obs. of  21 variables:
#>  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
#>  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#>  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#>  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#>  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
#>  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#>  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
#>  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#>  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
#>  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
#>  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
#>  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
#>  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
#>  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
#>  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#>  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#>  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#>  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
#>  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
#>  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

There are some columns which have redundant information in variable (factor) levels:

  • No internet service level is in column : OnlineSecurity, OnlineBackup, DeviceProtection, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
  • we can interpret No internet service as No, therefore to simplify variable levels, we’re going to change No internet service -> No
  • Remove all No internet service level

Adjusting Variable’s Level

# Change the No internet service -> No
churn[churn$MultipleLines == "No phone service", "MultipleLines"] <- "No"
churn[churn$OnlineSecurity == "No internet service", "OnlineSecurity"] <- "No"
churn[churn$OnlineBackup == "No internet service", "OnlineBackup"] <- "No"
churn[churn$DeviceProtection == "No internet service", "DeviceProtection"] <- "No"
churn[churn$TechSupport == "No internet service", "TechSupport"] <- "No"
churn[churn$StreamingTV == "No internet service", "StreamingTV"] <- "No"
churn[churn$StreamingMovies == "No internet service", "StreamingMovies"] <- "No"

# Remove all No internet service level
levels(churn$MultipleLines)[levels(churn$MultipleLines) == 'No phone service'] <- "No"
levels(churn$OnlineSecurity)[levels(churn$OnlineSecurity) == 'No internet service'] <- "No"
levels(churn$OnlineBackup)[levels(churn$OnlineBackup) == 'No internet service'] <- "No"
levels(churn$DeviceProtection)[levels(churn$DeviceProtection) == 'No internet service'] <- "No"
levels(churn$TechSupport)[levels(churn$TechSupport) == 'No internet service'] <- "No"
levels(churn$StreamingTV)[levels(churn$StreamingTV  ) == 'No internet service'] <- "No"
levels(churn$StreamingMovies)[levels(churn$StreamingMovies  ) == 'No internet service'] <- "No"

Recheck our data structure

str(churn)
#> 'data.frame':    7043 obs. of  21 variables:
#>  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
#>  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#>  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#>  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#>  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
#>  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#>  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#>  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#>  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
#>  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
#>  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
#>  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
#>  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#>  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
#>  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#>  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#>  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#>  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
#>  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
#>  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Now our data has more meaningful and more direct grouping level

Checking Duplicate

any(duplicated(churn[, -1]))
#> [1] TRUE

Our data contained duplicated data

Note: We excluded ID from checking duplicate value, so that duplicated() function checking our main data only. Supposedly our data contained different value to each customer(Id) in columns: MonthlyCharges & TotalCharges, so when the two column have identical information, we can assumed that they are indeed duplicated

churn[duplicated(churn[, -1]),]

Handling Duplicated and NA Value

as we see from above summary, our data contain duplicated data and NA value. We’re going to drop all duplicated and NA, considering the total number of NA and duplicated data are not significant.

Note: In this process I’d like to drop customerID column (irrelevant to prediction) and also change SeniorCitizen column to factor

churn <- churn %>% 
  select(-customerID) %>% 
  mutate(SeniorCitizen = as.factor(SeniorCitizen)) %>% 
  na.omit()

churn <- churn[!duplicated(churn),]
summary(churn)
#>     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
#>  Female:3475   0:5869        No :3617   No :4911   Min.   : 1.00   No : 680    
#>  Male  :3535   1:1141        Yes:3393   Yes:2099   1st Qu.: 9.00   Yes:6330    
#>                                                    Median :29.00               
#>                                                    Mean   :32.52               
#>                                                    3rd Qu.:56.00               
#>                                                    Max.   :72.00               
#>  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
#>  No :4043      DSL        :2414   No :4995       No :4585     No :4592        
#>  Yes:2967      Fiber optic:3090   Yes:2015       Yes:2425     Yes:2418        
#>                No         :1506                                               
#>                                                                               
#>                                                                               
#>                                                                               
#>  TechSupport StreamingTV StreamingMovies           Contract    PaperlessBilling
#>  No :4970    No :4307    No :4279        Month-to-month:3853   No :2852        
#>  Yes:2040    Yes:2703    Yes:2731        One year      :1472   Yes:4158        
#>                                          Two year      :1685                   
#>                                                                                
#>                                                                                
#>                                                                                
#>                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
#>  Bank transfer (automatic):1542   Min.   : 18.25   Min.   :  18.8   No :5153  
#>  Credit card (automatic)  :1521   1st Qu.: 35.75   1st Qu.: 408.3   Yes:1857  
#>  Electronic check         :2359   Median : 70.40   Median :1403.9             
#>  Mailed check             :1588   Mean   : 64.89   Mean   :2290.4             
#>                                   3rd Qu.: 89.90   3rd Qu.:3807.8             
#>                                   Max.   :118.75   Max.   :8684.8

Checking outliers

par(mfrow=c(1,3))
boxplot(churn$tenure, main = "Tenure")
boxplot(churn$MonthlyCharges, main = "MonthlyCharges")
boxplot(churn$TotalCharges, main = "TotalCharges" )

Highlights Our numerical variable doesn’t contain otliers



Logistic Regression Model

churn

Splitting data into Data Train and Data Test

Before creating a model, we’re going to split our data into data train and data test.

Data train is data sampling from the main data (churn data frame) which will be used to train our machine learning.

Data test is data sampling from the main data (churn data frame), which will be used to test our prediction (unseen data from modelling).

Splitting Proportion

  • Data Train : 80%
  • Data Test : 20%
  • Target variable: Churn
RNGkind(sample.kind = "Rounding")
set.seed(101) 

#sampling data
row.sample <- initial_split(data=churn,
                       prop = 0.8, 
                       strata = Churn) 

# Split data into data train and data test
churn_train <- training(row.sample)
churn_test <- testing(row.sample)

Checking Target Variable Distribution

prop.table(table(churn_train$Churn))
#> 
#>        No       Yes 
#> 0.7351525 0.2648475

Checking target variable in data train is essential after splitting our data. We want our model to learn the pattern equally for both target variable.

As we can see that our data train has more ‘No’ value than ‘Yes’ class, therefore I’m going to apply down sampling method to balance the target variable proportion in the data train.

Perform downsampling

RNGkind(sample.kind = "Rounding")
set.seed(101)

churn_train_down <- downSample(x = churn_train %>% select(-Churn),
                         y = churn_train$Churn,
                         yname = "Churn")
prop.table(table(churn_train_down$Churn))
#> 
#>  No Yes 
#> 0.5 0.5

Now our data train has balance target variable, we can now prepare our model.

Model Building

1. Logistic Regression Model

For efficiency we’re going to create a model with all variable

model.glm.all <- glm(formula = Churn ~ .,
                 data = churn_train_down,
                 family = "binomial")

summary(model.glm.all)
#> 
#> Call:
#> glm(formula = Churn ~ ., family = "binomial", data = churn_train_down)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3540  -0.7160   0.1118   0.7254   2.9098  
#> 
#> Coefficients:
#>                                        Estimate Std. Error z value
#> (Intercept)                           3.7699084  1.1890968   3.170
#> genderMale                            0.0008891  0.0940060   0.009
#> SeniorCitizen1                        0.3217475  0.1293025   2.488
#> PartnerYes                            0.0110704  0.1137114   0.097
#> DependentsYes                        -0.1816949  0.1263558  -1.438
#> tenure                               -0.0580430  0.0078784  -7.367
#> PhoneServiceYes                       1.3154225  0.9342042   1.408
#> MultipleLinesYes                      0.7639486  0.2589283   2.950
#> InternetServiceFiber optic            3.3090974  1.1576471   2.858
#> InternetServiceNo                    -3.1941767  1.1677907  -2.735
#> OnlineSecurityYes                     0.1878893  0.2589138   0.726
#> OnlineBackupYes                       0.2273537  0.2543363   0.894
#> DeviceProtectionYes                   0.4546489  0.2581146   1.761
#> TechSupportYes                        0.1078538  0.2581989   0.418
#> StreamingTVYes                        1.2415122  0.4707330   2.637
#> StreamingMoviesYes                    1.2959644  0.4723263   2.744
#> ContractOne year                     -0.6622757  0.1453015  -4.558
#> ContractTwo year                     -1.6071267  0.2291407  -7.014
#> PaperlessBillingYes                   0.1891272  0.1059667   1.785
#> PaymentMethodCredit card (automatic) -0.0848362  0.1573657  -0.539
#> PaymentMethodElectronic check         0.3407592  0.1363990   2.498
#> PaymentMethodMailed check            -0.3109473  0.1600670  -1.943
#> MonthlyCharges                       -0.1008175  0.0460016  -2.192
#> TotalCharges                          0.0003170  0.0000922   3.438
#>                                               Pr(>|z|)    
#> (Intercept)                                   0.001522 ** 
#> genderMale                                    0.992454    
#> SeniorCitizen1                                0.012834 *  
#> PartnerYes                                    0.922444    
#> DependentsYes                                 0.150445    
#> tenure                               0.000000000000174 ***
#> PhoneServiceYes                               0.159111    
#> MultipleLinesYes                              0.003173 ** 
#> InternetServiceFiber optic                    0.004257 ** 
#> InternetServiceNo                             0.006234 ** 
#> OnlineSecurityYes                             0.468033    
#> OnlineBackupYes                               0.371370    
#> DeviceProtectionYes                           0.078167 .  
#> TechSupportYes                                0.676155    
#> StreamingTVYes                                0.008354 ** 
#> StreamingMoviesYes                            0.006073 ** 
#> ContractOne year                     0.000005165774522 ***
#> ContractTwo year                     0.000000000002321 ***
#> PaperlessBillingYes                           0.074297 .  
#> PaymentMethodCredit card (automatic)          0.589816    
#> PaymentMethodElectronic check                 0.012481 *  
#> PaymentMethodMailed check                     0.052064 .  
#> MonthlyCharges                                0.028408 *  
#> TotalCharges                                  0.000585 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4117.3  on 2969  degrees of freedom
#> Residual deviance: 2779.5  on 2946  degrees of freedom
#> AIC: 2827.5
#> 
#> Number of Fisher Scoring iterations: 6

Logistic Regression Model Output Interpretation:

  1. When we include all variable as predictors, we can see that there are some variables that don’t have significance correlation to our target variable
  2. Residual deviance is 2779.5, we will compare to other model which will has lower residual deviance number

2. Step-wise Regression

As we can see from previous model, there are some variable that don’t have significance correlation to the target variable. We will try to use step-wise regression for feature selection

# Step-wise model (Backward)
model.step.bwd <- step(model.glm.all,
                  direction = "backward",
                  trace = F)
summary(model.step.bwd)
#> 
#> Call:
#> glm(formula = Churn ~ SeniorCitizen + Dependents + tenure + PhoneService + 
#>     MultipleLines + InternetService + DeviceProtection + StreamingTV + 
#>     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
#>     MonthlyCharges + TotalCharges, family = "binomial", data = churn_train_down)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3222  -0.7163   0.1143   0.7265   2.9351  
#> 
#> Coefficients:
#>                                         Estimate  Std. Error z value
#> (Intercept)                           2.96055331  0.45905807   6.449
#> SeniorCitizen1                        0.32493734  0.12860131   2.527
#> DependentsYes                        -0.17469692  0.11409329  -1.531
#> tenure                               -0.05751853  0.00781551  -7.360
#> PhoneServiceYes                       0.65700097  0.31614852   2.078
#> MultipleLinesYes                      0.59952553  0.13213733   4.537
#> InternetServiceFiber optic            2.49590173  0.32350517   7.715
#> InternetServiceNo                    -2.37917705  0.41859359  -5.684
#> DeviceProtectionYes                   0.28780612  0.13221166   2.177
#> StreamingTVYes                        0.91253593  0.17307548   5.272
#> StreamingMoviesYes                    0.96726840  0.17648317   5.481
#> ContractOne year                     -0.67344946  0.14443492  -4.663
#> ContractTwo year                     -1.62353712  0.22624622  -7.176
#> PaperlessBillingYes                   0.19296684  0.10564085   1.827
#> PaymentMethodCredit card (automatic) -0.08247750  0.15725161  -0.524
#> PaymentMethodElectronic check         0.34126955  0.13622324   2.505
#> PaymentMethodMailed check            -0.31398460  0.15987148  -1.964
#> MonthlyCharges                       -0.06811462  0.01298500  -5.246
#> TotalCharges                          0.00031851  0.00009192   3.465
#>                                                Pr(>|z|)    
#> (Intercept)                          0.0000000001124486 ***
#> SeniorCitizen1                                  0.01151 *  
#> DependentsYes                                   0.12573    
#> tenure                               0.0000000000001845 ***
#> PhoneServiceYes                                 0.03770 *  
#> MultipleLinesYes                     0.0000057022313052 ***
#> InternetServiceFiber optic           0.0000000000000121 ***
#> InternetServiceNo                    0.0000000131780691 ***
#> DeviceProtectionYes                             0.02949 *  
#> StreamingTVYes                       0.0000001345971066 ***
#> StreamingMoviesYes                   0.0000000423414440 ***
#> ContractOne year                     0.0000031216374031 ***
#> ContractTwo year                     0.0000000000007179 ***
#> PaperlessBillingYes                             0.06776 .  
#> PaymentMethodCredit card (automatic)            0.59994    
#> PaymentMethodElectronic check                   0.01224 *  
#> PaymentMethodMailed check                       0.04953 *  
#> MonthlyCharges                       0.0000001557418990 ***
#> TotalCharges                                    0.00053 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4117.3  on 2969  degrees of freedom
#> Residual deviance: 2780.6  on 2951  degrees of freedom
#> AIC: 2818.6
#> 
#> Number of Fisher Scoring iterations: 6

Step wise Model (forward)

# Step-wise Model (forward)
model.glm.none <- glm(formula = Churn ~ 1,
                  data = churn_train_down,
                  family = "binomial")
summary(step(model.glm.none,
             direction = 'forward',
             scope = list(upper = model.glm.all),
             trace = F))
#> 
#> Call:
#> glm(formula = Churn ~ Contract + InternetService + tenure + PaymentMethod + 
#>     StreamingMovies + SeniorCitizen + TechSupport + StreamingTV + 
#>     MonthlyCharges + MultipleLines + TotalCharges + PaperlessBilling + 
#>     Dependents, family = "binomial", data = churn_train_down)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3105  -0.7198   0.1151   0.7212   2.9337  
#> 
#> Coefficients:
#>                                         Estimate  Std. Error z value
#> (Intercept)                           2.26689319  0.40282864   5.627
#> ContractOne year                     -0.66802097  0.14469099  -4.617
#> ContractTwo year                     -1.59907286  0.22840324  -7.001
#> InternetServiceFiber optic            1.81334528  0.25376504   7.146
#> InternetServiceNo                    -1.60115932  0.25067272  -6.387
#> tenure                               -0.05803751  0.00779286  -7.448
#> PaymentMethodCredit card (automatic) -0.08759279  0.15696219  -0.558
#> PaymentMethodElectronic check         0.34102017  0.13602633   2.507
#> PaymentMethodMailed check            -0.30544359  0.15964923  -1.913
#> StreamingMoviesYes                    0.70265221  0.14123215   4.975
#> SeniorCitizen1                        0.32262122  0.12863613   2.508
#> TechSupportYes                       -0.19498046  0.13031466  -1.496
#> StreamingTVYes                        0.64479558  0.13694876   4.708
#> MonthlyCharges                       -0.03934125  0.00790485  -4.977
#> MultipleLinesYes                      0.47121501  0.12790702   3.684
#> TotalCharges                          0.00031207  0.00009168   3.404
#> PaperlessBillingYes                   0.19219913  0.10551222   1.822
#> DependentsYes                        -0.17974536  0.11390380  -1.578
#>                                                Pr(>|z|)    
#> (Intercept)                          0.0000000182905857 ***
#> ContractOne year                     0.0000038955164861 ***
#> ContractTwo year                     0.0000000000025397 ***
#> InternetServiceFiber optic           0.0000000000008950 ***
#> InternetServiceNo                    0.0000000001686753 ***
#> tenure                               0.0000000000000951 ***
#> PaymentMethodCredit card (automatic)           0.576810    
#> PaymentMethodElectronic check                  0.012176 *  
#> PaymentMethodMailed check                      0.055720 .  
#> StreamingMoviesYes                   0.0000006519461547 ***
#> SeniorCitizen1                                 0.012141 *  
#> TechSupportYes                                 0.134594    
#> StreamingTVYes                       0.0000024979388303 ***
#> MonthlyCharges                       0.0000006462721718 ***
#> MultipleLinesYes                               0.000230 ***
#> TotalCharges                                   0.000664 ***
#> PaperlessBillingYes                            0.068518 .  
#> DependentsYes                                  0.114555    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 4117.3  on 2969  degrees of freedom
#> Residual deviance: 2784.7  on 2952  degrees of freedom
#> AIC: 2820.7
#> 
#> Number of Fisher Scoring iterations: 6

From step-wise regression (forward direction), we can see that Residual deviance is not so much different, and yet we certainly won’t use this model because it has slightly bigger Residual deviance. It seems that our best model at this moment is step-wise backward model.

Model Performance Comparison

compare_performance(model.glm.all, model.step.bwd)

Prediction

We’re going to predict to our data test (using the range of probability to churn from our model) >where the condition is churn == 1

churn_test$glm.prediction <- predict(model.step.bwd, newdata = churn_test, type = "response")

Now we’re giving a label based on the range number of probability, for the first trial try let’s assume that :

if the probablity greater that 0.5, then the customer might churn (labelled as Yes)

churn_test$glm.Label <- ifelse(churn_test$glm.prediction > 0.5, "Yes" , "No") %>% as.factor()

Let’s see the glimpse of our model prediction to the data test

churn_test %>% 
  select(Churn, glm.prediction,glm.Label) %>% 
  head(6)

Model Evaluation

We’ve saved our prediction to our data test, now let’s evaluate our prediction using confusion matrix

confusionMatrix(data=churn_test$glm.Label, reference=churn_test$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  733  82
#>        Yes 298 290
#>                                              
#>                Accuracy : 0.7292             
#>                  95% CI : (0.7051, 0.7523)   
#>     No Information Rate : 0.7349             
#>     P-Value [Acc > NIR] : 0.6976             
#>                                              
#>                   Kappa : 0.4138             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7796             
#>             Specificity : 0.7110             
#>          Pos Pred Value : 0.4932             
#>          Neg Pred Value : 0.8994             
#>              Prevalence : 0.2651             
#>          Detection Rate : 0.2067             
#>    Detection Prevalence : 0.4191             
#>       Balanced Accuracy : 0.7453             
#>                                              
#>        'Positive' Class : Yes                
#> 

Model Interpretation

1. Accuracy rate: 73%

  • Means that our model could predict accurately of Churn and Not Churn class (compared to data test) for 73%
  • Predict True as Churn: 290 (out of 372)
  • Predicted True as Not Churn: 733 (out 0f 1031)
  • Accuracy : \[ \frac{290 + 733}{372 + 1031} = \frac{1023}{1403} = 0.7291518 \]

2. Focus output: Sensitivity Rate 78%

Focus output on sensitivity because we want to decrease the chance of wrong prediction to predict customer won’t churn but in fact the customer does churn

  • Means that our model accurately predict True Positive (true prediction to customer churn), and false prediction as customer won’t churn (yet in fact, customer does churn on data test) is 78%
  • Predict True as Churn: 290 (out of 372)
  • Predicted True as Not Churn: 82 (out 0f 372)
  • Sensitivity : \[ \frac{290}{372} = 0.7795699 \]

Increasing Sensitivity

we can increasing sensitivity of our model by adjusting the threshold of condition whether the customer churn or not. But of course the adjustment needed a reasonable consideration.

For example based on our model prediction, our model sensitivity rate will increase if we set the threshold to 0.45, let’s see the output

churn_test$glm.Label.4 <- ifelse(churn_test$glm.prediction > 0.45, "Yes" , "No") %>% as.factor()
confusionMatrix(data=churn_test$glm.Label.4, reference=churn_test$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  699  72
#>        Yes 332 300
#>                                              
#>                Accuracy : 0.712              
#>                  95% CI : (0.6876, 0.7356)   
#>     No Information Rate : 0.7349             
#>     P-Value [Acc > NIR] : 0.9746             
#>                                              
#>                   Kappa : 0.396              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.8065             
#>             Specificity : 0.6780             
#>          Pos Pred Value : 0.4747             
#>          Neg Pred Value : 0.9066             
#>              Prevalence : 0.2651             
#>          Detection Rate : 0.2138             
#>    Detection Prevalence : 0.4505             
#>       Balanced Accuracy : 0.7422             
#>                                              
#>        'Positive' Class : Yes                
#> 

Now our Sensitivity rate increase to 80%, we’ve decreased 10 false negative prediction compared to previous threshold (0.5)



KNN Model

Dummy Code for Categorical Variable

K-NN involves calculating distances between data points, we must use numeric variables only. This only applies to the predictor variables. The outcome variable for k-NN classification should remain a factor variable. Note that our data uses mixed data type (categorical and numerical), to handle this problem, we can dummy code any factor or categorical variables.

Now let’s see our data structure

str(churn)
#> 'data.frame':    7010 obs. of  20 variables:
#>  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#>  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#>  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#>  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
#>  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#>  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#>  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#>  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
#>  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
#>  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
#>  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
#>  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#>  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
#>  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#>  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#>  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#>  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
#>  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
#>  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
#>  - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
#>   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...

Our data has some categorical variable with 2 levels and more than 2 levels

1. Copy a Dataframe

churn_dum <- data.frame(churn)
churn_dum

From this point on we’re going to use churn_dum data frame

For variable that has 2 factor we’ll change the value into binomial.

Column list:

  • gender
  • Partner
  • Dependents
  • PhoneService
  • MultipleLines
  • OnlineSecurity
  • OnlineBackup
  • DeviceProtection
  • TechSupport
  • StreamingTV
  • StreamingMovies
  • PaperlessBilling

2. Dummy code variables that have 2 levels and coded 1 & 0 **

churn_dum$gender <- ifelse(churn_dum$gender == "Male", 1, 0)
churn_dum$Partner <- ifelse(churn_dum$Partner == "Yes", 1, 0)
churn_dum$Dependents <- ifelse(churn_dum$Dependents == "Yes", 1, 0)
churn_dum$PhoneService <- ifelse(churn_dum$PhoneService == "Yes", 1, 0)
churn_dum$MultipleLines <- ifelse(churn_dum$MultipleLines == "Yes", 1, 0)
churn_dum$OnlineSecurity <- ifelse(churn_dum$OnlineSecurity == "Yes", 1, 0)
churn_dum$OnlineBackup <- ifelse(churn_dum$OnlineBackup == "Yes", 1, 0)
churn_dum$DeviceProtection <- ifelse(churn_dum$DeviceProtection == "Yes", 1, 0)
churn_dum$TechSupport <- ifelse(churn_dum$TechSupport == "Yes", 1, 0)
churn_dum$StreamingTV <- ifelse(churn_dum$StreamingTV == "Yes", 1, 0)
churn_dum$StreamingMovies <- ifelse(churn_dum$StreamingMovies == "Yes", 1, 0)
churn_dum$PaperlessBilling <- ifelse(churn_dum$PaperlessBilling == "Yes", 1, 0)

3. Dummy code variables that have more that 2 levels

  • place dummy code of churn_dum variable into a data frame

  • rename column to be more meaningful name (optional)

  • Column InternetService

InternetService <- as.data.frame(dummy.code(churn_dum$InternetService))
InternetService <- rename(InternetService, no.InternetService = No)
  • Column Contract
Contract <- as.data.frame(dummy.code(churn_dum$Contract))
  • Column PaymentMethod
PaymentMethod <- as.data.frame(dummy.code(churn_dum$PaymentMethod)) 

Feature Engineering

Drop variable that has been turn into dummy code, and combine dummy code to churn_dum data frame

churn_dum <- churn_dum %>% 
  select(-c(InternetService,Contract,PaymentMethod)) %>% 
  rename(is.male = gender) %>% 
  mutate(SeniorCitizen = as.numeric(SeniorCitizen))


churn_dum <- cbind(churn_dum, InternetService, Contract, PaymentMethod)
head(churn_dum)

Recheck our data structure

str(churn_dum)
#> 'data.frame':    7010 obs. of  27 variables:
#>  $ is.male                  : num  0 1 1 1 0 0 1 0 0 1 ...
#>  $ SeniorCitizen            : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Partner                  : num  1 0 0 0 0 0 0 0 1 0 ...
#>  $ Dependents               : num  0 0 0 0 0 0 1 0 0 1 ...
#>  $ tenure                   : int  1 34 2 45 2 8 22 10 28 62 ...
#>  $ PhoneService             : num  0 1 1 0 1 1 1 0 1 1 ...
#>  $ MultipleLines            : num  0 0 0 0 0 1 1 0 1 0 ...
#>  $ OnlineSecurity           : num  0 1 1 1 0 0 0 1 0 1 ...
#>  $ OnlineBackup             : num  1 0 1 0 0 0 1 0 0 1 ...
#>  $ DeviceProtection         : num  0 1 0 1 0 1 0 0 1 0 ...
#>  $ TechSupport              : num  0 0 0 1 0 0 0 0 1 0 ...
#>  $ StreamingTV              : num  0 0 0 0 0 1 1 0 1 0 ...
#>  $ StreamingMovies          : num  0 0 0 0 0 1 0 0 1 0 ...
#>  $ PaperlessBilling         : num  1 0 1 0 1 1 1 0 1 0 ...
#>  $ MonthlyCharges           : num  29.9 57 53.9 42.3 70.7 ...
#>  $ TotalCharges             : num  29.9 1889.5 108.2 1840.8 151.7 ...
#>  $ Churn                    : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
#>  $ Fiber optic              : num  0 0 0 0 1 1 1 0 1 0 ...
#>  $ DSL                      : num  1 1 1 1 0 0 0 1 0 1 ...
#>  $ no.InternetService       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Month-to-month           : num  1 0 1 0 1 1 1 1 1 0 ...
#>  $ Two year                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ One year                 : num  0 1 0 1 0 0 0 0 0 1 ...
#>  $ Electronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
#>  $ Mailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
#>  $ Bank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
#>  $ Credit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...

Splitting Data Train dan Data Test for K-NN

as we did in previous model, we’re going to split data into data test and data train

RNGkind(sample.kind = "Rounding")
set.seed(101)

no.index <- initial_split(data=churn_dum,
                       prop = 0.8, 
                       strata = Churn) 

churndum_trainKNN <- training(no.index)
churndum_testKNN <- testing(no.index)

Recheck our target variable proportion in our data train

prop.table(table(churndum_trainKNN$Churn))
#> 
#>        No       Yes 
#> 0.7351525 0.2648475

Perform down sampling

Down sampling is to balance target variable proportion

RNGkind(sample.kind = "Rounding")
set.seed(101)

churndum_trainKNN_down <- downSample(x = churndum_trainKNN %>% select(-Churn),
                         y = churndum_trainKNN$Churn,
                         yname = "Churn")
prop.table(table(churndum_trainKNN_down$Churn))
#> 
#>  No Yes 
#> 0.5 0.5

Split into predictor variable and target variable

Splitting data into predictors (numerical) and target variable (categorical), is needed because we’re going to perform scaling to all of our numeric value, so that all numerical variable is in the same range.

# Data train & Data Test Predictor (numeric)
train_numeric <- churndum_trainKNN_down %>% select_if(is.numeric)
test_numeric <- churndum_trainKNN_down %>% select_if(is.numeric)

# Data train & Data Test Target (categorical/ Churn)
train_target <- churndum_trainKNN_down %>% select(Churn)
test_target <- churndum_trainKNN_down %>% select(Churn)

Scaling

train_numeric <- scale(train_numeric)

test_numeric <- scale(test_numeric, 
                center=attr(train_numeric, "scaled:center"),
                scale=attr(train_numeric, "scaled:scale"))
head(train_numeric, 1)
#>         is.male SeniorCitizen   Partner Dependents    tenure PhoneService
#> [1,] -0.9964708    -0.4739901 -0.890624 -0.5847687 0.3680759    0.3307779
#>      MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> [1,]    -0.8839573     -0.5728475   -0.7005702       -0.7091312  -0.5816587
#>      StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges
#> [1,]  -0.8284386      -0.8129276         -1.33233      -1.687379   -0.5714615
#>      Fiber optic        DSL no.InternetService Month-to-month Two year
#> [1,]   -1.036176 -0.6893836           2.291476      -1.374216 2.170242
#>        One year Electronic check Mailed check Bank transfer (automatic)
#> [1,] -0.4541661       -0.8359721     1.930814                -0.4809006
#>      Credit card (automatic)
#> [1,]              -0.4830219
head(test_numeric, 1)
#>         is.male SeniorCitizen   Partner Dependents    tenure PhoneService
#> [1,] -0.9964708    -0.4739901 -0.890624 -0.5847687 0.3680759    0.3307779
#>      MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> [1,]    -0.8839573     -0.5728475   -0.7005702       -0.7091312  -0.5816587
#>      StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges
#> [1,]  -0.8284386      -0.8129276         -1.33233      -1.687379   -0.5714615
#>      Fiber optic        DSL no.InternetService Month-to-month Two year
#> [1,]   -1.036176 -0.6893836           2.291476      -1.374216 2.170242
#>        One year Electronic check Mailed check Bank transfer (automatic)
#> [1,] -0.4541661       -0.8359721     1.930814                -0.4809006
#>      Credit card (automatic)
#> [1,]              -0.4830219

K-NN Model Building

Checking root of total rows

To determine the K number (argument from knn(function)) we’re going to use the root of total rows from data train

nrow(train_numeric)
#> [1] 2970
sqrt(nrow(train_numeric))
#> [1] 54.49771

Because we have event number of train model, we’re going to use odd number to K argument

Prepare K-NN Model

churn_knn <- knn(train = train_numeric,
    test = test_numeric,
    cl = train_target$Churn, 
    k=55)


head(churn_knn, 5)
#> [1] No  Yes Yes No  No 
#> Levels: No Yes

Model Evaluation

Let’s evaluate our prediction using K-NN model

confusionMatrix(data=churn_knn, reference=test_target$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   No  Yes
#>        No  1008  233
#>        Yes  477 1252
#>                                                
#>                Accuracy : 0.7609               
#>                  95% CI : (0.7452, 0.7762)     
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.5219               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8431               
#>             Specificity : 0.6788               
#>          Pos Pred Value : 0.7241               
#>          Neg Pred Value : 0.8122               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.4215               
#>    Detection Prevalence : 0.5822               
#>       Balanced Accuracy : 0.7609               
#>                                                
#>        'Positive' Class : Yes                  
#> 

KNN Model Interpretation

From the summary of K-NN model, we can see that K-NN model accuracy is 76% and sensitivity rate at 84 %, means that K-nn model quite good (76%) at predicting True Positive and True Negative value.

Sensitifity at 84%: K-NN model has 86% accurately predct True Positif from total of TP + FN. Note that we want to decrease the number of FN (a customer we predict as not churn but in fact they churn).

Summary

  1. Logistic Regression (Step-wise Backward) Model Output:
  • Accuracy : 71 % _ Sensitivity : 80 %
  1. K-NN Model Output:
  • Accuracy : 76 % _ Sensitivity : 84 %

Conclusion: Comparing both model (in this case), K-NN has better performance than logistic regression to predict customer churn. To increase accuracy rate and sensitivity, may needed more data information/ other model.