Preface
Logistic Regression
Why do we use logistic regression?
Logistic Regression is a classification algorithm which is used to
predict a categorical variable (as an output) (Yes/No)
based on a set of independent variable(s).
Logistic Regression is used to explain the relationship between the
categorical dependent variable (target variable) and one or more
independent variables. When the dependent variable is dichotomous, we
use binary logistic regression.
Main Objective: Customer Churn Prediction

Customer churn is the percentage of customers that stopped using your
company’s product or service during a certain time frame. Customer churn
prediction is essential to evaluate a business strategy, and of course
to avoid more customer churn.
In this section we’re going to predict a customer churn using 2
models which are logistic regression and K-NN model.
Data source: Telco
Customer Churn
Data Wrangling
Now let’s see the summary of our data
#> customerID gender SeniorCitizen Partner Dependents
#> 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
#> 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
#> 0004-TLHLJ: 1 Median :0.0000
#> 0011-IGKFF: 1 Mean :0.1621
#> 0013-EXCHZ: 1 3rd Qu.:0.0000
#> 0013-MHZWF: 1 Max. :1.0000
#> (Other) :7037
#> tenure PhoneService MultipleLines InternetService
#> Min. : 0.00 No : 682 No :3390 DSL :2421
#> 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
#> Median :29.00 Yes :2971 No :1526
#> Mean :32.37
#> 3rd Qu.:55.00
#> Max. :72.00
#>
#> OnlineSecurity OnlineBackup
#> No :3498 No :3088
#> No internet service:1526 No internet service:1526
#> Yes :2019 Yes :2429
#>
#>
#>
#>
#> DeviceProtection TechSupport
#> No :3095 No :3473
#> No internet service:1526 No internet service:1526
#> Yes :2422 Yes :2044
#>
#>
#>
#>
#> StreamingTV StreamingMovies Contract
#> No :2810 No :2785 Month-to-month:3875
#> No internet service:1526 No internet service:1526 One year :1473
#> Yes :2707 Yes :2732 Two year :1695
#>
#>
#>
#>
#> PaperlessBilling PaymentMethod MonthlyCharges
#> No :2872 Bank transfer (automatic):1544 Min. : 18.25
#> Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
#> Electronic check :2365 Median : 70.35
#> Mailed check :1612 Mean : 64.76
#> 3rd Qu.: 89.85
#> Max. :118.75
#>
#> TotalCharges Churn
#> Min. : 18.8 No :5174
#> 1st Qu.: 401.4 Yes:1869
#> Median :1397.5
#> Mean :2283.3
#> 3rd Qu.:3794.7
#> Max. :8684.8
#> NA's :11
Highlight Our data has 11 missing value in
Total Charges variable.
Handling NA value will be performed in further
processing
Checking data structure
#> 'data.frame': 7043 obs. of 21 variables:
#> $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
#> $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#> $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#> $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#> $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
#> $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#> $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
#> $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#> $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
#> $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
#> $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
#> $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
#> $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
#> $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
#> $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#> $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#> $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#> $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
#> $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
#> $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
There are some columns which have redundant information in variable
(factor) levels:
- No internet service level is in column :
OnlineSecurity, OnlineBackup,
DeviceProtection, DeviceProtection,
TechSupport, StreamingTV,
StreamingMovies
- we can interpret No internet service as No,
therefore to simplify variable levels, we’re going to change No
internet service -> No
- Remove all No internet service level
Adjusting Variable’s Level
# Change the No internet service -> No
churn[churn$MultipleLines == "No phone service", "MultipleLines"] <- "No"
churn[churn$OnlineSecurity == "No internet service", "OnlineSecurity"] <- "No"
churn[churn$OnlineBackup == "No internet service", "OnlineBackup"] <- "No"
churn[churn$DeviceProtection == "No internet service", "DeviceProtection"] <- "No"
churn[churn$TechSupport == "No internet service", "TechSupport"] <- "No"
churn[churn$StreamingTV == "No internet service", "StreamingTV"] <- "No"
churn[churn$StreamingMovies == "No internet service", "StreamingMovies"] <- "No"
# Remove all No internet service level
levels(churn$MultipleLines)[levels(churn$MultipleLines) == 'No phone service'] <- "No"
levels(churn$OnlineSecurity)[levels(churn$OnlineSecurity) == 'No internet service'] <- "No"
levels(churn$OnlineBackup)[levels(churn$OnlineBackup) == 'No internet service'] <- "No"
levels(churn$DeviceProtection)[levels(churn$DeviceProtection) == 'No internet service'] <- "No"
levels(churn$TechSupport)[levels(churn$TechSupport) == 'No internet service'] <- "No"
levels(churn$StreamingTV)[levels(churn$StreamingTV ) == 'No internet service'] <- "No"
levels(churn$StreamingMovies)[levels(churn$StreamingMovies ) == 'No internet service'] <- "No"
Recheck our data structure
#> 'data.frame': 7043 obs. of 21 variables:
#> $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
#> $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#> $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#> $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#> $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
#> $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#> $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#> $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#> $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
#> $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
#> $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
#> $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
#> $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#> $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
#> $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#> $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#> $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#> $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
#> $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
#> $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Now our data has more meaningful and more direct grouping level
Checking Duplicate
any(duplicated(churn[, -1]))
#> [1] TRUE
Our data contained duplicated data
Note: We excluded ID from checking duplicate value, so that
duplicated() function checking our main data only.
Supposedly our data contained different value to each customer(Id) in
columns: MonthlyCharges & TotalCharges, so
when the two column have identical information, we can assumed that they
are indeed duplicated
churn[duplicated(churn[, -1]),]
Handling Duplicated and NA Value
as we see from above summary, our data contain duplicated data and NA
value. We’re going to drop all duplicated and NA, considering the total
number of NA and duplicated data are not significant.
Note: In this process I’d like to drop customerID
column (irrelevant to prediction) and also change
SeniorCitizen column to factor
churn <- churn %>%
select(-customerID) %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen)) %>%
na.omit()
churn <- churn[!duplicated(churn),]
#> gender SeniorCitizen Partner Dependents tenure PhoneService
#> Female:3475 0:5869 No :3617 No :4911 Min. : 1.00 No : 680
#> Male :3535 1:1141 Yes:3393 Yes:2099 1st Qu.: 9.00 Yes:6330
#> Median :29.00
#> Mean :32.52
#> 3rd Qu.:56.00
#> Max. :72.00
#> MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
#> No :4043 DSL :2414 No :4995 No :4585 No :4592
#> Yes:2967 Fiber optic:3090 Yes:2015 Yes:2425 Yes:2418
#> No :1506
#>
#>
#>
#> TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
#> No :4970 No :4307 No :4279 Month-to-month:3853 No :2852
#> Yes:2040 Yes:2703 Yes:2731 One year :1472 Yes:4158
#> Two year :1685
#>
#>
#>
#> PaymentMethod MonthlyCharges TotalCharges Churn
#> Bank transfer (automatic):1542 Min. : 18.25 Min. : 18.8 No :5153
#> Credit card (automatic) :1521 1st Qu.: 35.75 1st Qu.: 408.3 Yes:1857
#> Electronic check :2359 Median : 70.40 Median :1403.9
#> Mailed check :1588 Mean : 64.89 Mean :2290.4
#> 3rd Qu.: 89.90 3rd Qu.:3807.8
#> Max. :118.75 Max. :8684.8
Checking outliers
par(mfrow=c(1,3))
boxplot(churn$tenure, main = "Tenure")
boxplot(churn$MonthlyCharges, main = "MonthlyCharges")
boxplot(churn$TotalCharges, main = "TotalCharges" )

Highlights Our numerical variable doesn’t contain
otliers
Logistic Regression Model
Splitting data into Data Train and Data Test
Before creating a model, we’re going to split our data into data
train and data test.
Data train is data sampling from the main data (churn data frame)
which will be used to train our machine learning.
Data test is data sampling from the main data (churn data frame),
which will be used to test our prediction (unseen data from
modelling).
Splitting Proportion
- Data Train : 80%
- Data Test : 20%
- Target variable: Churn
RNGkind(sample.kind = "Rounding")
set.seed(101)
#sampling data
row.sample <- initial_split(data=churn,
prop = 0.8,
strata = Churn)
# Split data into data train and data test
churn_train <- training(row.sample)
churn_test <- testing(row.sample)
Checking Target Variable Distribution
prop.table(table(churn_train$Churn))
#>
#> No Yes
#> 0.7351525 0.2648475
Checking target variable in data train is essential after splitting
our data. We want our model to learn the pattern equally for both target
variable.
As we can see that our data train has more ‘No’ value than ‘Yes’
class, therefore I’m going to apply down sampling method to balance the
target variable proportion in the data train.
Model Building
1. Logistic Regression Model
For efficiency we’re going to create a model with all variable
model.glm.all <- glm(formula = Churn ~ .,
data = churn_train_down,
family = "binomial")
summary(model.glm.all)
#>
#> Call:
#> glm(formula = Churn ~ ., family = "binomial", data = churn_train_down)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3540 -0.7160 0.1118 0.7254 2.9098
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) 3.7699084 1.1890968 3.170
#> genderMale 0.0008891 0.0940060 0.009
#> SeniorCitizen1 0.3217475 0.1293025 2.488
#> PartnerYes 0.0110704 0.1137114 0.097
#> DependentsYes -0.1816949 0.1263558 -1.438
#> tenure -0.0580430 0.0078784 -7.367
#> PhoneServiceYes 1.3154225 0.9342042 1.408
#> MultipleLinesYes 0.7639486 0.2589283 2.950
#> InternetServiceFiber optic 3.3090974 1.1576471 2.858
#> InternetServiceNo -3.1941767 1.1677907 -2.735
#> OnlineSecurityYes 0.1878893 0.2589138 0.726
#> OnlineBackupYes 0.2273537 0.2543363 0.894
#> DeviceProtectionYes 0.4546489 0.2581146 1.761
#> TechSupportYes 0.1078538 0.2581989 0.418
#> StreamingTVYes 1.2415122 0.4707330 2.637
#> StreamingMoviesYes 1.2959644 0.4723263 2.744
#> ContractOne year -0.6622757 0.1453015 -4.558
#> ContractTwo year -1.6071267 0.2291407 -7.014
#> PaperlessBillingYes 0.1891272 0.1059667 1.785
#> PaymentMethodCredit card (automatic) -0.0848362 0.1573657 -0.539
#> PaymentMethodElectronic check 0.3407592 0.1363990 2.498
#> PaymentMethodMailed check -0.3109473 0.1600670 -1.943
#> MonthlyCharges -0.1008175 0.0460016 -2.192
#> TotalCharges 0.0003170 0.0000922 3.438
#> Pr(>|z|)
#> (Intercept) 0.001522 **
#> genderMale 0.992454
#> SeniorCitizen1 0.012834 *
#> PartnerYes 0.922444
#> DependentsYes 0.150445
#> tenure 0.000000000000174 ***
#> PhoneServiceYes 0.159111
#> MultipleLinesYes 0.003173 **
#> InternetServiceFiber optic 0.004257 **
#> InternetServiceNo 0.006234 **
#> OnlineSecurityYes 0.468033
#> OnlineBackupYes 0.371370
#> DeviceProtectionYes 0.078167 .
#> TechSupportYes 0.676155
#> StreamingTVYes 0.008354 **
#> StreamingMoviesYes 0.006073 **
#> ContractOne year 0.000005165774522 ***
#> ContractTwo year 0.000000000002321 ***
#> PaperlessBillingYes 0.074297 .
#> PaymentMethodCredit card (automatic) 0.589816
#> PaymentMethodElectronic check 0.012481 *
#> PaymentMethodMailed check 0.052064 .
#> MonthlyCharges 0.028408 *
#> TotalCharges 0.000585 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4117.3 on 2969 degrees of freedom
#> Residual deviance: 2779.5 on 2946 degrees of freedom
#> AIC: 2827.5
#>
#> Number of Fisher Scoring iterations: 6
Logistic Regression Model Output Interpretation:
- When we include all variable as predictors, we can see that there
are some variables that don’t have significance correlation to our
target variable
- Residual deviance is 2779.5, we will compare to other model which
will has lower residual deviance number
2. Step-wise Regression
As we can see from previous model, there are some variable that don’t
have significance correlation to the target variable. We will try to use
step-wise regression for feature selection
# Step-wise model (Backward)
model.step.bwd <- step(model.glm.all,
direction = "backward",
trace = F)
summary(model.step.bwd)
#>
#> Call:
#> glm(formula = Churn ~ SeniorCitizen + Dependents + tenure + PhoneService +
#> MultipleLines + InternetService + DeviceProtection + StreamingTV +
#> StreamingMovies + Contract + PaperlessBilling + PaymentMethod +
#> MonthlyCharges + TotalCharges, family = "binomial", data = churn_train_down)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3222 -0.7163 0.1143 0.7265 2.9351
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) 2.96055331 0.45905807 6.449
#> SeniorCitizen1 0.32493734 0.12860131 2.527
#> DependentsYes -0.17469692 0.11409329 -1.531
#> tenure -0.05751853 0.00781551 -7.360
#> PhoneServiceYes 0.65700097 0.31614852 2.078
#> MultipleLinesYes 0.59952553 0.13213733 4.537
#> InternetServiceFiber optic 2.49590173 0.32350517 7.715
#> InternetServiceNo -2.37917705 0.41859359 -5.684
#> DeviceProtectionYes 0.28780612 0.13221166 2.177
#> StreamingTVYes 0.91253593 0.17307548 5.272
#> StreamingMoviesYes 0.96726840 0.17648317 5.481
#> ContractOne year -0.67344946 0.14443492 -4.663
#> ContractTwo year -1.62353712 0.22624622 -7.176
#> PaperlessBillingYes 0.19296684 0.10564085 1.827
#> PaymentMethodCredit card (automatic) -0.08247750 0.15725161 -0.524
#> PaymentMethodElectronic check 0.34126955 0.13622324 2.505
#> PaymentMethodMailed check -0.31398460 0.15987148 -1.964
#> MonthlyCharges -0.06811462 0.01298500 -5.246
#> TotalCharges 0.00031851 0.00009192 3.465
#> Pr(>|z|)
#> (Intercept) 0.0000000001124486 ***
#> SeniorCitizen1 0.01151 *
#> DependentsYes 0.12573
#> tenure 0.0000000000001845 ***
#> PhoneServiceYes 0.03770 *
#> MultipleLinesYes 0.0000057022313052 ***
#> InternetServiceFiber optic 0.0000000000000121 ***
#> InternetServiceNo 0.0000000131780691 ***
#> DeviceProtectionYes 0.02949 *
#> StreamingTVYes 0.0000001345971066 ***
#> StreamingMoviesYes 0.0000000423414440 ***
#> ContractOne year 0.0000031216374031 ***
#> ContractTwo year 0.0000000000007179 ***
#> PaperlessBillingYes 0.06776 .
#> PaymentMethodCredit card (automatic) 0.59994
#> PaymentMethodElectronic check 0.01224 *
#> PaymentMethodMailed check 0.04953 *
#> MonthlyCharges 0.0000001557418990 ***
#> TotalCharges 0.00053 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4117.3 on 2969 degrees of freedom
#> Residual deviance: 2780.6 on 2951 degrees of freedom
#> AIC: 2818.6
#>
#> Number of Fisher Scoring iterations: 6
Step wise Model (forward)
# Step-wise Model (forward)
model.glm.none <- glm(formula = Churn ~ 1,
data = churn_train_down,
family = "binomial")
summary(step(model.glm.none,
direction = 'forward',
scope = list(upper = model.glm.all),
trace = F))
#>
#> Call:
#> glm(formula = Churn ~ Contract + InternetService + tenure + PaymentMethod +
#> StreamingMovies + SeniorCitizen + TechSupport + StreamingTV +
#> MonthlyCharges + MultipleLines + TotalCharges + PaperlessBilling +
#> Dependents, family = "binomial", data = churn_train_down)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3105 -0.7198 0.1151 0.7212 2.9337
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) 2.26689319 0.40282864 5.627
#> ContractOne year -0.66802097 0.14469099 -4.617
#> ContractTwo year -1.59907286 0.22840324 -7.001
#> InternetServiceFiber optic 1.81334528 0.25376504 7.146
#> InternetServiceNo -1.60115932 0.25067272 -6.387
#> tenure -0.05803751 0.00779286 -7.448
#> PaymentMethodCredit card (automatic) -0.08759279 0.15696219 -0.558
#> PaymentMethodElectronic check 0.34102017 0.13602633 2.507
#> PaymentMethodMailed check -0.30544359 0.15964923 -1.913
#> StreamingMoviesYes 0.70265221 0.14123215 4.975
#> SeniorCitizen1 0.32262122 0.12863613 2.508
#> TechSupportYes -0.19498046 0.13031466 -1.496
#> StreamingTVYes 0.64479558 0.13694876 4.708
#> MonthlyCharges -0.03934125 0.00790485 -4.977
#> MultipleLinesYes 0.47121501 0.12790702 3.684
#> TotalCharges 0.00031207 0.00009168 3.404
#> PaperlessBillingYes 0.19219913 0.10551222 1.822
#> DependentsYes -0.17974536 0.11390380 -1.578
#> Pr(>|z|)
#> (Intercept) 0.0000000182905857 ***
#> ContractOne year 0.0000038955164861 ***
#> ContractTwo year 0.0000000000025397 ***
#> InternetServiceFiber optic 0.0000000000008950 ***
#> InternetServiceNo 0.0000000001686753 ***
#> tenure 0.0000000000000951 ***
#> PaymentMethodCredit card (automatic) 0.576810
#> PaymentMethodElectronic check 0.012176 *
#> PaymentMethodMailed check 0.055720 .
#> StreamingMoviesYes 0.0000006519461547 ***
#> SeniorCitizen1 0.012141 *
#> TechSupportYes 0.134594
#> StreamingTVYes 0.0000024979388303 ***
#> MonthlyCharges 0.0000006462721718 ***
#> MultipleLinesYes 0.000230 ***
#> TotalCharges 0.000664 ***
#> PaperlessBillingYes 0.068518 .
#> DependentsYes 0.114555
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 4117.3 on 2969 degrees of freedom
#> Residual deviance: 2784.7 on 2952 degrees of freedom
#> AIC: 2820.7
#>
#> Number of Fisher Scoring iterations: 6
From step-wise regression (forward direction), we can see that
Residual deviance is not so much different, and yet we certainly won’t
use this model because it has slightly bigger Residual deviance. It
seems that our best model at this moment is step-wise backward
model.
Prediction
We’re going to predict to our data test (using the range of
probability to churn from our model) >where the condition is churn ==
1
churn_test$glm.prediction <- predict(model.step.bwd, newdata = churn_test, type = "response")
Now we’re giving a label based on the range number of probability,
for the first trial try let’s assume that :
if the probablity greater that 0.5, then the customer might churn
(labelled as Yes)
churn_test$glm.Label <- ifelse(churn_test$glm.prediction > 0.5, "Yes" , "No") %>% as.factor()
Let’s see the glimpse of our model prediction to the data test
churn_test %>%
select(Churn, glm.prediction,glm.Label) %>%
head(6)
Model Evaluation
We’ve saved our prediction to our data test, now let’s evaluate our
prediction using confusion matrix
confusionMatrix(data=churn_test$glm.Label, reference=churn_test$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 733 82
#> Yes 298 290
#>
#> Accuracy : 0.7292
#> 95% CI : (0.7051, 0.7523)
#> No Information Rate : 0.7349
#> P-Value [Acc > NIR] : 0.6976
#>
#> Kappa : 0.4138
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7796
#> Specificity : 0.7110
#> Pos Pred Value : 0.4932
#> Neg Pred Value : 0.8994
#> Prevalence : 0.2651
#> Detection Rate : 0.2067
#> Detection Prevalence : 0.4191
#> Balanced Accuracy : 0.7453
#>
#> 'Positive' Class : Yes
#>
Model Interpretation
1. Accuracy rate: 73%
- Means that our model could predict accurately of Churn and Not Churn
class (compared to data test) for 73%
- Predict True as Churn: 290 (out of 372)
- Predicted True as Not Churn: 733 (out 0f 1031)
- Accuracy : \[
\frac{290 + 733}{372 + 1031} = \frac{1023}{1403} = 0.7291518
\]
2. Focus output: Sensitivity Rate 78%
Focus output on sensitivity because we want to decrease the
chance of wrong prediction to predict customer won’t churn but in fact
the customer does churn
- Means that our model accurately predict True Positive (true
prediction to customer churn), and false prediction as customer won’t
churn (yet in fact, customer does churn on data test) is 78%
- Predict True as Churn: 290 (out of 372)
- Predicted True as Not Churn: 82 (out 0f 372)
- Sensitivity : \[
\frac{290}{372} = 0.7795699
\]
Increasing Sensitivity
we can increasing sensitivity of our model by adjusting the threshold
of condition whether the customer churn or not. But of course the
adjustment needed a reasonable consideration.
For example based on our model prediction, our model sensitivity rate
will increase if we set the threshold to 0.45, let’s see the output
churn_test$glm.Label.4 <- ifelse(churn_test$glm.prediction > 0.45, "Yes" , "No") %>% as.factor()
confusionMatrix(data=churn_test$glm.Label.4, reference=churn_test$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 699 72
#> Yes 332 300
#>
#> Accuracy : 0.712
#> 95% CI : (0.6876, 0.7356)
#> No Information Rate : 0.7349
#> P-Value [Acc > NIR] : 0.9746
#>
#> Kappa : 0.396
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.8065
#> Specificity : 0.6780
#> Pos Pred Value : 0.4747
#> Neg Pred Value : 0.9066
#> Prevalence : 0.2651
#> Detection Rate : 0.2138
#> Detection Prevalence : 0.4505
#> Balanced Accuracy : 0.7422
#>
#> 'Positive' Class : Yes
#>
Now our Sensitivity rate increase to 80%, we’ve decreased 10 false
negative prediction compared to previous threshold (0.5)
KNN Model
Dummy Code for Categorical Variable
K-NN involves calculating distances between data points, we must use
numeric variables only. This only applies to the predictor variables.
The outcome variable for k-NN classification should remain a factor
variable. Note that our data uses mixed data type (categorical and
numerical), to handle this problem, we can dummy code any factor or
categorical variables.
Now let’s see our data structure
#> 'data.frame': 7010 obs. of 20 variables:
#> $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
#> $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
#> $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
#> $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
#> $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
#> $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#> $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
#> $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
#> $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
#> $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
#> $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
#> $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
#> $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
#> $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
#> $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
#> $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
#> $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
#> $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
#> $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
#> - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
#> ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
Our data has some categorical variable with 2 levels and more than 2
levels
1. Copy a Dataframe
churn_dum <- data.frame(churn)
churn_dum
From this point on we’re going to use churn_dum data
frame
For variable that has 2 factor we’ll change the value into
binomial.
Column list:
- gender
- Partner
- Dependents
- PhoneService
- MultipleLines
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- TechSupport
- StreamingTV
- StreamingMovies
- PaperlessBilling
2. Dummy code variables that have 2 levels and coded 1 & 0
**
churn_dum$gender <- ifelse(churn_dum$gender == "Male", 1, 0)
churn_dum$Partner <- ifelse(churn_dum$Partner == "Yes", 1, 0)
churn_dum$Dependents <- ifelse(churn_dum$Dependents == "Yes", 1, 0)
churn_dum$PhoneService <- ifelse(churn_dum$PhoneService == "Yes", 1, 0)
churn_dum$MultipleLines <- ifelse(churn_dum$MultipleLines == "Yes", 1, 0)
churn_dum$OnlineSecurity <- ifelse(churn_dum$OnlineSecurity == "Yes", 1, 0)
churn_dum$OnlineBackup <- ifelse(churn_dum$OnlineBackup == "Yes", 1, 0)
churn_dum$DeviceProtection <- ifelse(churn_dum$DeviceProtection == "Yes", 1, 0)
churn_dum$TechSupport <- ifelse(churn_dum$TechSupport == "Yes", 1, 0)
churn_dum$StreamingTV <- ifelse(churn_dum$StreamingTV == "Yes", 1, 0)
churn_dum$StreamingMovies <- ifelse(churn_dum$StreamingMovies == "Yes", 1, 0)
churn_dum$PaperlessBilling <- ifelse(churn_dum$PaperlessBilling == "Yes", 1, 0)
3. Dummy code variables that have more that 2 levels
InternetService <- as.data.frame(dummy.code(churn_dum$InternetService))
InternetService <- rename(InternetService, no.InternetService = No)
Contract <- as.data.frame(dummy.code(churn_dum$Contract))
PaymentMethod <- as.data.frame(dummy.code(churn_dum$PaymentMethod))
Feature Engineering
Drop variable that has been turn into dummy code, and combine dummy
code to churn_dum data frame
churn_dum <- churn_dum %>%
select(-c(InternetService,Contract,PaymentMethod)) %>%
rename(is.male = gender) %>%
mutate(SeniorCitizen = as.numeric(SeniorCitizen))
churn_dum <- cbind(churn_dum, InternetService, Contract, PaymentMethod)
head(churn_dum)
Recheck our data structure
#> 'data.frame': 7010 obs. of 27 variables:
#> $ is.male : num 0 1 1 1 0 0 1 0 0 1 ...
#> $ SeniorCitizen : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ Partner : num 1 0 0 0 0 0 0 0 1 0 ...
#> $ Dependents : num 0 0 0 0 0 0 1 0 0 1 ...
#> $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
#> $ PhoneService : num 0 1 1 0 1 1 1 0 1 1 ...
#> $ MultipleLines : num 0 0 0 0 0 1 1 0 1 0 ...
#> $ OnlineSecurity : num 0 1 1 1 0 0 0 1 0 1 ...
#> $ OnlineBackup : num 1 0 1 0 0 0 1 0 0 1 ...
#> $ DeviceProtection : num 0 1 0 1 0 1 0 0 1 0 ...
#> $ TechSupport : num 0 0 0 1 0 0 0 0 1 0 ...
#> $ StreamingTV : num 0 0 0 0 0 1 1 0 1 0 ...
#> $ StreamingMovies : num 0 0 0 0 0 1 0 0 1 0 ...
#> $ PaperlessBilling : num 1 0 1 0 1 1 1 0 1 0 ...
#> $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
#> $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
#> $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
#> $ Fiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
#> $ DSL : num 1 1 1 1 0 0 0 1 0 1 ...
#> $ no.InternetService : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ Month-to-month : num 1 0 1 0 1 1 1 1 1 0 ...
#> $ Two year : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ One year : num 0 1 0 1 0 0 0 0 0 1 ...
#> $ Electronic check : num 1 0 0 0 1 1 0 0 1 0 ...
#> $ Mailed check : num 0 1 1 0 0 0 0 1 0 0 ...
#> $ Bank transfer (automatic): num 0 0 0 1 0 0 0 0 0 1 ...
#> $ Credit card (automatic) : num 0 0 0 0 0 0 1 0 0 0 ...
Splitting Data Train dan Data Test for K-NN
as we did in previous model, we’re going to split data into data test
and data train
RNGkind(sample.kind = "Rounding")
set.seed(101)
no.index <- initial_split(data=churn_dum,
prop = 0.8,
strata = Churn)
churndum_trainKNN <- training(no.index)
churndum_testKNN <- testing(no.index)
Recheck our target variable proportion in our data train
prop.table(table(churndum_trainKNN$Churn))
#>
#> No Yes
#> 0.7351525 0.2648475
Split into predictor variable and target variable
Splitting data into predictors (numerical) and target variable
(categorical), is needed because we’re going to perform scaling to all
of our numeric value, so that all numerical variable is in the same
range.
# Data train & Data Test Predictor (numeric)
train_numeric <- churndum_trainKNN_down %>% select_if(is.numeric)
test_numeric <- churndum_trainKNN_down %>% select_if(is.numeric)
# Data train & Data Test Target (categorical/ Churn)
train_target <- churndum_trainKNN_down %>% select(Churn)
test_target <- churndum_trainKNN_down %>% select(Churn)
Scaling
train_numeric <- scale(train_numeric)
test_numeric <- scale(test_numeric,
center=attr(train_numeric, "scaled:center"),
scale=attr(train_numeric, "scaled:scale"))
#> is.male SeniorCitizen Partner Dependents tenure PhoneService
#> [1,] -0.9964708 -0.4739901 -0.890624 -0.5847687 0.3680759 0.3307779
#> MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> [1,] -0.8839573 -0.5728475 -0.7005702 -0.7091312 -0.5816587
#> StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges
#> [1,] -0.8284386 -0.8129276 -1.33233 -1.687379 -0.5714615
#> Fiber optic DSL no.InternetService Month-to-month Two year
#> [1,] -1.036176 -0.6893836 2.291476 -1.374216 2.170242
#> One year Electronic check Mailed check Bank transfer (automatic)
#> [1,] -0.4541661 -0.8359721 1.930814 -0.4809006
#> Credit card (automatic)
#> [1,] -0.4830219
#> is.male SeniorCitizen Partner Dependents tenure PhoneService
#> [1,] -0.9964708 -0.4739901 -0.890624 -0.5847687 0.3680759 0.3307779
#> MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> [1,] -0.8839573 -0.5728475 -0.7005702 -0.7091312 -0.5816587
#> StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges
#> [1,] -0.8284386 -0.8129276 -1.33233 -1.687379 -0.5714615
#> Fiber optic DSL no.InternetService Month-to-month Two year
#> [1,] -1.036176 -0.6893836 2.291476 -1.374216 2.170242
#> One year Electronic check Mailed check Bank transfer (automatic)
#> [1,] -0.4541661 -0.8359721 1.930814 -0.4809006
#> Credit card (automatic)
#> [1,] -0.4830219
K-NN Model Building
Checking root of total rows
To determine the K number (argument from knn(function)) we’re going
to use the root of total rows from data train
#> [1] 2970
sqrt(nrow(train_numeric))
#> [1] 54.49771
Because we have event number of train model, we’re going to use odd
number to K argument
Prepare K-NN Model
churn_knn <- knn(train = train_numeric,
test = test_numeric,
cl = train_target$Churn,
k=55)
head(churn_knn, 5)
#> [1] No Yes Yes No No
#> Levels: No Yes
Model Evaluation
Let’s evaluate our prediction using K-NN model
confusionMatrix(data=churn_knn, reference=test_target$Churn, positive="Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 1008 233
#> Yes 477 1252
#>
#> Accuracy : 0.7609
#> 95% CI : (0.7452, 0.7762)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.5219
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8431
#> Specificity : 0.6788
#> Pos Pred Value : 0.7241
#> Neg Pred Value : 0.8122
#> Prevalence : 0.5000
#> Detection Rate : 0.4215
#> Detection Prevalence : 0.5822
#> Balanced Accuracy : 0.7609
#>
#> 'Positive' Class : Yes
#>
KNN Model Interpretation
From the summary of K-NN model, we can see that K-NN model accuracy
is 76% and sensitivity rate at 84 %, means that K-nn model quite good
(76%) at predicting True Positive and True Negative value.
Sensitifity at 84%: K-NN model has 86% accurately predct True Positif
from total of TP + FN. Note that we want to decrease the number of FN (a
customer we predict as not churn but in fact they churn).