1 About the Dataset

In this case we use the Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models.

1.1 Business Question

How to mitigate churn customer in Orange Telecom?

In this case we will comparing predictive model with Logistic Regression, Stepwise Regression, and KNN to find highest sensitivity (recall) with suitable accuracy & AIC.

2 Prepare the Data

Read Data

telcochurn_train <- read.csv("churn-bigml-80.csv", stringsAsFactors = TRUE) #80% of data
telcochurn_test <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) #20% of data

nrow(telcochurn_train)
#> [1] 2666
nrow(telcochurn_test)
#> [1] 667

Data Train

head(telcochurn_train)

Data Test

head(telcochurn_test)

Selection variable in both data

dropcols <- c("State","Area.code")

telcochurn_train <-  telcochurn_train %>%
  select(-one_of(dropcols))

telcochurn_test <- telcochurn_test %>%
  select(-one_of(dropcols))

3 Explarotary Data

Focus on Data Train to create model

glimpse(telcochurn_train)
#> Rows: 2,666
#> Columns: 18
#> $ Account.length         <int> 128, 107, 137, 84, 75, 118, 121, 147, 141, 74, …
#> $ International.plan     <fct> No, No, No, Yes, Yes, Yes, No, Yes, Yes, No, No…
#> $ Voice.mail.plan        <fct> Yes, Yes, No, No, No, No, Yes, No, Yes, No, No,…
#> $ Number.vmail.messages  <int> 25, 26, 0, 0, 0, 0, 24, 0, 37, 0, 0, 0, 0, 27, …
#> $ Total.day.minutes      <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2…
#> $ Total.day.calls        <int> 110, 123, 114, 71, 113, 98, 88, 79, 84, 127, 96…
#> $ Total.day.charge       <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09…
#> $ Total.eve.minutes      <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5,…
#> $ Total.eve.calls        <int> 99, 103, 110, 88, 122, 101, 108, 94, 111, 148, …
#> $ Total.eve.charge       <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62,…
#> $ Total.night.minutes    <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6…
#> $ Total.night.calls      <int> 91, 103, 104, 89, 121, 118, 118, 96, 97, 94, 12…
#> $ Total.night.charge     <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.5…
#> $ Total.intl.minutes     <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 11.…
#> $ Total.intl.calls       <int> 3, 3, 5, 7, 3, 6, 7, 6, 5, 5, 2, 5, 6, 4, 3, 5,…
#> $ Total.intl.charge      <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92,…
#> $ Customer.service.calls <int> 1, 1, 0, 2, 3, 0, 3, 0, 0, 0, 1, 3, 4, 1, 3, 1,…
#> $ Churn                  <fct> False, False, False, False, False, False, False…

STATE: 51 Unique States in United States of America Account Length. Length of The Account Area Code 415 relates to San Francisco,408 is of San Jose and 510 is of City of Okland International Plan Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan Voice Mail Plan Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan Number vmail messages Number of Voice Mail Messages ranging from 0 to 50 Total day minutes Total Number of Minutes Spent By Customers in Morning Total day calls Total Number of Calls made by Customer in Morning. Total day charge Total Charge to the Customers in Morning. Total eve minutesTotal Number of Minutes Spent By Customers in Evening Total eve calls Total Number of Calls made by Customer in Evening. Total eve charge Total Charge to the Customers in Morning. Total night minutes Total Number of Minutes Spent By Customers in the Night. Total night calls Total Number of Calls made by Customer in Night. Total night charge Total Charge to the Customers in Night.

4 Data Wrangling

Check NA

colSums(is.na(telcochurn_train))
#>         Account.length     International.plan        Voice.mail.plan 
#>                      0                      0                      0 
#>  Number.vmail.messages      Total.day.minutes        Total.day.calls 
#>                      0                      0                      0 
#>       Total.day.charge      Total.eve.minutes        Total.eve.calls 
#>                      0                      0                      0 
#>       Total.eve.charge    Total.night.minutes      Total.night.calls 
#>                      0                      0                      0 
#>     Total.night.charge     Total.intl.minutes       Total.intl.calls 
#>                      0                      0                      0 
#>      Total.intl.charge Customer.service.calls                  Churn 
#>                      0                      0                      0
summary(telcochurn_train)
#>  Account.length  International.plan Voice.mail.plan Number.vmail.messages
#>  Min.   :  1.0   No :2396           No :1933        Min.   : 0.000       
#>  1st Qu.: 73.0   Yes: 270           Yes: 733        1st Qu.: 0.000       
#>  Median :100.0                                      Median : 0.000       
#>  Mean   :100.6                                      Mean   : 8.022       
#>  3rd Qu.:127.0                                      3rd Qu.:19.000       
#>  Max.   :243.0                                      Max.   :50.000       
#>  Total.day.minutes Total.day.calls Total.day.charge Total.eve.minutes
#>  Min.   :  0.0     Min.   :  0.0   Min.   : 0.00    Min.   :  0.0    
#>  1st Qu.:143.4     1st Qu.: 87.0   1st Qu.:24.38    1st Qu.:165.3    
#>  Median :179.9     Median :101.0   Median :30.59    Median :200.9    
#>  Mean   :179.5     Mean   :100.3   Mean   :30.51    Mean   :200.4    
#>  3rd Qu.:215.9     3rd Qu.:114.0   3rd Qu.:36.70    3rd Qu.:235.1    
#>  Max.   :350.8     Max.   :160.0   Max.   :59.64    Max.   :363.7    
#>  Total.eve.calls Total.eve.charge Total.night.minutes Total.night.calls
#>  Min.   :  0     Min.   : 0.00    Min.   : 43.7       Min.   : 33.0    
#>  1st Qu.: 87     1st Qu.:14.05    1st Qu.:166.9       1st Qu.: 87.0    
#>  Median :100     Median :17.08    Median :201.2       Median :100.0    
#>  Mean   :100     Mean   :17.03    Mean   :201.2       Mean   :100.1    
#>  3rd Qu.:114     3rd Qu.:19.98    3rd Qu.:236.5       3rd Qu.:113.0    
#>  Max.   :170     Max.   :30.91    Max.   :395.0       Max.   :166.0    
#>  Total.night.charge Total.intl.minutes Total.intl.calls Total.intl.charge
#>  Min.   : 1.970     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
#>  1st Qu.: 7.513     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
#>  Median : 9.050     Median :10.20      Median : 4.000   Median :2.750    
#>  Mean   : 9.053     Mean   :10.24      Mean   : 4.467   Mean   :2.764    
#>  3rd Qu.:10.640     3rd Qu.:12.10      3rd Qu.: 6.000   3rd Qu.:3.270    
#>  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
#>  Customer.service.calls   Churn     
#>  Min.   :0.000          False:2278  
#>  1st Qu.:1.000          True : 388  
#>  Median :1.000                      
#>  Mean   :1.563                      
#>  3rd Qu.:2.000                      
#>  Max.   :9.000

4.1 Tuning Target Variable

Check proportion of target variable

table(telcochurn_train$Churn)
#> 
#> False  True 
#>  2278   388
prop.table(table(telcochurn_train$Churn))
#> 
#>     False      True 
#> 0.8544636 0.1455364

From result above we know our proportion of Churn variable as target is unbalance. The next step is balancing the Churn variable with up sample, down sample, both sample, ROSE, and SMOTE.

#Up Sampling
set.seed(309)
telcochurn_train_ups <- upSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
a <- table(telcochurn_train_ups$Churn)

#Down Sampling
set.seed(309)
telcochurn_train_downs <- downSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
b <- table(telcochurn_train_downs$Churn)

#Both Sampling
telcochurn_train_boths <- ovun.sample(Churn~.,data = telcochurn_train, method= "both", p=0.51, N = nrow(telcochurn_train), seed = 309)$data
c <- table(telcochurn_train_boths$Churn)

#ROSE method
set.seed(309)
telcochurn_train_rose <- ROSE(Churn~.,data=telcochurn_train)$data
d <- table(telcochurn_train_rose$Churn)

#SMOTE method
set.seed(309)
telcochurn_train_smote <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 180, perc.under = 170)
e <- table(telcochurn_train_smote$Churn)

data.tuning <- data.frame(a,b,c,d,e)%>%
  select(Var1,Freq,Freq.1,Freq.2,Freq.3,Freq.4) %>%
  rename(Churn = Var1,
         "Up Sample" = Freq,
         "Down Sample" = Freq.1,
         "Both Sample" = Freq.2,
         ROSE = Freq.3,
         SMOTE = Freq.4)

data.tuning

5 Create Model Prediction

5.1 Logistic Regression

Generate Model in all data tuning

#Up Sampling
model_train_up <- glm(Churn ~. , data = telcochurn_train_ups, family = "binomial")
summary(model_train_up)
#> 
#> Call:
#> glm(formula = Churn ~ ., family = "binomial", data = telcochurn_train_ups)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.98180  -0.83079  -0.04096   0.85446   2.66123  
#> 
#> Coefficients:
#>                            Estimate   Std. Error z value             Pr(>|z|)
#> (Intercept)             -6.46532520   0.43999832 -14.694 < 0.0000000000000002
#> Account.length           0.00146982   0.00088772   1.656             0.097776
#> International.planYes    2.57904622   0.11511925  22.403 < 0.0000000000000002
#> Voice.mail.planYes      -1.82799632   0.33615387  -5.438      0.0000000538895
#> Number.vmail.messages    0.03703686   0.01073208   3.451             0.000558
#> Total.day.minutes        0.50910547   2.12062620   0.240             0.810274
#> Total.day.calls          0.00279397   0.00174049   1.605             0.108433
#> Total.day.charge        -2.92115053  12.47420834  -0.234             0.814849
#> Total.eve.minutes        2.22298947   1.06311894   2.091             0.036527
#> Total.eve.calls          0.00009287   0.00176469   0.053             0.958027
#> Total.eve.charge       -26.08491734  12.50753240  -2.086             0.037021
#> Total.night.minutes      0.66213219   0.56059887   1.181             0.237557
#> Total.night.calls        0.00175552   0.00183449   0.957             0.338589
#> Total.night.charge     -14.65135495  12.45742822  -1.176             0.239549
#> Total.intl.minutes      -2.92485233   3.43091241  -0.852             0.393937
#> Total.intl.calls        -0.09584360   0.01468474  -6.527      0.0000000000672
#> Total.intl.charge       11.12699356  12.70678413   0.876             0.381208
#> Customer.service.calls   0.56423034   0.02614708  21.579 < 0.0000000000000002
#>                           
#> (Intercept)            ***
#> Account.length         .  
#> International.planYes  ***
#> Voice.mail.planYes     ***
#> Number.vmail.messages  ***
#> Total.day.minutes         
#> Total.day.calls           
#> Total.day.charge          
#> Total.eve.minutes      *  
#> Total.eve.calls           
#> Total.eve.charge       *  
#> Total.night.minutes       
#> Total.night.calls         
#> Total.night.charge        
#> Total.intl.minutes        
#> Total.intl.calls       ***
#> Total.intl.charge         
#> Customer.service.calls ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 6316.0  on 4555  degrees of freedom
#> Residual deviance: 4701.4  on 4538  degrees of freedom
#> AIC: 4737.4
#> 
#> Number of Fisher Scoring iterations: 5

Create all model

#Down Sampling
model_train_down <- glm(Churn ~. , data = telcochurn_train_downs, family = "binomial")

#Both Sampling
model_train_both <- glm(Churn ~. , data = telcochurn_train_boths, family = "binomial")

#ROSE
model_train_rose <- glm(Churn ~. , data = telcochurn_train_rose, family = "binomial")

#SMOTE
model_train_smote <- glm(Churn ~. , data = telcochurn_train_smote, family = "binomial")

Comparing AIC value

summary(model_train_up)$aic
#> [1] 4737.433
summary(model_train_down)$aic
#> [1] 836.3498
summary(model_train_both)$aic
#> [1] 2749.511
summary(model_train_rose)$aic
#> [1] 2951.251
summary(model_train_smote)$aic
#> [1] 1469.022

Down sample has lowest AIC value then other model, but we not conclude using down sample because ammount of data observation too small.

Execute the Model

#Up Sampling
telcochurn_test$prob_up <- predict(model_train_up, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_up <- ifelse(telcochurn_test$prob_up > 0.5, "True", "False")
                              
telcochurn_test <- telcochurn_test %>%
  mutate(prediction_up = as.factor(prediction_up))

cm_up <- confusionMatrix(data = telcochurn_test$prediction_up, reference = telcochurn_test$Churn, positive = "True")
cm_up
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction False True
#>      False   449   24
#>      True    123   71
#>                                                
#>                Accuracy : 0.7796               
#>                  95% CI : (0.7462, 0.8105)     
#>     No Information Rate : 0.8576               
#>     P-Value [Acc > NIR] : 1                    
#>                                                
#>                   Kappa : 0.3711               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000000000000006324
#>                                                
#>             Sensitivity : 0.7474               
#>             Specificity : 0.7850               
#>          Pos Pred Value : 0.3660               
#>          Neg Pred Value : 0.9493               
#>              Prevalence : 0.1424               
#>          Detection Rate : 0.1064               
#>    Detection Prevalence : 0.2909               
#>       Balanced Accuracy : 0.7662               
#>                                                
#>        'Positive' Class : True                 
#> 
#Down Sampling
telcochurn_test$prob_down <- predict(model_train_down, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_down <- ifelse(telcochurn_test$prob_down > 0.5, "True", "False")
                              
telcochurn_test <- telcochurn_test %>%
  mutate(prediction_down = as.factor(prediction_down))

cm_down <- confusionMatrix(data = telcochurn_test$prediction_down, reference = telcochurn_test$Churn, positive = "True")

#Both Sampling
telcochurn_test$prob_both <- predict(model_train_both, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_both <- ifelse(telcochurn_test$prob_both > 0.5, "True", "False")
                              
telcochurn_test <- telcochurn_test %>%
  mutate(prediction_both = as.factor(prediction_both))

cm_both <- confusionMatrix(data = telcochurn_test$prediction_both, reference = telcochurn_test$Churn, positive = "True")

#ROSE
telcochurn_test$prob_rose <- predict(model_train_rose, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_rose <- ifelse(telcochurn_test$prob_rose > 0.5, "True", "False")
                              
telcochurn_test <- telcochurn_test %>%
  mutate(prediction_rose = as.factor(prediction_rose))

cm_rose <- confusionMatrix(data = telcochurn_test$prediction_rose, reference = telcochurn_test$Churn, positive = "True")

#SMOTE
telcochurn_test$prob_smote <- predict(model_train_smote, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_smote <- ifelse(telcochurn_test$prob_smote > 0.5, "True", "False")
                              
telcochurn_test <- telcochurn_test %>%
  mutate(prediction_smote = as.factor(prediction_smote))

cm_smote <- confusionMatrix(data = telcochurn_test$prediction_smote, reference = telcochurn_test$Churn, positive = "True")

5.1.1 Logistic Regression Model Evaluation

#AIC
f <- as_vector(summary(model_train_up)["aic"])
g <- as_vector(summary(model_train_down)["aic"])
h <- as_vector(summary(model_train_both)["aic"])
i <- as_vector(summary(model_train_rose)["aic"])
j <- as_vector(summary(model_train_smote)["aic"])

#Accuracy
k <- cm_up$overall["Accuracy"]
l <- cm_down$overall["Accuracy"]
m <- cm_both$overall["Accuracy"]
n <- cm_rose$overall["Accuracy"]
o <- cm_smote$overall["Accuracy"]

#Sensitivity
p <- cm_up$byClass["Sensitivity"]
q <- cm_down$byClass["Sensitivity"]
r <- cm_both$byClass["Sensitivity"]
s <- cm_rose$byClass["Sensitivity"]
t <- cm_smote$byClass["Sensitivity"]

Model <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE")
AIC <- c(f,g,h,i,j)
Accuracy <- c(k,l,m,n,o)
Sensitivity <- c(p,q,r,s,t)

data.frame(Model,AIC,Accuracy,Sensitivity)

5.2 Stepwise Regression

model_train_backward <- step(model_train_smote, direction = "backward", trace = F)
summary(model_train_backward)
#> 
#> Call:
#> glm(formula = Churn ~ Account.length + International.plan + Voice.mail.plan + 
#>     Total.day.charge + Total.eve.charge + Total.night.minutes + 
#>     Total.intl.calls + Total.intl.charge + Customer.service.calls, 
#>     family = "binomial", data = telcochurn_train_smote)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5602  -0.8266   0.2653   0.8238   2.3967  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)            -6.478280   0.616437 -10.509 < 0.0000000000000002 ***
#> Account.length          0.002672   0.001699   1.573               0.1157    
#> International.planYes   2.491231   0.193065  12.904 < 0.0000000000000002 ***
#> Voice.mail.planYes     -0.620511   0.149038  -4.163        0.00003134846 ***
#> Total.day.charge        0.072947   0.007237  10.079 < 0.0000000000000002 ***
#> Total.eve.charge        0.065697   0.015626   4.204        0.00002618385 ***
#> Total.night.minutes     0.002776   0.001390   1.998               0.0457 *  
#> Total.intl.calls       -0.116059   0.028389  -4.088        0.00004346854 ***
#> Total.intl.charge       0.523696   0.090163   5.808        0.00000000631 ***
#> Customer.service.calls  0.545336   0.048412  11.265 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1979.8  on 1434  degrees of freedom
#> Residual deviance: 1436.9  on 1425  degrees of freedom
#> AIC: 1456.9
#> 
#> Number of Fisher Scoring iterations: 5
sr_aic <- summary(model_train_backward)$aic
sr_aic
#> [1] 1456.917

5.2.1 Stepwise Regression Model Evaluation

telcochurn_test$prob_backward <- predict(model_train_backward, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_backward <- ifelse(telcochurn_test$prob_backward > 0.5, "True", "False")

telcochurn_test <- telcochurn_test %>% 
  mutate(prediction_backward = as.factor(prediction_backward))

sr <- confusionMatrix(data = telcochurn_test$prediction_backward, reference = telcochurn_test$Churn, positive = "True")

method <- c("Logistic Regression","Logistic Regression","Logistic Regression","Logistic Regression","Logistic Regression","Stepwise Regression")
model <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE","SMOTE")
aic <- c(f,g,h,i,j,sr_aic)
accuracy <- c(k,l,m,n,o,sr$overall["Accuracy"])
sensitivity <- c(p,q,r,s,t,sr$byClass["Sensitivity"])

data.frame(method,model,aic,accuracy,sensitivity)

5.3 KNN

Generate Train Data from previous trial

train_knn_up <- telcochurn_train_ups %>%
  select(-c(International.plan,Voice.mail.plan))

train_knn_down <- telcochurn_train_downs %>%
  select(-c(International.plan,Voice.mail.plan))

train_knn_both <- telcochurn_train_boths %>%
  select(-c(International.plan,Voice.mail.plan))

train_knn_rose <- telcochurn_train_rose %>%
  select(-c(International.plan,Voice.mail.plan))

train_knn_smote <- telcochurn_train_smote %>%
  select(-c(International.plan,Voice.mail.plan))

Scaling train data

#Up Sampling
train_knn_up_scaled <- train_knn_up %>%
  select_if(is.numeric) %>%
  scale()

#Down Sampling
train_knn_down_scaled <- train_knn_down %>%
  select_if(is.numeric) %>%
  scale()

#Both Sampling
train_knn_both_scaled <- train_knn_both %>%
  select_if(is.numeric) %>%
  scale()

#ROSE
train_knn_rose_scaled <- train_knn_rose %>%
  select_if(is.numeric) %>%
  scale()

#SMOTE
train_knn_smote_scaled <- train_knn_smote %>%
  select_if(is.numeric) %>%
  scale()

Generate Test Data

telcochurn_test_knn <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) %>%
  select(-one_of(dropcols)) %>%
  select(-c(International.plan,Voice.mail.plan))

Scaling Data Test for each tuning

#Up Sampling
telcochurn_test_knn_scaled_up <- telcochurn_test_knn %>%
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_up_scaled, "scaled:center"),
        scale = attr(train_knn_up_scaled, "scaled:scale"))

#Down Sampling
telcochurn_test_knn_scaled_down <- telcochurn_test_knn %>% 
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_down_scaled, "scaled:center"),
        scale = attr(train_knn_down_scaled, "scaled:scale"))

#Both Sampling
telcochurn_test_knn_scaled_both <- telcochurn_test_knn %>%
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_both_scaled, "scaled:center"),
        scale = attr(train_knn_both_scaled, "scaled:scale"))

#ROSE
telcochurn_test_knn_scaled_rose <- telcochurn_test_knn %>% 
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_rose_scaled, "scaled:center"),
        scale = attr(train_knn_rose_scaled, "scaled:scale"))

#SMOTE
telcochurn_test_knn_scaled_smote <- telcochurn_test_knn %>% 
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_smote_scaled, "scaled:center"),
        scale = attr(train_knn_smote_scaled, "scaled:scale"))

Generate optimum K value

k_up <- sqrt(nrow(train_knn_up_scaled))
k_down <- sqrt(nrow(train_knn_down_scaled))
k_both <- sqrt(nrow(train_knn_both_scaled))
k_rose <- sqrt(nrow(train_knn_rose_scaled))
k_smote <- sqrt(nrow(train_knn_smote_scaled))

k_value <- c(k_up,k_down,k_both,k_rose,k_smote)

data.frame(Model,k_value)

In this case our target variable is binominal (even number) so for K value we must use odd number. To do so we can rounding our result and turns the number to odd.

5.3.1 KNN Model Evaluation

Create KNN Model

#Up Sampling
knn_telcochurn_up <- knn(train = train_knn_up_scaled, test = telcochurn_test_knn_scaled_up, cl = train_knn_up$Churn, k = 67)
u <- confusionMatrix(data = knn_telcochurn_up, reference = telcochurn_test_knn$Churn, positive = "True")

#Down Sampling
knn_telcochurn_down <- knn(train = train_knn_down_scaled, test = telcochurn_test_knn_scaled_down, cl = train_knn_down$Churn, k = 27)
v <- confusionMatrix(data = knn_telcochurn_down, reference = telcochurn_test_knn$Churn, positive = "True")

#Both Sampling
knn_telcochurn_both <- knn(train = train_knn_both_scaled, test = telcochurn_test_knn_scaled_both, cl = train_knn_both$Churn, k = 51)
w <- confusionMatrix(data = knn_telcochurn_both, reference = telcochurn_test_knn$Churn, positive = "True")

#ROSE
knn_telcochurn_rose <- knn(train = train_knn_rose_scaled, test = telcochurn_test_knn_scaled_rose, cl = train_knn_rose$Churn, k = 51)
x <- confusionMatrix(data = knn_telcochurn_rose, reference = telcochurn_test_knn$Churn, positive = "True")

#SMOTE
knn_telcochurn_smote <- knn(train = train_knn_smote_scaled, test = telcochurn_test_knn_scaled_smote, cl = train_knn_smote$Churn, k = 37)
y <- confusionMatrix(data = knn_telcochurn_smote, reference = telcochurn_test_knn$Churn, positive = "True")

#Comparison KNN Model
knn_accuracy = c(u$overall["Accuracy"],v$overall["Accuracy"],w$overall["Accuracy"],x$overall["Accuracy"],y$overall["Accuracy"])
knn_sensitivity = c(u$byClass["Sensitivity"],v$byClass["Sensitivity"],w$byClass["Sensitivity"],x$byClass["Sensitivity"],y$byClass["Sensitivity"])

data.frame(Model,knn_accuracy,knn_sensitivity)

5.3.2 Tuning SMOTE Model

Re-tunning SMOTE for KNN to increase accuracy

#re-tunning SMOTE
set.seed(309)
telcochurn_train_smote2_knn <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 200, perc.under = 180) %>%
  select(-c(International.plan,Voice.mail.plan))

#Scaling data train
train_knn_smote2_scaled <- telcochurn_train_smote2_knn %>%
  select_if(is.numeric) %>%
  scale()

#Scaling data test
telcochurn_test_knn_scaled_smote2 <- telcochurn_test_knn %>% 
  select_if(is.numeric) %>%
  scale(center = attr(train_knn_smote2_scaled, "scaled:center"),
        scale = attr(train_knn_smote2_scaled, "scaled:scale"))

#K-value
k_smote2 <- sqrt(nrow(train_knn_smote2_scaled))

#SMOTE New Model
knn_telcochurn_smote2 <- knn(train = train_knn_smote2_scaled, test = telcochurn_test_knn_scaled_smote2, cl = telcochurn_train_smote2_knn$Churn, k = 51)
z <- confusionMatrix(data = knn_telcochurn_smote2, reference = telcochurn_test_knn$Churn, positive = "True")

Comparing All KNN Model

Model_knn <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE", "SMOTE2")

knn_accuracy <- c(u$overall["Accuracy"],v$overall["Accuracy"],w$overall["Accuracy"],x$overall["Accuracy"],y$overall["Accuracy"],z$overall["Accuracy"])
knn_sensitivity <- c(u$byClass["Sensitivity"],v$byClass["Sensitivity"],w$byClass["Sensitivity"],x$byClass["Sensitivity"],y$byClass["Sensitivity"],z$byClass["Sensitivity"])

data.frame(Model_knn,knn_accuracy,knn_sensitivity)

6 Logistic Regression Vs Stepwise Regression Vs KNN

method_all <- c(method,"KNN","KNN","KNN","KNN","KNN","KNN")
model_all <- c(model,"Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE", "SMOTE2")
accuracy_all <- c(accuracy,knn_accuracy)
sensitivity_all <- c(sensitivity,knn_sensitivity)

data.frame(method_all,model_all,accuracy_all,sensitivity_all)

7 Conclusion

Here’s some conclusions we can take from this study case:

  • For this dataset, we try 4 method up sampling, down sampling, both sampling, ROSE, and SMOTE.
  • The chosen Matrix Evaluation for this study case is Recall / Sensitivity, because we want to focus on class positive. In this case, as we want to improve our services regarding to customer’s feedback, firstly we have to know which customer is going to stop using our services.
  • Model generated by KNN and SMOTE is the best model for this study case. It generate sensitivity of 0.8526.