In this case we use the Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models.
How to mitigate churn customer in Orange Telecom?
In this case we will comparing predictive model with Logistic Regression, Stepwise Regression, and KNN to find highest sensitivity (recall) with suitable accuracy & AIC.
Read Data
telcochurn_train <- read.csv("churn-bigml-80.csv", stringsAsFactors = TRUE) #80% of data
telcochurn_test <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) #20% of data
nrow(telcochurn_train)
#> [1] 2666
nrow(telcochurn_test)
#> [1] 667
Data Train
head(telcochurn_train)
Data Test
head(telcochurn_test)
Selection variable in both data
dropcols <- c("State","Area.code")
telcochurn_train <- telcochurn_train %>%
select(-one_of(dropcols))
telcochurn_test <- telcochurn_test %>%
select(-one_of(dropcols))
Focus on Data Train to create model
glimpse(telcochurn_train)
#> Rows: 2,666
#> Columns: 18
#> $ Account.length <int> 128, 107, 137, 84, 75, 118, 121, 147, 141, 74, …
#> $ International.plan <fct> No, No, No, Yes, Yes, Yes, No, Yes, Yes, No, No…
#> $ Voice.mail.plan <fct> Yes, Yes, No, No, No, No, Yes, No, Yes, No, No,…
#> $ Number.vmail.messages <int> 25, 26, 0, 0, 0, 0, 24, 0, 37, 0, 0, 0, 0, 27, …
#> $ Total.day.minutes <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2…
#> $ Total.day.calls <int> 110, 123, 114, 71, 113, 98, 88, 79, 84, 127, 96…
#> $ Total.day.charge <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09…
#> $ Total.eve.minutes <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5,…
#> $ Total.eve.calls <int> 99, 103, 110, 88, 122, 101, 108, 94, 111, 148, …
#> $ Total.eve.charge <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62,…
#> $ Total.night.minutes <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6…
#> $ Total.night.calls <int> 91, 103, 104, 89, 121, 118, 118, 96, 97, 94, 12…
#> $ Total.night.charge <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.5…
#> $ Total.intl.minutes <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 11.…
#> $ Total.intl.calls <int> 3, 3, 5, 7, 3, 6, 7, 6, 5, 5, 2, 5, 6, 4, 3, 5,…
#> $ Total.intl.charge <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92,…
#> $ Customer.service.calls <int> 1, 1, 0, 2, 3, 0, 3, 0, 0, 0, 1, 3, 4, 1, 3, 1,…
#> $ Churn <fct> False, False, False, False, False, False, False…
STATE: 51 Unique States in United States of America Account Length. Length of The Account Area Code 415 relates to San Francisco,408 is of San Jose and 510 is of City of Okland International Plan Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan Voice Mail Plan Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan Number vmail messages Number of Voice Mail Messages ranging from 0 to 50 Total day minutes Total Number of Minutes Spent By Customers in Morning Total day calls Total Number of Calls made by Customer in Morning. Total day charge Total Charge to the Customers in Morning. Total eve minutesTotal Number of Minutes Spent By Customers in Evening Total eve calls Total Number of Calls made by Customer in Evening. Total eve charge Total Charge to the Customers in Morning. Total night minutes Total Number of Minutes Spent By Customers in the Night. Total night calls Total Number of Calls made by Customer in Night. Total night charge Total Charge to the Customers in Night.
Check NA
colSums(is.na(telcochurn_train))
#> Account.length International.plan Voice.mail.plan
#> 0 0 0
#> Number.vmail.messages Total.day.minutes Total.day.calls
#> 0 0 0
#> Total.day.charge Total.eve.minutes Total.eve.calls
#> 0 0 0
#> Total.eve.charge Total.night.minutes Total.night.calls
#> 0 0 0
#> Total.night.charge Total.intl.minutes Total.intl.calls
#> 0 0 0
#> Total.intl.charge Customer.service.calls Churn
#> 0 0 0
summary(telcochurn_train)
#> Account.length International.plan Voice.mail.plan Number.vmail.messages
#> Min. : 1.0 No :2396 No :1933 Min. : 0.000
#> 1st Qu.: 73.0 Yes: 270 Yes: 733 1st Qu.: 0.000
#> Median :100.0 Median : 0.000
#> Mean :100.6 Mean : 8.022
#> 3rd Qu.:127.0 3rd Qu.:19.000
#> Max. :243.0 Max. :50.000
#> Total.day.minutes Total.day.calls Total.day.charge Total.eve.minutes
#> Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
#> 1st Qu.:143.4 1st Qu.: 87.0 1st Qu.:24.38 1st Qu.:165.3
#> Median :179.9 Median :101.0 Median :30.59 Median :200.9
#> Mean :179.5 Mean :100.3 Mean :30.51 Mean :200.4
#> 3rd Qu.:215.9 3rd Qu.:114.0 3rd Qu.:36.70 3rd Qu.:235.1
#> Max. :350.8 Max. :160.0 Max. :59.64 Max. :363.7
#> Total.eve.calls Total.eve.charge Total.night.minutes Total.night.calls
#> Min. : 0 Min. : 0.00 Min. : 43.7 Min. : 33.0
#> 1st Qu.: 87 1st Qu.:14.05 1st Qu.:166.9 1st Qu.: 87.0
#> Median :100 Median :17.08 Median :201.2 Median :100.0
#> Mean :100 Mean :17.03 Mean :201.2 Mean :100.1
#> 3rd Qu.:114 3rd Qu.:19.98 3rd Qu.:236.5 3rd Qu.:113.0
#> Max. :170 Max. :30.91 Max. :395.0 Max. :166.0
#> Total.night.charge Total.intl.minutes Total.intl.calls Total.intl.charge
#> Min. : 1.970 Min. : 0.00 Min. : 0.000 Min. :0.000
#> 1st Qu.: 7.513 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
#> Median : 9.050 Median :10.20 Median : 4.000 Median :2.750
#> Mean : 9.053 Mean :10.24 Mean : 4.467 Mean :2.764
#> 3rd Qu.:10.640 3rd Qu.:12.10 3rd Qu.: 6.000 3rd Qu.:3.270
#> Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
#> Customer.service.calls Churn
#> Min. :0.000 False:2278
#> 1st Qu.:1.000 True : 388
#> Median :1.000
#> Mean :1.563
#> 3rd Qu.:2.000
#> Max. :9.000
Check proportion of target variable
table(telcochurn_train$Churn)
#>
#> False True
#> 2278 388
prop.table(table(telcochurn_train$Churn))
#>
#> False True
#> 0.8544636 0.1455364
From result above we know our proportion of Churn variable as target is unbalance. The next step is balancing the Churn variable with up sample, down sample, both sample, ROSE, and SMOTE.
#Up Sampling
set.seed(309)
telcochurn_train_ups <- upSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
a <- table(telcochurn_train_ups$Churn)
#Down Sampling
set.seed(309)
telcochurn_train_downs <- downSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
b <- table(telcochurn_train_downs$Churn)
#Both Sampling
telcochurn_train_boths <- ovun.sample(Churn~.,data = telcochurn_train, method= "both", p=0.51, N = nrow(telcochurn_train), seed = 309)$data
c <- table(telcochurn_train_boths$Churn)
#ROSE method
set.seed(309)
telcochurn_train_rose <- ROSE(Churn~.,data=telcochurn_train)$data
d <- table(telcochurn_train_rose$Churn)
#SMOTE method
set.seed(309)
telcochurn_train_smote <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 180, perc.under = 170)
e <- table(telcochurn_train_smote$Churn)
data.tuning <- data.frame(a,b,c,d,e)%>%
select(Var1,Freq,Freq.1,Freq.2,Freq.3,Freq.4) %>%
rename(Churn = Var1,
"Up Sample" = Freq,
"Down Sample" = Freq.1,
"Both Sample" = Freq.2,
ROSE = Freq.3,
SMOTE = Freq.4)
data.tuning
Generate Model in all data tuning
#Up Sampling
model_train_up <- glm(Churn ~. , data = telcochurn_train_ups, family = "binomial")
summary(model_train_up)
#>
#> Call:
#> glm(formula = Churn ~ ., family = "binomial", data = telcochurn_train_ups)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.98180 -0.83079 -0.04096 0.85446 2.66123
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -6.46532520 0.43999832 -14.694 < 0.0000000000000002
#> Account.length 0.00146982 0.00088772 1.656 0.097776
#> International.planYes 2.57904622 0.11511925 22.403 < 0.0000000000000002
#> Voice.mail.planYes -1.82799632 0.33615387 -5.438 0.0000000538895
#> Number.vmail.messages 0.03703686 0.01073208 3.451 0.000558
#> Total.day.minutes 0.50910547 2.12062620 0.240 0.810274
#> Total.day.calls 0.00279397 0.00174049 1.605 0.108433
#> Total.day.charge -2.92115053 12.47420834 -0.234 0.814849
#> Total.eve.minutes 2.22298947 1.06311894 2.091 0.036527
#> Total.eve.calls 0.00009287 0.00176469 0.053 0.958027
#> Total.eve.charge -26.08491734 12.50753240 -2.086 0.037021
#> Total.night.minutes 0.66213219 0.56059887 1.181 0.237557
#> Total.night.calls 0.00175552 0.00183449 0.957 0.338589
#> Total.night.charge -14.65135495 12.45742822 -1.176 0.239549
#> Total.intl.minutes -2.92485233 3.43091241 -0.852 0.393937
#> Total.intl.calls -0.09584360 0.01468474 -6.527 0.0000000000672
#> Total.intl.charge 11.12699356 12.70678413 0.876 0.381208
#> Customer.service.calls 0.56423034 0.02614708 21.579 < 0.0000000000000002
#>
#> (Intercept) ***
#> Account.length .
#> International.planYes ***
#> Voice.mail.planYes ***
#> Number.vmail.messages ***
#> Total.day.minutes
#> Total.day.calls
#> Total.day.charge
#> Total.eve.minutes *
#> Total.eve.calls
#> Total.eve.charge *
#> Total.night.minutes
#> Total.night.calls
#> Total.night.charge
#> Total.intl.minutes
#> Total.intl.calls ***
#> Total.intl.charge
#> Customer.service.calls ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 6316.0 on 4555 degrees of freedom
#> Residual deviance: 4701.4 on 4538 degrees of freedom
#> AIC: 4737.4
#>
#> Number of Fisher Scoring iterations: 5
Create all model
#Down Sampling
model_train_down <- glm(Churn ~. , data = telcochurn_train_downs, family = "binomial")
#Both Sampling
model_train_both <- glm(Churn ~. , data = telcochurn_train_boths, family = "binomial")
#ROSE
model_train_rose <- glm(Churn ~. , data = telcochurn_train_rose, family = "binomial")
#SMOTE
model_train_smote <- glm(Churn ~. , data = telcochurn_train_smote, family = "binomial")
Comparing AIC value
summary(model_train_up)$aic
#> [1] 4737.433
summary(model_train_down)$aic
#> [1] 836.3498
summary(model_train_both)$aic
#> [1] 2749.511
summary(model_train_rose)$aic
#> [1] 2951.251
summary(model_train_smote)$aic
#> [1] 1469.022
Down sample has lowest AIC value then other model, but we not conclude using down sample because ammount of data observation too small.
Execute the Model
#Up Sampling
telcochurn_test$prob_up <- predict(model_train_up, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_up <- ifelse(telcochurn_test$prob_up > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_up = as.factor(prediction_up))
cm_up <- confusionMatrix(data = telcochurn_test$prediction_up, reference = telcochurn_test$Churn, positive = "True")
cm_up
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction False True
#> False 449 24
#> True 123 71
#>
#> Accuracy : 0.7796
#> 95% CI : (0.7462, 0.8105)
#> No Information Rate : 0.8576
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3711
#>
#> Mcnemar's Test P-Value : 0.0000000000000006324
#>
#> Sensitivity : 0.7474
#> Specificity : 0.7850
#> Pos Pred Value : 0.3660
#> Neg Pred Value : 0.9493
#> Prevalence : 0.1424
#> Detection Rate : 0.1064
#> Detection Prevalence : 0.2909
#> Balanced Accuracy : 0.7662
#>
#> 'Positive' Class : True
#>
#Down Sampling
telcochurn_test$prob_down <- predict(model_train_down, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_down <- ifelse(telcochurn_test$prob_down > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_down = as.factor(prediction_down))
cm_down <- confusionMatrix(data = telcochurn_test$prediction_down, reference = telcochurn_test$Churn, positive = "True")
#Both Sampling
telcochurn_test$prob_both <- predict(model_train_both, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_both <- ifelse(telcochurn_test$prob_both > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_both = as.factor(prediction_both))
cm_both <- confusionMatrix(data = telcochurn_test$prediction_both, reference = telcochurn_test$Churn, positive = "True")
#ROSE
telcochurn_test$prob_rose <- predict(model_train_rose, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_rose <- ifelse(telcochurn_test$prob_rose > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_rose = as.factor(prediction_rose))
cm_rose <- confusionMatrix(data = telcochurn_test$prediction_rose, reference = telcochurn_test$Churn, positive = "True")
#SMOTE
telcochurn_test$prob_smote <- predict(model_train_smote, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_smote <- ifelse(telcochurn_test$prob_smote > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_smote = as.factor(prediction_smote))
cm_smote <- confusionMatrix(data = telcochurn_test$prediction_smote, reference = telcochurn_test$Churn, positive = "True")
#AIC
f <- as_vector(summary(model_train_up)["aic"])
g <- as_vector(summary(model_train_down)["aic"])
h <- as_vector(summary(model_train_both)["aic"])
i <- as_vector(summary(model_train_rose)["aic"])
j <- as_vector(summary(model_train_smote)["aic"])
#Accuracy
k <- cm_up$overall["Accuracy"]
l <- cm_down$overall["Accuracy"]
m <- cm_both$overall["Accuracy"]
n <- cm_rose$overall["Accuracy"]
o <- cm_smote$overall["Accuracy"]
#Sensitivity
p <- cm_up$byClass["Sensitivity"]
q <- cm_down$byClass["Sensitivity"]
r <- cm_both$byClass["Sensitivity"]
s <- cm_rose$byClass["Sensitivity"]
t <- cm_smote$byClass["Sensitivity"]
Model <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE")
AIC <- c(f,g,h,i,j)
Accuracy <- c(k,l,m,n,o)
Sensitivity <- c(p,q,r,s,t)
data.frame(Model,AIC,Accuracy,Sensitivity)
model_train_backward <- step(model_train_smote, direction = "backward", trace = F)
summary(model_train_backward)
#>
#> Call:
#> glm(formula = Churn ~ Account.length + International.plan + Voice.mail.plan +
#> Total.day.charge + Total.eve.charge + Total.night.minutes +
#> Total.intl.calls + Total.intl.charge + Customer.service.calls,
#> family = "binomial", data = telcochurn_train_smote)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5602 -0.8266 0.2653 0.8238 2.3967
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -6.478280 0.616437 -10.509 < 0.0000000000000002 ***
#> Account.length 0.002672 0.001699 1.573 0.1157
#> International.planYes 2.491231 0.193065 12.904 < 0.0000000000000002 ***
#> Voice.mail.planYes -0.620511 0.149038 -4.163 0.00003134846 ***
#> Total.day.charge 0.072947 0.007237 10.079 < 0.0000000000000002 ***
#> Total.eve.charge 0.065697 0.015626 4.204 0.00002618385 ***
#> Total.night.minutes 0.002776 0.001390 1.998 0.0457 *
#> Total.intl.calls -0.116059 0.028389 -4.088 0.00004346854 ***
#> Total.intl.charge 0.523696 0.090163 5.808 0.00000000631 ***
#> Customer.service.calls 0.545336 0.048412 11.265 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1979.8 on 1434 degrees of freedom
#> Residual deviance: 1436.9 on 1425 degrees of freedom
#> AIC: 1456.9
#>
#> Number of Fisher Scoring iterations: 5
sr_aic <- summary(model_train_backward)$aic
sr_aic
#> [1] 1456.917
telcochurn_test$prob_backward <- predict(model_train_backward, newdata = telcochurn_test, type = "response")
telcochurn_test$prediction_backward <- ifelse(telcochurn_test$prob_backward > 0.5, "True", "False")
telcochurn_test <- telcochurn_test %>%
mutate(prediction_backward = as.factor(prediction_backward))
sr <- confusionMatrix(data = telcochurn_test$prediction_backward, reference = telcochurn_test$Churn, positive = "True")
method <- c("Logistic Regression","Logistic Regression","Logistic Regression","Logistic Regression","Logistic Regression","Stepwise Regression")
model <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE","SMOTE")
aic <- c(f,g,h,i,j,sr_aic)
accuracy <- c(k,l,m,n,o,sr$overall["Accuracy"])
sensitivity <- c(p,q,r,s,t,sr$byClass["Sensitivity"])
data.frame(method,model,aic,accuracy,sensitivity)
Generate Train Data from previous trial
train_knn_up <- telcochurn_train_ups %>%
select(-c(International.plan,Voice.mail.plan))
train_knn_down <- telcochurn_train_downs %>%
select(-c(International.plan,Voice.mail.plan))
train_knn_both <- telcochurn_train_boths %>%
select(-c(International.plan,Voice.mail.plan))
train_knn_rose <- telcochurn_train_rose %>%
select(-c(International.plan,Voice.mail.plan))
train_knn_smote <- telcochurn_train_smote %>%
select(-c(International.plan,Voice.mail.plan))
Scaling train data
#Up Sampling
train_knn_up_scaled <- train_knn_up %>%
select_if(is.numeric) %>%
scale()
#Down Sampling
train_knn_down_scaled <- train_knn_down %>%
select_if(is.numeric) %>%
scale()
#Both Sampling
train_knn_both_scaled <- train_knn_both %>%
select_if(is.numeric) %>%
scale()
#ROSE
train_knn_rose_scaled <- train_knn_rose %>%
select_if(is.numeric) %>%
scale()
#SMOTE
train_knn_smote_scaled <- train_knn_smote %>%
select_if(is.numeric) %>%
scale()
Generate Test Data
telcochurn_test_knn <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) %>%
select(-one_of(dropcols)) %>%
select(-c(International.plan,Voice.mail.plan))
Scaling Data Test for each tuning
#Up Sampling
telcochurn_test_knn_scaled_up <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_up_scaled, "scaled:center"),
scale = attr(train_knn_up_scaled, "scaled:scale"))
#Down Sampling
telcochurn_test_knn_scaled_down <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_down_scaled, "scaled:center"),
scale = attr(train_knn_down_scaled, "scaled:scale"))
#Both Sampling
telcochurn_test_knn_scaled_both <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_both_scaled, "scaled:center"),
scale = attr(train_knn_both_scaled, "scaled:scale"))
#ROSE
telcochurn_test_knn_scaled_rose <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_rose_scaled, "scaled:center"),
scale = attr(train_knn_rose_scaled, "scaled:scale"))
#SMOTE
telcochurn_test_knn_scaled_smote <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_smote_scaled, "scaled:center"),
scale = attr(train_knn_smote_scaled, "scaled:scale"))
Generate optimum K value
k_up <- sqrt(nrow(train_knn_up_scaled))
k_down <- sqrt(nrow(train_knn_down_scaled))
k_both <- sqrt(nrow(train_knn_both_scaled))
k_rose <- sqrt(nrow(train_knn_rose_scaled))
k_smote <- sqrt(nrow(train_knn_smote_scaled))
k_value <- c(k_up,k_down,k_both,k_rose,k_smote)
data.frame(Model,k_value)
In this case our target variable is binominal (even number) so for K value we must use odd number. To do so we can rounding our result and turns the number to odd.
Create KNN Model
#Up Sampling
knn_telcochurn_up <- knn(train = train_knn_up_scaled, test = telcochurn_test_knn_scaled_up, cl = train_knn_up$Churn, k = 67)
u <- confusionMatrix(data = knn_telcochurn_up, reference = telcochurn_test_knn$Churn, positive = "True")
#Down Sampling
knn_telcochurn_down <- knn(train = train_knn_down_scaled, test = telcochurn_test_knn_scaled_down, cl = train_knn_down$Churn, k = 27)
v <- confusionMatrix(data = knn_telcochurn_down, reference = telcochurn_test_knn$Churn, positive = "True")
#Both Sampling
knn_telcochurn_both <- knn(train = train_knn_both_scaled, test = telcochurn_test_knn_scaled_both, cl = train_knn_both$Churn, k = 51)
w <- confusionMatrix(data = knn_telcochurn_both, reference = telcochurn_test_knn$Churn, positive = "True")
#ROSE
knn_telcochurn_rose <- knn(train = train_knn_rose_scaled, test = telcochurn_test_knn_scaled_rose, cl = train_knn_rose$Churn, k = 51)
x <- confusionMatrix(data = knn_telcochurn_rose, reference = telcochurn_test_knn$Churn, positive = "True")
#SMOTE
knn_telcochurn_smote <- knn(train = train_knn_smote_scaled, test = telcochurn_test_knn_scaled_smote, cl = train_knn_smote$Churn, k = 37)
y <- confusionMatrix(data = knn_telcochurn_smote, reference = telcochurn_test_knn$Churn, positive = "True")
#Comparison KNN Model
knn_accuracy = c(u$overall["Accuracy"],v$overall["Accuracy"],w$overall["Accuracy"],x$overall["Accuracy"],y$overall["Accuracy"])
knn_sensitivity = c(u$byClass["Sensitivity"],v$byClass["Sensitivity"],w$byClass["Sensitivity"],x$byClass["Sensitivity"],y$byClass["Sensitivity"])
data.frame(Model,knn_accuracy,knn_sensitivity)
Re-tunning SMOTE for KNN to increase accuracy
#re-tunning SMOTE
set.seed(309)
telcochurn_train_smote2_knn <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 200, perc.under = 180) %>%
select(-c(International.plan,Voice.mail.plan))
#Scaling data train
train_knn_smote2_scaled <- telcochurn_train_smote2_knn %>%
select_if(is.numeric) %>%
scale()
#Scaling data test
telcochurn_test_knn_scaled_smote2 <- telcochurn_test_knn %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn_smote2_scaled, "scaled:center"),
scale = attr(train_knn_smote2_scaled, "scaled:scale"))
#K-value
k_smote2 <- sqrt(nrow(train_knn_smote2_scaled))
#SMOTE New Model
knn_telcochurn_smote2 <- knn(train = train_knn_smote2_scaled, test = telcochurn_test_knn_scaled_smote2, cl = telcochurn_train_smote2_knn$Churn, k = 51)
z <- confusionMatrix(data = knn_telcochurn_smote2, reference = telcochurn_test_knn$Churn, positive = "True")
Comparing All KNN Model
Model_knn <- c("Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE", "SMOTE2")
knn_accuracy <- c(u$overall["Accuracy"],v$overall["Accuracy"],w$overall["Accuracy"],x$overall["Accuracy"],y$overall["Accuracy"],z$overall["Accuracy"])
knn_sensitivity <- c(u$byClass["Sensitivity"],v$byClass["Sensitivity"],w$byClass["Sensitivity"],x$byClass["Sensitivity"],y$byClass["Sensitivity"],z$byClass["Sensitivity"])
data.frame(Model_knn,knn_accuracy,knn_sensitivity)
method_all <- c(method,"KNN","KNN","KNN","KNN","KNN","KNN")
model_all <- c(model,"Up Sampling","Down Sampling","Both Sampling","ROSE","SMOTE", "SMOTE2")
accuracy_all <- c(accuracy,knn_accuracy)
sensitivity_all <- c(sensitivity,knn_sensitivity)
data.frame(method_all,model_all,accuracy_all,sensitivity_all)
Here’s some conclusions we can take from this study case:
0.8526.