1 Introduction

In this Learning by Building task, I would like to make model from data wholesale.csv for predicting variabel Channel using Logistic Regression and KNN

    1. Horeca: Short for Hotel, Restaurant and Cafe
    1. Retail: Retail industry
library(tidyverse)
library(gtools)
library(caret)

theme_set(theme_minimal() +
            theme(legend.position = "top"))

options(scipen = 999)

2 Data Input

wholesale <- read.csv("data_input/wholesale.csv") %>% 
  select(-Region) %>% 
  mutate(Channel = ifelse(Channel == 1, 0, 1))%>% 
  mutate(Channel=as.factor(Channel))

wholesale

2.1 Check Null Value

No Null value in the dataset

colSums(is.na(wholesale))
##          Channel            Fresh             Milk          Grocery 
##                0                0                0                0 
##           Frozen Detergents_Paper       Delicassen 
##                0                0                0

3 Logistic Regression

3.1 Cross-Validation

3.2 Check Class Imbalance

Check if the data class imbalance or not by looking at proportion in imbalance class target variable

# Your code here
prop.table(table(data_train$Channel))
## 
##         0         1 
## 0.6846591 0.3153409

3.3 Handling Class Imbalance

We can see that data train is imbalance proportion. This will influence performance model because model will tend to predict majority class or class with dominant proportion.

There are some ways to handle Class Imbalance:

  • Upsampling: increasing minority class
  • Downsampling: decreasing majority class
  • Make synthetic data

In this case, I would like to do upsample to handle class imbalance

library(caret)
# Your code here
set.seed(123)
data_train <- upSample(x = data_train %>% select(-Channel),
                          y = data_train$Channel,
                          list = F,
                          yname = "Channel")

table(data_train$Channel)
## 
##   0   1 
## 241 241

3.4 Model Fitting

Make model regression logistic with Channel as target variable.

summary(data_train)
##      Fresh             Milk          Grocery          Frozen       
##  Min.   :     3   Min.   :   55   Min.   :  137   Min.   :   25.0  
##  1st Qu.:  3044   1st Qu.: 2100   1st Qu.: 2856   1st Qu.:  780.2  
##  Median :  8090   Median : 4980   Median : 7408   Median : 1447.0  
##  Mean   : 11698   Mean   : 7339   Mean   :10358   Mean   : 2871.3  
##  3rd Qu.: 16800   3rd Qu.: 9246   3rd Qu.:13164   3rd Qu.: 3232.0  
##  Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen      Channel
##  Min.   :    3.0   Min.   :    3.0   0:241  
##  1st Qu.:  386.2   1st Qu.:  430.2   1:241  
##  Median : 2171.0   Median : 1037.0          
##  Mean   : 4192.3   Mean   : 1567.8          
##  3rd Qu.: 6171.2   3rd Qu.: 2063.5          
##  Max.   :40827.0   Max.   :47943.0
# Your code here
model_channel <- glm(Channel ~ ., data_train, family = "binomial") 

# menggunakan stepwise mencari AIC terendah 
model_step <- step(model_channel, direction = "both", trace = 0)
summary(model_step) 
## 
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen, family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2837  -0.3124  -0.0006   0.1388   3.3881  
## 
## Coefficients:
##                     Estimate  Std. Error z value            Pr(>|z|)    
## (Intercept)      -3.14238368  0.39834696  -7.889 0.00000000000000306 ***
## Milk              0.00011687  0.00005440   2.148              0.0317 *  
## Grocery           0.00010868  0.00005899   1.842              0.0654 .  
## Frozen           -0.00022007  0.00009250  -2.379              0.0173 *  
## Detergents_Paper  0.00094407  0.00013541   6.972 0.00000000000312735 ***
## Delicassen       -0.00022092  0.00012692  -1.741              0.0818 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 668.19  on 481  degrees of freedom
## Residual deviance: 212.61  on 476  degrees of freedom
## AIC: 224.61
## 
## Number of Fisher Scoring iterations: 8

3.5 Model Prediction and Evaluation

Predict data test by using regression logistic model. Change prediction in probability into category, with terms if probability > 0.5 is change to be category 1 which means Retail.

3.5.1 Predict Data Train

# Predict data train
pred_train <- predict(model_step, data_train, type = "response")
pred_class_train <- ifelse(pred_train > 0.5, "Retail", "Horeca") %>% 
  as.factor() 
data_train$Channel <- as.factor(ifelse(data_train$Channel ==1 , "Retail", "Horeca"))

confusionMatrix(pred_class_train, data_train$Channel, positive = "Retail")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca    224     22
##     Retail     17    219
##                                              
##                Accuracy : 0.9191             
##                  95% CI : (0.891, 0.9418)    
##     No Information Rate : 0.5                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8382             
##                                              
##  Mcnemar's Test P-Value : 0.5218             
##                                              
##             Sensitivity : 0.9087             
##             Specificity : 0.9295             
##          Pos Pred Value : 0.9280             
##          Neg Pred Value : 0.9106             
##              Prevalence : 0.5000             
##          Detection Rate : 0.4544             
##    Detection Prevalence : 0.4896             
##       Balanced Accuracy : 0.9191             
##                                              
##        'Positive' Class : Retail             
## 
data_train$predtrain <- data.frame(pred_train)
ggplot(data_train, aes(x=pred_train)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data in Data Train") +
  theme_minimal()


From graph above, can be interpreted that prediction data train of Retail has highest results than Horeca

3.5.2 Predict Data Test

# Your code here
pred_test <- predict(model_step, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "Retail", "Horeca") %>% as.factor() 
data_test$Channel <- as.factor(ifelse(data_test$Channel ==1 , "Retail", "Horeca"))
log_conf <- confusionMatrix(pred_class_test, data_test$Channel, positive = "Retail")
log_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca     54      6
##     Retail      3     25
##                                           
##                Accuracy : 0.8977          
##                  95% CI : (0.8147, 0.9522)
##     No Information Rate : 0.6477          
##     P-Value [Acc > NIR] : 0.00000007518   
##                                           
##                   Kappa : 0.7708          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.8065          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.8929          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.3523          
##          Detection Rate : 0.2841          
##    Detection Prevalence : 0.3182          
##       Balanced Accuracy : 0.8769          
##                                           
##        'Positive' Class : Retail          
## 
data_test$predtest <- data.frame(pred_test)
ggplot(data_test, aes(x=pred_test)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data in Data Test") +
  theme_minimal()


From graph above, can be interpreted that prediction data test of Horeca has highest results than Retail

3.6 Result of Logistic Regression Model Confusion Matrix

For predicting Channel, Accuracy of Data train is 0.91 and accuracy of data test is 0.89 which is overall balance with high accuracy.

4 K-Nearest Neighbour

4.1 Cross-Validation

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(wholesale)
index <- sample(row_data, row_data*0.8)

data_train <- wholesale[index, ]
data_test <- wholesale[-index, ] 

4.2 EDA

library(GGally)
ggcorr(data_train, label=T)

plot(as.factor(data_train$Channel), data_train$Fresh)

plot(as.factor(data_train$Channel), data_train$Milk)

plot(as.factor(data_train$Channel), data_train$Grocery)

plot(as.factor(data_train$Channel), data_train$Frozen)

plot(as.factor(data_train$Channel), data_train$Detergents_Paper)


From boxplot above, we can see that Retail tends to buy less Fresh and Frozen than Horeca, and Retail tends to buy more Milk, Grocery, and Detergents_Paper than Horeca. Grocery has high positive correlation with Detergents Paper, which means when the market buy high Grocery it means they tends to also buy Detergents Paper and vice versa. Milk also has high positive correlation with Grocery and Detergents Paper, which means when market buy high Milk, it means they tends to also buy high Grocery and high Detergents Paper and vice versa.

4.3 Data Preprocessing

train_x <- data_train %>% 
select(-Channel) %>% # buang target variable
scale() # lakukan scalling ke semua prediktor


# Menyimpan target variabel
train_y <- as.factor(data_train$Channel)
test_x <- data_test %>% 
select(-Channel) %>% #Buang target variable
scale(center = attr(train_x,"scaled:center"), # menunjukkan rata-rata dari setiap variabel
scale = attr(train_x, "scaled:scale") # menunjukkan standar deviasi dari setiap variabel
) # lakukan scalling dengan informasi dari data train

# Menyimpan variabel target
test_y <- as.factor(data_test$Channel)

4.4 Model Fitting

4.4.1 Make Model KNN

model_knn <- knn3(x = train_x, 
                  y = train_y, 
                  k = sqrt(nrow(train_x))
                  )

4.4.2 Predict Data Train

pred_knn_train <- predict(model_knn,train_x, type = "class")
head(pred_knn_train)
## [1] 0 1 0 0 0 0
## Levels: 0 1
confusionMatrix(pred_knn_train, train_y, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 225  12
##          1  16  99
##                                              
##                Accuracy : 0.9205             
##                  95% CI : (0.8871, 0.9465)   
##     No Information Rate : 0.6847             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8176             
##                                              
##  Mcnemar's Test P-Value : 0.5708             
##                                              
##             Sensitivity : 0.8919             
##             Specificity : 0.9336             
##          Pos Pred Value : 0.8609             
##          Neg Pred Value : 0.9494             
##              Prevalence : 0.3153             
##          Detection Rate : 0.2812             
##    Detection Prevalence : 0.3267             
##       Balanced Accuracy : 0.9128             
##                                              
##        'Positive' Class : 1                  
## 

4.4.3 Predict Data Test

pred_knn <- predict(model_knn, test_x, type = "class")

or same result different code

pred_knn <- knn3Train(train = train_x, 
                      test = test_x, 
                      cl = train_y, 
                      k = sqrt(nrow(train_x)) %>% round()
                      ) %>% 
  as.factor()

head(pred_knn)
## [1] 0 0 0 0 1 1
## Levels: 0 1
pred_knn_conf <- confusionMatrix(pred_knn, test_y,positive = "1")
pred_knn_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 54  9
##          1  3 22
##                                           
##                Accuracy : 0.8636          
##                  95% CI : (0.7739, 0.9275)
##     No Information Rate : 0.6477          
##     P-Value [Acc > NIR] : 0.000004804     
##                                           
##                   Kappa : 0.6874          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.7097          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.8800          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3523          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2841          
##       Balanced Accuracy : 0.8285          
##                                           
##        'Positive' Class : 1               
## 

4.5 Result of KNN Model Confusion Matrix

For predicting Channel, Accuracy of Data train is 0.92 and accuracy of data test is 0.86 which is overall balance with high accuracy.

5 Model Evaluation Logistic Regression and K-Nearest Neighbour

  • Results of Logistic Regression:
library(tibble)
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_logit
  • Results of KNN:
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])
eval_knn

If we can see from both method, using Logistic Regression and K-NN, ability of the model for predicting Retail or Horeca is both has good performance of prediction and quite similar high value of Accuracy, Specificity, and Precision. But highest recall in Logistic Regression model.

6 Conclusion

If I am an enterpreneur I would like to targeting both Horeca or Retail, both of them are my market targets, and I want my model in predicting Channel has a high accuracy. From the model that has been made in previous time, we can see that both model Logistic Regression and KNN has good performance with high accuracy, so both model is good for predicting Channel. But if we are prioritizing to have a good prediction in predicting Retail only, we can focusing on results of Recall/Sensitivity, then we can use Logistic Regression for predicting Channel.