Logistic Regression and KNN Model to Predict Wholesale Data

Gaos Tipki Alpandi | @gaostipkialpandi

2022-06-03

Data Explanation

Objective

The data used in this model consist of variables related to annual spending in monetary units on diverse product categories. This dataset refers to clients of a wholesale distributor. There are 440 rows with 6 predictor variables and 1 response variable. The source of this dataset comes from archive.ics.uci.edu.

Why?

The purpose of logistic regression modeling and K-Nearest Neighbor (KNN) on this data is to determine whether the customers on the channel are from Horeca (industry in the food and beverage sector) or not. The results of this modeling can be used as a reference in predicting the type of channel based on the existing variables.

Data Preparation

wholesale <- read.csv("wholesale_kaggle.csv")

Exploratory Data Analysis

Checking the type of each variable

str(wholesale)
## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

Changing The Data Type

library(dplyr)
wholesale_clean <- wholesale %>% 
  select(-Region) %>% #We get Region variabel out of from the dataset
  mutate(Channel=as.factor(Channel))

rmarkdown::paged_table(wholesale_clean)

Summary of The Data

summary(wholesale_clean)
##  Channel     Fresh             Milk          Grocery          Frozen       
##  1:298   Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
##  2:142   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
##          Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
##          Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
##          3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
##          Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen     
##  Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  256.8   1st Qu.:  408.2  
##  Median :  816.5   Median :  965.5  
##  Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :40827.0   Max.   :47943.0

Checking the Existence NaN Data

colSums(is.na(wholesale_clean))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Result: There is no NaN data.

Logistic Regression Preparation

Checking The Balance of Data Proportion

prop.table(table(wholesale_clean$Channel))
## 
##         1         2 
## 0.6772727 0.3227273

Result: The proportion 0.67 and 0.33 is quite balanced to be used in modelling.

Splitting Train-Test Data

We can determine the proportion between data train and data test whatever it is. But, most of the researchers make a bigger value in data train proportion than data test. Here, I used proportion 0.8 for data train and the rest for data test.

set.seed(17)
index <- sample(x=nrow(wholesale_clean),
                size = nrow(wholesale_clean)*0.8)
whs_train <- wholesale_clean[index,]
whs_test <- wholesale_clean[-index,]

Logistic Regression Model

model_all <- glm(formula = Channel~., data = whs_train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
## 
## Call:
## glm(formula = Channel ~ ., family = binomial, data = whs_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1914  -0.2779  -0.1555   0.0295   3.3453  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.087e+00  9.101e-01  -5.589 2.29e-08 ***
## Region2           2.023e+00  1.061e+00   1.907   0.0565 .  
## Region3           1.300e+00  7.434e-01   1.749   0.0803 .  
## Fresh             9.944e-06  2.137e-05   0.465   0.6417    
## Milk              1.289e-04  6.984e-05   1.846   0.0649 .  
## Grocery           6.131e-05  6.770e-05   0.906   0.3651    
## Frozen           -2.760e-04  1.332e-04  -2.072   0.0382 *  
## Detergents_Paper  1.086e-03  1.786e-04   6.083 1.18e-09 ***
## Delicassen       -4.384e-05  1.201e-04  -0.365   0.7151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 444.80  on 351  degrees of freedom
## Residual deviance: 140.15  on 343  degrees of freedom
## AIC: 158.15
## 
## Number of Fisher Scoring iterations: 7

Result: We can see that there are several insignificant predictors such as Milk, Fresh, Grocery, Frozen, and Delicassen which those have the p-value greater than alpha 0.05. So, we can eliminate insignificant variables using stepwise model.

StepWise “Both”

model_step <- step(object = model_all,
                   direction = "both",
                   trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_step)
## 
## Call:
## glm(formula = Channel ~ Region + Milk + Frozen + Detergents_Paper, 
##     family = binomial, data = whs_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1976  -0.2825  -0.1577   0.0263   3.3124  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.947e+00  8.947e-01  -5.529 3.22e-08 ***
## Region2           2.101e+00  1.069e+00   1.965  0.04941 *  
## Region3           1.322e+00  7.433e-01   1.779  0.07520 .  
## Milk              1.630e-04  5.935e-05   2.747  0.00602 ** 
## Frozen           -2.743e-04  1.163e-04  -2.358  0.01836 *  
## Detergents_Paper  1.156e-03  1.536e-04   7.524 5.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 444.80  on 351  degrees of freedom
## Residual deviance: 141.13  on 346  degrees of freedom
## AIC: 153.13
## 
## Number of Fisher Scoring iterations: 7

Result: We can observe that there are three predictor variables that are left on the model (Grocery, Frozen, and Detergent Paper).

Predicting by Logistic Regression Model (After Stepwise)

whs_test$prediction <- predict(object = model_step,
                                      newdata = whs_test,
                                      type = "response")

Based on the dataset, we determine the probability that is greater than 0.5 is categorized as 2 (Non Horeca) and otherwise as 1 (Horeca).

whs_test$predict_label <- ifelse(test = whs_test$prediction>0.5,
                                 yes = "2",
                                 no="1")
head(whs_test)
##    Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 2        2      3  7057 9810    9568   1762             3293       1776
## 15       2      3 24653 9465   12091    294             5058       2168
## 17       2      3  1020 8816   12121    134             4508       1080
## 33       1      3 21632 1318    2886    266              918        405
## 37       1      3 29955 4362    5428   1729              862       4626
## 40       1      3 56159  555     902  10002              212       2916
##     prediction predict_label
## 2  0.785296341             2
## 15 0.975469292             2
## 17 0.951913401             2
## 33 0.081525194             1
## 37 0.083813194             1
## 40 0.002393493             1

Changing the Predict Label to Factor Data

whs_test$predict_label <- as.factor(whs_test$predict_label)

Evaluating Logistic Regression Model

library(caret)
confusionMatrix(data = whs_test$predict_label,
                reference = whs_test$Channel,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 55  3
##          2  6 24
##                                           
##                Accuracy : 0.8977          
##                  95% CI : (0.8147, 0.9522)
##     No Information Rate : 0.6932          
##     P-Value [Acc > NIR] : 4.871e-06       
##                                           
##                   Kappa : 0.7668          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.9016          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.9483          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6932          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.6591          
##       Balanced Accuracy : 0.8953          
##                                           
##        'Positive' Class : 1               
## 

Result: We get the accuracy 89.77%, recall 90.32%, and precision 94.92%.

K-Nearest Neighbor Preparation

whs_knn_train <- wholesale_clean[index,]
whs_knn_test <- wholesale_clean[-index,]

Checking The Balance of Data Train

prop.table(table(whs_knn_train$Channel))
## 
##         1         2 
## 0.6732955 0.3267045

Defining Some Variables Predictor-Target for Train-Test Data

library(dplyr)
#variabel prediktor pada whs train
whs_train_pred <- whs_knn_train %>% 
  select(-Channel)

#variabel target pada whs train
whs_train_targ <- whs_knn_train %>% 
  pull(Channel)

#variabel prediktor pada whs test
whs_test_pred <- whs_knn_test %>% 
  select(-Channel)

#variabel target pada whs test
whs_test_targ <- whs_knn_test %>% 
  pull(Channel)

Scaling Data

whs_train_pred_scale <-whs_train_pred %>% 
  scale()

whs_test_pred_scale <-whs_test_pred %>% 
  scale(center = attr(whs_train_pred_scale, "scaled:center"),
        scale = attr(whs_train_pred_scale, "scaled:scale"))

Looking for Optimum K Value

sqrt(nrow(whs_train_pred))
## [1] 18.76166

Result: The optimum K value is 18.76 or equal to 19.

K-Nearest Neighbor Model

library(class)
model_knn <- knn(train =whs_train_pred_scale,
                 test = whs_test_pred_scale,
                 cl =whs_train_targ,
                 k=19)

Evaluating KNN Model

confusionMatrix(data = model_knn,
                reference = whs_test_targ,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 54  4
##          2  7 23
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.7873, 0.9359)
##     No Information Rate : 0.6932          
##     P-Value [Acc > NIR] : 5.812e-05       
##                                           
##                   Kappa : 0.715           
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.8852          
##             Specificity : 0.8519          
##          Pos Pred Value : 0.9310          
##          Neg Pred Value : 0.7667          
##              Prevalence : 0.6932          
##          Detection Rate : 0.6136          
##    Detection Prevalence : 0.6591          
##       Balanced Accuracy : 0.8685          
##                                           
##        'Positive' Class : 1               
## 

Result: We can observe that the value of accuracy, recall, and precision are 93.18%, 93.55%, and 96.67% respectively.

Conclusion

Based on the two model we analyzed before using logistic regression and KNN we get that KNN is the best model based on all the three value of confusion matrix indicators. So, the wholesale distributor better to use KNN model in predicting whether the channel is Horeca or not. Though, in this case, there is no loss will be happened if the model is wrong in classifying the response variables. However, if the response variable refers to an action that will be carried out on a subject and has a risk, then we have to choose the model properly by considering all the risks.