Predicting Wholesaler Customer in Wholesaler.csv Using Regression Model and KNN Model

Adli Rikanda Saputra

2022-11-21

Objective

In this opportunity, I will predict the customer from wholesaler dataset. Those customer divide by two which are Horeca (Hotel, Restaurant, Cafe) and Retail. My job is to predict which customer fit better as Horeca or Retail. I will using two models which calculate probability of which category of the wholesaler is. The models are regression model and knn model.

Installing Library

To process the data, I need to install necessary library to process the data

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(MASS)
library(caret)
library(class)

Logistic Regression

Data set I will use is data from 2014 of wholesaler data set. The data is used to distribute product to wholesaler customer based on the wholesaler spending of the product. I save the data in object for processing.

Saving dataset into object

sale <- read.csv("wholesale.csv")
glimpse(sale)
## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

Checking the dataset

head(sale)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451

From the data above those are the attribute Information:

  1. FRESH: annual spending (m.u.) on fresh products (Continuous);

  2. MILK: annual spending (m.u.) on milk products (Continuous);

  3. GROCERY: annual spending (m.u.)on grocery products (Continuous);

  4. FROZEN: annual spending (m.u.)on frozen products (Continuous)

  5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)

  6. DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);

  7. CHANNEL: customers’ Channel - (1) Horeca (Hotel/Restaurant/Café) or (2) Retail channel (Nominal)

  8. REGION: customers’ Region – (1) Lisnon, (2) Oporto or (3) Other (Nominal)

Data Manipulation

I want to coerce some column as class category so that I can use it to analyze further. Based on the attribute I want to change Channel and Region column into factor.

Data coercion

sale <- sale %>% 
  mutate(Channel = factor(Channel, levels = c(1,2), labels = c("Horeca", "Retail")),
         Region =factor(Region, levels = c(1,2,3), labels = c("Lisnon", "Oporto", "Other")))

Checking coercion

glimpse(sale)
## Rows: 440
## Columns: 8
## $ Channel          <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
## $ Region           <fct> Other, Other, Other, Other, Other, Other, Other, Othe…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

After that I want to check if there is any missing value of the data.

Checking missing value

colSums(is.na(sale))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Pre-Processing Data

I want to look the proportion of the data. The balance of the data will help the process furter

Checking proportion

prop.table(table(sale$Channel))
## 
##    Horeca    Retail 
## 0.6772727 0.3227273

Checking amount of data

table(sale$Channel)
## 
## Horeca Retail 
##    298    142

The proportion of both class is already balance. I will go to next step to process it.

Splitting Train-Test

Data splitting will help to test the model provide unseen data. It will help the modelling in predicting the data. The splitting process include train data and test data. I will use 70% proportion to train data and 30% proportion to test data.

Splitting Data

set.seed(100)
index <- sample(nrow(sale), nrow(sale)*0.7)
train <- sale[index,]
test <- sale[-index,]

The train data have less row if the splitting succeed. I want to check the train data to confirm that.

Checking train data

glimpse(train)
## Rows: 308
## Columns: 8
## $ Channel          <fct> Retail, Retail, Retail, Retail, Horeca, Horeca, Horec…
## $ Region           <fct> Lisnon, Other, Other, Lisnon, Other, Other, Oporto, O…
## $ Fresh            <int> 4484, 37, 12579, 1107, 25066, 13265, 7034, 32717, 403…
## $ Milk             <int> 14399, 1275, 11114, 11711, 5010, 1196, 1492, 16784, 2…
## $ Grocery          <int> 24708, 22272, 17569, 23596, 5026, 4221, 2405, 13626, …
## $ Frozen           <int> 3549, 137, 805, 955, 9806, 6404, 12569, 60869, 774, 5…
## $ Detergents_Paper <int> 14235, 6747, 6457, 9265, 1092, 507, 299, 1272, 54, 31…
## $ Delicassen       <int> 1681, 110, 1519, 710, 960, 1788, 1117, 5609, 63, 1000…

Modelling Data

I will use the glm formula to calculate the probability of the model.

Regression Probability Model

wholesale <- glm(formula = Channel~ ., family = "binomial", 
             data = train)
summary(wholesale)
## 
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.91292  -0.30769  -0.19112   0.05672   3.14662  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.880e+00  8.672e-01  -5.628 1.82e-08 ***
## RegionOporto      2.437e+00  9.659e-01   2.523   0.0116 *  
## RegionOther       1.376e+00  7.497e-01   1.836   0.0664 .  
## Fresh             1.069e-05  2.015e-05   0.530   0.5959    
## Milk              5.544e-05  7.093e-05   0.782   0.4344    
## Grocery           7.166e-05  6.551e-05   1.094   0.2740    
## Frozen           -1.949e-04  1.250e-04  -1.559   0.1189    
## Detergents_Paper  9.555e-04  1.619e-04   5.903 3.57e-09 ***
## Delicassen       -9.677e-05  1.212e-04  -0.799   0.4245    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.29  on 307  degrees of freedom
## Residual deviance: 139.50  on 299  degrees of freedom
## AIC: 157.5
## 
## Number of Fisher Scoring iterations: 7

I want to optimize the model so the it can eliminate unnecessary variable. I will use stepAIC() with backward direction.

model_new <- stepAIC(wholesale, direction = "backward")
## Start:  AIC=157.5
## Channel ~ Region + Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## 
##                    Df Deviance    AIC
## - Fresh             1   139.78 155.78
## - Milk              1   140.10 156.10
## - Delicassen        1   140.15 156.15
## - Grocery           1   140.70 156.70
## <none>                  139.50 157.50
## - Frozen            1   143.42 159.42
## - Region            2   146.62 160.62
## - Detergents_Paper  1   181.76 197.76
## 
## Step:  AIC=155.78
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## 
##                    Df Deviance    AIC
## - Delicassen        1   140.35 154.35
## - Milk              1   140.58 154.58
## - Grocery           1   140.97 154.97
## <none>                  139.78 155.78
## - Frozen            1   143.46 157.46
## - Region            2   147.30 159.30
## - Detergents_Paper  1   182.11 196.11
## 
## Step:  AIC=154.35
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper
## 
##                    Df Deviance    AIC
## - Grocery           1   141.17 153.17
## - Milk              1   141.22 153.22
## <none>                  140.35 154.35
## - Region            2   147.85 157.85
## - Frozen            1   146.15 158.15
## - Detergents_Paper  1   184.10 196.10
## 
## Step:  AIC=153.17
## Channel ~ Region + Milk + Frozen + Detergents_Paper
## 
##                    Df Deviance    AIC
## <none>                  141.17 153.17
## - Milk              1   143.50 153.50
## - Frozen            1   147.28 157.28
## - Region            2   149.40 157.40
## - Detergents_Paper  1   248.25 258.25

The model can be summarized as below:

summary(model_new)
## 
## Call:
## glm(formula = Channel ~ Region + Milk + Frozen + Detergents_Paper, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9738  -0.3146  -0.1874   0.0617   3.1662  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.793e+00  8.615e-01  -5.564 2.64e-08 ***
## RegionOporto      2.627e+00  9.657e-01   2.721  0.00652 ** 
## RegionOther       1.470e+00  7.380e-01   1.992  0.04634 *  
## Milk              9.391e-05  5.969e-05   1.573  0.11564    
## Frozen           -2.098e-04  1.036e-04  -2.025  0.04285 *  
## Detergents_Paper  1.017e-03  1.415e-04   7.189 6.52e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.29  on 307  degrees of freedom
## Residual deviance: 141.17  on 302  degrees of freedom
## AIC: 153.17
## 
## Number of Fisher Scoring iterations: 7

Prediction Model

With new model I will use it to predict data test.

test$pred<-predict(model_new, type = "response", newdata = test)
test$pred
##   [1] 1.383277e-01 2.969916e-01 5.423034e-01 2.100807e-01 9.934819e-01
##   [6] 8.481993e-01 8.870891e-01 7.279295e-02 2.739359e-02 9.685245e-01
##  [11] 2.538715e-02 9.991819e-01 9.519709e-02 7.448757e-02 2.203871e-02
##  [16] 6.093315e-02 9.121495e-01 9.999946e-01 3.196117e-02 9.612344e-01
##  [21] 1.000000e+00 4.599294e-02 5.719156e-01 1.000000e+00 6.306540e-01
##  [26] 6.174651e-02 5.771916e-01 7.212636e-02 2.294073e-02 2.566079e-01
##  [31] 9.999511e-01 3.401562e-02 6.209438e-02 1.000000e+00 5.783327e-03
##  [36] 3.829139e-02 1.406343e-02 3.341216e-05 5.145085e-02 9.925836e-01
##  [41] 5.109060e-03 1.316123e-01 3.635683e-02 9.981293e-01 4.903190e-01
##  [46] 4.355140e-02 2.421080e-02 3.181022e-01 4.494786e-02 1.224249e-01
##  [51] 1.096796e-01 5.403652e-02 3.029666e-02 9.804503e-01 9.960542e-01
##  [56] 7.934930e-01 9.999910e-01 9.847144e-01 1.979123e-01 3.029803e-02
##  [61] 9.996261e-01 9.992572e-01 8.415852e-01 2.858877e-02 4.578010e-02
##  [66] 4.087296e-01 9.774852e-01 8.583760e-01 3.111996e-02 5.059652e-01
##  [71] 1.111102e-02 2.068873e-02 3.955593e-01 8.440131e-01 9.999651e-01
##  [76] 7.158553e-03 2.076702e-02 7.585840e-02 3.867282e-02 4.601115e-03
##  [81] 8.078802e-03 5.392549e-02 2.418084e-03 1.840993e-02 2.305994e-03
##  [86] 1.172421e-02 6.581478e-02 6.507432e-01 6.974966e-03 3.504527e-03
##  [91] 1.749455e-02 7.260970e-02 1.707025e-02 7.952828e-03 1.854480e-02
##  [96] 9.931477e-02 4.776252e-02 2.779207e-02 9.913404e-01 1.365792e-01
## [101] 1.305378e-01 1.626878e-01 2.008238e-01 3.254560e-02 6.901159e-02
## [106] 1.046726e-01 1.179395e-01 2.980573e-02 9.094497e-01 3.044272e-02
## [111] 6.781503e-01 4.168059e-02 3.596900e-02 1.581787e-02 6.030694e-02
## [116] 2.492814e-01 4.278225e-02 4.124409e-02 1.152182e-01 2.578785e-02
## [121] 3.255173e-02 1.059982e-02 2.119325e-02 3.201368e-03 5.378019e-02
## [126] 2.010230e-03 1.127120e-02 5.725339e-02 3.345292e-01 1.728719e-02
## [131] 9.999980e-01 3.977621e-02

To see the Distribution I will use the graph to see the data distribution.

Probability Distribution

ggplot(test, aes(x=pred)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

On the graph above, the prediction density tend to 0. It tends to Horeca as prediction.

I will use the column prediction into new conclusion. I will check the prediction and the real data.

Sample of Data Prediction and Result

test$pred <- factor(ifelse(test$pred > 0.5 , "Retail","Horeca"))
test[1:10, c("pred", "Channel")]
##      pred Channel
## 5  Horeca  Retail
## 6  Horeca  Retail
## 8  Retail  Retail
## 9  Horeca  Horeca
## 10 Retail  Retail
## 13 Retail  Retail
## 17 Retail  Retail
## 18 Horeca  Horeca
## 22 Horeca  Horeca
## 24 Retail  Retail

In the data above if the probability value more than 0.5, it means retail.

Model Evaluation

I want to use confusion matrix to evaluate the data.

Confusion matrix

log_model <- confusionMatrix(test$pred, test$Channel, positive = "Retail")
log_model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca     87      9
##     Retail      3     33
##                                           
##                Accuracy : 0.9091          
##                  95% CI : (0.8466, 0.9521)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 5.202e-10       
##                                           
##                   Kappa : 0.7822          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.7857          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.9062          
##              Prevalence : 0.3182          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2727          
##       Balanced Accuracy : 0.8762          
##                                           
##        'Positive' Class : Retail          
## 

From the data above we can interpret log_model that:

  1. Accuracy: 0.9091

  2. Sensitivity/Recall: 0.7857

  3. Precision/Pos Pred Value: 0.8919

KNN Model

In knn model I calculate only numeric predictor variable to see the distance of the model. The Region column will not be used in knn model.

Deselect column

wsale <-  sale %>% dplyr::select(-Region) 
glimpse(wsale)
## Rows: 440
## Columns: 7
## $ Channel          <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

Data Splitting

The process is same like regression model. I will use same proportion so that I can compare it with regression model.

Data train and test

RNGkind(sample.kind = "Rounding")
set.seed(300)

# index sampling
index <- sample(nrow(wsale)*0.7)

# splitting
wsale_train <- wsale[index,]
wsale_test <-  wsale[-index,]

I want to check the proportion of the new sample if it is balance or not.

Checking proportion

# recheck class balance
prop.table((table(wsale_train$Channel)))
## 
##    Horeca    Retail 
## 0.6428571 0.3571429

The proportion is balance. I want to check the new train data to confirm the new dataset.

Checking data

head(wsale_train)
##     Channel Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 282  Retail 12238 7108    6235   1093             2328       2079
## 235  Horeca 15603 2703    3833   4260              325       2563
## 247  Horeca  8885 2428    1777   1777              430        610
## 224  Retail  2790 2527    5265   5612              788       1360
## 208  Retail  2541 4737    6089   2946             5316        120
## 4    Horeca 13265 1196    4221   6404              507       1788

The train data is ready. Now, I need to split the data between numeric and categorical. I will change the data format from data frame to data array.

Data coercion

wsale_train_x <- wsale_train %>% select_if(is.numeric)

wsale_test_x <- wsale_test %>% select_if(is.numeric)

# target
wsale_train_y <- wsale_train[,"Channel"]

wsale_test_y <- wsale_test[,"Channel"]

Data Modelling

Now the data is ready to scaling.

Data scaling

wsale_train_xs <- scale(wsale_train_x)

wsale_test_xs <- scale(wsale_test_x, center = attr(wsale_train_xs, "scaled:center"), scale = attr(wsale_train_xs, "scaled:scale"))

I need optimum k value of the model by square root the amount rows used in test.

Optimum k value

sqrt(nrow(wsale_train_xs))
## [1] 17.54993

Now I can make knn model to predict.

knn model

wsale_pred <- knn(train=wsale_train_xs, 
                 test=wsale_test_xs, 
                 cl=wsale_train_y,
                 k=17)
wsale_pred
##   [1] Horeca Retail Horeca Horeca Retail Horeca Horeca Retail Horeca Horeca
##  [11] Horeca Retail Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca
##  [21] Horeca Horeca Horeca Retail Horeca Retail Horeca Retail Horeca Horeca
##  [31] Horeca Horeca Retail Retail Retail Retail Horeca Horeca Retail Retail
##  [41] Horeca Retail Horeca Retail Horeca Retail Horeca Horeca Horeca Retail
##  [51] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca
##  [61] Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail Horeca
##  [71] Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Horeca
##  [81] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca
##  [91] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail
## [101] Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail Retail Retail
## [111] Retail Horeca Retail Retail Horeca Horeca Retail Horeca Retail Horeca
## [121] Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail
## [131] Horeca Horeca
## Levels: Horeca Retail

The result is shown. To see the performance, I need to put it into confusion matrix.

knn confusion matrix

knn_model <- confusionMatrix(data = as.factor(wsale_pred),
                reference = wsale_test_y,
                positive = "Retail")
knn_model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca     93      5
##     Retail      7     27
##                                           
##                Accuracy : 0.9091          
##                  95% CI : (0.8466, 0.9521)
##     No Information Rate : 0.7576          
##     P-Value [Acc > NIR] : 6.964e-06       
##                                           
##                   Kappa : 0.7576          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.8438          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.7941          
##          Neg Pred Value : 0.9490          
##              Prevalence : 0.2424          
##          Detection Rate : 0.2045          
##    Detection Prevalence : 0.2576          
##       Balanced Accuracy : 0.8869          
##                                           
##        'Positive' Class : Retail          
## 

From the data above we can interpret knn_model that:

  1. Accuracy: 0.9091

  2. Recall/Sensitivity : 0.8438

  3. Precision/Pos Pred Value : 0.7941

Conclusion and Recommendation

The termin of false positive and false negative of the models are:

FN: predict negative (horeca), but it is positive (retail)

FP: predict positive (retail), but it is negative (horeca)

I use recall to minimize the FN and precision to minimize FP.

From both model we can analyze the difference of value in confusion matrix. There are some conclusion that can be created from both model.

  1. Both model have same accuracy value with high value of 90.91%. It means both model is good and can predict well of true positive and true negative value which is very successfully predict retail or horeca pretty well.

  2. Knn model has higher sensitivity value than regression model. If we want to predict horeca better we can use knn model as our predictor model.

  3. Regression model has higher precision value than knn model. If we want to predict retail better we can use regression model as our predictor model.