1 INTRODUCTION

The dataset refers to clients/customers of a wholesale distributor. It includes the annual spending in monetary units on diverse product categories (Data source: UCI Machine Learning https://archive.ics.uci.edu/ml/datasets/Wholesale+customers). In this case, we would like to see the prediction of channel ‘Horeca’ is correctly predicted.

2 Logistic Regression

2.1 Loading Packages

library(dplyr)
library(caret)
library(readr)
library(class)
library(gtools)
library(tidyverse)
library(inspectdf)
library(MASS)
library(plotly)

2.2 Import Data

wsLBB <- read.csv("data_input/wholesale.csv")
wsLBB

Data description:

  1. FRESH: annual spending on fresh products.

  2. MILK: annual spending on milk products.

  3. GROCERY: annual spending on grocery products.

  4. FROZEN: annual spending on frozen products.

  5. DETERGENTS_PAPER: annual spending on detergents and paper products.

  6. DELICATESSEN: annual spending on and delicatessen products.

  7. CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel.

  • 1: Hotel/Restaurant/Cafe
  • 2: Retail
  1. REGION: customers Region.

2.3 Inspect Data

  1. Checking missing vaue
colSums(is.na(wsLBB))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0
  1. Converting target data type from integer to factor. (this case target data = Channel)
wsLBB <- wsLBB %>% 
  select(-Region) %>%
  mutate(Channel = as.factor(Channel))

wsLBB
  1. Explanotory Data
num <- inspect_num(wsLBB)
cat <- inspect_cat(wsLBB)
show_plot(num); show_plot(cat)

  1. Check proportion of target variable
prop.table(table(wsLBB$Channel))
## 
##         1         2 
## 0.6772727 0.3227273

2.4 Cross Validation

set.seed(200)

index <- sample(nrow(wsLBB),0.8*nrow(wsLBB))

train_wsLBB <- wsLBB[index,]
test_wsLBB <- wsLBB[-index,]

2.5 Balancing Target Variable

  1. upSample
train_wsLBB.new <-  upSample(x = train_wsLBB[, -1], 
                                  y = train_wsLBB$Channel, 
                                  yname = "Channel")
  1. Check Proportion of Target Variable After upSample
prop.table(table(train_wsLBB.new$Channel))
## 
##   1   2 
## 0.5 0.5

2.6 Modelling

wsLBB_model <- glm(Channel~.,train_wsLBB.new, family='binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

2.7 Feature Selection Using ‘step()’

  • Selection using Backward Method.
step(wsLBB_model, method="backward")
## Start:  AIC=264.49
## Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Milk              1   251.05 263.05
## - Fresh             1   251.25 263.25
## - Delicassen        1   251.42 263.42
## <none>                  250.49 264.49
## - Frozen            1   256.41 268.41
## - Grocery           1   256.52 268.52
## - Detergents_Paper  1   309.65 321.65
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=263.05
## Channel ~ Fresh + Grocery + Frozen + Detergents_Paper + Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Delicassen        1   251.64 261.64
## - Fresh             1   251.96 261.96
## <none>                  251.05 263.05
## - Grocery           1   256.64 266.64
## - Frozen            1   257.52 267.52
## - Detergents_Paper  1   309.89 319.89
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=261.64
## Channel ~ Fresh + Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Fresh             1   252.88 260.88
## <none>                  251.64 261.64
## - Frozen            1   257.54 265.54
## - Grocery           1   258.67 266.67
## - Detergents_Paper  1   310.18 318.18
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=260.88
## Channel ~ Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## <none>                  252.88 260.88
## - Frozen            1   257.76 263.76
## - Grocery           1   261.06 267.06
## - Detergents_Paper  1   310.28 316.28
## 
## Call:  glm(formula = Channel ~ Grocery + Frozen + Detergents_Paper, 
##     family = "binomial", data = train_wsLBB.new)
## 
## Coefficients:
##      (Intercept)           Grocery            Frozen  Detergents_Paper  
##       -3.0537035         0.0001570        -0.0001134         0.0008759  
## 
## Degrees of Freedom: 483 Total (i.e. Null);  480 Residual
## Null Deviance:       671 
## Residual Deviance: 252.9     AIC: 260.9
  • Create New Model Based on Feature Selection Results.
wsLBB_model <- glm(Channel ~ Grocery + Frozen + Detergents_Paper, 
    family = "binomial", data = train_wsLBB.new)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

2.8 Predicting to Data Test

test_wsLBB$prob <- predict(wsLBB_model, test_wsLBB, type = "response")

head(test_wsLBB)
hist(test_wsLBB$prob)

  • Setting Threshold = 0.5
test_wsLBB$predict <- as.factor(ifelse(test_wsLBB$prob >= 0.5, 2, 1))

head(test_wsLBB)
  • Setting Threshold = 0.4
test_wsLBB$predict2 <- as.factor(ifelse(test_wsLBB$prob >= 0.4, 2, 1))

head(test_wsLBB)

2.9 Evaluating

** positive class = 1 **

  • threshold >= 0.5
confusionMatrix(test_wsLBB$predict, test_wsLBB$Channel, positive ='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 50  5
##          2  6 27
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.7873, 0.9359)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 4.749e-07       
##                                           
##                   Kappa : 0.7317          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8929          
##             Specificity : 0.8438          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.6364          
##          Detection Rate : 0.5682          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.8683          
##                                           
##        'Positive' Class : 1               
## 
  • threshold >= 0.4
confusionMatrix(test_wsLBB$predict2, test_wsLBB$Channel, positive ='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 49  3
##          2  7 29
##                                           
##                Accuracy : 0.8864          
##                  95% CI : (0.8009, 0.9441)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 1.136e-07       
##                                           
##                   Kappa : 0.7609          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.9062          
##          Pos Pred Value : 0.9423          
##          Neg Pred Value : 0.8056          
##              Prevalence : 0.6364          
##          Detection Rate : 0.5568          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.8906          
##                                           
##        'Positive' Class : 1               
## 

Note: After changing the threshold from 0.5 to 0.4, the precision of the model is raise from 90.91 & to 94.23%

3 KNN Algorithm

3.1 Import Data

wsKNN <- read.csv("data_input/wholesale.csv")
wsKNN

3.2 Inspect Data

  1. Checking missing vaue
sapply(wsKNN, function(x) {sum(is.na(x))})
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0
  1. Convert target data type from integer to factor. (this case target data = Channel)
wsKNN <- wsKNN %>% 
  select(-Region) %>% 
  mutate(Channel = factor(Channel,
                          levels=c(1,2),
                          labels=c("horeca","retail")))

wsKNN

3.3 Scaling

  • Scaling data
wsKNN.sc <- wsLBB %>% 
  mutate_if(is.integer, scale)
  • Check data after scaling
range(wsKNN[,2:7]) # before scaling
## [1]      3 112151
range(wsKNN.sc[,2:7]) # after scaling
## [1] -0.9486033 16.4597113

3.4 Cross Validation

set.seed(200)

index <- sample(nrow(wsKNN.sc),0.8*nrow(wsKNN.sc))

train_wsKNN.sc <- wsKNN.sc[index,]
test_wsKNN.sc <- wsKNN.sc[-index,]

3.5 Modelling

sqrt(nrow(train_wsKNN.sc)) 
## [1] 18.76166

–> from sqrt(), the value of k = 18.76

  • k=19
wsKNN_pred <- knn(train=train_wsKNN.sc[,-1],
               test=test_wsKNN.sc[,-1],
               cl=train_wsKNN.sc$Channel,
               19)
  • k=17
wsKNN_pred2 <- knn(train=train_wsKNN.sc[,-1],
               test=test_wsKNN.sc[,-1],
               cl=train_wsKNN.sc$Channel,
               17)

3.6 Evaluating

  • Positive class in this case is 1 = horeca

  • k=19

confusionMatrix(wsKNN_pred,
                test_wsKNN.sc$Channel, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 53  4
##          2  3 28
##                                          
##                Accuracy : 0.9205         
##                  95% CI : (0.843, 0.9674)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : 7.867e-10      
##                                          
##                   Kappa : 0.827          
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9464         
##             Specificity : 0.8750         
##          Pos Pred Value : 0.9298         
##          Neg Pred Value : 0.9032         
##              Prevalence : 0.6364         
##          Detection Rate : 0.6023         
##    Detection Prevalence : 0.6477         
##       Balanced Accuracy : 0.9107         
##                                          
##        'Positive' Class : 1              
## 
  • k=17
confusionMatrix(wsKNN_pred2,
                test_wsKNN.sc$Channel, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 53  4
##          2  3 28
##                                          
##                Accuracy : 0.9205         
##                  95% CI : (0.843, 0.9674)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : 7.867e-10      
##                                          
##                   Kappa : 0.827          
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9464         
##             Specificity : 0.8750         
##          Pos Pred Value : 0.9298         
##          Neg Pred Value : 0.9032         
##              Prevalence : 0.6364         
##          Detection Rate : 0.6023         
##    Detection Prevalence : 0.6477         
##       Balanced Accuracy : 0.9107         
##                                          
##        'Positive' Class : 1              
## 

4 Summary

Our target of the analysis is to see the prediction of customers channel ‘Horeca’ is correctly predicted, in order to make sure that we do the right strategy to the correct channel to increase horeca channel in the population (i.e by giving special price to the variables/items to horeca channel). For this case, we will see the precision value from the analysis result.

Based on the regression method above, precision result using logistic regression with threshold = 0.4 is higher than KNN algorithm.

The precision value results:

  1. Logistic regression
    • threshold = 0.5 –> precision = 90,91%
    • threshold = 0.4 –> precision = 94,23% (reducing the threshold value could increase the precision value)
  2. KNN
    • k = 19 –> precision = 92.98%
    • k = 17 –> precision = 92.98% (reducing the k-value doesn’t change the precision value)