1. Objective

The objective of this analysis is to predict what channel affecting to wholesale by classifying related data to wholsale channel. The target varibale in this analysis is “Channel”. The method will use Logistic Regression and K-NN. By applying these methods, this analysis will determine which method that will perform better than the other, therefore, it can be used to predict.

Source of data: UCI Wholesale customers Data Set

2. Data Preparation and Library & Setup

2.1. Library & Setup

Here are the libraries used in this analysis

library(tidyverse)
library(gtools)
library(caret)
library(GGally)
library(ggplot2)
library(readr)
library(dplyr)
library(class)

2.2. Data Inputted

Here is the data that is used.

wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Column Description:

  • Channel: channel used by customers to purchase things (Horeca or Retail), e.g 1 == “Horeca” and 2 == “Retail”
  • Region: store location
  • Fresh: amount of money was spent on fresh products by customers
  • Milk: amount of money was spent on milk products by customers
  • Grocery: amount of money was spent on grocery products by customers
  • Frozen: amount of money was spent on frozen products by customers
  • Detergens_Paper: amount of money was spent on frozen products by customers
  • Delicassen: amount of money was spent on delicassen products by customers

3. Data Preparation & EDA

3.1. Data Wrangling

Based on dataset above, there will be adjusted in the dataset:

  • Type data change: “Channel”, from “int” to “factor”
  • Data exlude: “Region” column will be excluded in this analysis

Here is the process:

wholesale <- wholesale %>% 
  select(-Region) %>% 
  mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail")) %>% 
  mutate(Channel = as.factor(Channel))
head(wholesale)

3.2. Missing Value Check

colSums(is.na(wholesale))
##          Channel            Fresh             Milk          Grocery 
##                0                0                0                0 
##           Frozen Detergents_Paper       Delicassen 
##                0                0                0

From checking dataset above, there is no missing value.

3.2. Data Distribution Check

summary(wholesale)
##    Channel        Fresh             Milk          Grocery     
##  Horeca:298   Min.   :     3   Min.   :   55   Min.   :    3  
##  Retail:142   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153  
##               Median :  8504   Median : 3627   Median : 4756  
##               Mean   : 12000   Mean   : 5796   Mean   : 7951  
##               3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656  
##               Max.   :112151   Max.   :73498   Max.   :92780  
##      Frozen        Detergents_Paper    Delicassen     
##  Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

Insight:

  • There are outliers indication within the data, e.g: Fresh, Milk, Grocery, Detergents_Paper, Delicassen

3.3. Data Imbalance Check

Before builing a model, it is needed to check the propotion of the target variable.

prop.table(table(wholesale$Channel))
## 
##    Horeca    Retail 
## 0.6772727 0.3227273

According to the propotion check, there is imbalance issue within the data. Therefore, to solve it, “Upsampling” Method will be used.

wholesale_up <- upSample(x = wholesale %>% select(-Channel),
                         y = wholesale$Channel, 
                         list = FALSE, 
                         yname = "Channel")

head(wholesale_up)

After Upsampling has been completed, here is the updated propotion result.

prop.table(table(wholesale_up$Channel))
## 
## Horeca Retail 
##    0.5    0.5

The result is balance with 0.5 : 0.5.

4.Cross Validation: Train-Test Splitting

Before setting a model, it needs to split the dataset to “train” and “test”. The propotion of data will be categorized into “train” is 80%, while “test” is 20%. Here is the splitting process.

RNGkind(sample.kind = "Rounding")
set.seed(132)

tts_data <- nrow(wholesale_up)
index <- sample(tts_data, tts_data*0.8)

train <- wholesale_up[index, ]
test <- wholesale_up[-index, ]

5. Logistic Regression

5.1. Modeling

Here is the logistic modeling process for wholesale with “Channel” as the target variable.

set.seed(132)

model_log.reg <- glm(Channel ~., wholesale_up, family = "binomial")

summary(model_log.reg) 
## 
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = wholesale_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2842  -0.3343  -0.0201   0.1733   2.8961  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.428e+00  4.028e-01  -8.511  < 2e-16 ***
## Fresh             2.012e-06  1.562e-05   0.129   0.8975    
## Milk              8.035e-05  4.993e-05   1.609   0.1076    
## Grocery           1.109e-04  5.635e-05   1.967   0.0491 *  
## Frozen           -1.189e-04  6.763e-05  -1.758   0.0787 .  
## Detergents_Paper  1.017e-03  1.293e-04   7.863 3.76e-15 ***
## Delicassen       -3.611e-05  8.768e-05  -0.412   0.6804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 826.23  on 595  degrees of freedom
## Residual deviance: 287.67  on 589  degrees of freedom
## AIC: 301.67
## 
## Number of Fisher Scoring iterations: 7

Here is the logistic modeling process for wholesale with “Channel” as the target variable by fitting it using stepwise regression.

set.seed(132)

model_log.reg.step <- step(model_log.reg, direction = "backward", trace = 0)

summary(model_log.reg.step)
## 
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper, 
##     family = "binomial", data = wholesale_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2740  -0.3374  -0.0179   0.1765   2.8931  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.391e+00  3.556e-01  -9.535  < 2e-16 ***
## Milk              7.386e-05  4.770e-05   1.548   0.1215    
## Grocery           1.089e-04  5.530e-05   1.969   0.0489 *  
## Frozen           -1.283e-04  5.994e-05  -2.140   0.0323 *  
## Detergents_Paper  1.015e-03  1.264e-04   8.027 9.99e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 826.23  on 595  degrees of freedom
## Residual deviance: 287.86  on 591  degrees of freedom
## AIC: 297.86
## 
## Number of Fisher Scoring iterations: 7

To predict “test” data, this analysis uses logistic regression method.

set.seed(132)


pred_test <- predict(object = model_log.reg.step, 
                     newdata = test, 
                     type = "response")

pred_class <- as.factor(if_else(pred_test > 0.5, "Retail", "Horeca"))
ggplot(test, aes(x=pred_test)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

According to graph abbove, the prediction tends to 0 is “Horeca”. Therefore, if > 0.5 will be defined as “Retail”

5.2. Model Evaluation

To evaluate the model performance will be used confusion matrix, as it is proceeded below.

set.seed(132)

confusionMatrix(data = pred_class, 
                reference = test$Channel, 
                positive = "Retail")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca     55      5
##     Retail      5     55
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8521, 0.9593)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8333          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.9167          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.9167          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4583          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : Retail          
## 

According to confusionMatrix above, the result shows:

In overall this model can predict accuracy, sensitivity, specificity, and precision is more than 80%.

6. K-Nearest Neighbour

6.1. Modeling

#Predictor

set.seed(132)

train.knn.x <- train %>% 
  select_if(is.numeric)

test.knn.x <- test %>% 
  select_if(is.numeric)

#Target

train.knn.y <- train[,"Channel"]


test.knn.y <- test[,"Channel"]

Here is the scaling process for the data train and test.

#Predictor Scaling

set.seed(132)

train.knn.xs <- scale(train.knn.x)

test.knn.xs <- scale(test.knn.x, 
                     center = attr(train.knn.xs, "scaled:center"),
                     scale = attr(train.knn.xs, "scaled:scale"))

Here is to find optimum K value that will be used in KNN

#Find optimum K
sqrt(nrow(train.knn.xs))
## [1] 21.81742

Here is to KNN model building process.

set.seed(132)

pred.knn <- knn(train = train.knn.xs, 
                test = test.knn.xs, 
                cl= train.knn.y, 
                k= 21)

6.2. Model Evaluation

Therefore, to evaluate the model will use ConfusionMatrix.

set.seed(132)

confusionMatrix(data = as.factor(pred.knn),
                reference = as.factor(test.knn.y),
                positive = "Retail")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Horeca Retail
##     Horeca     50      3
##     Retail     10     57
##                                          
##                Accuracy : 0.8917         
##                  95% CI : (0.8219, 0.941)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.7833         
##                                          
##  Mcnemar's Test P-Value : 0.09609        
##                                          
##             Sensitivity : 0.9500         
##             Specificity : 0.8333         
##          Pos Pred Value : 0.8507         
##          Neg Pred Value : 0.9434         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4750         
##    Detection Prevalence : 0.5583         
##       Balanced Accuracy : 0.8917         
##                                          
##        'Positive' Class : Retail         
## 

In overall this model can predict accuracy, sensitivity, specificity, and precision is more than 80%.

7. Conclusion

Based on the analysis of wholesale dataset using Logistic Regression Model and KNN Model, both model have performance above 80% in term of accuracy, sensitivity, specificity, and precision. In overall review, Logistic Regression Model has better performance than KNN.

Since the perspective of this analysis is to determine what channels can be categorize based on data so that it can be an input for the Management how to improve the effective channel strategy to boost sales. Therefore, this analysis uses the highest precision (Pos Pred Value) which is the Logistic Regression Model that has better performance with compared to KNN.