Introduction

The Project

In this project, our primary objective is to construct a model that can accurately categorize customers into their respective channels based on their annual spending on diverse product categories. Understanding customer behavior in wholesale distribution will allow the business to optimize their services based on the specific needs of each channel.

The Dataset

The dataset at hand contains annual spending in monetary units (m.u.) on diverse product categories for clients of a wholesale distributor. It comprises 8 variables which includes channel, region and six different types of product categories like Fresh, Milk, Grocery, etc. Our target variable here is channel, which categorizes customers into Hotel/Restaurant/Cafe or Retail.

Business goal

The primary business goal is to help the wholesale distributor in understanding the purchasing behavior of their customers. This will aid in optimizing their distribution strategy based on the type of customer. The model will also be beneficial in predicting the channel of new customers based on their spending habits, thus helping in providing more tailored services.

Library and Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
##  要求されたパッケージ lattice をロード中です 
## 
##  次のパッケージを付け加えます: 'caret' 
## 
##  以下のオブジェクトは 'package:purrr' からマスクされています:
## 
##     lift
library(e1071)
library(class)

Data Preparation

wholesale <- read.csv("data/wholesale.csv")
head(wholesale)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451

Exploratory Data Analysis

Data Manipulation

wholesale$Channel <- as.factor(wholesale$Channel)

Pre-Processing Data

colSums(is.na(wholesale))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0
wholesale <- wholesale %>% mutate_if(is.numeric, scale)

set.seed(123)
index <- createDataPartition(wholesale$Channel, p=0.8, list=FALSE)
train_set <- wholesale[index, ]
test_set <- wholesale[-index, ]

Logistic Regression

model_lr <- glm(Channel ~ ., data=train_set, family=binomial)
## Warning: glm.fit: 数値的に 0 か 1 である確率が生じました
predict_lr <- predict(model_lr, newdata=test_set, type="response")

predict_lr_class <- ifelse(predict_lr > 0.5, 2, 1)

confusionMatrix(as.factor(predict_lr_class), as.factor(test_set$Channel))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 59  3
##          2  0 25
##                                           
##                Accuracy : 0.9655          
##                  95% CI : (0.9025, 0.9928)
##     No Information Rate : 0.6782          
##     P-Value [Acc > NIR] : 2.587e-11       
##                                           
##                   Kappa : 0.9187          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6782          
##          Detection Rate : 0.6782          
##    Detection Prevalence : 0.7126          
##       Balanced Accuracy : 0.9464          
##                                           
##        'Positive' Class : 1               
## 

Modelling

set.seed(123)
predicted_knn <- knn(train=train_set[, -1], test=test_set[, -1], cl=train_set$Channel, k=5)

Prediction

prediction_lr <- ifelse(predict_lr > 0.5, 2, 1)

Model Evaluation

confusionMatrix(factor(prediction_lr), test_set$Channel)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 59  3
##          2  0 25
##                                           
##                Accuracy : 0.9655          
##                  95% CI : (0.9025, 0.9928)
##     No Information Rate : 0.6782          
##     P-Value [Acc > NIR] : 2.587e-11       
##                                           
##                   Kappa : 0.9187          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6782          
##          Detection Rate : 0.6782          
##    Detection Prevalence : 0.7126          
##       Balanced Accuracy : 0.9464          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(predicted_knn, test_set$Channel)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 57  6
##          2  2 22
##                                           
##                Accuracy : 0.908           
##                  95% CI : (0.8268, 0.9595)
##     No Information Rate : 0.6782          
##     P-Value [Acc > NIR] : 3.999e-07       
##                                           
##                   Kappa : 0.7811          
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9661          
##             Specificity : 0.7857          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 0.9167          
##              Prevalence : 0.6782          
##          Detection Rate : 0.6552          
##    Detection Prevalence : 0.7241          
##       Balanced Accuracy : 0.8759          
##                                           
##        'Positive' Class : 1               
## 

K-Nearest Neighbour

k <- seq(1, 20, by=2)
accuracy_rate <- c()

for(i in k) {
  set.seed(123)
  pred_knn <- knn(train=train_set[, -1], test=test_set[, -1], cl=train_set$Channel, k=i)
  accuracy <- sum(pred_knn == test_set$Channel) / nrow(test_set)
  accuracy_rate <- c(accuracy_rate, accuracy)
}
plot(k, accuracy_rate, type='b')
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Model Evaluation Matrix for Logistic Regression and K-NN

predict_lr_class <- ifelse(predict_lr > 0.5, 2, 1)
confusionMatrix_lr <- confusionMatrix(as.factor(predict_lr_class), as.factor(test_set$Channel))

Logistic Regression

eval_logit <- tibble(
  Accuracy = confusionMatrix_lr$overall["Accuracy"],
  Recall = confusionMatrix_lr$byClass["Sensitivity"],
  Specificity = confusionMatrix_lr$byClass["Specificity"],
  Precision = confusionMatrix_lr$byClass["Pos Pred Value"]
)

Model Evaluation for K-NN

predictors_train <- train_set[,-1]
response_train <- train_set$Channel

predictors_test <- test_set[,-1]
response_test <- test_set$Channel

set.seed(123)
model_knn <- knn(train = predictors_train, test = predictors_test, cl = response_train, k=5)

confusionMatrix_knn <- confusionMatrix(model_knn, response_test)

eval_knn <- tibble(
  Accuracy = confusionMatrix_knn$overall["Accuracy"],
  Recall = confusionMatrix_knn$byClass["Sensitivity"],
  Specificity = confusionMatrix_knn$byClass["Specificity"],
  Precision = confusionMatrix_knn$byClass["Pos Pred Value"]
)

Model Evaluation Result

print(eval_logit)
## # A tibble: 1 × 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.966      1       0.893     0.952
print(eval_knn)
## # A tibble: 1 × 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.908  0.966       0.786     0.905

Conclusion

Given the performance of the Logistic Regression and K-NN models on the wholesale data, we can observe that both models showed considerable predictive capabilities, with slight differences in their performance metrics.

If we look at precision, which is particularly important in situations where false positives have a high cost (misclassifying a distributor as belonging to the wrong channel), the Logistic Regression model stands out, reaching a precision rate of 95.2%, compared to K-NN’s 90.5%.

On the other hand, if we focus on recall, which matters when false negatives (missing a distributor who actually belongs to a specific channel) have a high cost, both models perform well. However, Logistic Regression delivers a perfect recall of 100%, slightly higher than the K-NN model, which has a recall of 96.6%.

Thus, if we liken ourselves to a distributorship manager, who needs to assign the right channel to each distributor, different strategies have different implications. If we apply the wrong channel to a distributor, it could lead to inappropriate marketing strategies and sales plans, possibly hurting the relationship with the distributor. Therefore, we would look at the precision metric, where we don’t want our model to make mistakes in predicting the correct channel for each distributor.

Given these considerations, the Logistic Regression model, with its higher precision and recall rates, would be our model of choice. However, it’s essential to keep in mind that model performance can always be improved by further feature engineering, hyperparameter tuning, and using more complex models when necessary. Also, no single model is the best choice for all situations. It’s crucial to understand the specific needs and constraints of the problem you’re trying to solve and select a model accordingly.

References

Dataset

Dataset: Wholesale.csv (ADSS’s Classification in Machine Learning I Archive) ..