In this project, our primary objective is to construct a model that can accurately categorize customers into their respective channels based on their annual spending on diverse product categories. Understanding customer behavior in wholesale distribution will allow the business to optimize their services based on the specific needs of each channel.
The dataset at hand contains annual spending in monetary units (m.u.) on diverse product categories for clients of a wholesale distributor. It comprises 8 variables which includes channel, region and six different types of product categories like Fresh, Milk, Grocery, etc. Our target variable here is channel, which categorizes customers into Hotel/Restaurant/Cafe or Retail.
The primary business goal is to help the wholesale distributor in understanding the purchasing behavior of their customers. This will aid in optimizing their distribution strategy based on the type of customer. The model will also be beneficial in predicting the channel of new customers based on their spending habits, thus helping in providing more tailored services.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## 要求されたパッケージ lattice をロード中です
##
## 次のパッケージを付け加えます: 'caret'
##
## 以下のオブジェクトは 'package:purrr' からマスクされています:
##
## lift
library(e1071)
library(class)
wholesale <- read.csv("data/wholesale.csv")
head(wholesale)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
wholesale$Channel <- as.factor(wholesale$Channel)
colSums(is.na(wholesale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
wholesale <- wholesale %>% mutate_if(is.numeric, scale)
set.seed(123)
index <- createDataPartition(wholesale$Channel, p=0.8, list=FALSE)
train_set <- wholesale[index, ]
test_set <- wholesale[-index, ]
model_lr <- glm(Channel ~ ., data=train_set, family=binomial)
## Warning: glm.fit: 数値的に 0 か 1 である確率が生じました
predict_lr <- predict(model_lr, newdata=test_set, type="response")
predict_lr_class <- ifelse(predict_lr > 0.5, 2, 1)
confusionMatrix(as.factor(predict_lr_class), as.factor(test_set$Channel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 59 3
## 2 0 25
##
## Accuracy : 0.9655
## 95% CI : (0.9025, 0.9928)
## No Information Rate : 0.6782
## P-Value [Acc > NIR] : 2.587e-11
##
## Kappa : 0.9187
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.8929
## Pos Pred Value : 0.9516
## Neg Pred Value : 1.0000
## Prevalence : 0.6782
## Detection Rate : 0.6782
## Detection Prevalence : 0.7126
## Balanced Accuracy : 0.9464
##
## 'Positive' Class : 1
##
set.seed(123)
predicted_knn <- knn(train=train_set[, -1], test=test_set[, -1], cl=train_set$Channel, k=5)
prediction_lr <- ifelse(predict_lr > 0.5, 2, 1)
confusionMatrix(factor(prediction_lr), test_set$Channel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 59 3
## 2 0 25
##
## Accuracy : 0.9655
## 95% CI : (0.9025, 0.9928)
## No Information Rate : 0.6782
## P-Value [Acc > NIR] : 2.587e-11
##
## Kappa : 0.9187
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.8929
## Pos Pred Value : 0.9516
## Neg Pred Value : 1.0000
## Prevalence : 0.6782
## Detection Rate : 0.6782
## Detection Prevalence : 0.7126
## Balanced Accuracy : 0.9464
##
## 'Positive' Class : 1
##
confusionMatrix(predicted_knn, test_set$Channel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 57 6
## 2 2 22
##
## Accuracy : 0.908
## 95% CI : (0.8268, 0.9595)
## No Information Rate : 0.6782
## P-Value [Acc > NIR] : 3.999e-07
##
## Kappa : 0.7811
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9661
## Specificity : 0.7857
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.9167
## Prevalence : 0.6782
## Detection Rate : 0.6552
## Detection Prevalence : 0.7241
## Balanced Accuracy : 0.8759
##
## 'Positive' Class : 1
##
k <- seq(1, 20, by=2)
accuracy_rate <- c()
for(i in k) {
set.seed(123)
pred_knn <- knn(train=train_set[, -1], test=test_set[, -1], cl=train_set$Channel, k=i)
accuracy <- sum(pred_knn == test_set$Channel) / nrow(test_set)
accuracy_rate <- c(accuracy_rate, accuracy)
}
plot(k, accuracy_rate, type='b')
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
predict_lr_class <- ifelse(predict_lr > 0.5, 2, 1)
confusionMatrix_lr <- confusionMatrix(as.factor(predict_lr_class), as.factor(test_set$Channel))
eval_logit <- tibble(
Accuracy = confusionMatrix_lr$overall["Accuracy"],
Recall = confusionMatrix_lr$byClass["Sensitivity"],
Specificity = confusionMatrix_lr$byClass["Specificity"],
Precision = confusionMatrix_lr$byClass["Pos Pred Value"]
)
predictors_train <- train_set[,-1]
response_train <- train_set$Channel
predictors_test <- test_set[,-1]
response_test <- test_set$Channel
set.seed(123)
model_knn <- knn(train = predictors_train, test = predictors_test, cl = response_train, k=5)
confusionMatrix_knn <- confusionMatrix(model_knn, response_test)
eval_knn <- tibble(
Accuracy = confusionMatrix_knn$overall["Accuracy"],
Recall = confusionMatrix_knn$byClass["Sensitivity"],
Specificity = confusionMatrix_knn$byClass["Specificity"],
Precision = confusionMatrix_knn$byClass["Pos Pred Value"]
)
print(eval_logit)
## # A tibble: 1 × 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.966 1 0.893 0.952
print(eval_knn)
## # A tibble: 1 × 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.908 0.966 0.786 0.905
Given the performance of the Logistic Regression and K-NN models on the wholesale data, we can observe that both models showed considerable predictive capabilities, with slight differences in their performance metrics.
If we look at precision, which is particularly important in situations where false positives have a high cost (misclassifying a distributor as belonging to the wrong channel), the Logistic Regression model stands out, reaching a precision rate of 95.2%, compared to K-NN’s 90.5%.
On the other hand, if we focus on recall, which matters when false negatives (missing a distributor who actually belongs to a specific channel) have a high cost, both models perform well. However, Logistic Regression delivers a perfect recall of 100%, slightly higher than the K-NN model, which has a recall of 96.6%.
Thus, if we liken ourselves to a distributorship manager, who needs to assign the right channel to each distributor, different strategies have different implications. If we apply the wrong channel to a distributor, it could lead to inappropriate marketing strategies and sales plans, possibly hurting the relationship with the distributor. Therefore, we would look at the precision metric, where we don’t want our model to make mistakes in predicting the correct channel for each distributor.
Given these considerations, the Logistic Regression model, with its higher precision and recall rates, would be our model of choice. However, it’s essential to keep in mind that model performance can always be improved by further feature engineering, hyperparameter tuning, and using more complex models when necessary. Also, no single model is the best choice for all situations. It’s crucial to understand the specific needs and constraints of the problem you’re trying to solve and select a model accordingly.
Dataset: Wholesale.csv (ADSS’s Classification in Machine Learning I Archive) ..