Data Explanation
Objective
The data used in this model consist of variables related to annual spending in monetary units on diverse product categories. This dataset refers to clients of a wholesale distributor. There are 440 rows with 6 predictor variables and 1 response variable. The source of this dataset comes from archive.ics.uci.edu.
Why?
The purpose of logistic regression modeling and K-Nearest Neighbor (KNN) on this data is to determine whether the customers on the channel are from Horeca (industry in the food and beverage sector) or not. The results of this modeling can be used as a reference in predicting the type of channel based on the existing variables.
Data Preparation
wholesale <- read.csv("wholesale_kaggle.csv")
Exploratory Data Analysis
Checking the type of each variable
str(wholesale)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
Changing The Data Type
library(dplyr)
wholesale_clean <- wholesale %>%
select(-Region) %>% #We get Region variabel out of from the dataset
mutate(Channel=as.factor(Channel))
rmarkdown::paged_table(wholesale_clean)
Summary of The Data
summary(wholesale_clean)
## Channel Fresh Milk Grocery Frozen
## 1:298 Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
## 2:142 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
## Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
## Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen
## Min. : 3.0 Min. : 3.0
## 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 816.5 Median : 965.5
## Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :40827.0 Max. :47943.0
Checking the Existence NaN Data
colSums(is.na(wholesale_clean))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Result: There is no NaN data.
Logistic Regression Preparation
Checking The Balance of Data Proportion
prop.table(table(wholesale_clean$Channel))
##
## 1 2
## 0.6772727 0.3227273
Result: The proportion 0.67 and 0.33 is quite balanced to be used in modelling.
Splitting Train-Test Data
We can determine the proportion between data train and data test whatever it is. But, most of the researchers make a bigger value in data train proportion than data test. Here, I used proportion 0.8 for data train and the rest for data test.
set.seed(17)
index <- sample(x=nrow(wholesale_clean),
size = nrow(wholesale_clean)*0.8)
whs_train <- wholesale_clean[index,]
whs_test <- wholesale_clean[-index,]
Logistic Regression Model
model_all <- glm(formula = Channel~., data = whs_train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
##
## Call:
## glm(formula = Channel ~ ., family = binomial, data = whs_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1914 -0.2779 -0.1555 0.0295 3.3453
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.087e+00 9.101e-01 -5.589 2.29e-08 ***
## Region2 2.023e+00 1.061e+00 1.907 0.0565 .
## Region3 1.300e+00 7.434e-01 1.749 0.0803 .
## Fresh 9.944e-06 2.137e-05 0.465 0.6417
## Milk 1.289e-04 6.984e-05 1.846 0.0649 .
## Grocery 6.131e-05 6.770e-05 0.906 0.3651
## Frozen -2.760e-04 1.332e-04 -2.072 0.0382 *
## Detergents_Paper 1.086e-03 1.786e-04 6.083 1.18e-09 ***
## Delicassen -4.384e-05 1.201e-04 -0.365 0.7151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 444.80 on 351 degrees of freedom
## Residual deviance: 140.15 on 343 degrees of freedom
## AIC: 158.15
##
## Number of Fisher Scoring iterations: 7
Result: We can see that there are several insignificant predictors such as Milk, Fresh, Grocery, Frozen, and Delicassen which those have the p-value greater than alpha 0.05. So, we can eliminate insignificant variables using stepwise model.
StepWise “Both”
model_step <- step(object = model_all,
direction = "both",
trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_step)
##
## Call:
## glm(formula = Channel ~ Region + Milk + Frozen + Detergents_Paper,
## family = binomial, data = whs_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1976 -0.2825 -0.1577 0.0263 3.3124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.947e+00 8.947e-01 -5.529 3.22e-08 ***
## Region2 2.101e+00 1.069e+00 1.965 0.04941 *
## Region3 1.322e+00 7.433e-01 1.779 0.07520 .
## Milk 1.630e-04 5.935e-05 2.747 0.00602 **
## Frozen -2.743e-04 1.163e-04 -2.358 0.01836 *
## Detergents_Paper 1.156e-03 1.536e-04 7.524 5.32e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 444.80 on 351 degrees of freedom
## Residual deviance: 141.13 on 346 degrees of freedom
## AIC: 153.13
##
## Number of Fisher Scoring iterations: 7
Result: We can observe that there are three predictor variables that are left on the model (Grocery, Frozen, and Detergent Paper).
Predicting by Logistic Regression Model (After Stepwise)
whs_test$prediction <- predict(object = model_step,
newdata = whs_test,
type = "response")
Based on the dataset, we determine the probability that is greater than 0.5 is categorized as 2 (Non Horeca) and otherwise as 1 (Horeca).
whs_test$predict_label <- ifelse(test = whs_test$prediction>0.5,
yes = "2",
no="1")
head(whs_test)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 2 2 3 7057 9810 9568 1762 3293 1776
## 15 2 3 24653 9465 12091 294 5058 2168
## 17 2 3 1020 8816 12121 134 4508 1080
## 33 1 3 21632 1318 2886 266 918 405
## 37 1 3 29955 4362 5428 1729 862 4626
## 40 1 3 56159 555 902 10002 212 2916
## prediction predict_label
## 2 0.785296341 2
## 15 0.975469292 2
## 17 0.951913401 2
## 33 0.081525194 1
## 37 0.083813194 1
## 40 0.002393493 1
Changing the Predict Label to Factor Data
whs_test$predict_label <- as.factor(whs_test$predict_label)
Evaluating Logistic Regression Model
library(caret)
confusionMatrix(data = whs_test$predict_label,
reference = whs_test$Channel,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 55 3
## 2 6 24
##
## Accuracy : 0.8977
## 95% CI : (0.8147, 0.9522)
## No Information Rate : 0.6932
## P-Value [Acc > NIR] : 4.871e-06
##
## Kappa : 0.7668
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.9016
## Specificity : 0.8889
## Pos Pred Value : 0.9483
## Neg Pred Value : 0.8000
## Prevalence : 0.6932
## Detection Rate : 0.6250
## Detection Prevalence : 0.6591
## Balanced Accuracy : 0.8953
##
## 'Positive' Class : 1
##
Result: We get the accuracy 89.77%, recall 90.32%, and precision 94.92%.
K-Nearest Neighbor Preparation
whs_knn_train <- wholesale_clean[index,]
whs_knn_test <- wholesale_clean[-index,]
Checking The Balance of Data Train
prop.table(table(whs_knn_train$Channel))
##
## 1 2
## 0.6732955 0.3267045
Defining Some Variables Predictor-Target for Train-Test Data
library(dplyr)
#variabel prediktor pada whs train
whs_train_pred <- whs_knn_train %>%
select(-Channel)
#variabel target pada whs train
whs_train_targ <- whs_knn_train %>%
pull(Channel)
#variabel prediktor pada whs test
whs_test_pred <- whs_knn_test %>%
select(-Channel)
#variabel target pada whs test
whs_test_targ <- whs_knn_test %>%
pull(Channel)
Scaling Data
whs_train_pred_scale <-whs_train_pred %>%
scale()
whs_test_pred_scale <-whs_test_pred %>%
scale(center = attr(whs_train_pred_scale, "scaled:center"),
scale = attr(whs_train_pred_scale, "scaled:scale"))
Looking for Optimum K Value
sqrt(nrow(whs_train_pred))
## [1] 18.76166
Result: The optimum K value is 18.76 or equal to 19.
K-Nearest Neighbor Model
library(class)
model_knn <- knn(train =whs_train_pred_scale,
test = whs_test_pred_scale,
cl =whs_train_targ,
k=19)
Evaluating KNN Model
confusionMatrix(data = model_knn,
reference = whs_test_targ,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 54 4
## 2 7 23
##
## Accuracy : 0.875
## 95% CI : (0.7873, 0.9359)
## No Information Rate : 0.6932
## P-Value [Acc > NIR] : 5.812e-05
##
## Kappa : 0.715
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8852
## Specificity : 0.8519
## Pos Pred Value : 0.9310
## Neg Pred Value : 0.7667
## Prevalence : 0.6932
## Detection Rate : 0.6136
## Detection Prevalence : 0.6591
## Balanced Accuracy : 0.8685
##
## 'Positive' Class : 1
##
Result: We can observe that the value of accuracy, recall, and precision are 93.18%, 93.55%, and 96.67% respectively.
Conclusion
Based on the two model we analyzed before using logistic regression and KNN we get that KNN is the best model based on all the three value of confusion matrix indicators. So, the wholesale distributor better to use KNN model in predicting whether the channel is Horeca or not. Though, in this case, there is no loss will be happened if the model is wrong in classifying the response variables. However, if the response variable refers to an action that will be carried out on a subject and has a risk, then we have to choose the model properly by considering all the risks.
Reference
- Dataset source: https://archive.ics.uci.edu/ml/datasets/wholesale+customers