Intro

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. source : https://archive.ics.uci.edu/ml/datasets/wholesale+customers. I will try to predict the channel on this data using logistic regression and the knn models.

Data Preparation

Library

library(dplyr)
library(caret)
library(ggplot2)
library(plotly)
library(class)

Data Import

wholesale <- read.csv("data_input/wholesale.csv")
head(wholesale)

glimpse(wholesale)

#> Rows: 440
#> Columns: 8
#> $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Attribute Information:

FRESH: annual spending (m.u.) on fresh products (Continuous);
MILK: annual spending (m.u.) on milk products (Continuous);
GROCERY: annual spending (m.u.)on grocery products (Continuous);
FROZEN: annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
CHANNEL: customers Channel - Horeca (Hotel/Restaurant/CafÃ©) or Retail channel (Nominal)
REGION: customers Region - Lisnon, Oporto or Other (Nominal)

Data Manipulation

wholesale <- wholesale %>% 
  select(-Region) %>% 
  mutate(Channel = factor(Channel, levels = c(1, 2), labels = c("Horeca", "Retail"))
         )

str(wholesale)

#> 'data.frame':    440 obs. of  7 variables:
#>  $ Channel         : Factor w/ 2 levels "Horeca","Retail": 2 2 2 1 2 2 2 2 1 2 ...
#>  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
#>  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
#>  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
#>  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
#>  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
#>  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

Checking for missing values

anyNA(wholesale)

#> [1] FALSE

Pre-Processing Data

Splitting Data

RNGkind(sample.kind = "Rounding") 
set.seed(123)

# index sampling
index <- sample(x = nrow(wholesale), 
                size = nrow(wholesale)*0.8) 

# splitting
sale_train <- wholesale[index, ]
sale_test <- wholesale[-index, ]

Checking the balance of data proportion

prop.table(table(sale_train$Channel))

#> 
#>    Horeca    Retail 
#> 0.6846591 0.3153409

The proportion of the data is 7:3, I assume it’s not balanced.

Balancing Data

RNGkind(sample.kind = "Rounding") 
        
set.seed(123)

sale_train <- downSample(x = sale_train %>% 
                           select(-Channel),
                         y = sale_train$Channel,
                         yname = "Channel")

prop.table(table(sale_train$Channel))

#> 
#> Horeca Retail 
#>    0.5    0.5

Logistic Regression Model

model_all <- glm(formula = Channel~., data = sale_train, family = binomial)
summary(model_all)

#> 
#> Call:
#> glm(formula = Channel ~ ., family = binomial, data = sale_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.95563  -0.38246  -0.00624   0.24385   3.09751  
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.930e+00  6.078e-01  -4.821 1.43e-06 ***
#> Fresh             2.116e-05  2.266e-05   0.934    0.350    
#> Milk              1.022e-04  6.874e-05   1.486    0.137    
#> Grocery           5.982e-05  6.691e-05   0.894    0.371    
#> Frozen           -1.971e-04  1.155e-04  -1.706    0.088 .  
#> Detergents_Paper  8.510e-04  1.697e-04   5.014 5.33e-07 ***
#> Delicassen       -1.372e-04  1.290e-04  -1.063    0.288    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 307.76  on 221  degrees of freedom
#> Residual deviance: 118.05  on 215  degrees of freedom
#> AIC: 132.05
#> 
#> Number of Fisher Scoring iterations: 7

StepWise Model

model_step <- step(object = model_all, direction = "both", trace = F)

summary(model_step)

#> 
#> Call:
#> glm(formula = Channel ~ Milk + Frozen + Detergents_Paper, family = binomial, 
#>     data = sale_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.97606  -0.39799  -0.00721   0.25024   3.00802  
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.603e+00  5.075e-01  -5.130 2.90e-07 ***
#> Milk              1.241e-04  5.789e-05   2.144   0.0321 *  
#> Frozen           -2.003e-04  1.085e-04  -1.847   0.0647 .  
#> Detergents_Paper  8.759e-04  1.383e-04   6.332 2.42e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 307.76  on 221  degrees of freedom
#> Residual deviance: 120.47  on 218  degrees of freedom
#> AIC: 128.47
#> 
#> Number of Fisher Scoring iterations: 7

predictions <- predict(object = model_step, newdata = sale_test, type = "response")

sale_test$prediction <- predictions
sale_test$prediction <- ifelse(test = sale_test$prediction > 0.5, yes = "Retail", no = "Horeca")

sale_test <- sale_test %>% 
  mutate(prediction = as.factor(prediction))

Model Evaluation

confusionMatrix(data = sale_test$prediction, reference = sale_test$Channel, positive = "Retail")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     54      5
#>     Retail      3     26
#>                                           
#>                Accuracy : 0.9091          
#>                  95% CI : (0.8287, 0.9599)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 1.509e-08       
#>                                           
#>                   Kappa : 0.7978          
#>                                           
#>  Mcnemar's Test P-Value : 0.7237          
#>                                           
#>             Sensitivity : 0.8387          
#>             Specificity : 0.9474          
#>          Pos Pred Value : 0.8966          
#>          Neg Pred Value : 0.9153          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.2955          
#>    Detection Prevalence : 0.3295          
#>       Balanced Accuracy : 0.8930          
#>                                           
#>        'Positive' Class : Retail          
#>

Model has a good performance with an accuracy of 91%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 94.74 %.

K-Nearest Neighbor (KNN) Model

Now we will still use the same dataset that we have used before. And then, we will separate the target variable and predictor variable.

sale_train_predictor <- sale_train %>% select(-Channel)

sale_train_target <- sale_train %>% pull(Channel)

sale_test_predictor <- sale_test %>% select(-c("Channel", "prediction"))

sale_test_target <- sale_test %>% pull(Channel)

Scaling

sale_train_predictor_scale <- sale_train_predictor %>% 
  scale()

sale_test_predictor_scale <- sale_test_predictor %>% 
  scale(center = attr(sale_train_predictor_scale,"scaled:center"), scale = attr(sale_train_predictor_scale,"scaled:scale"))

Looking for Optimum K-Value

sqrt(nrow(sale_train_predictor_scale))

#> [1] 14.89966

The Optimum K-Value is 15

K-NN Model

channel_pred <- knn(train = sale_train_predictor_scale,
                    test = sale_test_predictor_scale,
                    cl = sale_train_target,
                    k = 15
                    )

Evaluation

confusionMatrix(data = channel_pred,
                reference = sale_test_target,
                positive = "Retail")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     51      5
#>     Retail      6     26
#>                                           
#>                Accuracy : 0.875           
#>                  95% CI : (0.7873, 0.9359)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 1.329e-06       
#>                                           
#>                   Kappa : 0.7281          
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.8387          
#>             Specificity : 0.8947          
#>          Pos Pred Value : 0.8125          
#>          Neg Pred Value : 0.9107          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.2955          
#>    Detection Prevalence : 0.3636          
#>       Balanced Accuracy : 0.8667          
#>                                           
#>        'Positive' Class : Retail          
#>

This model also has a good performance with an accuracy of 87.5%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 89.47%.

Conclusion

Based on the evaluation results of the two models above, I can conclude that the best model for predicting channel values in wholesale data is the StepWise Model. The StepWise model has better accuracy in predicting channel values than the KNN model.

Predicting Channel Values on Wholesale Data Using Logistic Regression and KNN Models

Alvian Khairi

2022-06-05