Intro

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. source : https://archive.ics.uci.edu/ml/datasets/wholesale+customers. I will try to predict the channel on this data using logistic regression and the knn models.

Data Preparation

Library

library(dplyr)
library(caret)
library(ggplot2)
library(plotly)
library(class)

Data Import

wholesale <- read.csv("data_input/wholesale.csv")
head(wholesale)
glimpse(wholesale)
#> Rows: 440
#> Columns: 8
#> $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Attribute Information:

  1. FRESH: annual spending (m.u.) on fresh products (Continuous);
  2. MILK: annual spending (m.u.) on milk products (Continuous);
  3. GROCERY: annual spending (m.u.)on grocery products (Continuous);
  4. FROZEN: annual spending (m.u.)on frozen products (Continuous)
  5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  6. DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
  7. CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
  8. REGION: customers Region - Lisnon, Oporto or Other (Nominal)

Data Manipulation

wholesale <- wholesale %>% 
  select(-Region) %>% 
  mutate(Channel = factor(Channel, levels = c(1, 2), labels = c("Horeca", "Retail"))
         )
str(wholesale)
#> 'data.frame':    440 obs. of  7 variables:
#>  $ Channel         : Factor w/ 2 levels "Horeca","Retail": 2 2 2 1 2 2 2 2 1 2 ...
#>  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
#>  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
#>  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
#>  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
#>  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
#>  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

Checking for missing values

anyNA(wholesale)
#> [1] FALSE

Pre-Processing Data

Splitting Data

RNGkind(sample.kind = "Rounding") 
set.seed(123)

# index sampling
index <- sample(x = nrow(wholesale), 
                size = nrow(wholesale)*0.8) 

# splitting
sale_train <- wholesale[index, ]
sale_test <- wholesale[-index, ]

Checking the balance of data proportion

prop.table(table(sale_train$Channel))
#> 
#>    Horeca    Retail 
#> 0.6846591 0.3153409

The proportion of the data is 7:3, I assume it’s not balanced.

Balancing Data

RNGkind(sample.kind = "Rounding") 
        
set.seed(123)

sale_train <- downSample(x = sale_train %>% 
                           select(-Channel),
                         y = sale_train$Channel,
                         yname = "Channel")
prop.table(table(sale_train$Channel))
#> 
#> Horeca Retail 
#>    0.5    0.5

Logistic Regression Model

model_all <- glm(formula = Channel~., data = sale_train, family = binomial)
summary(model_all)
#> 
#> Call:
#> glm(formula = Channel ~ ., family = binomial, data = sale_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.95563  -0.38246  -0.00624   0.24385   3.09751  
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.930e+00  6.078e-01  -4.821 1.43e-06 ***
#> Fresh             2.116e-05  2.266e-05   0.934    0.350    
#> Milk              1.022e-04  6.874e-05   1.486    0.137    
#> Grocery           5.982e-05  6.691e-05   0.894    0.371    
#> Frozen           -1.971e-04  1.155e-04  -1.706    0.088 .  
#> Detergents_Paper  8.510e-04  1.697e-04   5.014 5.33e-07 ***
#> Delicassen       -1.372e-04  1.290e-04  -1.063    0.288    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 307.76  on 221  degrees of freedom
#> Residual deviance: 118.05  on 215  degrees of freedom
#> AIC: 132.05
#> 
#> Number of Fisher Scoring iterations: 7

StepWise Model

model_step <- step(object = model_all, direction = "both", trace = F)
summary(model_step)
#> 
#> Call:
#> glm(formula = Channel ~ Milk + Frozen + Detergents_Paper, family = binomial, 
#>     data = sale_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.97606  -0.39799  -0.00721   0.25024   3.00802  
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.603e+00  5.075e-01  -5.130 2.90e-07 ***
#> Milk              1.241e-04  5.789e-05   2.144   0.0321 *  
#> Frozen           -2.003e-04  1.085e-04  -1.847   0.0647 .  
#> Detergents_Paper  8.759e-04  1.383e-04   6.332 2.42e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 307.76  on 221  degrees of freedom
#> Residual deviance: 120.47  on 218  degrees of freedom
#> AIC: 128.47
#> 
#> Number of Fisher Scoring iterations: 7
predictions <- predict(object = model_step, newdata = sale_test, type = "response")
sale_test$prediction <- predictions
sale_test$prediction <- ifelse(test = sale_test$prediction > 0.5, yes = "Retail", no = "Horeca")
sale_test <- sale_test %>% 
  mutate(prediction = as.factor(prediction))

Model Evaluation

confusionMatrix(data = sale_test$prediction, reference = sale_test$Channel, positive = "Retail")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     54      5
#>     Retail      3     26
#>                                           
#>                Accuracy : 0.9091          
#>                  95% CI : (0.8287, 0.9599)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 1.509e-08       
#>                                           
#>                   Kappa : 0.7978          
#>                                           
#>  Mcnemar's Test P-Value : 0.7237          
#>                                           
#>             Sensitivity : 0.8387          
#>             Specificity : 0.9474          
#>          Pos Pred Value : 0.8966          
#>          Neg Pred Value : 0.9153          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.2955          
#>    Detection Prevalence : 0.3295          
#>       Balanced Accuracy : 0.8930          
#>                                           
#>        'Positive' Class : Retail          
#> 

Model has a good performance with an accuracy of 91%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 94.74 %.

K-Nearest Neighbor (KNN) Model

Now we will still use the same dataset that we have used before. And then, we will separate the target variable and predictor variable.

sale_train_predictor <- sale_train %>% select(-Channel)

sale_train_target <- sale_train %>% pull(Channel)

sale_test_predictor <- sale_test %>% select(-c("Channel", "prediction"))

sale_test_target <- sale_test %>% pull(Channel)

Scaling

sale_train_predictor_scale <- sale_train_predictor %>% 
  scale()
sale_test_predictor_scale <- sale_test_predictor %>% 
  scale(center = attr(sale_train_predictor_scale,"scaled:center"), scale = attr(sale_train_predictor_scale,"scaled:scale"))

Looking for Optimum K-Value

sqrt(nrow(sale_train_predictor_scale))
#> [1] 14.89966

The Optimum K-Value is 15

K-NN Model

channel_pred <- knn(train = sale_train_predictor_scale,
                    test = sale_test_predictor_scale,
                    cl = sale_train_target,
                    k = 15
                    )

Evaluation

confusionMatrix(data = channel_pred,
                reference = sale_test_target,
                positive = "Retail")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     51      5
#>     Retail      6     26
#>                                           
#>                Accuracy : 0.875           
#>                  95% CI : (0.7873, 0.9359)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 1.329e-06       
#>                                           
#>                   Kappa : 0.7281          
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.8387          
#>             Specificity : 0.8947          
#>          Pos Pred Value : 0.8125          
#>          Neg Pred Value : 0.9107          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.2955          
#>    Detection Prevalence : 0.3636          
#>       Balanced Accuracy : 0.8667          
#>                                           
#>        'Positive' Class : Retail          
#> 

This model also has a good performance with an accuracy of 87.5%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 89.47%.

Conclusion

Based on the evaluation results of the two models above, I can conclude that the best model for predicting channel values ​​in wholesale data is the StepWise Model. The StepWise model has better accuracy in predicting channel values than the KNN model.