Introduction

Machine learning has several methods to predict, I will predict the wholesale dataset by looking at the influence of variables either numeric or categorical. The target that will be used in this prediction is Channel. The prediction method will use 2 methods, namely Logistic Regression and K-NN. We will compare the two types of methods and will conclude which method is best to use for this prediction.

Library

Before we start we will install some libraries as follows:

library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(GGally)
library(dplyr)
library(car)
library(class)
library(lmtest)
library(gtools)

Read Data & Data Understanding

Import Data

Import the data that we have prepared, namely wholesale.csv. Use command read.csv according to the file extension

wholesale <- read.csv("wholesale.csv")

Data Inspection

Let’s take a quick look at the data content with the command Head ()

head(wholesale)

We check the data type with the command glimpse ()

glimpse(wholesale)
#> Rows: 440
#> Columns: 8
#> $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
#> $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

From Function glimps above we can see, the data has 440 rows and 8 columns. Here is the explanation about the variables:

  • Channel : Horeca (Hotel/Restaurant/Cafe or Retail channel (Nominal)
  • Region : Lisnon, Oporto or Other (Nominal)
  • Fresh : annual spending (m.u.) on fresh products (Continuous)
  • Milk : annual spending (m.u.) on milk products (Continuous)
  • Grocery : annual spending (m.u.)on grocery products (Continuous)
  • Frozen : annual spending (m.u.)on frozen products (Continuous)
  • Detergents_Paper : annual spending (m.u.) on detergents and paper products (Continuous)
  • Delicassen : annual spending (m.u.)on and delicatessen products (Continuous)

Data Manipulation

Change Type data

Region and Channel wil be change from integer to factor, because the number just infomation to be area:

wholesale_clean <- wholesale %>%
  mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail")) %>%
  mutate(Channel = as.factor(Channel),
         Region = as.factor(Region))

Check Missing values

colSums(is.na(wholesale_clean))
#>          Channel           Region            Fresh             Milk 
#>                0                0                0                0 
#>          Grocery           Frozen Detergents_Paper       Delicassen 
#>                0                0                0                0

No missing values from above data

Exploratory Data Analysis

Check the distribution/pattern of the data

summary(wholesale_clean)
#>    Channel    Region      Fresh             Milk          Grocery     
#>  Horeca:298   1: 77   Min.   :     3   Min.   :   55   Min.   :    3  
#>  Retail:142   2: 47   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153  
#>               3:316   Median :  8504   Median : 3627   Median : 4756  
#>                       Mean   : 12000   Mean   : 5796   Mean   : 7951  
#>                       3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656  
#>                       Max.   :112151   Max.   :73498   Max.   :92780  
#>      Frozen        Detergents_Paper    Delicassen     
#>  Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
#>  1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
#>  Median : 1526.0   Median :  816.5   Median :  965.5  
#>  Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
#>  3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
#>  Max.   :60869.0   Max.   :40827.0   Max.   :47943.0
  • For channel majority in horeca
  • Region the most in 3
  • For Fresh has a high range with a minimum of 3 and a maximum of 112,151 with a mean of 12,000.
  • Milk has a high range with a minimum of 3 and a maximum of 92,780 with a mean of 7,951.
  • Frozen has a high range between minimum 25 and maximum 60,869 with a mean of 3,071.9
  • Detergents_Paper has a high range with a minimum of 3 and a maximum of 40,827 with a mean of 2,881.5
  • Delicassen has a mean of 1,524.9 with a high range having a minimum value of 3 and a maximum of 47,943.

We will look for the relationship between the target and the predictor

ggplot(wholesale_clean, mapping = aes(x = Region, y = reorder(Channel, Region)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Region",
     x = "Region",
     y = NULL,
     fill = "Region"
   ) + theme_light()

From the plot above we can conclude that the composition of retail with heroca is the same in each region.

ggplot(wholesale_clean, mapping = aes(x = Fresh, y = reorder(Channel, Fresh)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Fresh",
     x = "Fresh",
     y = NULL,
     fill = "Fresh"
   ) + theme_light()

From the above plot in channel horeca which contains fresh more than in channel retail

ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Milk)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Milk",
     x = "Milk",
     y = NULL,
     fill = "Milk"
   ) + theme_light()

From the above plot in channel retail which contains more milk than in channel horeca

ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Grocery)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Grocery",
     x = "Grocery",
     y = NULL,
     fill = "Grocery"
   ) + theme_light()

From the above plot in channel retail which contains grocery more than in channel horeca.

ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Frozen)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Frozen",
     x = "Frozen",
     y = NULL,
     fill = "Frozen"
   ) + theme_light()

From the above plot in channel horeca which contains frozen more than in channel retail

ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Detergents_Paper)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Detergents Paper",
     x = "Detergents Paper",
     y = NULL,
     fill = "Detergents Paper"
   ) + theme_light()

From the above plot in channel retail which contains Detergent_Paper more than in channel horeca.

ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Delicassen)))+
  geom_col(aes(fill = Channel), position = "dodge")+
  labs(
     title = "Total Channel In Wholesale",
     subtitle = "Channel Vs Delicassen",
     x = "Delicassen",
     y = NULL,
     fill = "Delicassen"
   ) + theme_light()

From the above plot in channel retail which contains more Delicassen than in channel horeca

Logistic Regression

Logistic Regression It’s a statistical model used to predict the probability of an event occurring. The outcome variable is binary (0 or 1), representing two possible states (e.g., yes/no, true/false). It applies a logistic function to transform the linear combination of predictors into a probability.

Cross validation

We will divide the data into 2, namely train data and test data with a proportion of 80:20.

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <- sample(nrow(wholesale_clean), nrow(wholesale_clean)*0.8)
# splitting
wholesale_train <- wholesale_clean[index,]
wholesale_test <- wholesale_clean[-index,]

Check the proportion of data in wholesale_train

prop.table(table(wholesale_train$Channel))
#> 
#>    Horeca    Retail 
#> 0.6846591 0.3153409

The data is not balanced, we will do upsampling to make the data balance

RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

wholesale_train_up <- upSample(x = wholesale_train %>% select(-Channel),
                       y = wholesale_train$Channel,
                       yname = "Channel")

Check the proportion of data in wholesale_train_up

prop.table(table(wholesale_train_up$Channel))
#> 
#> Horeca Retail 
#>    0.5    0.5

Modeling

We will model with the function glm(), as predictors are all variables and the target is channel

model_logistic <- glm(Channel ~ ., wholesale_train_up, family = "binomial") 
summary(model_logistic)
#> 
#> Call:
#> glm(formula = Channel ~ ., family = "binomial", data = wholesale_train_up)
#> 
#> Coefficients:
#>                     Estimate  Std. Error z value         Pr(>|z|)    
#> (Intercept)      -4.13752189  0.65765804  -6.291 0.00000000031482 ***
#> Region2           2.32470217  0.74967925   3.101          0.00193 ** 
#> Region3           0.70848987  0.55851530   1.269          0.20461    
#> Fresh            -0.00001571  0.00001752  -0.897          0.36983    
#> Milk              0.00008356  0.00005547   1.506          0.13196    
#> Grocery           0.00011132  0.00005450   2.043          0.04110 *  
#> Frozen           -0.00005594  0.00004684  -1.194          0.23238    
#> Detergents_Paper  0.00095894  0.00013768   6.965 0.00000000000328 ***
#> Delicassen       -0.00012365  0.00011335  -1.091          0.27532    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 668.19  on 481  degrees of freedom
#> Residual deviance: 216.49  on 473  degrees of freedom
#> AIC: 234.49
#> 
#> Number of Fisher Scoring iterations: 8
  • From the model above we can get the residual deviance model error value: 216.49

  • From the logistic model above, we can find the information loss / AIC : 234.49

We will do step-wise to find the smallest AIC value / find the smallest information loss from our model.

model_step <- step(model_logistic, direction = "backward", trace = F)
summary(model_step)
#> 
#> Call:
#> glm(formula = Channel ~ Region + Grocery + Frozen + Detergents_Paper, 
#>     family = "binomial", data = wholesale_train_up)
#> 
#> Coefficients:
#>                     Estimate  Std. Error z value           Pr(>|z|)    
#> (Intercept)      -4.20701054  0.65137428  -6.459 0.0000000001056282 ***
#> Region2           2.21093277  0.72161499   3.064            0.00218 ** 
#> Region3           0.64284322  0.55470272   1.159            0.24650    
#> Grocery           0.00013960  0.00004278   3.263            0.00110 ** 
#> Frozen           -0.00007404  0.00004909  -1.508            0.13146    
#> Detergents_Paper  0.00097744  0.00012912   7.570 0.0000000000000373 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 668.19  on 481  degrees of freedom
#> Residual deviance: 220.69  on 476  degrees of freedom
#> AIC: 232.69
#> 
#> Number of Fisher Scoring iterations: 8
  • From the model above we can get the model error value Residual deviance: 220.69

  • From the step-wise backward model, we can see the AIC value of 232.69 is smaller than the logistic model (all predictors).

We will predict wholesale_test with predic function based on model_step.

Predict

With model_step as our machine learning model and wholesale_test as our test data, with the predic function below:

pred_test_Log <- predict(model_step, type = "response", newdata = wholesale_test)

From the graph below, the prediction results can be interpreted that the value is more inclined towards 1, namely retail.

ggplot(wholesale_test, aes(x=pred_test_Log)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution Prediction Data Channel") +
  theme_minimal()

In the graph above, it can be interpreted that the prediction results are more inclined towards 1, which means Retail.

pred_wholesale_test <- ifelse(pred_test_Log > 0.5, "Retail", "Horeca") %>% as.factor()

Model Evaluation

To evaluate the model we have created, we will use confusion matrix

wholesale_log_con <- confusionMatrix(pred_wholesale_test, wholesale_test$Channel, positive = "Retail")
wholesale_log_con
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     54      7
#>     Retail      3     24
#>                                           
#>                Accuracy : 0.8864          
#>                  95% CI : (0.8009, 0.9441)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 0.0000003334    
#>                                           
#>                   Kappa : 0.7434          
#>                                           
#>  Mcnemar's Test P-Value : 0.3428          
#>                                           
#>             Sensitivity : 0.7742          
#>             Specificity : 0.9474          
#>          Pos Pred Value : 0.8889          
#>          Neg Pred Value : 0.8852          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.2727          
#>    Detection Prevalence : 0.3068          
#>       Balanced Accuracy : 0.8608          
#>                                           
#>        'Positive' Class : Retail          
#> 
  • Sensitivity = measure of model goodness of fit to the positive class
  • Specificity = measure of model goodness of fit to the negative class
  • Accuracy = how well our model predicts the target class (globally)
  • Precision = how precisely the model predicts the positive class
Recall <- round(24/(24+7),2)
Specificity <- round((54)/(3+54),2)
Accuracy <- round((54+24)/nrow(wholesale_test),2)
Precision <- round(24/(24+3),2)

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance

Tuning Cutoff

Used to find out the maximum threshold of what we will be examining.

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = pred_test_Log, 
                     ref = wholesale_test$Channel, 
                     postarget = "Retail", 
                     negtarget = "Horeca")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff model perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

Based on the performance tradeoff model above, we can see that with a cutoff of 0.5 we get higher values of specificity, accuracy and precision, but the recall value is rather low.

Model Interpretation

# Odds ratio all coefficients
exp(model_step$coefficients) %>% 
  data.frame() 

Interpretasi model :

  • Channel with Region2 has 9.124223 times more probability, compared to Region1 and Region3 provided the other variables have the same value.
  • Channel with Region3 is 1.901881 times more likely than Region2 and Region3 provided the other variables have the same values.
  • Every 1 point increase in Grocery will increase the probability of Channel retail by 1.00014x, holding all other variables constant.
  • Every 1 point increase of frozen will increase the probability of Channel retail by 0.999926x, holding all other variables constant.
  • Every 1 point increase of Detergents_Paper will increase the probability of Channel retail by 1.000978x, holding all other variables constant.

K-NN

A non-parametric classification and regression algorithm. Assumes that similar instances belong to the same class or have similar values. Works by finding the K nearest neighbors of a new instance and assigning it to the most frequent class among them. We will use the previous data by using machine learning type K-NN

Read Data

Use Function head() to see preview data

head(wholesale_clean)

Data Wrangling

Check data type use glimpse()

glimpse(wholesale_clean)
#> Rows: 440
#> Columns: 8
#> $ Channel          <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
#> $ Region           <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
#> $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
#> $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
#> $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
#> $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

We will only use numeric data types as predictors.

wholesale_knn <- wholesale_clean %>% select(-Region)

Exploratory Data Analysis

Check the target proportion

prop.table(table(wholesale_knn$Channel))
#> 
#>    Horeca    Retail 
#> 0.6772727 0.3227273

Cross Validation

We will divide the data into 2, namely train data and test data with a proportion of 80:20.

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <- sample(nrow(wholesale_knn), nrow(wholesale_knn)*0.8)
# splitting
wholesale_knn_train <- wholesale_knn[index,]
wholesale_knn_test <- wholesale_knn[-index,] # data yang tidak masuk train

Check the target proportion wholesale_knn_train

prop.table(table(wholesale_knn_train$Channel))
#> 
#>    Horeca    Retail 
#> 0.6846591 0.3153409

Data is not in balance we will do upsampling

RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

wholesale_knn_train_up <- upSample(x = wholesale_knn_train %>% select(-Channel),
                       y = wholesale_knn_train$Channel,
                       yname = "Channel")

Check the target proportion wholesale_knn_train_up

prop.table(table(wholesale_knn_train_up$Channel))
#> 
#> Horeca Retail 
#>    0.5    0.5

Data Pre-Processing

For k-NN, the predictors and labels (target variables) are separated.

# prediktor data train
wholesale_knn_train_x <- wholesale_knn_train_up %>% select_if(is.numeric) 

# target data train
wholesale_knn_train_y <- wholesale_knn_train_up[,"Channel"]

# prediktor data test
wholesale_knn_test_x <- wholesale_knn_test %>% select_if(is.numeric)

# target data test
wholesale_knn_test_y <- wholesale_knn_test[,"Channel"]

Because of the range of each data, the next step is scaling the train data.

# scaling data
# train
wholesale_knn_train_xs <- scale(x = wholesale_knn_train_x) # data prediktor untuk data train

# test
wholesale_knn_test_xs <- scale(x = wholesale_knn_test_x,
                      center = attr(wholesale_knn_train_xs, "scaled:center"), #nilai rata rata data
                      scale = attr(wholesale_knn_train_xs,"scaled:scale")) # nilai data train

Determine the K value with the formula below

sqrt(nrow(wholesale_knn_train))
#> [1] 18.76166

K Optimum = 19

Predict

The K-NN machine learning model does not need to be modeled in advance.

library(class) 
pred_knn_wholsesale <- knn(train = wholesale_knn_train_xs, # prediktor data train
                test = wholesale_knn_test_xs, # prediktor data test
                cl = wholesale_knn_train_y, # label dari data train
                k = 19)
head(pred_knn_wholsesale)
#> [1] Retail Horeca Retail Retail Retail Retail
#> Levels: Horeca Retail

Model evaluation

# confusion matrix
library(caret)
wholesale_knn_conf <- confusionMatrix(pred_knn_wholsesale, wholesale_knn_test_y, positive = "Retail")
wholesale_knn_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Horeca Retail
#>     Horeca     53      4
#>     Retail      4     27
#>                                           
#>                Accuracy : 0.9091          
#>                  95% CI : (0.8287, 0.9599)
#>     No Information Rate : 0.6477          
#>     P-Value [Acc > NIR] : 0.00000001509   
#>                                           
#>                   Kappa : 0.8008          
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.8710          
#>             Specificity : 0.9298          
#>          Pos Pred Value : 0.8710          
#>          Neg Pred Value : 0.9298          
#>              Prevalence : 0.3523          
#>          Detection Rate : 0.3068          
#>    Detection Prevalence : 0.3523          
#>       Balanced Accuracy : 0.9004          
#>                                           
#>        'Positive' Class : Retail          
#> 
  • Sensitivity = measure of model goodness of fit to the positive class
  • Specificity = measure of model goodness of fit to the negative class
  • Accuracy = how well our model predicts the target class (globally)
  • Precision = how precisely the model predicts the positive class

Model Evaluation Logistic Regression and K-NN

eval_wholesale_log <- data_frame(Accuracy = wholesale_log_con$overall[1],
                                 Recall = wholesale_log_con$byClass[1],
                                 Specificity = wholesale_log_con$byClass[2],
                                 Precision = wholesale_log_con$byClass[3])

eval_wholesale_knn <- data_frame(Accuracy = wholesale_knn_conf$overall[1],
                                 Recall = wholesale_knn_conf$byClass[1],
                                 Specificity = wholesale_knn_conf$byClass[2],
                                 Precision = wholesale_knn_conf$byClass[3])
# Model Evaluation Logistic Regretion
eval_wholesale_log
# Model Evaluation K-NN
eval_wholesale_knn

When viewed from both methods, namely by using Logistic Regression and K-NN, how capable is the proportion of my model to correctly guess the Channel that is Retail better by using the K-NN method because it has a Recall value = 87.1% greater than using the logistic regression method.

Conclusion

If I am a wholesale manager, considering the goodness of the correct guessing model, I will use the K-NN model. If the predicted goods go to Retail but actually go to Horeca then the risk is that Retail will run out of stock. Therefore, I will use the Recall matrix by looking at the goodness of the model to the positive class.