1 Explanation

1.1 Brief

We will learn to use classification model using wholesale dataset. We want to know the relationship among variables, especially between the channel with other variables. We also want to predict the future channel based on the historical data.

1.2 Data’s Point of View

In the following code we would borrow from a dataset prepared by Margarida Cardoso and available on the UCI Machine Learning repository

2 Data Preparation

Load the required package

library(tidyverse)
library(GGally) 
library(car)
library(caret)
library(class)
library(dplyr)

options(scipen = 100, max.print = 1e+06)

Load the dataset.

# Read the dataset in, drop the "Region" feature because it's not interesting
wholesale <- read.csv("data_input/wholesale.csv")
wholesale <- wholesale[,-2]
glimpse(wholesale)
## Rows: 440
## Columns: 7
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

The data has 440 rows and 7 columns.

2.1 Cek missing value

anyNA(wholesale)
## [1] FALSE

3 Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(wholesale, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The graphic shows that a few of variables has strong correlation with the Channel variables.

Let’s convert the ‘Channel’ feature into ‘Industry’ and make it a factor and apply the name.

wholesale$Industry <- factor(wholesale$Channel, levels = c(1, 2), labels = c("horeca", "retail"))

# After doing that we can remove the original Channel feature
wholesale <- wholesale[,-1]

wholesale <-  wholesale %>%
  mutate(Industry=as.factor(Industry))
         
table(wholesale$Industry)
## 
## horeca retail 
##    298    142

Notice here that, unlike the credit risk analysis example, we do not have a balanced dataset. The prior or baseline accuracy for predicting the majority class would be 67.7%.

4 Modeling

4.1 Holdout: Train-Test Split

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

Normalization to z-score:

wholesale.z <- data.frame(scale(wholesale[,-7]))
summary(wholesale.z)
##      Fresh              Milk            Grocery            Frozen        
##  Min.   :-0.9486   Min.   :-0.7779   Min.   :-0.8364   Min.   :-0.62763  
##  1st Qu.:-0.7015   1st Qu.:-0.5776   1st Qu.:-0.6101   1st Qu.:-0.47988  
##  Median :-0.2764   Median :-0.2939   Median :-0.3363   Median :-0.31844  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.3901   3rd Qu.: 0.1889   3rd Qu.: 0.2846   3rd Qu.: 0.09935  
##  Max.   : 7.9187   Max.   : 9.1732   Max.   : 8.9264   Max.   :11.90545  
##  Detergents_Paper    Delicassen     
##  Min.   :-0.6037   Min.   :-0.5396  
##  1st Qu.:-0.5505   1st Qu.:-0.3960  
##  Median :-0.4331   Median :-0.1984  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2182   3rd Qu.: 0.1047  
##  Max.   : 7.9586   Max.   :16.4597

Merge the new data frame containing z-score values with the Industry vector

wholesale.n <- data.frame(cbind(wholesale.z, Industry = wholesale$Industry))

Let’s split the dataset into train and test sets:

# But we can randomize the ordering if we're uncertain
RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <-  sample(nrow(wholesale.n), nrow(wholesale.n)*0.8)

#wholesale.n <- wholesale.n[sample(nrow(wholesale.n)), ]
# 80% train - 20% test
wholesale_train <-wholesale.n[index,]
wholesale_test <- wholesale.n[-index,]
table(wholesale_train$Industry)
## 
## horeca retail 
##    241    111

4.2 Logistic Regression

We train the logistic regression model. This is the equation of a logistic regression model: \[log(\frac{p(X)}{1-p(X)}) = b_0 + b_1x_1 . X\] The left-hand side is called the log-odds or logit. On the right side, the \(b_0\) is the model intercept and \(b_1\) is the coefficient of feature \(X\).

logit_mod <- glm(Industry~.,data = wholesale_train,family = binomial("logit"))

summary(logit_mod)
## 
## Call:
## glm(formula = Industry ~ ., family = binomial("logit"), data = wholesale_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8246  -0.2821  -0.2092   0.0294   3.4120  
## 
## Coefficients:
##                  Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)      -0.64396    0.24972  -2.579      0.00992 ** 
## Fresh             0.04209    0.24377   0.173      0.86291    
## Milk              0.59561    0.46204   1.289      0.19737    
## Grocery           1.06188    0.65788   1.614      0.10651    
## Frozen           -1.00895    0.54665  -1.846      0.06493 .  
## Detergents_Paper  4.00611    0.74907   5.348 0.0000000889 ***
## Delicassen       -0.44435    0.39039  -1.138      0.25503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 438.81  on 351  degrees of freedom
## Residual deviance: 153.08  on 345  degrees of freedom
## AIC: 167.08
## 
## Number of Fisher Scoring iterations: 7

Use model stepwise for feature selection

# model stepwise
logit_mod_step <- step(logit_mod,
                   direction = "backward",
                   trace = F)
summary(logit_mod_step)
## 
## Call:
## glm(formula = Industry ~ Grocery + Frozen + Detergents_Paper, 
##     family = binomial("logit"), data = wholesale_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8312  -0.2966  -0.2235   0.0340   3.2083  
## 
## Coefficients:
##                  Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)       -0.6433     0.2367  -2.717      0.00658 ** 
## Grocery            1.2108     0.5244   2.309      0.02095 *  
## Frozen            -0.9456     0.4137  -2.286      0.02226 *  
## Detergents_Paper   4.0783     0.7062   5.775 0.0000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 438.81  on 351  degrees of freedom
## Residual deviance: 156.38  on 348  degrees of freedom
## AIC: 164.38
## 
## Number of Fisher Scoring iterations: 7

The output show the summary of our logistic regression model. The coefficient for the intercept is -0.29 and for the Grocery value is 1,19. Both coefficients has small p-value (< 0.05), indicating that they are significant and should not be removed from the equation. The intercept value indicate that if the sentiment value is 0, the probability of of the outcome to be “Yes” is 0.5888 (James et al. , 2013). The coefficient for sentiment value shows that for sentiment value of 1 unit-point, the probability of the outcome to be “Yes” is 0.6024. The standard error shows the confidence interval for the estimate value of each coefficient with the following equation:

4.3 K-NN

K-NN classify the outcome by looking at the nearest “neighbour”. In other words, K-NN is looking at what is the class of data-point(s) with the least/shortest distance to the data we want to classify. The distance is measured with Euclidean Distance, which can penalizes neighbour with greater distance.

\[dist(A, B) = \sqrt{\sum\limits^{m}_{i=1}(x_i-y_i)^2}\]

We can customize the number of neighbours (K) we want to see in order to classify our data. First, we try to classify using the optimum number of K, which is the square root of the number of train dataset based on the rule of thumb.

# recheck class balance
prop.table(table(wholesale_train$Industry))
## 
##    horeca    retail 
## 0.6846591 0.3153409

For k-NN, we will separate the predictor and target variable.

#variabel prediktor (x: yang ingin discaling)
wholesale_train_x <- wholesale_train %>% select_if(is.numeric)
wholesale_test_x <- wholesale_test %>% select_if(is.numeric)

#variabel target (y: kategorikal)
wholesale_train_y <- wholesale_train[,7]
wholesale_test_y <- wholesale_test[,7]

We will try scaling data to make better proportion

# scaling data
wholesale_train_x_scale <- scale(wholesale_train_x) 
attr(wholesale_train_x_scale, "scaled:center")
##            Fresh             Milk          Grocery           Frozen 
##       0.04207774       0.01388679       0.02790082       0.05516416 
## Detergents_Paper       Delicassen 
##       0.02691581       0.01860786
attr(wholesale_train_x_scale, "scaled:scale")
##            Fresh             Milk          Grocery           Frozen 
##         1.050467         1.052833         1.061578         1.085773 
## Detergents_Paper       Delicassen 
##         1.066339         1.056819
#using center and scale from data train
wholesale_test_x_scale <- scale(x=wholesale_test_x,
                           center = attr(wholesale_train_x_scale, "scaled:center"),
                           scale = attr(wholesale_train_x_scale, "scaled:scale"))

head(wholesale_test_x_scale)
##          Fresh        Milk     Grocery      Frozen Detergents_Paper  Delicassen
## 3  -0.46512530  0.37440548 -0.05277607 -0.17733313       0.09955963  2.10266364
## 5   0.75890672 -0.06290046 -0.10095047  0.10913608      -0.24248376  1.21048367
## 6  -0.23480074  0.30375187 -0.30633596 -0.50724700      -0.23894335 -0.04239334
## 7  -0.03059468 -0.34744532 -0.12305518 -0.54253394       0.02560431 -0.34638546
## 14  0.65368001  0.03979823  0.67063286 -0.04642998       0.72719648 -0.32726013
## 17 -0.86653735  0.37543504  0.38703832 -0.60817522       0.29467582 -0.16687577

k-NN do not create model, so we will directly go to predict.

5 Evaluation

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

5.1 Logistic Regression

Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or Above Average).

pred_logit <- predict(logit_mod_step,newdata = wholesale_test,type = "response")

rmarkdown::paged_table(head(as.data.frame(pred_logit),10))

We will convert the probability into class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.

# determine the class based on the threshold 0.5
pred_class <- as.factor(if_else(pred_logit > 0.5, "retail", "horeca"))

 

# confusion matrix
perf_logit1 <- confusionMatrix(data = pred_class, reference = wholesale_test$Industry, 
    positive = "retail")
perf_logit1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction horeca retail
##     horeca     54     11
##     retail      3     20
##                                           
##                Accuracy : 0.8409          
##                  95% CI : (0.7475, 0.9102)
##     No Information Rate : 0.6477          
##     P-Value [Acc > NIR] : 0.00004804      
##                                           
##                   Kappa : 0.6296          
##                                           
##  Mcnemar's Test P-Value : 0.06137         
##                                           
##             Sensitivity : 0.6452          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.8696          
##          Neg Pred Value : 0.8308          
##              Prevalence : 0.3523          
##          Detection Rate : 0.2273          
##    Detection Prevalence : 0.2614          
##       Balanced Accuracy : 0.7963          
##                                           
##        'Positive' Class : retail          
## 

The result shows that our logistic regression model has accuracy of 84.09 % on test dataset, meaning that 84.09 % of our data is correctly classified. The value of sensitivity and specificity is 64.52 % and 94.74 %. This indicate that above average of positive outcomes are correctly classified but most of negative outcomes are correctly classified. The precision/positive predicted value is 86.96 %, meaning that 86.96 % of our positive prediction is correct.

5.2 K-NN

We want to find the optimum k. Usually it is square root from total row of data train

sqrt(nrow(wholesale_train))
## [1] 18.76166
#library(class)  
wholesale_pred <- knn(train = wholesale_train_x_scale,
                 test = wholesale_test_x_scale,
                 cl = wholesale_train_y,
                 k= 19)

Check confusion matrix

confusionMatrix(data = wholesale_pred,
                reference = wholesale_test_y,
                positive="retail")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction horeca retail
##     horeca     54      9
##     retail      3     22
##                                           
##                Accuracy : 0.8636          
##                  95% CI : (0.7739, 0.9275)
##     No Information Rate : 0.6477          
##     P-Value [Acc > NIR] : 0.000004804     
##                                           
##                   Kappa : 0.6874          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.7097          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.8800          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3523          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2841          
##       Balanced Accuracy : 0.8285          
##                                           
##        'Positive' Class : retail          
## 

The result shows that our K-NN with K = 19 has accuracy of 86.36 % on test dataset, meaning that 86.36 % of our data is correctly classified. The value of sensitivity and specificity is 70.97 % and 94.74 %. This indicate that most of positive outcomes are correctly classified and most of number of negative outcomes are correctly classified too. The precision/positive predicted value is 88 %, meaning that 88 % of our positive prediction is correct.

6 Conclusion

Both model perform good in accuracy. No significant difference between logistic regression and K-NN in term of accuracy. But, K-NN has slightly better in accuracy, sensitivity, etc. So, we can conclude K-NN model is better that logistic regression in this case.