LBb5rev

BACKGROUND
DATA PREPARATION
LOGISTIC REGRESSION
KNN
- SCALLING PREDICTOR
- KNN PREDICTION
MODEL EVALUATION
- LOGISTIC REGRESSION EVALUATION
- KNN EVALUATION
CONCLUSION

BACKGROUND

LBB REQUIREMENTS

In making a report, don’t forget to cover the following:

Selection of variable targets depends on the perspective of the case you want to take
Data analysis and the process of selecting predictor / feature selection variables
Pre-processing data from data cleansing to cross validation.
A description of the evaluation of the model used, what is the best metric for evaluating the model and why.
Documenting the analysis of how to improve the performance of the model (For example, the process of selecting the optimum k in the current model). and / or comparison of logistic and KNN models.

CASE STUDY

the hospitality industry, Restaurants and Cafes (HORECA) are the fastest growing markets. Wholesale companies that have a high reputation are mostly involved as FMCG supply chain reaching HORECA segmentation. The hospitality industry, Restaurants and Cafes are the fastest growing markets. Wholesale companies that have a high reputation are mostly involved as FMCG supply chain reaching HORECA segmentation.In recent years, the hotel, restaurant and cafe industry is one of the fastest growing markets in the world. And this project holding company to differ Retail and HORECA using Wholesale.csv dataset.

INSIGHT

The purpose of this analysis is to understand the buying patterns of customers from the perspective of a wholesale distributor, are they from retail or HORECA?

DATA PREPARATION

PACKAGES

This package that we used

library(readr)
library(tidyverse)
library(gtools)
library(car)
library(caret)
library(rsample)
library(glmnet)
library(class)

DATA INPUT

The dataset I will use is wholesale.csv data and is classified by channel with 2 casifications namely Hotel, Restaurant and Cafe or HORECA and Retail.

wsl <- read.csv("wholesale.csv")
head(wsl)

#>   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
#> 1       2      3 12669 9656    7561    214             2674       1338
#> 2       2      3  7057 9810    9568   1762             3293       1776
#> 3       2      3  6353 8808    7684   2405             3516       7844
#> 4       1      3 13265 1196    4221   6404              507       1788
#> 5       2      3 22615 5410    7198   3915             1777       5185
#> 6       2      3  9413 8259    5126    666             1795       1451

STRUCTURE AND MISSING VALUE

We must check our structure to follow next step

str(wsl)

#> 'data.frame':    440 obs. of  8 variables:
#>  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
#>  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
#>  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
#>  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
#>  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
#>  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
#>  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
#>  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

There is not missing value

anyNA(wsl)

#> [1] FALSE

DATA WRANGLING

Drop the “Region” feature because we not used it

wsl<- wsl %>% 
  select(-Region)

‘Channel’ must be convertes as Factor and bring HORECA and Retail Label

wsl$Channel <- factor(wsl$Channel, levels = c(1, 2), labels = c("HORECA", "RETAIL"))

EDA

Explanatory Data Anaylsis

Check the proportion of the Channel and the result quite balance so we dont need a further step

prop.table(table(wsl$Channel))

#> 
#>    HORECA    RETAIL 
#> 0.6772727 0.3227273

Cross Validation

The process to separate the train data and test data is then followed by the proportion table check

set.seed(100)
intrain <- sample(nrow(wsl), nrow(wsl)*0.8)
wsl.train <- wsl[intrain, ]
wsl.test <- wsl[-intrain, ]

prop.table(table(wsl.train$Channel))

#> 
#>    HORECA    RETAIL 
#> 0.6789773 0.3210227

The class is still in balance so that it can proceed to modeling using logistic regression and prediction using k-NN.

LOGISTIC REGRESSION

We choose the best model in Logistic Regression, therefore this process needs to be followed stepwise model to get the predictor variable that we use with the smallest AIC value and continue through Prediction

LOGISTIC MODEL

model.log <- glm(Channel ~ ., data = wsl.train, family = "binomial")
summary(model.log)

#> 
#> Call:
#> glm(formula = Channel ~ ., family = "binomial", data = wsl.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7452  -0.3308  -0.2437   0.0535   3.2120  
#> 
#> Coefficients:
#>                     Estimate  Std. Error z value            Pr(>|z|)    
#> (Intercept)      -3.52671990  0.45063101  -7.826 0.00000000000000503 ***
#> Fresh             0.00001297  0.00001751   0.741              0.4590    
#> Milk              0.00005286  0.00006000   0.881              0.3784    
#> Grocery           0.00010705  0.00006134   1.745              0.0810 .  
#> Frozen           -0.00018934  0.00010350  -1.829              0.0674 .  
#> Detergents_Paper  0.00082619  0.00014164   5.833 0.00000000544240551 ***
#> Delicassen       -0.00007818  0.00011437  -0.684              0.4943    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 441.86  on 351  degrees of freedom
#> Residual deviance: 171.70  on 345  degrees of freedom
#> AIC: 185.7
#> 
#> Number of Fisher Scoring iterations: 7

there are some insignificant variables then we use stepwise method to eliminate these variables

STEPWISE MODEL

BACKWARD

we choose based on the smallest AIC value

model.log.back <- step(model.log, direction = "backward")

#> Start:  AIC=185.7
#> Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
#>     Delicassen
#> 
#>                    Df Deviance    AIC
#> - Delicassen        1   172.17 184.17
#> - Fresh             1   172.25 184.25
#> - Milk              1   172.45 184.45
#> <none>                  171.70 185.70
#> - Grocery           1   174.90 186.90
#> - Frozen            1   176.55 188.55
#> - Detergents_Paper  1   209.87 221.87
#> 
#> Step:  AIC=184.17
#> Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper
#> 
#>                    Df Deviance    AIC
#> - Fresh             1   172.58 182.58
#> - Milk              1   172.90 182.90
#> <none>                  172.17 184.17
#> - Grocery           1   174.96 184.96
#> - Frozen            1   178.48 188.48
#> - Detergents_Paper  1   211.11 221.11
#> 
#> Step:  AIC=182.58
#> Channel ~ Milk + Grocery + Frozen + Detergents_Paper
#> 
#>                    Df Deviance    AIC
#> - Milk              1   173.38 181.38
#> <none>                  172.58 182.58
#> - Grocery           1   175.62 183.62
#> - Frozen            1   178.67 186.67
#> - Detergents_Paper  1   211.41 219.41
#> 
#> Step:  AIC=181.38
#> Channel ~ Grocery + Frozen + Detergents_Paper
#> 
#>                    Df Deviance    AIC
#> <none>                  173.38 181.38
#> - Frozen            1   178.69 184.69
#> - Grocery           1   179.41 185.41
#> - Detergents_Paper  1   213.03 219.03

LOGISTIC REGRESSION PREDICTION

log.Risk <- predict(model.log.back, newdata = wsl.test, type = "response")
# tentukan kelas
log.Label <- ifelse(log.Risk > 0.5, "RETAIL", "HORECA")
# ubah label ke tipe faktor
log.Label <- as.factor(log.Label)
head(log.Label)

#>      8      9     18     22     24     29 
#> RETAIL HORECA HORECA HORECA RETAIL RETAIL 
#> Levels: HORECA RETAIL

KNN

In knn model we define:

data train with predictor and only target
data test with predictor and only target

wsl.train.x <- wsl.train[,-1]
wsl.test.x <- wsl.test[,-1]
wsl.train.y <- wsl.train[,1]
wsl.test.y<- wsl.test[,1]

SCALLING PREDICTOR

continue to scale ppredictor

wsl.train.x<- scale(x = wsl.train.x)
wsl.test.x<- scale(x = wsl.test.x, 
                     center = attr(wsl.train.x, "scaled:center"), 
                     scale = attr(wsl.train.x, "scaled:scale"))

This is K value that need to find

sqrt(nrow(wsl.train.x))

#> [1] 18.76166

KNN PREDICTION

wsl.train.x<- scale(x = wsl.train.x)
wsl.test.x <- scale(x = wsl.test.x , 
                     center = attr(wsl.train.x, "scaled:center"), 
                     scale = attr(wsl.train.x, "scaled:scale"))
nrow(wsl.test.x)

#> [1] 88

knn.Label <- knn(train = wsl.train.x, 
                 test = wsl.test.x,
                 cl = wsl.train.y,
                 k = 19)
knn.Label

#>  [1] RETAIL HORECA HORECA HORECA RETAIL RETAIL HORECA HORECA RETAIL RETAIL
#> [11] HORECA RETAIL RETAIL HORECA RETAIL RETAIL RETAIL HORECA HORECA RETAIL
#> [21] RETAIL HORECA HORECA HORECA HORECA HORECA RETAIL HORECA RETAIL HORECA
#> [31] HORECA HORECA RETAIL HORECA HORECA RETAIL HORECA HORECA RETAIL RETAIL
#> [41] RETAIL RETAIL RETAIL RETAIL HORECA HORECA RETAIL RETAIL HORECA RETAIL
#> [51] HORECA HORECA RETAIL HORECA HORECA HORECA HORECA HORECA HORECA HORECA
#> [61] HORECA HORECA RETAIL HORECA HORECA HORECA HORECA RETAIL HORECA HORECA
#> [71] HORECA HORECA HORECA RETAIL HORECA HORECA HORECA HORECA RETAIL HORECA
#> [81] HORECA HORECA HORECA HORECA HORECA HORECA HORECA HORECA
#> Levels: HORECA RETAIL

MODEL EVALUATION

Model Evaluation is the process that we use to choose a model, Do we choose Logistic Regression or KNN?

LOGISTIC REGRESSION EVALUATION

cm_log <- confusionMatrix(data = log.Label, 
                reference = wsl.test$Channel,
                positive = "RETAIL")
cm_log$overall[1]

#>  Accuracy 
#> 0.9090909

KNN EVALUATION

cm_knn <- confusionMatrix(data = knn.Label, 
                reference = wsl.test.y,
                positive = "RETAIL")
cm_knn$overall[1]

#>  Accuracy 
#> 0.9431818

CONCLUSION

We choose to use the KNN model because it has a higher accuracy value than Logistic Regression and Using KNN, we don’t need to assure the assumption, such as no multicollinearity among variables, and linearity of variables. In logistic regression process, we need to see the variable correlation first, and using step-wise regression, also VIF. Compare to KNN, logistic regression don’t need to scaling the data, and it also works if there are non-numeric variables.