LBb5rev
BACKGROUND
LBB REQUIREMENTS
In making a report, don’t forget to cover the following:
- Selection of variable targets depends on the perspective of the case you want to take
- Data analysis and the process of selecting predictor / feature selection variables
- Pre-processing data from data cleansing to cross validation.
- A description of the evaluation of the model used, what is the best metric for evaluating the model and why.
- Documenting the analysis of how to improve the performance of the model (For example, the process of selecting the optimum k in the current model). and / or comparison of logistic and KNN models.
CASE STUDY
the hospitality industry, Restaurants and Cafes (HORECA) are the fastest growing markets. Wholesale companies that have a high reputation are mostly involved as FMCG supply chain reaching HORECA segmentation. The hospitality industry, Restaurants and Cafes are the fastest growing markets. Wholesale companies that have a high reputation are mostly involved as FMCG supply chain reaching HORECA segmentation.In recent years, the hotel, restaurant and cafe industry is one of the fastest growing markets in the world. And this project holding company to differ Retail and HORECA using Wholesale.csv dataset.
INSIGHT
The purpose of this analysis is to understand the buying patterns of customers from the perspective of a wholesale distributor, are they from retail or HORECA?
DATA PREPARATION
PACKAGES
This package that we used
DATA INPUT
The dataset I will use is wholesale.csv data and is classified by channel with 2 casifications namely Hotel, Restaurant and Cafe or HORECA and Retail.
#> Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
#> 1 2 3 12669 9656 7561 214 2674 1338
#> 2 2 3 7057 9810 9568 1762 3293 1776
#> 3 2 3 6353 8808 7684 2405 3516 7844
#> 4 1 3 13265 1196 4221 6404 507 1788
#> 5 2 3 22615 5410 7198 3915 1777 5185
#> 6 2 3 9413 8259 5126 666 1795 1451
STRUCTURE AND MISSING VALUE
We must check our structure to follow next step
#> 'data.frame': 440 obs. of 8 variables:
#> $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
#> $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
#> $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
#> $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
#> $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
#> $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
#> $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
#> $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
There is not missing value
#> [1] FALSE
DATA WRANGLING
Drop the “Region” feature because we not used it
‘Channel’ must be convertes as Factor and bring HORECA and Retail Label
EDA
Explanatory Data Anaylsis
Check the proportion of the Channel and the result quite balance so we dont need a further step
#>
#> HORECA RETAIL
#> 0.6772727 0.3227273
Cross Validation
The process to separate the train data and test data is then followed by the proportion table check
set.seed(100)
intrain <- sample(nrow(wsl), nrow(wsl)*0.8)
wsl.train <- wsl[intrain, ]
wsl.test <- wsl[-intrain, ]#>
#> HORECA RETAIL
#> 0.6789773 0.3210227
The class is still in balance so that it can proceed to modeling using logistic regression and prediction using k-NN.
LOGISTIC REGRESSION
We choose the best model in Logistic Regression, therefore this process needs to be followed stepwise model to get the predictor variable that we use with the smallest AIC value and continue through Prediction
LOGISTIC MODEL
#>
#> Call:
#> glm(formula = Channel ~ ., family = "binomial", data = wsl.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7452 -0.3308 -0.2437 0.0535 3.2120
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.52671990 0.45063101 -7.826 0.00000000000000503 ***
#> Fresh 0.00001297 0.00001751 0.741 0.4590
#> Milk 0.00005286 0.00006000 0.881 0.3784
#> Grocery 0.00010705 0.00006134 1.745 0.0810 .
#> Frozen -0.00018934 0.00010350 -1.829 0.0674 .
#> Detergents_Paper 0.00082619 0.00014164 5.833 0.00000000544240551 ***
#> Delicassen -0.00007818 0.00011437 -0.684 0.4943
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 441.86 on 351 degrees of freedom
#> Residual deviance: 171.70 on 345 degrees of freedom
#> AIC: 185.7
#>
#> Number of Fisher Scoring iterations: 7
there are some insignificant variables then we use stepwise method to eliminate these variables
STEPWISE MODEL
BACKWARD
we choose based on the smallest AIC value
#> Start: AIC=185.7
#> Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper +
#> Delicassen
#>
#> Df Deviance AIC
#> - Delicassen 1 172.17 184.17
#> - Fresh 1 172.25 184.25
#> - Milk 1 172.45 184.45
#> <none> 171.70 185.70
#> - Grocery 1 174.90 186.90
#> - Frozen 1 176.55 188.55
#> - Detergents_Paper 1 209.87 221.87
#>
#> Step: AIC=184.17
#> Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper
#>
#> Df Deviance AIC
#> - Fresh 1 172.58 182.58
#> - Milk 1 172.90 182.90
#> <none> 172.17 184.17
#> - Grocery 1 174.96 184.96
#> - Frozen 1 178.48 188.48
#> - Detergents_Paper 1 211.11 221.11
#>
#> Step: AIC=182.58
#> Channel ~ Milk + Grocery + Frozen + Detergents_Paper
#>
#> Df Deviance AIC
#> - Milk 1 173.38 181.38
#> <none> 172.58 182.58
#> - Grocery 1 175.62 183.62
#> - Frozen 1 178.67 186.67
#> - Detergents_Paper 1 211.41 219.41
#>
#> Step: AIC=181.38
#> Channel ~ Grocery + Frozen + Detergents_Paper
#>
#> Df Deviance AIC
#> <none> 173.38 181.38
#> - Frozen 1 178.69 184.69
#> - Grocery 1 179.41 185.41
#> - Detergents_Paper 1 213.03 219.03
LOGISTIC REGRESSION PREDICTION
log.Risk <- predict(model.log.back, newdata = wsl.test, type = "response")
# tentukan kelas
log.Label <- ifelse(log.Risk > 0.5, "RETAIL", "HORECA")
# ubah label ke tipe faktor
log.Label <- as.factor(log.Label)
head(log.Label)#> 8 9 18 22 24 29
#> RETAIL HORECA HORECA HORECA RETAIL RETAIL
#> Levels: HORECA RETAIL
KNN
In knn model we define:
- data train with predictor and only target
- data test with predictor and only target
wsl.train.x <- wsl.train[,-1]
wsl.test.x <- wsl.test[,-1]
wsl.train.y <- wsl.train[,1]
wsl.test.y<- wsl.test[,1]SCALLING PREDICTOR
continue to scale ppredictor
wsl.train.x<- scale(x = wsl.train.x)
wsl.test.x<- scale(x = wsl.test.x,
center = attr(wsl.train.x, "scaled:center"),
scale = attr(wsl.train.x, "scaled:scale"))This is K value that need to find
#> [1] 18.76166
KNN PREDICTION
wsl.train.x<- scale(x = wsl.train.x)
wsl.test.x <- scale(x = wsl.test.x ,
center = attr(wsl.train.x, "scaled:center"),
scale = attr(wsl.train.x, "scaled:scale"))
nrow(wsl.test.x)#> [1] 88
#> [1] RETAIL HORECA HORECA HORECA RETAIL RETAIL HORECA HORECA RETAIL RETAIL
#> [11] HORECA RETAIL RETAIL HORECA RETAIL RETAIL RETAIL HORECA HORECA RETAIL
#> [21] RETAIL HORECA HORECA HORECA HORECA HORECA RETAIL HORECA RETAIL HORECA
#> [31] HORECA HORECA RETAIL HORECA HORECA RETAIL HORECA HORECA RETAIL RETAIL
#> [41] RETAIL RETAIL RETAIL RETAIL HORECA HORECA RETAIL RETAIL HORECA RETAIL
#> [51] HORECA HORECA RETAIL HORECA HORECA HORECA HORECA HORECA HORECA HORECA
#> [61] HORECA HORECA RETAIL HORECA HORECA HORECA HORECA RETAIL HORECA HORECA
#> [71] HORECA HORECA HORECA RETAIL HORECA HORECA HORECA HORECA RETAIL HORECA
#> [81] HORECA HORECA HORECA HORECA HORECA HORECA HORECA HORECA
#> Levels: HORECA RETAIL
MODEL EVALUATION
Model Evaluation is the process that we use to choose a model, Do we choose Logistic Regression or KNN?
LOGISTIC REGRESSION EVALUATION
cm_log <- confusionMatrix(data = log.Label,
reference = wsl.test$Channel,
positive = "RETAIL")
cm_log$overall[1]#> Accuracy
#> 0.9090909
KNN EVALUATION
cm_knn <- confusionMatrix(data = knn.Label,
reference = wsl.test.y,
positive = "RETAIL")
cm_knn$overall[1]#> Accuracy
#> 0.9431818
CONCLUSION
We choose to use the KNN model because it has a higher accuracy value than Logistic Regression and Using KNN, we don’t need to assure the assumption, such as no multicollinearity among variables, and linearity of variables. In logistic regression process, we need to see the variable correlation first, and using step-wise regression, also VIF. Compare to KNN, logistic regression don’t need to scaling the data, and it also works if there are non-numeric variables.