Intro
Greetings
Hi !! Welcome!.
im looking dataset from external source (UCI). I hope u’ll enjoy.

Content
We will classifying Business Industry based on purchasing products.
Target parameter : channel
i will use same value for tresshold >=0.4 and set Retail as Positive for comparisson between Logistic regression and KNN.
- FRESH: annual spending (m.u.) on fresh products (Continuous);
- MILK: annual spending (m.u.) on milk products (Continuous);
- GROCERY: annual spending (m.u.)on grocery products (Continuous);
- FROZEN: annual spending (m.u.)on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
- DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
- CHANNEL: customers’ Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
- REGION: customers’ Region – Lisnon, Oporto or Other (Nominal) Descriptive Statistics:
(Minimum, Maximum, Mean, Std. Deviation) FRESH ( 3, 112151, 12000.30, 12647.329) MILK (55, 73498, 5796.27, 7380.377) GROCERY (3, 92780, 7951.28, 9503.163) FROZEN (25, 60869, 3071.93, 4854.673) DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854) DELICATESSEN (3, 47943, 1524.87, 2820.106)
REGION Frequency Lisbon 77 Oporto 47 Other Region 316 Total 440
CHANNEL Frequency Horeca 298 Retail 142 Total 440
Load Library needed
Load Dataset
Observations: 440
Variables: 8
$ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, ...
$ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
$ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126,...
$ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 495...
$ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 942...
$ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, ...
$ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321...
$ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566...
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
lets check the corelation between all parameters
Interpretations :
we might say that Milk, Grocery, and Detergent Paper have high correlations to channel.
Find out which one Horeca and which one Retail ? then rename channel column become Business and change to factor type
# A tibble: 2 x 2
Channel F
<int> <int>
1 1 298
2 2 142
1 horeca ; 2 retail
Business Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 Retail 3 12669 9656 7561 214 2674 1338
2 Retail 3 7057 9810 9568 1762 3293 1776
3 Retail 3 6353 8808 7684 2405 3516 7844
4 Horeca 3 13265 1196 4221 6404 507 1788
5 Retail 3 22615 5410 7198 3915 1777 5185
6 Retail 3 9413 8259 5126 666 1795 1451
Whats inside region?
# A tibble: 3 x 2
Region `n()`
<int> <int>
1 1 77
2 2 47
3 3 316
1 Lisnon ;2 Oporto ;3 other region
Data Explanatory
Target Variable : Business
Lets delete region column since it has mostly 0 value of corelation.
Check proportion in target variable
Horeca Retail
0.6772727 0.3227273
our dataset consist of Horeca around 68% and Retail the rest. it means we cant use accuracy for our
Check distribution within each variable
Interpretation :
We may see that Horeca mostly has hiher value compare to Retail in several purchasing segments. Especaially for Delicassen, fresh, frozen products, From cahrt above, we could make sure that in this dataset there is no perfect separator.
Logistic Regression
Using all parameters
Call:
glm(formula = Business ~ ., family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7997 -0.3028 -0.2261 0.0493 3.4111
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.55338992 0.47296921 -7.513 0.0000000000000578 ***
Fresh 0.00001656 0.00001905 0.869 0.3847
Milk 0.00007411 0.00006342 1.169 0.2426
Grocery 0.00007413 0.00006237 1.189 0.2346
Frozen -0.00024514 0.00012468 -1.966 0.0493 *
Detergents_Paper 0.00089039 0.00014776 6.026 0.0000000016810356 ***
Delicassen -0.00007869 0.00012060 -0.652 0.5141
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 154.40 on 323 degrees of freedom
AIC: 168.4
Number of Fisher Scoring iterations: 7
Interpretations :
Result above showing us that dataset have perfect separator, showing :‘glm.fit: fitted probabilities numerically 0 or 1 occurred’. Need to find which parameter is the perfect separator.
Finding percet Separatorn Using Logistic Regression
Parameter Fresh
Call:
glm(formula = Business ~ Fresh, family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0235 -0.9324 -0.7886 1.3619 2.0330
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.37335552 0.17283225 -2.160 0.03076 *
Fresh -0.00003504 0.00001235 -2.836 0.00456 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 403.01 on 328 degrees of freedom
AIC: 407.01
Number of Fisher Scoring iterations: 4
Interpretation :
Fresh is not perfect separator
Parameters Milk
Call:
glm(formula = Business ~ Milk, family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.0938 -0.6896 -0.5519 0.8090 2.0062
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.09004460 0.22871406 -9.138 < 0.0000000000000002 ***
Milk 0.00023822 0.00003466 6.873 0.00000000000627 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 331.89 on 328 degrees of freedom
AIC: 335.89
Number of Fisher Scoring iterations: 5
Interpretation :
Milk is not a perfect separator
Parameters Grocery
Call:
glm(formula = Business ~ Grocery, family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7649 -0.4705 -0.3391 0.2235 2.2708
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44356913 0.33652465 -10.233 <0.0000000000000002 ***
Grocery 0.00034425 0.00003866 8.905 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 230.22 on 328 degrees of freedom
AIC: 234.22
Number of Fisher Scoring iterations: 6
Interpretation :
Grocery is not a perfect separator
Parameters Frozen
Call:
glm(formula = Business ~ Frozen, family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0994 -0.9762 -0.7119 1.2934 2.5220
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.17992460 0.16783773 -1.072 0.284
Frozen -0.00025590 0.00006349 -4.031 0.0000556 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 385.74 on 328 degrees of freedom
AIC: 389.74
Number of Fisher Scoring iterations: 5
Interpretation :
Frozen is not a perfect separator
Parameters Detergents Paper
Call:
glm(formula = Business ~ Detergents_Paper, family = "binomial",
data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84382 -0.32800 -0.26336 0.05962 2.53857
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.5454083 0.3548412 -9.992 <0.0000000000000002 ***
Detergents_Paper 0.0010962 0.0001211 9.054 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 165.37 on 328 degrees of freedom
AIC: 169.37
Number of Fisher Scoring iterations: 7
Interpretation :
Detergen Paper is the perfect separator
Parameters Delicassen
Call:
glm(formula = Business ~ Delicassen, family = "binomial", data = ws_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1118 -0.8730 -0.8697 1.5093 1.5226
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.78266434 0.13170135 -5.943 0.0000000028 ***
Delicassen 0.00001306 0.00003662 0.357 0.721
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 412.70 on 328 degrees of freedom
AIC: 416.7
Number of Fisher Scoring iterations: 4
Interpretation :
Delicassen is not a perfect separator
We have to set new data Train and test without Detergetn Paper
set new train and test dataset by deleting ‘Detergent_paper’ column.
Model without perfect separator(Detergent Paper) - all parameters
Call:
glm(formula = Business ~ Delicassen + Frozen + Fresh + Milk +
Grocery, family = "binomial", data = ws_trainnew)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5351 -0.4311 -0.2540 0.1302 3.5025
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.021435650 0.386005320 -7.827 0.00000000000000498 ***
Delicassen -0.000155068 0.000119981 -1.292 0.19621
Frozen -0.000307394 0.000100564 -3.057 0.00224 **
Fresh -0.000003322 0.000016524 -0.201 0.84069
Milk 0.000095351 0.000055344 1.723 0.08491 .
Grocery 0.000346650 0.000053877 6.434 0.00000000012418861 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 195.48 on 324 degrees of freedom
AIC: 207.48
Number of Fisher Scoring iterations: 6
Interpretations:
We dont have perfect separator again.
Continue to prediction and confussion matrix
Set tresshold limit at >=04 for Retail
I want to get more Retail and set Retail as Positive
Confusion Matrix and Statistics
Reference
Prediction Horeca Retail
Horeca 67 5
Retail 6 32
Accuracy : 0.9
95% CI : (0.8281, 0.949)
No Information Rate : 0.6636
P-Value [Acc > NIR] : 0.000000007923
Kappa : 0.7775
Mcnemar's Test P-Value : 1
Sensitivity : 0.8649
Specificity : 0.9178
Pos Pred Value : 0.8421
Neg Pred Value : 0.9306
Prevalence : 0.3364
Detection Rate : 0.2909
Detection Prevalence : 0.3455
Balanced Accuracy : 0.8913
'Positive' Class : Retail
Interpretations :
When using all parameters(without perfect separator/Detergent paper) we get sensitivity value at 86% and accuracy 90%.
Stepwise Regression Backward
use all parameters without Detergent Paper
Start: AIC=207.48
Business ~ Delicassen + Frozen + Fresh + Milk + Grocery
Df Deviance AIC
- Fresh 1 195.52 205.52
- Delicassen 1 197.13 207.13
<none> 195.48 207.48
- Milk 1 198.42 208.42
- Frozen 1 210.36 220.36
- Grocery 1 265.40 275.40
Step: AIC=205.52
Business ~ Delicassen + Frozen + Milk + Grocery
Df Deviance AIC
- Delicassen 1 197.31 205.31
<none> 195.52 205.52
- Milk 1 198.43 206.43
- Frozen 1 214.20 222.20
- Grocery 1 265.83 273.83
Step: AIC=205.31
Business ~ Frozen + Milk + Grocery
Df Deviance AIC
<none> 197.31 205.31
- Milk 1 199.86 205.86
- Frozen 1 228.07 234.07
- Grocery 1 265.85 271.85
Call:
glm(formula = Business ~ Frozen + Milk + Grocery, family = "binomial",
data = ws_trainnew)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5156 -0.4363 -0.2552 0.1565 3.5231
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05775560 0.37122768 -8.237 < 0.0000000000000002 ***
Frozen -0.00035793 0.00009128 -3.921 0.000088077563 ***
Milk 0.00009161 0.00005690 1.610 0.107
Grocery 0.00033031 0.00005198 6.355 0.000000000208 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 197.31 on 326 degrees of freedom
AIC: 205.31
Number of Fisher Scoring iterations: 6
Continue to prediction . set the same value for tresshold >=0.4
Confusion Matrix and Statistics
Reference
Prediction Horeca Retail
Horeca 68 5
Retail 5 32
Accuracy : 0.9091
95% CI : (0.8392, 0.9555)
No Information Rate : 0.6636
P-Value [Acc > NIR] : 0.000000001675
Kappa : 0.7964
Mcnemar's Test P-Value : 1
Sensitivity : 0.8649
Specificity : 0.9315
Pos Pred Value : 0.8649
Neg Pred Value : 0.9315
Prevalence : 0.3364
Detection Rate : 0.2909
Detection Prevalence : 0.3364
Balanced Accuracy : 0.8982
'Positive' Class : Retail
Interpretations :
When using stepwise backward method, we only use ‘frozen, milk, grocery’ as our parameters and we get sensitivity value at 86% and accuracy 90%. This result is same like previous model (model_all).
Using Frozen and Grocery Parameters
Call:
glm(formula = Business ~ Grocery + Frozen, family = "binomial",
data = ws_trainnew)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6167 -0.4297 -0.2728 0.1507 3.3038
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.03791345 0.37479440 -8.106 0.000000000000000525 ***
Grocery 0.00038200 0.00004415 8.652 < 0.0000000000000002 ***
Frozen -0.00029961 0.00008074 -3.711 0.000207 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 412.82 on 329 degrees of freedom
Residual deviance: 199.86 on 327 degrees of freedom
AIC: 205.86
Number of Fisher Scoring iterations: 6
Continue to prediction . set the same value for tresshold >=0.4
5 8 9 10 18 22
0.18827377 0.51568470 0.31004512 0.97869685 0.10257970 0.03613515
Lets evaluate the model using confussionmatix
Confusion Matrix and Statistics
Reference
Prediction Horeca Retail
Horeca 69 6
Retail 4 31
Accuracy : 0.9091
95% CI : (0.8392, 0.9555)
No Information Rate : 0.6636
P-Value [Acc > NIR] : 0.000000001675
Kappa : 0.7936
Mcnemar's Test P-Value : 0.7518
Sensitivity : 0.8378
Specificity : 0.9452
Pos Pred Value : 0.8857
Neg Pred Value : 0.9200
Prevalence : 0.3364
Detection Rate : 0.2818
Detection Prevalence : 0.3182
Balanced Accuracy : 0.8915
'Positive' Class : Retail
Interpretations :
From model_FG (only using ’frozen and grocery as paramters) we get lower sensitivity value at 84% and same value of accuracy 90%.
Conclusion
Comparing between Logistic Regression with KNN, this dataset is giving better result when using KNN algoritm. in Logistic Regression, we can see that dataset has perfect separator (detergent paper parameter) which one of the parameter is able to define target parameter corectly. The first things we have to do is finding the perfect separator and delet it from our dataset. after that we could continue the proces for creating the model.
In logistic regression, the most optimum model by using Logistic Regression backward, by using only 3 parameters (Frozen, milk, Grocery) we got sensitivity value 86%, accuracy 90%, and Pos Pred Value 86%. this result is better compare using all parameters(without Detergent paper as perfect separator) we got accuracy90%, sensitivity 86%, and Pos Pred Value 84%. Even though both of them are showing very little difference, stepwise regression backward is giving better result.
On the other hand, using KNN algoritm gives us better result than logistic regression with sensitivity value at 89%, 91% of accuracy and Pos Pred Value 86%. It’s Showing us that KNN algoritm is good to be used when the predictors are numeric type.