Intro

Greetings

Hi !! Welcome!.
im looking dataset from external source (UCI). I hope u’ll enjoy.

Content

We will classifying Business Industry based on purchasing products.
Target parameter : channel
i will use same value for tresshold >=0.4 and set Retail as Positive for comparisson between Logistic regression and KNN.

  1. FRESH: annual spending (m.u.) on fresh products (Continuous);
  2. MILK: annual spending (m.u.) on milk products (Continuous);
  3. GROCERY: annual spending (m.u.)on grocery products (Continuous);
  4. FROZEN: annual spending (m.u.)on frozen products (Continuous)
  5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  6. DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
  7. CHANNEL: customers’ Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
  8. REGION: customers’ Region – Lisnon, Oporto or Other (Nominal) Descriptive Statistics:

(Minimum, Maximum, Mean, Std. Deviation) FRESH ( 3, 112151, 12000.30, 12647.329) MILK (55, 73498, 5796.27, 7380.377) GROCERY (3, 92780, 7951.28, 9503.163) FROZEN (25, 60869, 3071.93, 4854.673) DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854) DELICATESSEN (3, 47943, 1524.87, 2820.106)

REGION Frequency Lisbon 77 Oporto 47 Other Region 316 Total 440

CHANNEL Frequency Horeca 298 Retail 142 Total 440

Load Library needed

Load Dataset

Observations: 440
Variables: 8
$ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, ...
$ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
$ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126,...
$ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 495...
$ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 942...
$ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, ...
$ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321...
$ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566...
  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1       2      3 12669 9656    7561    214             2674       1338
2       2      3  7057 9810    9568   1762             3293       1776
3       2      3  6353 8808    7684   2405             3516       7844
4       1      3 13265 1196    4221   6404              507       1788
5       2      3 22615 5410    7198   3915             1777       5185
6       2      3  9413 8259    5126    666             1795       1451

lets check the corelation between all parameters

Interpretations :
we might say that Milk, Grocery, and Detergent Paper have high correlations to channel.

Find out which one Horeca and which one Retail ? then rename channel column become Business and change to factor type

# A tibble: 2 x 2
  Channel     F
    <int> <int>
1       1   298
2       2   142

1 horeca ; 2 retail

  Business Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1   Retail      3 12669 9656    7561    214             2674       1338
2   Retail      3  7057 9810    9568   1762             3293       1776
3   Retail      3  6353 8808    7684   2405             3516       7844
4   Horeca      3 13265 1196    4221   6404              507       1788
5   Retail      3 22615 5410    7198   3915             1777       5185
6   Retail      3  9413 8259    5126    666             1795       1451

Whats inside region?

# A tibble: 3 x 2
  Region `n()`
   <int> <int>
1      1    77
2      2    47
3      3   316

1 Lisnon ;2 Oporto ;3 other region

Data Explanatory

Target Variable : Business

Lets delete region column since it has mostly 0 value of corelation.

Check proportion in target variable


   Horeca    Retail 
0.6772727 0.3227273 

our dataset consist of Horeca around 68% and Retail the rest. it means we cant use accuracy for our

Check distribution within each variable

Interpretation :
We may see that Horeca mostly has hiher value compare to Retail in several purchasing segments. Especaially for Delicassen, fresh, frozen products, From cahrt above, we could make sure that in this dataset there is no perfect separator.

Logistic Regression

Using all parameters


Call:
glm(formula = Business ~ ., family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7997  -0.3028  -0.2261   0.0493   3.4111  

Coefficients:
                    Estimate  Std. Error z value           Pr(>|z|)    
(Intercept)      -3.55338992  0.47296921  -7.513 0.0000000000000578 ***
Fresh             0.00001656  0.00001905   0.869             0.3847    
Milk              0.00007411  0.00006342   1.169             0.2426    
Grocery           0.00007413  0.00006237   1.189             0.2346    
Frozen           -0.00024514  0.00012468  -1.966             0.0493 *  
Detergents_Paper  0.00089039  0.00014776   6.026 0.0000000016810356 ***
Delicassen       -0.00007869  0.00012060  -0.652             0.5141    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 154.40  on 323  degrees of freedom
AIC: 168.4

Number of Fisher Scoring iterations: 7

Interpretations :
Result above showing us that dataset have perfect separator, showing :‘glm.fit: fitted probabilities numerically 0 or 1 occurred’. Need to find which parameter is the perfect separator.

Finding percet Separatorn Using Logistic Regression

Parameter Fresh


Call:
glm(formula = Business ~ Fresh, family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0235  -0.9324  -0.7886   1.3619   2.0330  

Coefficients:
               Estimate  Std. Error z value Pr(>|z|)   
(Intercept) -0.37335552  0.17283225  -2.160  0.03076 * 
Fresh       -0.00003504  0.00001235  -2.836  0.00456 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 403.01  on 328  degrees of freedom
AIC: 407.01

Number of Fisher Scoring iterations: 4

Interpretation :
Fresh is not perfect separator

Parameters Milk


Call:
glm(formula = Business ~ Milk, family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.0938  -0.6896  -0.5519   0.8090   2.0062  

Coefficients:
               Estimate  Std. Error z value             Pr(>|z|)    
(Intercept) -2.09004460  0.22871406  -9.138 < 0.0000000000000002 ***
Milk         0.00023822  0.00003466   6.873     0.00000000000627 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 331.89  on 328  degrees of freedom
AIC: 335.89

Number of Fisher Scoring iterations: 5

Interpretation :
Milk is not a perfect separator

Parameters Grocery


Call:
glm(formula = Business ~ Grocery, family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7649  -0.4705  -0.3391   0.2235   2.2708  

Coefficients:
               Estimate  Std. Error z value            Pr(>|z|)    
(Intercept) -3.44356913  0.33652465 -10.233 <0.0000000000000002 ***
Grocery      0.00034425  0.00003866   8.905 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 230.22  on 328  degrees of freedom
AIC: 234.22

Number of Fisher Scoring iterations: 6

Interpretation :
Grocery is not a perfect separator

Parameters Frozen


Call:
glm(formula = Business ~ Frozen, family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0994  -0.9762  -0.7119   1.2934   2.5220  

Coefficients:
               Estimate  Std. Error z value  Pr(>|z|)    
(Intercept) -0.17992460  0.16783773  -1.072     0.284    
Frozen      -0.00025590  0.00006349  -4.031 0.0000556 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 385.74  on 328  degrees of freedom
AIC: 389.74

Number of Fisher Scoring iterations: 5

Interpretation :
Frozen is not a perfect separator

Parameters Detergents Paper


Call:
glm(formula = Business ~ Detergents_Paper, family = "binomial", 
    data = ws_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.84382  -0.32800  -0.26336   0.05962   2.53857  

Coefficients:
                   Estimate Std. Error z value            Pr(>|z|)    
(Intercept)      -3.5454083  0.3548412  -9.992 <0.0000000000000002 ***
Detergents_Paper  0.0010962  0.0001211   9.054 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 165.37  on 328  degrees of freedom
AIC: 169.37

Number of Fisher Scoring iterations: 7

Interpretation :
Detergen Paper is the perfect separator

Parameters Delicassen


Call:
glm(formula = Business ~ Delicassen, family = "binomial", data = ws_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1118  -0.8730  -0.8697   1.5093   1.5226  

Coefficients:
               Estimate  Std. Error z value     Pr(>|z|)    
(Intercept) -0.78266434  0.13170135  -5.943 0.0000000028 ***
Delicassen   0.00001306  0.00003662   0.357        0.721    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 412.70  on 328  degrees of freedom
AIC: 416.7

Number of Fisher Scoring iterations: 4

Interpretation :
Delicassen is not a perfect separator

We have to set new data Train and test without Detergetn Paper

set new train and test dataset by deleting ‘Detergent_paper’ column.

Model without perfect separator(Detergent Paper) - all parameters


Call:
glm(formula = Business ~ Delicassen + Frozen + Fresh + Milk + 
    Grocery, family = "binomial", data = ws_trainnew)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5351  -0.4311  -0.2540   0.1302   3.5025  

Coefficients:
                Estimate   Std. Error z value            Pr(>|z|)    
(Intercept) -3.021435650  0.386005320  -7.827 0.00000000000000498 ***
Delicassen  -0.000155068  0.000119981  -1.292             0.19621    
Frozen      -0.000307394  0.000100564  -3.057             0.00224 ** 
Fresh       -0.000003322  0.000016524  -0.201             0.84069    
Milk         0.000095351  0.000055344   1.723             0.08491 .  
Grocery      0.000346650  0.000053877   6.434 0.00000000012418861 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 195.48  on 324  degrees of freedom
AIC: 207.48

Number of Fisher Scoring iterations: 6

Interpretations:
We dont have perfect separator again.

Continue to prediction and confussion matrix

Set tresshold limit at >=04 for Retail

I want to get more Retail and set Retail as Positive

Confusion Matrix and Statistics

          Reference
Prediction Horeca Retail
    Horeca     67      5
    Retail      6     32
                                         
               Accuracy : 0.9            
                 95% CI : (0.8281, 0.949)
    No Information Rate : 0.6636         
    P-Value [Acc > NIR] : 0.000000007923 
                                         
                  Kappa : 0.7775         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.8649         
            Specificity : 0.9178         
         Pos Pred Value : 0.8421         
         Neg Pred Value : 0.9306         
             Prevalence : 0.3364         
         Detection Rate : 0.2909         
   Detection Prevalence : 0.3455         
      Balanced Accuracy : 0.8913         
                                         
       'Positive' Class : Retail         
                                         

Interpretations :
When using all parameters(without perfect separator/Detergent paper) we get sensitivity value at 86% and accuracy 90%.

Stepwise Regression Backward

use all parameters without Detergent Paper

Start:  AIC=207.48
Business ~ Delicassen + Frozen + Fresh + Milk + Grocery

             Df Deviance    AIC
- Fresh       1   195.52 205.52
- Delicassen  1   197.13 207.13
<none>            195.48 207.48
- Milk        1   198.42 208.42
- Frozen      1   210.36 220.36
- Grocery     1   265.40 275.40

Step:  AIC=205.52
Business ~ Delicassen + Frozen + Milk + Grocery

             Df Deviance    AIC
- Delicassen  1   197.31 205.31
<none>            195.52 205.52
- Milk        1   198.43 206.43
- Frozen      1   214.20 222.20
- Grocery     1   265.83 273.83

Step:  AIC=205.31
Business ~ Frozen + Milk + Grocery

          Df Deviance    AIC
<none>         197.31 205.31
- Milk     1   199.86 205.86
- Frozen   1   228.07 234.07
- Grocery  1   265.85 271.85

Call:
glm(formula = Business ~ Frozen + Milk + Grocery, family = "binomial", 
    data = ws_trainnew)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5156  -0.4363  -0.2552   0.1565   3.5231  

Coefficients:
               Estimate  Std. Error z value             Pr(>|z|)    
(Intercept) -3.05775560  0.37122768  -8.237 < 0.0000000000000002 ***
Frozen      -0.00035793  0.00009128  -3.921       0.000088077563 ***
Milk         0.00009161  0.00005690   1.610                0.107    
Grocery      0.00033031  0.00005198   6.355       0.000000000208 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 197.31  on 326  degrees of freedom
AIC: 205.31

Number of Fisher Scoring iterations: 6

Continue to prediction . set the same value for tresshold >=0.4

Confusion Matrix and Statistics

          Reference
Prediction Horeca Retail
    Horeca     68      5
    Retail      5     32
                                          
               Accuracy : 0.9091          
                 95% CI : (0.8392, 0.9555)
    No Information Rate : 0.6636          
    P-Value [Acc > NIR] : 0.000000001675  
                                          
                  Kappa : 0.7964          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8649          
            Specificity : 0.9315          
         Pos Pred Value : 0.8649          
         Neg Pred Value : 0.9315          
             Prevalence : 0.3364          
         Detection Rate : 0.2909          
   Detection Prevalence : 0.3364          
      Balanced Accuracy : 0.8982          
                                          
       'Positive' Class : Retail          
                                          

Interpretations :
When using stepwise backward method, we only use ‘frozen, milk, grocery’ as our parameters and we get sensitivity value at 86% and accuracy 90%. This result is same like previous model (model_all).

Using Frozen and Grocery Parameters


Call:
glm(formula = Business ~ Grocery + Frozen, family = "binomial", 
    data = ws_trainnew)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6167  -0.4297  -0.2728   0.1507   3.3038  

Coefficients:
               Estimate  Std. Error z value             Pr(>|z|)    
(Intercept) -3.03791345  0.37479440  -8.106 0.000000000000000525 ***
Grocery      0.00038200  0.00004415   8.652 < 0.0000000000000002 ***
Frozen      -0.00029961  0.00008074  -3.711             0.000207 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 412.82  on 329  degrees of freedom
Residual deviance: 199.86  on 327  degrees of freedom
AIC: 205.86

Number of Fisher Scoring iterations: 6

Continue to prediction . set the same value for tresshold >=0.4

         5          8          9         10         18         22 
0.18827377 0.51568470 0.31004512 0.97869685 0.10257970 0.03613515 

Lets evaluate the model using confussionmatix

Confusion Matrix and Statistics

          Reference
Prediction Horeca Retail
    Horeca     69      6
    Retail      4     31
                                          
               Accuracy : 0.9091          
                 95% CI : (0.8392, 0.9555)
    No Information Rate : 0.6636          
    P-Value [Acc > NIR] : 0.000000001675  
                                          
                  Kappa : 0.7936          
                                          
 Mcnemar's Test P-Value : 0.7518          
                                          
            Sensitivity : 0.8378          
            Specificity : 0.9452          
         Pos Pred Value : 0.8857          
         Neg Pred Value : 0.9200          
             Prevalence : 0.3364          
         Detection Rate : 0.2818          
   Detection Prevalence : 0.3182          
      Balanced Accuracy : 0.8915          
                                          
       'Positive' Class : Retail          
                                          

Interpretations :
From model_FG (only using ’frozen and grocery as paramters) we get lower sensitivity value at 84% and same value of accuracy 90%.

KNN Model

Using all Parameters (without perfect separator/Detergent paper)

Define y

Scalling Train and Test Data

Continue Modeling using KNN

lets test the model

Confusion Matrix and Statistics

          Reference
Prediction Horeca Retail
    Horeca     68      4
    Retail      5     33
                                          
               Accuracy : 0.9182          
                 95% CI : (0.8504, 0.9619)
    No Information Rate : 0.6636          
    P-Value [Acc > NIR] : 0.0000000003191 
                                          
                  Kappa : 0.8179          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8919          
            Specificity : 0.9315          
         Pos Pred Value : 0.8684          
         Neg Pred Value : 0.9444          
             Prevalence : 0.3364          
         Detection Rate : 0.3000          
   Detection Prevalence : 0.3455          
      Balanced Accuracy : 0.9117          
                                          
       'Positive' Class : Retail          
                                          

Interpretations :
KNN model using all parameters without Region, we get 89% of sensitivity. its higher from all previous models!!

Using Frozen and Grocery Parameters

  Business Fresh Milk Grocery Frozen Delicassen
1   Retail 12669 9656    7561    214       1338
2   Retail  7057 9810    9568   1762       1776
3   Retail  6353 8808    7684   2405       7844
4   Horeca 13265 1196    4221   6404       1788
5   Retail 22615 5410    7198   3915       5185
6   Retail  9413 8259    5126    666       1451

lets test the model

Confusion Matrix and Statistics

          Reference
Prediction Horeca Retail
    Horeca     68      4
    Retail      5     33
                                          
               Accuracy : 0.9182          
                 95% CI : (0.8504, 0.9619)
    No Information Rate : 0.6636          
    P-Value [Acc > NIR] : 0.0000000003191 
                                          
                  Kappa : 0.8179          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8919          
            Specificity : 0.9315          
         Pos Pred Value : 0.8684          
         Neg Pred Value : 0.9444          
             Prevalence : 0.3364          
         Detection Rate : 0.3000          
   Detection Prevalence : 0.3455          
      Balanced Accuracy : 0.9117          
                                          
       'Positive' Class : Retail          
                                          

Interpretations :
KNN model using all parameters without Region, we get 86% of sensitivity. its lower than previous model KNN which using all parameters.

Conclusion

Comparing between Logistic Regression with KNN, this dataset is giving better result when using KNN algoritm. in Logistic Regression, we can see that dataset has perfect separator (detergent paper parameter) which one of the parameter is able to define target parameter corectly. The first things we have to do is finding the perfect separator and delet it from our dataset. after that we could continue the proces for creating the model.
In logistic regression, the most optimum model by using Logistic Regression backward, by using only 3 parameters (Frozen, milk, Grocery) we got sensitivity value 86%, accuracy 90%, and Pos Pred Value 86%. this result is better compare using all parameters(without Detergent paper as perfect separator) we got accuracy90%, sensitivity 86%, and Pos Pred Value 84%. Even though both of them are showing very little difference, stepwise regression backward is giving better result.

On the other hand, using KNN algoritm gives us better result than logistic regression with sensitivity value at 89%, 91% of accuracy and Pos Pred Value 86%. It’s Showing us that KNN algoritm is good to be used when the predictors are numeric type.