CLASSIFICATION IN MACHINE LEARNING 1 (C1)

1. Introduction

On this occasion, I will try to classify using wholesale data, where later we will try to explore the target channel variables to be able to analyze how the influence of the variables that greatly influence and be grouped as a particular channel. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning.

1.1 Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
## Loading required package: lattice
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
## 
##     permutations

2. Data Explanatory

**Data Input & Structure*

wholesale <- read.csv("wholesale.csv")
str(wholesale)
## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
  • Channel : Type of Business, 1 = horeca (hotel/Restaurant/cafe) and 2 = retail
  • Region : Customer Area, 1 = Lisbon; 2 = Porto; 3 = Others

Now we would like to check if there is missing values

colSums(is.na(wholesale))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Great! No missing values!

2.1 Data Correlation

table(wholesale$Channel)
## 
##   1   2 
## 298 142
cor(wholesale$Channel, wholesale$Region)
## [1] 0.06202762

3. Data Pre-processing

wholesale$Channel %>% 
  table() %>% 
  prop.table()
## .
##         1         2 
## 0.6772727 0.3227273

AS we can see that the proportion of data is not balanced and there is more “hurray”Horeca" than the region. A balanced proportion is important so that the classification model studies the characteristics of positive and negative classes in a balanced way, not from just one class, but it all depends on needs. For this case I decided not to resampling (add data).

3.1 Data Cross Validaiton

set.seed(666)
wholesale_clean <- wholesale %>% 
  mutate(Channel = as.factor(Channel)) %>% 
  select(-Region)
wholesale_split <- sample(nrow(wholesale_clean), nrow(wholesale_clean)*0.6772727)
wholesale_train <- wholesale_clean[wholesale_split,]
wholesale_test <- wholesale_clean[-wholesale_split,]
wholesale %>% nrow()
## [1] 440
wholesale_train$Channel %>% 
  table() %>% 
  prop.table()
## .
##         1         2 
## 0.6666667 0.3333333
wholesale_test$Channel %>% 
  table() %>% 
  prop.table()
## .
##         1         2 
## 0.6993007 0.3006993

Logistic Regression model 0 Interpretation without predictor

whole_clean <- wholesale %>% 
  mutate(Channel = as.factor(Channel))

model0 <- glm(formula = Channel~1 , data = whole_clean, family = "binomial") # logistic regression, biner classification
summary(model0)
## 
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8828  -0.8828  -0.8828   1.5040   1.5040  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7413     0.1020  -7.269 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 553.44  on 439  degrees of freedom
## AIC: 555.44
## 
## Number of Fisher Scoring iterations: 4
sum_model0 <- summary(model0)
sum_model0$coefficients[1]
## [1] -0.7412664
inv.logit(-0.7413)
## [1] 0.3227199
exp(-0.7413)
## [1] 0.4764941

Conclusion: Horeca has a 32 times more chance to sell with a 47% chance.

K-nearest Neighbor

wholesale_x_train <- wholesale_train %>% 
  select(-Channel) %>% 
  scale()

wholesale_y_train <- wholesale_train %>% 
  select(Channel)

wholesale_x_test <- wholesale_test %>% 
  select(-Channel) %>% 
  scale(center = attr(wholesale_x_train, "scaled:center") , 
        scale = attr(wholesale_x_train, "scaled:scale"))

wholesale_y_test <- wholesale_test %>% 
  select(Channel)

sqrt(nrow(wholesale_x_train))
## [1] 17.23369

Model Evaluation

library(class)
wholesale_knn <- knn(train = wholesale_x_train,
                     test = wholesale_x_test,
                     cl = wholesale_y_train$Channel,
                     k = 17.23369)

wholesale_y_train$Channel %>% 
  table() %>% 
  prop.table()
## .
##         1         2 
## 0.6666667 0.3333333
confusionMatrix(wholesale_knn, reference = wholesale_y_test$Channel, positive = "2")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 93  9
##          2  7 34
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8247, 0.9347)
##     No Information Rate : 0.6993          
##     P-Value [Acc > NIR] : 7.134e-08       
##                                           
##                   Kappa : 0.7304          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.7907          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.8293          
##          Neg Pred Value : 0.9118          
##              Prevalence : 0.3007          
##          Detection Rate : 0.2378          
##    Detection Prevalence : 0.2867          
##       Balanced Accuracy : 0.8603          
##                                           
##        'Positive' Class : 2               
## 

Conclusion: The test results have an accuracy of 93% with a P-Value of almost 10, which is 9.16.

Through these 2 models through wholesale data, it can be seen that the modeling rate has an accuracy of 93% with a P-Value of 10 which is 9.16, and based on the modeling of logistic regression it shows that Horeca has the opportunity to sell more than 32 times with a 47% probability.

This shows that the data can be trusted with a good level of accuracy and from the data it concludes that Horeca to sell more with a margin of 32 times, but with a percentage rate or opportunity below 50%, this shows that Horeca is likely to stagnate in sales wholesaler.

4. Logistic Regression Model 1

With all predictor

model1 <- glm(formula = Channel~. , data = whole_clean, family = "binomial") 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model0)
## 
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8828  -0.8828  -0.8828   1.5040   1.5040  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7413     0.1020  -7.269 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 553.44  on 439  degrees of freedom
## AIC: 555.44
## 
## Number of Fisher Scoring iterations: 4

Build model

modelstep <- step(model1, direction = "backward")
## Start:  AIC=217.29
## Channel ~ Region + Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Fresh             1   201.35 215.35
## - Delicassen        1   201.83 215.83
## - Milk              1   203.20 217.20
## <none>                  201.29 217.29
## - Region            1   203.91 217.91
## - Grocery           1   205.01 219.01
## - Frozen            1   205.40 219.40
## - Detergents_Paper  1   248.46 262.45
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=215.35
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Delicassen        1   201.85 213.85
## - Milk              1   203.29 215.29
## <none>                  201.35 215.35
## - Region            1   204.05 216.05
## - Grocery           1   205.12 217.12
## - Frozen            1   205.63 217.63
## - Detergents_Paper  1   249.68 261.68
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=213.85
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## - Milk              1   203.53 213.53
## <none>                  201.85 213.85
## - Region            1   204.38 214.38
## - Grocery           1   205.17 215.17
## - Frozen            1   208.34 218.34
## - Detergents_Paper  1   251.68 261.68
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=213.53
## Channel ~ Region + Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                    Df Deviance    AIC
## <none>                  203.53 213.53
## - Region            1   206.03 214.03
## - Frozen            1   208.41 216.41
## - Grocery           1   211.86 219.86
## - Detergents_Paper  1   254.20 262.20
summary(modelstep)
## 
## Call:
## glm(formula = Channel ~ Region + Grocery + Frozen + Detergents_Paper, 
##     family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93992  -0.31466  -0.22617   0.04413   3.08861  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.694e+00  8.334e-01  -5.633 1.78e-08 ***
## Region            4.088e-01  2.667e-01   1.533  0.12529    
## Grocery           1.430e-04  5.068e-05   2.822  0.00478 ** 
## Frozen           -1.353e-04  7.340e-05  -1.843  0.06536 .  
## Detergents_Paper  8.636e-04  1.336e-04   6.463 1.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 203.53  on 435  degrees of freedom
## AIC: 213.53
## 
## Number of Fisher Scoring iterations: 7

Interpretation

inv.logit(-0.00013526)
## [1] 0.4999662
exp(-0.00013526)
## [1] 0.9998647

Conclusion: Frozen product has 5 times more chance to sell with 99% chance

5. Conclusion

Wholesale research with all predictors shows that frozen products have 5 times more chance of being sold with a 99% chance of success