Objective

Pada kesempatan kali ini saya akan mencoba untuk mengklasifikasikan dengan menggunakan data wholesale, yang dimana nanti kita akan mencoba untuk eksplorasi target variabel channel untuk dapat dianalisis bagaimana pengaruh variabel-variabel yang sangat mempengaruhi dan dikelompokan sebagai channel tertentu. Algoritma yang akan saya gunakan yaitu menggunakan logistik regression dan k-nearest neighbor yang termasuk dalam supervised learning.

Library

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
library(e1071)

Read Data

wholesale <- read.csv("Data/wholesale.csv")
glimpse(wholesale)

## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Summary:
- Channel : jenis usaha, 1 = horeca (hotel/restoran/cafe) dan 2 = retail
- Region : Wilayah customer, 1 = Lisbon; 2 = Oporto; 3 = lainnya

Clean Data

anyNA(wholesale)

## [1] FALSE

colSums(is.na(wholesale))

##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Clean! No N/A

Correlation

table(wholesale$Channel)

## 
##   1   2 
## 298 142

cor(wholesale$Channel, wholesale$Region)

## [1] 0.06202762

Pre-Processing Data

Kita akan melihat proporsi data terlebih dahulu

wholesale$Channel %>% 
  table() %>% 
  prop.table()

## .
##         1         2 
## 0.6772727 0.3227273

Disini dapat kita lihat bahwa proporsi data tidak seimbang dan lebih banyak horeca dibandingkan region. Proporsi yang seimbang penting agar model klasifikasi mempelajari karakteristik kelas positif maupun negatif secara seimbang, tidak dari satu kelas saja, namun semua itu tergantung pada kebutuhan. Untuk kasus kali ini saya memutuskan untuk tidak melakukan resampling (menambah data).

Data Cross Validation

set.seed(666)
wholesale_clean <- wholesale %>% 
  mutate(Channel = as.factor(Channel)) %>% 
  select(-Region)
wholesale_split <- sample(nrow(wholesale_clean), nrow(wholesale_clean)*0.6772727)
wholesale_train <- wholesale_clean[wholesale_split,]
wholesale_test <- wholesale_clean[-wholesale_split,]
wholesale %>% nrow()

## [1] 440

wholesale_train$Channel %>% 
  table() %>% 
  prop.table()

## .
##         1         2 
## 0.6666667 0.3333333

wholesale_test$Channel %>% 
  table() %>% 
  prop.table()

## .
##         1         2 
## 0.6993007 0.3006993

Logistic Regression model 0

`Interpretasi tanpa prediktor`

whole_clean <- wholesale %>% 
  mutate(Channel = as.factor(Channel))

# fyi 1 = `Horeca`

model0 <- glm(formula = Channel~1 , data = whole_clean, family = "binomial") # logistic regression, biner classification
summary(model0)

## 
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8828  -0.8828  -0.8828   1.5040   1.5040  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7413     0.1020  -7.269 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 553.44  on 439  degrees of freedom
## AIC: 555.44
## 
## Number of Fisher Scoring iterations: 4

sum_model0 <- summary(model0)
sum_model0$coefficients[1]

## [1] -0.7412664

inv.logit(-0.7413) #Peluang

## [1] 0.3227199

exp(-0.7413) #Probabilitas

## [1] 0.4764941

Kesimpulan: Horeca memiliki peluang untuk menjual lebih banyak sebesar 32 kali dengan kemungkinan 47%.

k-nearest neighbor

Scalling dan pemilihan nilai k

wholesale_x_train <- wholesale_train %>% 
  select(-Channel) %>% 
  scale()

wholesale_y_train <- wholesale_train %>% 
  select(Channel)

wholesale_x_test <- wholesale_test %>% 
  select(-Channel) %>% 
  scale(center = attr(wholesale_x_train, "scaled:center") , 
        scale = attr(wholesale_x_train, "scaled:scale"))

wholesale_y_test <- wholesale_test %>% 
  select(Channel)

sqrt(nrow(wholesale_x_train))

## [1] 17.23369

Evaluasi Model

library(class)
wholesale_knn <- knn(train = wholesale_x_train,
                     test = wholesale_x_test,
                     cl = wholesale_y_train$Channel,
                     k = 17.23369)

wholesale_y_train$Channel %>% 
  table() %>% 
  prop.table()

## .
##         1         2 
## 0.6666667 0.3333333

confusionMatrix(wholesale_knn, reference = wholesale_y_test$Channel, positive = "2")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 93  9
##          2  7 34
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8247, 0.9347)
##     No Information Rate : 0.6993          
##     P-Value [Acc > NIR] : 7.134e-08       
##                                           
##                   Kappa : 0.7304          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.7907          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.8293          
##          Neg Pred Value : 0.9118          
##              Prevalence : 0.3007          
##          Detection Rate : 0.2378          
##    Detection Prevalence : 0.2867          
##       Balanced Accuracy : 0.8603          
##                                           
##        'Positive' Class : 2               
##

Kesimpulan: Hasil test memiliki akurasi 93% dengan P-Value hampir mencapai 10 yaitu 9,16.

Melalui 2 model ini melalui data wholesale, dapat dilihat bahwa tingkat permodelan memiliki akurasi 93% dengan P-Value mencapai 10 yaitu 9,16, dan berdasarkan permodelan dari regressi logistik menunjukkan Horeca memiliki peluang untuk menjual lebih banyak sebesar 32 kali dengan kemungkinan 47%.

Hal ini menunjukkan data tersebut dapat dipercaya dengan tingkat akurasi yang baik dan dari data tersebut memberikan kesimpulan bahwa Horeca untuk menjual lebih banyak dengan margin 32 kali, namun dengan tingkat persentase atau peluang dibawah dari 50%, hal ini menunjukkan bahwa horeca kemungkinan akan stag di penjualan grosir.

Logistic Regression model 1

`Interpretasi semua prediktor`

model1 <- glm(formula = Channel~. , data = whole_clean, family = "binomial") # logistic regression, biner classification

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model0)

## 
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8828  -0.8828  -0.8828   1.5040   1.5040  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7413     0.1020  -7.269 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 553.44  on 439  degrees of freedom
## AIC: 555.44
## 
## Number of Fisher Scoring iterations: 4

Build Model

modelstep <- step(model1, direction = "backward")

## Start:  AIC=217.29
## Channel ~ Region + Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                    Df Deviance    AIC
## - Fresh             1   201.35 215.35
## - Delicassen        1   201.83 215.83
## - Milk              1   203.20 217.20
## <none>                  201.29 217.29
## - Region            1   203.91 217.91
## - Grocery           1   205.01 219.01
## - Frozen            1   205.40 219.40
## - Detergents_Paper  1   248.46 262.45

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Step:  AIC=215.35
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                    Df Deviance    AIC
## - Delicassen        1   201.85 213.85
## - Milk              1   203.29 215.29
## <none>                  201.35 215.35
## - Region            1   204.05 216.05
## - Grocery           1   205.12 217.12
## - Frozen            1   205.63 217.63
## - Detergents_Paper  1   249.68 261.68

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Step:  AIC=213.85
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                    Df Deviance    AIC
## - Milk              1   203.53 213.53
## <none>                  201.85 213.85
## - Region            1   204.38 214.38
## - Grocery           1   205.17 215.17
## - Frozen            1   208.34 218.34
## - Detergents_Paper  1   251.68 261.68

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Step:  AIC=213.53
## Channel ~ Region + Grocery + Frozen + Detergents_Paper

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                    Df Deviance    AIC
## <none>                  203.53 213.53
## - Region            1   206.03 214.03
## - Frozen            1   208.41 216.41
## - Grocery           1   211.86 219.86
## - Detergents_Paper  1   254.20 262.20

summary(modelstep)

## 
## Call:
## glm(formula = Channel ~ Region + Grocery + Frozen + Detergents_Paper, 
##     family = "binomial", data = whole_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93992  -0.31466  -0.22617   0.04413   3.08861  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.694e+00  8.334e-01  -5.633 1.78e-08 ***
## Region            4.088e-01  2.667e-01   1.533  0.12529    
## Grocery           1.430e-04  5.068e-05   2.822  0.00478 ** 
## Frozen           -1.353e-04  7.340e-05  -1.843  0.06536 .  
## Detergents_Paper  8.636e-04  1.336e-04   6.463 1.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 553.44  on 439  degrees of freedom
## Residual deviance: 203.53  on 435  degrees of freedom
## AIC: 213.53
## 
## Number of Fisher Scoring iterations: 7

Interpretasi:

inv.logit(-0.00013526) #Peluang

## [1] 0.4999662

exp(-0.00013526) #Probabilitas

## [1] 0.9998647

Kesimpulan: Produk Frozen memiliki peluang untuk menjual lebih banyak sebesar 5 kali dengan kemungkinan 99%

`Conclusion`

Reserach wholesale dengan semua prediktor menunjukkan bahwa produk frozen memiliki peluang untuk dijual lebih banyak sebesar 5 kali dengan kemungkinan keberhasilan sebesar 99%

5. LBB : Classification in Machine Learning I (C1) “Wholesale”

Felicia Haliman

May 26, 2021