1. DATA INTRODUCTION

On this occasion, we will analyze using a clasification on the data wholesale.csv.

We will explore the target variable channel to be able to analyze how the influence of what variables are very influential for a data grouped as a particular channel.

1.1. Data Preparation

Read the data.

wholesale <- read.csv("wholesale.csv")
head(wholesale)

Check data structure.

str(wholesale)
## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

Check missing value.

anyNA(wholesale)
## [1] FALSE
colSums(is.na(wholesale))
##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

1.2. Data Preprocessing

Data Wrangling.

We will change the datatype of Channel into a factor and We will throw away the Region variable, since it is a class type data like our target variable and We don’t need it.

library(dplyr)
wholesale <- wholesale %>% 
  mutate(Channel = as.factor(Channel),
         Region = as.factor(Region)) %>% 
         select(-Region) 

glimpse(wholesale)
## Rows: 440
## Columns: 7
## $ Channel          <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

2. DATA ANALYSIS

2.1. Exploratory

Check data patern.

summary(wholesale)
##  Channel     Fresh             Milk          Grocery          Frozen       
##  1:298   Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
##  2:142   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
##          Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
##          Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
##          3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
##          Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen     
##  Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  256.8   1st Qu.:  408.2  
##  Median :  816.5   Median :  965.5  
##  Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :40827.0   Max.   :47943.0

Based on the summary above, we can see that there are no irregularities in each variable.

Check the distribution proportion of target class

prop.table(table(wholesale$Channel))
## 
##         1         2 
## 0.6772727 0.3227273
table(wholesale$Channel)
## 
##   1   2 
## 298 142

When viewed from the proportion of the two classes, it is quite balanced, so we don’t really need additional pre-processing to balance the proportion between the two target classes of variables.

2.2. Cross Validation

Splitting the data into data train(80%) and data test(20%).

RNGkind(sample.kind = "Rounding")
set.seed(417)

# index sampling
index <- sample(x = nrow(wholesale), size = nrow(wholesale)*0.85)

# splitting
wholesale_train <- wholesale[index , ]

wholesale_test <- wholesale[-index , ]

Check the distribution proportion of target class from data train.

prop.table(table(wholesale_train$Channel))
## 
##         1         2 
## 0.6737968 0.3262032
table(wholesale_train$Channel)
## 
##   1   2 
## 252 122

The prportion is quite balanced.

3. LOGISTIC REGRESSION

We will make a logistic regression model to predict Channel. Based on the data and our business inquiry, We will use all the predictor variable for building the model.

wholesale_LRMod <- glm(formula = Channel ~ ., data = wholesale_train, family = "binomial")

summary(wholesale_LRMod)
## 
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = wholesale_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8624  -0.3243  -0.2289   0.0481   3.2562  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.636e+00  4.708e-01  -7.723 1.14e-14 ***
## Fresh             9.955e-06  1.812e-05   0.550   0.5827    
## Milk              9.016e-05  5.935e-05   1.519   0.1287    
## Grocery           9.621e-05  6.259e-05   1.537   0.1243    
## Frozen           -1.961e-04  1.015e-04  -1.933   0.0533 .  
## Detergents_Paper  8.721e-04  1.488e-04   5.862 4.56e-09 ***
## Delicassen       -5.699e-05  1.104e-04  -0.516   0.6056    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 472.33  on 373  degrees of freedom
## Residual deviance: 176.59  on 367  degrees of freedom
## AIC: 190.59
## 
## Number of Fisher Scoring iterations: 7

3.1. Model Fitting

In the first modeling, there are still many predictor variables that are not significant to the target variable, therefore we will try to do a model fitting using the stepwise method.

wholesale_LRMod_step <- step(wholesale_LRMod, direction = "backward")
## Start:  AIC=190.59
## Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper + 
##     Delicassen
## 
##                    Df Deviance    AIC
## - Delicassen        1   176.86 188.86
## - Fresh             1   176.89 188.89
## <none>                  176.59 190.59
## - Milk              1   178.86 190.86
## - Grocery           1   179.10 191.10
## - Frozen            1   182.25 194.25
## - Detergents_Paper  1   217.37 229.37
## 
## Step:  AIC=188.86
## Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper
## 
##                    Df Deviance    AIC
## - Fresh             1   177.09 187.09
## <none>                  176.86 188.86
## - Milk              1   179.01 189.01
## - Grocery           1   179.12 189.12
## - Frozen            1   184.24 194.24
## - Detergents_Paper  1   218.39 228.39
## 
## Step:  AIC=187.09
## Channel ~ Milk + Grocery + Frozen + Detergents_Paper
## 
##                    Df Deviance    AIC
## <none>                  177.09 187.09
## - Milk              1   179.37 187.37
## - Grocery           1   179.49 187.49
## - Frozen            1   184.50 192.50
## - Detergents_Paper  1   219.04 227.04
summary(wholesale_LRMod_step)
## 
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper, 
##     family = "binomial", data = wholesale_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8469  -0.3242  -0.2311   0.0514   3.2211  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.548e+00  4.268e-01  -8.313  < 2e-16 ***
## Milk              9.048e-05  5.904e-05   1.532   0.1254    
## Grocery           9.133e-05  6.070e-05   1.505   0.1324    
## Frozen           -1.950e-04  9.029e-05  -2.160   0.0308 *  
## Detergents_Paper  8.596e-04  1.453e-04   5.916  3.3e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 472.33  on 373  degrees of freedom
## Residual deviance: 177.09  on 369  degrees of freedom
## AIC: 187.09
## 
## Number of Fisher Scoring iterations: 7

3.2. Prediction

By using the results from stepwise model, we will try to predict using the test data that we already have.

We will predict the probability Channel for data wholesale_test and save it in a new column named wholesale_pred

wholesale_test$wholesale_pred <- predict(wholesale_LRMod_step,
                             wholesale_test,
                             type = "response")

wholesale_test

We will Classify the wholesale_test data based on wholesale_pred and save it in a new column named channel_pred.

wholesale_test$channel_pred <- ifelse(wholesale_test$wholesale_pred > 0.5,
                                      yes = "2",
                                      no = "1")

wholesale_test

3.3. Evaluation

To evaluate the model that we have created, we will use a confusion matrix.

lr_cm <- confusionMatrix(as.factor(wholesale_test$channel_pred),
                wholesale_test$Channel,
                positive = "2")

lr_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 43  2
##          2  3 18
##                                          
##                Accuracy : 0.9242         
##                  95% CI : (0.832, 0.9749)
##     No Information Rate : 0.697          
##     P-Value [Acc > NIR] : 7.577e-06      
##                                          
##                   Kappa : 0.8232         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9000         
##             Specificity : 0.9348         
##          Pos Pred Value : 0.8571         
##          Neg Pred Value : 0.9556         
##              Prevalence : 0.3030         
##          Detection Rate : 0.2727         
##    Detection Prevalence : 0.3182         
##       Balanced Accuracy : 0.9174         
##                                          
##        'Positive' Class : 2              
## 

Based on business questions, the best metrics are accuracy & sensitivity/recall. Because we want to predict whether a customer belongs to a certain customer group, in this case we have a marketing strategy for each group.

4. K-NN MODEL

In addition to predicting the logistic regression model, we will also make predictions using the K-NN method. Next, we will compare the results and we will take the best one.

4.1. Picking optimum K

Class Target.

levels(wholesale_train$Channel)
## [1] "1" "2"

Class Target = 2

k optimum is the root of our data sum: sqrt(nrow(data))

sqrt(nrow(wholesale_train))
## [1] 19.33908

Pay attention to the number of target classes + Even target class -> odd number of k + Odd target class -> even number of k

K optimum is 19.

4.2. Data Preprocessing

Cross Validation.

RNGkind(sample.kind = "Rounding")
set.seed(419)

# index sampling
index <- sample(x = nrow(wholesale), size = nrow(wholesale)*0.85)

# splitting
wholesale_KNNtrain <- wholesale[index , ]

wholesale_KNNtest <- wholesale[-index , ]

For k-NN, separate predictor and label (target variable)

# prediktor
wholesale_train_x <- wholesale_KNNtrain %>% select_if(is.numeric)

wholesale_test_x <- wholesale_KNNtest %>% select_if(is.numeric)

# target
wholesale_train_y <- wholesale_KNNtrain[,"Channel"]

wholesale_test_y <- wholesale_KNNtest[,"Channel"]

The range of each variable is not too different so there is no need for feature rescaling in the data pre-processing stage.

The K-NN method does not require prior modeling. So that predictions can be made immediately.

wholesale_knn <- knn(train = wholesale_train_x,
                 test = wholesale_test_x,
                 cl = wholesale_train_y,
                 k = 19)

4.3. Evaluation

To evaluate the model that we have created, we will use a confusion matrix.

knn_cm <- confusionMatrix(data = as.factor(wholesale_knn),
                reference = as.factor(wholesale_test_y),
                positive = "2")

knn_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 38  5
##          2  6 17
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7213, 0.9138)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.001989        
##                                           
##                   Kappa : 0.6292          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.7727          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.7391          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.3333          
##          Detection Rate : 0.2576          
##    Detection Prevalence : 0.3485          
##       Balanced Accuracy : 0.8182          
##                                           
##        'Positive' Class : 2               
## 

5. CONCLUSION

eval_lr <- data_frame(Accuracy = lr_cm$overall[1],
           Recall = lr_cm$byClass[1],
           Specificity = lr_cm$byClass[2],
           Precision = lr_cm$byClass[3])

eval_knn <- data_frame(Accuracy = knn_cm$overall[1],
           Recall = knn_cm$byClass[1],
           Specificity = knn_cm$byClass[2],
           Precision = knn_cm$byClass[3])

Evaluation Comparison.

eval_lr
eval_knn

Based on the evaluation above, it can be seen that the results of the logistic regression have a better accuracy and sensitivity/recal.

So it was decided that we would use a logistic regression model to answer future business questions.