The objective of this analysis is to predict what channel affecting to wholesale by classifying related data to wholsale channel. The target varibale in this analysis is “Channel”. The method will use Logistic Regression and K-NN. By applying these methods, this analysis will determine which method that will perform better than the other, therefore, it can be used to predict.
Source of data: UCI Wholesale customers Data Set
Here are the libraries used in this analysis
library(tidyverse)
library(gtools)
library(caret)
library(GGally)
library(ggplot2)
library(readr)
library(dplyr)
library(class)Here is the data that is used.
wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)## Rows: 440
## Columns: 8
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
## $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Column Description:
Based on dataset above, there will be adjusted in the dataset:
Here is the process:
wholesale <- wholesale %>%
select(-Region) %>%
mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail")) %>%
mutate(Channel = as.factor(Channel))head(wholesale)colSums(is.na(wholesale))## Channel Fresh Milk Grocery
## 0 0 0 0
## Frozen Detergents_Paper Delicassen
## 0 0 0
From checking dataset above, there is no missing value.
summary(wholesale)## Channel Fresh Milk Grocery
## Horeca:298 Min. : 3 Min. : 55 Min. : 3
## Retail:142 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153
## Median : 8504 Median : 3627 Median : 4756
## Mean : 12000 Mean : 5796 Mean : 7951
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656
## Max. :112151 Max. :73498 Max. :92780
## Frozen Detergents_Paper Delicassen
## Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :60869.0 Max. :40827.0 Max. :47943.0
Insight:
Before builing a model, it is needed to check the propotion of the target variable.
prop.table(table(wholesale$Channel))##
## Horeca Retail
## 0.6772727 0.3227273
According to the propotion check, there is imbalance issue within the data. Therefore, to solve it, “Upsampling” Method will be used.
wholesale_up <- upSample(x = wholesale %>% select(-Channel),
y = wholesale$Channel,
list = FALSE,
yname = "Channel")
head(wholesale_up)After Upsampling has been completed, here is the updated propotion result.
prop.table(table(wholesale_up$Channel))##
## Horeca Retail
## 0.5 0.5
The result is balance with 0.5 : 0.5.
Before setting a model, it needs to split the dataset to “train” and “test”. The propotion of data will be categorized into “train” is 80%, while “test” is 20%. Here is the splitting process.
RNGkind(sample.kind = "Rounding")
set.seed(132)
tts_data <- nrow(wholesale_up)
index <- sample(tts_data, tts_data*0.8)
train <- wholesale_up[index, ]
test <- wholesale_up[-index, ]Here is the logistic modeling process for wholesale with “Channel” as the target variable.
set.seed(132)
model_log.reg <- glm(Channel ~., wholesale_up, family = "binomial")
summary(model_log.reg) ##
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = wholesale_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2842 -0.3343 -0.0201 0.1733 2.8961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.428e+00 4.028e-01 -8.511 < 2e-16 ***
## Fresh 2.012e-06 1.562e-05 0.129 0.8975
## Milk 8.035e-05 4.993e-05 1.609 0.1076
## Grocery 1.109e-04 5.635e-05 1.967 0.0491 *
## Frozen -1.189e-04 6.763e-05 -1.758 0.0787 .
## Detergents_Paper 1.017e-03 1.293e-04 7.863 3.76e-15 ***
## Delicassen -3.611e-05 8.768e-05 -0.412 0.6804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 826.23 on 595 degrees of freedom
## Residual deviance: 287.67 on 589 degrees of freedom
## AIC: 301.67
##
## Number of Fisher Scoring iterations: 7
Here is the logistic modeling process for wholesale with “Channel” as the target variable by fitting it using stepwise regression.
set.seed(132)
model_log.reg.step <- step(model_log.reg, direction = "backward", trace = 0)
summary(model_log.reg.step)##
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper,
## family = "binomial", data = wholesale_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2740 -0.3374 -0.0179 0.1765 2.8931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.391e+00 3.556e-01 -9.535 < 2e-16 ***
## Milk 7.386e-05 4.770e-05 1.548 0.1215
## Grocery 1.089e-04 5.530e-05 1.969 0.0489 *
## Frozen -1.283e-04 5.994e-05 -2.140 0.0323 *
## Detergents_Paper 1.015e-03 1.264e-04 8.027 9.99e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 826.23 on 595 degrees of freedom
## Residual deviance: 287.86 on 591 degrees of freedom
## AIC: 297.86
##
## Number of Fisher Scoring iterations: 7
To predict “test” data, this analysis uses logistic regression method.
set.seed(132)
pred_test <- predict(object = model_log.reg.step,
newdata = test,
type = "response")
pred_class <- as.factor(if_else(pred_test > 0.5, "Retail", "Horeca"))ggplot(test, aes(x=pred_test)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()According to graph abbove, the prediction tends to 0 is “Horeca”. Therefore, if > 0.5 will be defined as “Retail”
To evaluate the model performance will be used confusion matrix, as it is proceeded below.
set.seed(132)
confusionMatrix(data = pred_class,
reference = test$Channel,
positive = "Retail")## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 55 5
## Retail 5 55
##
## Accuracy : 0.9167
## 95% CI : (0.8521, 0.9593)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8333
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9167
## Specificity : 0.9167
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9167
## Prevalence : 0.5000
## Detection Rate : 0.4583
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : Retail
##
According to confusionMatrix above, the result shows:
In overall this model can predict accuracy, sensitivity, specificity, and precision is more than 80%.
#Predictor
set.seed(132)
train.knn.x <- train %>%
select_if(is.numeric)
test.knn.x <- test %>%
select_if(is.numeric)
#Target
train.knn.y <- train[,"Channel"]
test.knn.y <- test[,"Channel"]Here is the scaling process for the data train and test.
#Predictor Scaling
set.seed(132)
train.knn.xs <- scale(train.knn.x)
test.knn.xs <- scale(test.knn.x,
center = attr(train.knn.xs, "scaled:center"),
scale = attr(train.knn.xs, "scaled:scale"))Here is to find optimum K value that will be used in KNN
#Find optimum K
sqrt(nrow(train.knn.xs))## [1] 21.81742
Here is to KNN model building process.
set.seed(132)
pred.knn <- knn(train = train.knn.xs,
test = test.knn.xs,
cl= train.knn.y,
k= 21)Therefore, to evaluate the model will use ConfusionMatrix.
set.seed(132)
confusionMatrix(data = as.factor(pred.knn),
reference = as.factor(test.knn.y),
positive = "Retail")## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 50 3
## Retail 10 57
##
## Accuracy : 0.8917
## 95% CI : (0.8219, 0.941)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7833
##
## Mcnemar's Test P-Value : 0.09609
##
## Sensitivity : 0.9500
## Specificity : 0.8333
## Pos Pred Value : 0.8507
## Neg Pred Value : 0.9434
## Prevalence : 0.5000
## Detection Rate : 0.4750
## Detection Prevalence : 0.5583
## Balanced Accuracy : 0.8917
##
## 'Positive' Class : Retail
##
In overall this model can predict accuracy, sensitivity, specificity, and precision is more than 80%.
Based on the analysis of wholesale dataset using Logistic Regression Model and KNN Model, both model have performance above 80% in term of accuracy, sensitivity, specificity, and precision. In overall review, Logistic Regression Model has better performance than KNN.
Since the perspective of this analysis is to determine what channels can be categorize based on data so that it can be an input for the Management how to improve the effective channel strategy to boost sales. Therefore, this analysis uses the highest precision (Pos Pred Value) which is the Logistic Regression Model that has better performance with compared to KNN.