Objective
In this opportunity, I will predict the customer from wholesaler dataset. Those customer divide by two which are Horeca (Hotel, Restaurant, Cafe) and Retail. My job is to predict which customer fit better as Horeca or Retail. I will using two models which calculate probability of which category of the wholesaler is. The models are regression model and knn model.
Installing Library
To process the data, I need to install necessary library to process the data
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(MASS)
library(caret)
library(class)
Logistic Regression
Data set I will use is data from 2014 of wholesaler data set. The data is used to distribute product to wholesaler customer based on the wholesaler spending of the product. I save the data in object for processing.
Saving dataset into object
sale <- read.csv("wholesale.csv")
glimpse(sale)
## Rows: 440
## Columns: 8
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
Checking the dataset
head(sale)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
From the data above those are the attribute Information:
FRESH: annual spending (m.u.) on fresh products (Continuous);
MILK: annual spending (m.u.) on milk products (Continuous);
GROCERY: annual spending (m.u.)on grocery products (Continuous);
FROZEN: annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
CHANNEL: customers’ Channel - (1) Horeca (Hotel/Restaurant/Café) or (2) Retail channel (Nominal)
REGION: customers’ Region – (1) Lisnon, (2) Oporto or (3) Other (Nominal)
Data Manipulation
I want to coerce some column as class category so that I can use it to analyze further. Based on the attribute I want to change Channel and Region column into factor.
Data coercion
sale <- sale %>%
mutate(Channel = factor(Channel, levels = c(1,2), labels = c("Horeca", "Retail")),
Region =factor(Region, levels = c(1,2,3), labels = c("Lisnon", "Oporto", "Other")))
Checking coercion
glimpse(sale)
## Rows: 440
## Columns: 8
## $ Channel <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
## $ Region <fct> Other, Other, Other, Other, Other, Other, Other, Othe…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
After that I want to check if there is any missing value of the data.
Checking missing value
colSums(is.na(sale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Pre-Processing Data
I want to look the proportion of the data. The balance of the data will help the process furter
Checking proportion
prop.table(table(sale$Channel))
##
## Horeca Retail
## 0.6772727 0.3227273
Checking amount of data
table(sale$Channel)
##
## Horeca Retail
## 298 142
The proportion of both class is already balance. I will go to next step to process it.
Splitting Train-Test
Data splitting will help to test the model provide unseen data. It will help the modelling in predicting the data. The splitting process include train data and test data. I will use 70% proportion to train data and 30% proportion to test data.
Splitting Data
set.seed(100)
index <- sample(nrow(sale), nrow(sale)*0.7)
train <- sale[index,]
test <- sale[-index,]
The train data have less row if the splitting succeed. I want to check the train data to confirm that.
Checking train data
glimpse(train)
## Rows: 308
## Columns: 8
## $ Channel <fct> Retail, Retail, Retail, Retail, Horeca, Horeca, Horec…
## $ Region <fct> Lisnon, Other, Other, Lisnon, Other, Other, Oporto, O…
## $ Fresh <int> 4484, 37, 12579, 1107, 25066, 13265, 7034, 32717, 403…
## $ Milk <int> 14399, 1275, 11114, 11711, 5010, 1196, 1492, 16784, 2…
## $ Grocery <int> 24708, 22272, 17569, 23596, 5026, 4221, 2405, 13626, …
## $ Frozen <int> 3549, 137, 805, 955, 9806, 6404, 12569, 60869, 774, 5…
## $ Detergents_Paper <int> 14235, 6747, 6457, 9265, 1092, 507, 299, 1272, 54, 31…
## $ Delicassen <int> 1681, 110, 1519, 710, 960, 1788, 1117, 5609, 63, 1000…
Modelling Data
I will use the glm formula to calculate the probability of the model.
Regression Probability Model
wholesale <- glm(formula = Channel~ ., family = "binomial",
data = train)
summary(wholesale)
##
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.91292 -0.30769 -0.19112 0.05672 3.14662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.880e+00 8.672e-01 -5.628 1.82e-08 ***
## RegionOporto 2.437e+00 9.659e-01 2.523 0.0116 *
## RegionOther 1.376e+00 7.497e-01 1.836 0.0664 .
## Fresh 1.069e-05 2.015e-05 0.530 0.5959
## Milk 5.544e-05 7.093e-05 0.782 0.4344
## Grocery 7.166e-05 6.551e-05 1.094 0.2740
## Frozen -1.949e-04 1.250e-04 -1.559 0.1189
## Detergents_Paper 9.555e-04 1.619e-04 5.903 3.57e-09 ***
## Delicassen -9.677e-05 1.212e-04 -0.799 0.4245
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.29 on 307 degrees of freedom
## Residual deviance: 139.50 on 299 degrees of freedom
## AIC: 157.5
##
## Number of Fisher Scoring iterations: 7
I want to optimize the model so the it can eliminate unnecessary variable. I will use stepAIC() with backward direction.
model_new <- stepAIC(wholesale, direction = "backward")
## Start: AIC=157.5
## Channel ~ Region + Fresh + Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen
##
## Df Deviance AIC
## - Fresh 1 139.78 155.78
## - Milk 1 140.10 156.10
## - Delicassen 1 140.15 156.15
## - Grocery 1 140.70 156.70
## <none> 139.50 157.50
## - Frozen 1 143.42 159.42
## - Region 2 146.62 160.62
## - Detergents_Paper 1 181.76 197.76
##
## Step: AIC=155.78
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen
##
## Df Deviance AIC
## - Delicassen 1 140.35 154.35
## - Milk 1 140.58 154.58
## - Grocery 1 140.97 154.97
## <none> 139.78 155.78
## - Frozen 1 143.46 157.46
## - Region 2 147.30 159.30
## - Detergents_Paper 1 182.11 196.11
##
## Step: AIC=154.35
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper
##
## Df Deviance AIC
## - Grocery 1 141.17 153.17
## - Milk 1 141.22 153.22
## <none> 140.35 154.35
## - Region 2 147.85 157.85
## - Frozen 1 146.15 158.15
## - Detergents_Paper 1 184.10 196.10
##
## Step: AIC=153.17
## Channel ~ Region + Milk + Frozen + Detergents_Paper
##
## Df Deviance AIC
## <none> 141.17 153.17
## - Milk 1 143.50 153.50
## - Frozen 1 147.28 157.28
## - Region 2 149.40 157.40
## - Detergents_Paper 1 248.25 258.25
The model can be summarized as below:
summary(model_new)
##
## Call:
## glm(formula = Channel ~ Region + Milk + Frozen + Detergents_Paper,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9738 -0.3146 -0.1874 0.0617 3.1662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.793e+00 8.615e-01 -5.564 2.64e-08 ***
## RegionOporto 2.627e+00 9.657e-01 2.721 0.00652 **
## RegionOther 1.470e+00 7.380e-01 1.992 0.04634 *
## Milk 9.391e-05 5.969e-05 1.573 0.11564
## Frozen -2.098e-04 1.036e-04 -2.025 0.04285 *
## Detergents_Paper 1.017e-03 1.415e-04 7.189 6.52e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.29 on 307 degrees of freedom
## Residual deviance: 141.17 on 302 degrees of freedom
## AIC: 153.17
##
## Number of Fisher Scoring iterations: 7
Prediction Model
With new model I will use it to predict data test.
test$pred<-predict(model_new, type = "response", newdata = test)
test$pred
## [1] 1.383277e-01 2.969916e-01 5.423034e-01 2.100807e-01 9.934819e-01
## [6] 8.481993e-01 8.870891e-01 7.279295e-02 2.739359e-02 9.685245e-01
## [11] 2.538715e-02 9.991819e-01 9.519709e-02 7.448757e-02 2.203871e-02
## [16] 6.093315e-02 9.121495e-01 9.999946e-01 3.196117e-02 9.612344e-01
## [21] 1.000000e+00 4.599294e-02 5.719156e-01 1.000000e+00 6.306540e-01
## [26] 6.174651e-02 5.771916e-01 7.212636e-02 2.294073e-02 2.566079e-01
## [31] 9.999511e-01 3.401562e-02 6.209438e-02 1.000000e+00 5.783327e-03
## [36] 3.829139e-02 1.406343e-02 3.341216e-05 5.145085e-02 9.925836e-01
## [41] 5.109060e-03 1.316123e-01 3.635683e-02 9.981293e-01 4.903190e-01
## [46] 4.355140e-02 2.421080e-02 3.181022e-01 4.494786e-02 1.224249e-01
## [51] 1.096796e-01 5.403652e-02 3.029666e-02 9.804503e-01 9.960542e-01
## [56] 7.934930e-01 9.999910e-01 9.847144e-01 1.979123e-01 3.029803e-02
## [61] 9.996261e-01 9.992572e-01 8.415852e-01 2.858877e-02 4.578010e-02
## [66] 4.087296e-01 9.774852e-01 8.583760e-01 3.111996e-02 5.059652e-01
## [71] 1.111102e-02 2.068873e-02 3.955593e-01 8.440131e-01 9.999651e-01
## [76] 7.158553e-03 2.076702e-02 7.585840e-02 3.867282e-02 4.601115e-03
## [81] 8.078802e-03 5.392549e-02 2.418084e-03 1.840993e-02 2.305994e-03
## [86] 1.172421e-02 6.581478e-02 6.507432e-01 6.974966e-03 3.504527e-03
## [91] 1.749455e-02 7.260970e-02 1.707025e-02 7.952828e-03 1.854480e-02
## [96] 9.931477e-02 4.776252e-02 2.779207e-02 9.913404e-01 1.365792e-01
## [101] 1.305378e-01 1.626878e-01 2.008238e-01 3.254560e-02 6.901159e-02
## [106] 1.046726e-01 1.179395e-01 2.980573e-02 9.094497e-01 3.044272e-02
## [111] 6.781503e-01 4.168059e-02 3.596900e-02 1.581787e-02 6.030694e-02
## [116] 2.492814e-01 4.278225e-02 4.124409e-02 1.152182e-01 2.578785e-02
## [121] 3.255173e-02 1.059982e-02 2.119325e-02 3.201368e-03 5.378019e-02
## [126] 2.010230e-03 1.127120e-02 5.725339e-02 3.345292e-01 1.728719e-02
## [131] 9.999980e-01 3.977621e-02
To see the Distribution I will use the graph to see the data distribution.
Probability Distribution
ggplot(test, aes(x=pred)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
On the graph above, the prediction density tend to 0. It tends to Horeca as prediction.
I will use the column prediction into new conclusion. I will check the prediction and the real data.
Sample of Data Prediction and Result
test$pred <- factor(ifelse(test$pred > 0.5 , "Retail","Horeca"))
test[1:10, c("pred", "Channel")]
## pred Channel
## 5 Horeca Retail
## 6 Horeca Retail
## 8 Retail Retail
## 9 Horeca Horeca
## 10 Retail Retail
## 13 Retail Retail
## 17 Retail Retail
## 18 Horeca Horeca
## 22 Horeca Horeca
## 24 Retail Retail
In the data above if the probability value more than 0.5, it means retail.
Model Evaluation
I want to use confusion matrix to evaluate the data.
Confusion matrix
log_model <- confusionMatrix(test$pred, test$Channel, positive = "Retail")
log_model
## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 87 9
## Retail 3 33
##
## Accuracy : 0.9091
## 95% CI : (0.8466, 0.9521)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 5.202e-10
##
## Kappa : 0.7822
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.7857
## Specificity : 0.9667
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9062
## Prevalence : 0.3182
## Detection Rate : 0.2500
## Detection Prevalence : 0.2727
## Balanced Accuracy : 0.8762
##
## 'Positive' Class : Retail
##
From the data above we can interpret log_model that:
Accuracy: 0.9091
Sensitivity/Recall: 0.7857
Precision/Pos Pred Value: 0.8919
KNN Model
In knn model I calculate only numeric predictor variable to see the distance of the model. The Region column will not be used in knn model.
Deselect column
wsale <- sale %>% dplyr::select(-Region)
glimpse(wsale)
## Rows: 440
## Columns: 7
## $ Channel <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
Data Splitting
The process is same like regression model. I will use same proportion so that I can compare it with regression model.
Data train and test
RNGkind(sample.kind = "Rounding")
set.seed(300)
# index sampling
index <- sample(nrow(wsale)*0.7)
# splitting
wsale_train <- wsale[index,]
wsale_test <- wsale[-index,]
I want to check the proportion of the new sample if it is balance or not.
Checking proportion
# recheck class balance
prop.table((table(wsale_train$Channel)))
##
## Horeca Retail
## 0.6428571 0.3571429
The proportion is balance. I want to check the new train data to confirm the new dataset.
Checking data
head(wsale_train)
## Channel Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 282 Retail 12238 7108 6235 1093 2328 2079
## 235 Horeca 15603 2703 3833 4260 325 2563
## 247 Horeca 8885 2428 1777 1777 430 610
## 224 Retail 2790 2527 5265 5612 788 1360
## 208 Retail 2541 4737 6089 2946 5316 120
## 4 Horeca 13265 1196 4221 6404 507 1788
The train data is ready. Now, I need to split the data between numeric and categorical. I will change the data format from data frame to data array.
Data coercion
wsale_train_x <- wsale_train %>% select_if(is.numeric)
wsale_test_x <- wsale_test %>% select_if(is.numeric)
# target
wsale_train_y <- wsale_train[,"Channel"]
wsale_test_y <- wsale_test[,"Channel"]
Data Modelling
Now the data is ready to scaling.
Data scaling
wsale_train_xs <- scale(wsale_train_x)
wsale_test_xs <- scale(wsale_test_x, center = attr(wsale_train_xs, "scaled:center"), scale = attr(wsale_train_xs, "scaled:scale"))
I need optimum k value of the model by square root the amount rows used in test.
Optimum k value
sqrt(nrow(wsale_train_xs))
## [1] 17.54993
Now I can make knn model to predict.
knn model
wsale_pred <- knn(train=wsale_train_xs,
test=wsale_test_xs,
cl=wsale_train_y,
k=17)
wsale_pred
## [1] Horeca Retail Horeca Horeca Retail Horeca Horeca Retail Horeca Horeca
## [11] Horeca Retail Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca
## [21] Horeca Horeca Horeca Retail Horeca Retail Horeca Retail Horeca Horeca
## [31] Horeca Horeca Retail Retail Retail Retail Horeca Horeca Retail Retail
## [41] Horeca Retail Horeca Retail Horeca Retail Horeca Horeca Horeca Retail
## [51] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca
## [61] Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail Horeca
## [71] Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Horeca
## [81] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca
## [91] Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Horeca Retail
## [101] Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail Retail Retail
## [111] Retail Horeca Retail Retail Horeca Horeca Retail Horeca Retail Horeca
## [121] Horeca Horeca Horeca Horeca Horeca Horeca Retail Horeca Horeca Retail
## [131] Horeca Horeca
## Levels: Horeca Retail
The result is shown. To see the performance, I need to put it into confusion matrix.
knn confusion matrix
knn_model <- confusionMatrix(data = as.factor(wsale_pred),
reference = wsale_test_y,
positive = "Retail")
knn_model
## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 93 5
## Retail 7 27
##
## Accuracy : 0.9091
## 95% CI : (0.8466, 0.9521)
## No Information Rate : 0.7576
## P-Value [Acc > NIR] : 6.964e-06
##
## Kappa : 0.7576
##
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.8438
## Specificity : 0.9300
## Pos Pred Value : 0.7941
## Neg Pred Value : 0.9490
## Prevalence : 0.2424
## Detection Rate : 0.2045
## Detection Prevalence : 0.2576
## Balanced Accuracy : 0.8869
##
## 'Positive' Class : Retail
##
From the data above we can interpret knn_model that:
Accuracy: 0.9091
Recall/Sensitivity : 0.8438
Precision/Pos Pred Value : 0.7941
Conclusion and Recommendation
The termin of false positive and false negative of the models are:
FN: predict negative (horeca), but it is positive (retail)
FP: predict positive (retail), but it is negative (horeca)
I use recall to minimize the FN and precision to minimize FP.
From both model we can analyze the difference of value in confusion matrix. There are some conclusion that can be created from both model.
Both model have same accuracy value with high value of 90.91%. It means both model is good and can predict well of true positive and true negative value which is very successfully predict retail or horeca pretty well.
Knn model has higher sensitivity value than regression model. If we want to predict horeca better we can use knn model as our predictor model.
Regression model has higher precision value than knn model. If we want to predict retail better we can use regression model as our predictor model.