On this occasion, I will try to classify using wholesale data, where later we will try to explore the target channel variables to be able to analyze how the influence of the variables that greatly influence and be grouped as a particular channel. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
## Loading required package: lattice
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
##
## permutations
**Data Input & Structure*
wholesale <- read.csv("wholesale.csv")
str(wholesale)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
Now we would like to check if there is missing values
colSums(is.na(wholesale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Great! No missing values!
table(wholesale$Channel)
##
## 1 2
## 298 142
cor(wholesale$Channel, wholesale$Region)
## [1] 0.06202762
wholesale$Channel %>%
table() %>%
prop.table()
## .
## 1 2
## 0.6772727 0.3227273
AS we can see that the proportion of data is not balanced and there is more “hurray”Horeca" than the region. A balanced proportion is important so that the classification model studies the characteristics of positive and negative classes in a balanced way, not from just one class, but it all depends on needs. For this case I decided not to resampling (add data).
set.seed(666)
wholesale_clean <- wholesale %>%
mutate(Channel = as.factor(Channel)) %>%
select(-Region)
wholesale_split <- sample(nrow(wholesale_clean), nrow(wholesale_clean)*0.6772727)
wholesale_train <- wholesale_clean[wholesale_split,]
wholesale_test <- wholesale_clean[-wholesale_split,]
wholesale %>% nrow()
## [1] 440
wholesale_train$Channel %>%
table() %>%
prop.table()
## .
## 1 2
## 0.6666667 0.3333333
wholesale_test$Channel %>%
table() %>%
prop.table()
## .
## 1 2
## 0.6993007 0.3006993
Logistic Regression model 0 Interpretation without predictor
whole_clean <- wholesale %>%
mutate(Channel = as.factor(Channel))
model0 <- glm(formula = Channel~1 , data = whole_clean, family = "binomial") # logistic regression, biner classification
summary(model0)
##
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8828 -0.8828 -0.8828 1.5040 1.5040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7413 0.1020 -7.269 3.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 553.44 on 439 degrees of freedom
## Residual deviance: 553.44 on 439 degrees of freedom
## AIC: 555.44
##
## Number of Fisher Scoring iterations: 4
sum_model0 <- summary(model0)
sum_model0$coefficients[1]
## [1] -0.7412664
inv.logit(-0.7413)
## [1] 0.3227199
exp(-0.7413)
## [1] 0.4764941
Conclusion: Horeca has a 32 times more chance to sell with a 47% chance.
K-nearest Neighbor
wholesale_x_train <- wholesale_train %>%
select(-Channel) %>%
scale()
wholesale_y_train <- wholesale_train %>%
select(Channel)
wholesale_x_test <- wholesale_test %>%
select(-Channel) %>%
scale(center = attr(wholesale_x_train, "scaled:center") ,
scale = attr(wholesale_x_train, "scaled:scale"))
wholesale_y_test <- wholesale_test %>%
select(Channel)
sqrt(nrow(wholesale_x_train))
## [1] 17.23369
Model Evaluation
library(class)
wholesale_knn <- knn(train = wholesale_x_train,
test = wholesale_x_test,
cl = wholesale_y_train$Channel,
k = 17.23369)
wholesale_y_train$Channel %>%
table() %>%
prop.table()
## .
## 1 2
## 0.6666667 0.3333333
confusionMatrix(wholesale_knn, reference = wholesale_y_test$Channel, positive = "2")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 93 9
## 2 7 34
##
## Accuracy : 0.8881
## 95% CI : (0.8247, 0.9347)
## No Information Rate : 0.6993
## P-Value [Acc > NIR] : 7.134e-08
##
## Kappa : 0.7304
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.7907
## Specificity : 0.9300
## Pos Pred Value : 0.8293
## Neg Pred Value : 0.9118
## Prevalence : 0.3007
## Detection Rate : 0.2378
## Detection Prevalence : 0.2867
## Balanced Accuracy : 0.8603
##
## 'Positive' Class : 2
##
Conclusion: The test results have an accuracy of 93% with a P-Value of almost 10, which is 9.16.
Through these 2 models through wholesale data, it can be seen that the modeling rate has an accuracy of 93% with a P-Value of 10 which is 9.16, and based on the modeling of logistic regression it shows that Horeca has the opportunity to sell more than 32 times with a 47% probability.
This shows that the data can be trusted with a good level of accuracy and from the data it concludes that Horeca to sell more with a margin of 32 times, but with a percentage rate or opportunity below 50%, this shows that Horeca is likely to stagnate in sales wholesaler.
With all predictor
model1 <- glm(formula = Channel~. , data = whole_clean, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model0)
##
## Call:
## glm(formula = Channel ~ 1, family = "binomial", data = whole_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8828 -0.8828 -0.8828 1.5040 1.5040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7413 0.1020 -7.269 3.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 553.44 on 439 degrees of freedom
## Residual deviance: 553.44 on 439 degrees of freedom
## AIC: 555.44
##
## Number of Fisher Scoring iterations: 4
Build model
modelstep <- step(model1, direction = "backward")
## Start: AIC=217.29
## Channel ~ Region + Fresh + Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Fresh 1 201.35 215.35
## - Delicassen 1 201.83 215.83
## - Milk 1 203.20 217.20
## <none> 201.29 217.29
## - Region 1 203.91 217.91
## - Grocery 1 205.01 219.01
## - Frozen 1 205.40 219.40
## - Detergents_Paper 1 248.46 262.45
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=215.35
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Delicassen 1 201.85 213.85
## - Milk 1 203.29 215.29
## <none> 201.35 215.35
## - Region 1 204.05 216.05
## - Grocery 1 205.12 217.12
## - Frozen 1 205.63 217.63
## - Detergents_Paper 1 249.68 261.68
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=213.85
## Channel ~ Region + Milk + Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Milk 1 203.53 213.53
## <none> 201.85 213.85
## - Region 1 204.38 214.38
## - Grocery 1 205.17 215.17
## - Frozen 1 208.34 218.34
## - Detergents_Paper 1 251.68 261.68
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=213.53
## Channel ~ Region + Grocery + Frozen + Detergents_Paper
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 203.53 213.53
## - Region 1 206.03 214.03
## - Frozen 1 208.41 216.41
## - Grocery 1 211.86 219.86
## - Detergents_Paper 1 254.20 262.20
summary(modelstep)
##
## Call:
## glm(formula = Channel ~ Region + Grocery + Frozen + Detergents_Paper,
## family = "binomial", data = whole_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93992 -0.31466 -0.22617 0.04413 3.08861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.694e+00 8.334e-01 -5.633 1.78e-08 ***
## Region 4.088e-01 2.667e-01 1.533 0.12529
## Grocery 1.430e-04 5.068e-05 2.822 0.00478 **
## Frozen -1.353e-04 7.340e-05 -1.843 0.06536 .
## Detergents_Paper 8.636e-04 1.336e-04 6.463 1.03e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 553.44 on 439 degrees of freedom
## Residual deviance: 203.53 on 435 degrees of freedom
## AIC: 213.53
##
## Number of Fisher Scoring iterations: 7
Interpretation
inv.logit(-0.00013526)
## [1] 0.4999662
exp(-0.00013526)
## [1] 0.9998647
Conclusion: Frozen product has 5 times more chance to sell with 99% chance
Wholesale research with all predictors shows that frozen products have 5 times more chance of being sold with a 99% chance of success