The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. source : https://archive.ics.uci.edu/ml/datasets/wholesale+customers. I will try to predict the channel on this data using logistic regression and the knn models.
library(dplyr)
library(caret)
library(ggplot2)
library(plotly)
library(class)wholesale <- read.csv("data_input/wholesale.csv")
head(wholesale)glimpse(wholesale)#> Rows: 440
#> Columns: 8
#> $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
#> $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Attribute Information:
wholesale <- wholesale %>%
select(-Region) %>%
mutate(Channel = factor(Channel, levels = c(1, 2), labels = c("Horeca", "Retail"))
)str(wholesale)#> 'data.frame': 440 obs. of 7 variables:
#> $ Channel : Factor w/ 2 levels "Horeca","Retail": 2 2 2 1 2 2 2 2 1 2 ...
#> $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
#> $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
#> $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
#> $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
#> $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
#> $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
anyNA(wholesale)#> [1] FALSE
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(x = nrow(wholesale),
size = nrow(wholesale)*0.8)
# splitting
sale_train <- wholesale[index, ]
sale_test <- wholesale[-index, ]prop.table(table(sale_train$Channel))#>
#> Horeca Retail
#> 0.6846591 0.3153409
The proportion of the data is 7:3, I assume it’s not balanced.
RNGkind(sample.kind = "Rounding")
set.seed(123)
sale_train <- downSample(x = sale_train %>%
select(-Channel),
y = sale_train$Channel,
yname = "Channel")prop.table(table(sale_train$Channel))#>
#> Horeca Retail
#> 0.5 0.5
model_all <- glm(formula = Channel~., data = sale_train, family = binomial)
summary(model_all)#>
#> Call:
#> glm(formula = Channel ~ ., family = binomial, data = sale_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.95563 -0.38246 -0.00624 0.24385 3.09751
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.930e+00 6.078e-01 -4.821 1.43e-06 ***
#> Fresh 2.116e-05 2.266e-05 0.934 0.350
#> Milk 1.022e-04 6.874e-05 1.486 0.137
#> Grocery 5.982e-05 6.691e-05 0.894 0.371
#> Frozen -1.971e-04 1.155e-04 -1.706 0.088 .
#> Detergents_Paper 8.510e-04 1.697e-04 5.014 5.33e-07 ***
#> Delicassen -1.372e-04 1.290e-04 -1.063 0.288
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 307.76 on 221 degrees of freedom
#> Residual deviance: 118.05 on 215 degrees of freedom
#> AIC: 132.05
#>
#> Number of Fisher Scoring iterations: 7
model_step <- step(object = model_all, direction = "both", trace = F)summary(model_step)#>
#> Call:
#> glm(formula = Channel ~ Milk + Frozen + Detergents_Paper, family = binomial,
#> data = sale_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.97606 -0.39799 -0.00721 0.25024 3.00802
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.603e+00 5.075e-01 -5.130 2.90e-07 ***
#> Milk 1.241e-04 5.789e-05 2.144 0.0321 *
#> Frozen -2.003e-04 1.085e-04 -1.847 0.0647 .
#> Detergents_Paper 8.759e-04 1.383e-04 6.332 2.42e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 307.76 on 221 degrees of freedom
#> Residual deviance: 120.47 on 218 degrees of freedom
#> AIC: 128.47
#>
#> Number of Fisher Scoring iterations: 7
predictions <- predict(object = model_step, newdata = sale_test, type = "response")sale_test$prediction <- predictions
sale_test$prediction <- ifelse(test = sale_test$prediction > 0.5, yes = "Retail", no = "Horeca")sale_test <- sale_test %>%
mutate(prediction = as.factor(prediction))confusionMatrix(data = sale_test$prediction, reference = sale_test$Channel, positive = "Retail")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Horeca Retail
#> Horeca 54 5
#> Retail 3 26
#>
#> Accuracy : 0.9091
#> 95% CI : (0.8287, 0.9599)
#> No Information Rate : 0.6477
#> P-Value [Acc > NIR] : 1.509e-08
#>
#> Kappa : 0.7978
#>
#> Mcnemar's Test P-Value : 0.7237
#>
#> Sensitivity : 0.8387
#> Specificity : 0.9474
#> Pos Pred Value : 0.8966
#> Neg Pred Value : 0.9153
#> Prevalence : 0.3523
#> Detection Rate : 0.2955
#> Detection Prevalence : 0.3295
#> Balanced Accuracy : 0.8930
#>
#> 'Positive' Class : Retail
#>
Model has a good performance with an accuracy of 91%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 94.74 %.
Now we will still use the same dataset that we have used before. And then, we will separate the target variable and predictor variable.
sale_train_predictor <- sale_train %>% select(-Channel)
sale_train_target <- sale_train %>% pull(Channel)
sale_test_predictor <- sale_test %>% select(-c("Channel", "prediction"))
sale_test_target <- sale_test %>% pull(Channel)sale_train_predictor_scale <- sale_train_predictor %>%
scale()sale_test_predictor_scale <- sale_test_predictor %>%
scale(center = attr(sale_train_predictor_scale,"scaled:center"), scale = attr(sale_train_predictor_scale,"scaled:scale"))sqrt(nrow(sale_train_predictor_scale))#> [1] 14.89966
The Optimum K-Value is 15
channel_pred <- knn(train = sale_train_predictor_scale,
test = sale_test_predictor_scale,
cl = sale_train_target,
k = 15
)confusionMatrix(data = channel_pred,
reference = sale_test_target,
positive = "Retail")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Horeca Retail
#> Horeca 51 5
#> Retail 6 26
#>
#> Accuracy : 0.875
#> 95% CI : (0.7873, 0.9359)
#> No Information Rate : 0.6477
#> P-Value [Acc > NIR] : 1.329e-06
#>
#> Kappa : 0.7281
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.8387
#> Specificity : 0.8947
#> Pos Pred Value : 0.8125
#> Neg Pred Value : 0.9107
#> Prevalence : 0.3523
#> Detection Rate : 0.2955
#> Detection Prevalence : 0.3636
#> Balanced Accuracy : 0.8667
#>
#> 'Positive' Class : Retail
#>
This model also has a good performance with an accuracy of 87.5%, meaning that 91% of our data is correctly classified. The value of sensitivity and specificity are 83.87 % and 89.47%.
Based on the evaluation results of the two models above, I can conclude that the best model for predicting channel values in wholesale data is the StepWise Model. The StepWise model has better accuracy in predicting channel values than the KNN model.