Machine learning has several methods to predict, I will predict the
wholesale dataset by looking at the influence of variables either
numeric or categorical. The target that will be used in this prediction
is Channel. The prediction method will use 2 methods,
namely Logistic Regression and K-NN. We will compare the two types of
methods and will conclude which method is best to use for this
prediction.
Before we start we will install some libraries as follows:
library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(GGally)
library(dplyr)
library(car)
library(class)
library(lmtest)
library(gtools)Import the data that we have prepared, namely wholesale.csv. Use
command read.csv according to the file extension
wholesale <- read.csv("wholesale.csv")Let’s take a quick look at the data content with the command
Head ()
head(wholesale)We check the data type with the command glimpse ()
glimpse(wholesale)#> Rows: 440
#> Columns: 8
#> $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
#> $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
From Function glimps above we can see, the data has 440
rows and 8 columns. Here is the explanation about the variables:
Channel : Horeca (Hotel/Restaurant/Cafe or Retail
channel (Nominal)Region : Lisnon, Oporto or Other (Nominal)Fresh : annual spending (m.u.) on fresh products
(Continuous)Milk : annual spending (m.u.) on milk products
(Continuous)Grocery : annual spending (m.u.)on grocery products
(Continuous)Frozen : annual spending (m.u.)on frozen products
(Continuous)Detergents_Paper : annual spending (m.u.) on detergents
and paper products (Continuous)Delicassen : annual spending (m.u.)on and delicatessen
products (Continuous)Change Type data
Region and Channel wil be change from
integer to factor, because the number just infomation to be area:
wholesale_clean <- wholesale %>%
mutate(Channel = ifelse(Channel == 1, "Horeca", "Retail")) %>%
mutate(Channel = as.factor(Channel),
Region = as.factor(Region))Check Missing values
colSums(is.na(wholesale_clean))#> Channel Region Fresh Milk
#> 0 0 0 0
#> Grocery Frozen Detergents_Paper Delicassen
#> 0 0 0 0
No missing values from above data
Check the distribution/pattern of the data
summary(wholesale_clean)#> Channel Region Fresh Milk Grocery
#> Horeca:298 1: 77 Min. : 3 Min. : 55 Min. : 3
#> Retail:142 2: 47 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153
#> 3:316 Median : 8504 Median : 3627 Median : 4756
#> Mean : 12000 Mean : 5796 Mean : 7951
#> 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656
#> Max. :112151 Max. :73498 Max. :92780
#> Frozen Detergents_Paper Delicassen
#> Min. : 25.0 Min. : 3.0 Min. : 3.0
#> 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
#> Median : 1526.0 Median : 816.5 Median : 965.5
#> Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
#> 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
#> Max. :60869.0 Max. :40827.0 Max. :47943.0
channel majority in horecaRegion the most in 3Fresh has a high range with a minimum of 3 and a
maximum of 112,151 with a mean of 12,000.Milk has a high range with a minimum of 3 and a maximum
of 92,780 with a mean of 7,951.Frozen has a high range between minimum 25 and maximum
60,869 with a mean of 3,071.9Detergents_Paper has a high range with a minimum of 3
and a maximum of 40,827 with a mean of 2,881.5Delicassen has a mean of 1,524.9 with a high range
having a minimum value of 3 and a maximum of 47,943.We will look for the relationship between the target and the predictor
ggplot(wholesale_clean, mapping = aes(x = Region, y = reorder(Channel, Region)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Region",
x = "Region",
y = NULL,
fill = "Region"
) + theme_light()From the plot above we can conclude that the composition of
retail with heroca is the same in each
region.
ggplot(wholesale_clean, mapping = aes(x = Fresh, y = reorder(Channel, Fresh)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Fresh",
x = "Fresh",
y = NULL,
fill = "Fresh"
) + theme_light()From the above plot in channel horeca which
contains fresh more than in channel
retail
ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Milk)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Milk",
x = "Milk",
y = NULL,
fill = "Milk"
) + theme_light()From the above plot in channel retail which
contains more milk than in channel
horeca
ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Grocery)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Grocery",
x = "Grocery",
y = NULL,
fill = "Grocery"
) + theme_light()From the above plot in channel retail which
contains grocery more than in channel
horeca.
ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Frozen)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Frozen",
x = "Frozen",
y = NULL,
fill = "Frozen"
) + theme_light()From the above plot in channel horeca which
contains frozen more than in channel
retail
ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Detergents_Paper)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Detergents Paper",
x = "Detergents Paper",
y = NULL,
fill = "Detergents Paper"
) + theme_light()From the above plot in channel retail which
contains Detergent_Paper more than in channel
horeca.
ggplot(wholesale_clean, mapping = aes(x = Milk, y = reorder(Channel, Delicassen)))+
geom_col(aes(fill = Channel), position = "dodge")+
labs(
title = "Total Channel In Wholesale",
subtitle = "Channel Vs Delicassen",
x = "Delicassen",
y = NULL,
fill = "Delicassen"
) + theme_light()From the above plot in channel retail which
contains more Delicassen than in channel
horeca
Logistic Regression It’s a statistical model used to predict the probability of an event occurring. The outcome variable is binary (0 or 1), representing two possible states (e.g., yes/no, true/false). It applies a logistic function to transform the linear combination of predictors into a probability.
We will divide the data into 2, namely train data and test data with a proportion of 80:20.
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(nrow(wholesale_clean), nrow(wholesale_clean)*0.8)
# splitting
wholesale_train <- wholesale_clean[index,]
wholesale_test <- wholesale_clean[-index,]Check the proportion of data in wholesale_train
prop.table(table(wholesale_train$Channel))#>
#> Horeca Retail
#> 0.6846591 0.3153409
The data is not balanced, we will do upsampling to make
the data balance
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
wholesale_train_up <- upSample(x = wholesale_train %>% select(-Channel),
y = wholesale_train$Channel,
yname = "Channel")Check the proportion of data in wholesale_train_up
prop.table(table(wholesale_train_up$Channel))#>
#> Horeca Retail
#> 0.5 0.5
We will model with the function glm(), as predictors are
all variables and the target is channel
model_logistic <- glm(Channel ~ ., wholesale_train_up, family = "binomial")
summary(model_logistic)#>
#> Call:
#> glm(formula = Channel ~ ., family = "binomial", data = wholesale_train_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.13752189 0.65765804 -6.291 0.00000000031482 ***
#> Region2 2.32470217 0.74967925 3.101 0.00193 **
#> Region3 0.70848987 0.55851530 1.269 0.20461
#> Fresh -0.00001571 0.00001752 -0.897 0.36983
#> Milk 0.00008356 0.00005547 1.506 0.13196
#> Grocery 0.00011132 0.00005450 2.043 0.04110 *
#> Frozen -0.00005594 0.00004684 -1.194 0.23238
#> Detergents_Paper 0.00095894 0.00013768 6.965 0.00000000000328 ***
#> Delicassen -0.00012365 0.00011335 -1.091 0.27532
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 668.19 on 481 degrees of freedom
#> Residual deviance: 216.49 on 473 degrees of freedom
#> AIC: 234.49
#>
#> Number of Fisher Scoring iterations: 8
From the model above we can get the residual deviance model error value: 216.49
From the logistic model above, we can find the information loss / AIC : 234.49
We will do step-wise to find the smallest AIC value / find the smallest information loss from our model.
model_step <- step(model_logistic, direction = "backward", trace = F)
summary(model_step)#>
#> Call:
#> glm(formula = Channel ~ Region + Grocery + Frozen + Detergents_Paper,
#> family = "binomial", data = wholesale_train_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.20701054 0.65137428 -6.459 0.0000000001056282 ***
#> Region2 2.21093277 0.72161499 3.064 0.00218 **
#> Region3 0.64284322 0.55470272 1.159 0.24650
#> Grocery 0.00013960 0.00004278 3.263 0.00110 **
#> Frozen -0.00007404 0.00004909 -1.508 0.13146
#> Detergents_Paper 0.00097744 0.00012912 7.570 0.0000000000000373 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 668.19 on 481 degrees of freedom
#> Residual deviance: 220.69 on 476 degrees of freedom
#> AIC: 232.69
#>
#> Number of Fisher Scoring iterations: 8
From the model above we can get the model error value Residual deviance: 220.69
From the step-wise backward model, we
can see the AIC value of 232.69 is smaller than the logistic model (all
predictors).
We will predict wholesale_test with predic function
based on model_step.
With model_step as our machine learning model and
wholesale_test as our test data, with the
predic function below:
pred_test_Log <- predict(model_step, type = "response", newdata = wholesale_test)From the graph below, the prediction results can be interpreted that
the value is more inclined towards 1, namely retail.
ggplot(wholesale_test, aes(x=pred_test_Log)) +
geom_density(lwd=0.5) +
labs(title = "Distribution Prediction Data Channel") +
theme_minimal()In the graph above, it can be interpreted that the prediction results
are more inclined towards 1, which means Retail.
pred_wholesale_test <- ifelse(pred_test_Log > 0.5, "Retail", "Horeca") %>% as.factor()To evaluate the model we have created, we will use confusion matrix
wholesale_log_con <- confusionMatrix(pred_wholesale_test, wholesale_test$Channel, positive = "Retail")
wholesale_log_con#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Horeca Retail
#> Horeca 54 7
#> Retail 3 24
#>
#> Accuracy : 0.8864
#> 95% CI : (0.8009, 0.9441)
#> No Information Rate : 0.6477
#> P-Value [Acc > NIR] : 0.0000003334
#>
#> Kappa : 0.7434
#>
#> Mcnemar's Test P-Value : 0.3428
#>
#> Sensitivity : 0.7742
#> Specificity : 0.9474
#> Pos Pred Value : 0.8889
#> Neg Pred Value : 0.8852
#> Prevalence : 0.3523
#> Detection Rate : 0.2727
#> Detection Prevalence : 0.3068
#> Balanced Accuracy : 0.8608
#>
#> 'Positive' Class : Retail
#>
Recall <- round(24/(24+7),2)
Specificity <- round((54)/(3+54),2)
Accuracy <- round((54+24)/nrow(wholesale_test),2)
Precision <- round(24/(24+3),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performanceUsed to find out the maximum threshold of what we will be examining.
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = pred_test_Log,
ref = wholesale_test$Channel,
postarget = "Retail",
negtarget = "Horeca")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())Based on the performance tradeoff model above, we can see that with a cutoff of 0.5 we get higher values of specificity, accuracy and precision, but the recall value is rather low.
# Odds ratio all coefficients
exp(model_step$coefficients) %>%
data.frame() Interpretasi model :
Channel with Region2 has 9.124223 times
more probability, compared to Region1 and
Region3 provided the other variables have the same
value.Channel with Region3 is 1.901881 times
more likely than Region2 and Region3 provided
the other variables have the same values.Grocery will increase the
probability of Channel retail by 1.00014x,
holding all other variables constant.frozen will increase the
probability of Channel retail by 0.999926x,
holding all other variables constant.Detergents_Paper will
increase the probability of Channel retail by
1.000978x, holding all other variables constant.A non-parametric classification and regression algorithm. Assumes that similar instances belong to the same class or have similar values. Works by finding the K nearest neighbors of a new instance and assigning it to the most frequent class among them. We will use the previous data by using machine learning type K-NN
Use Function head() to see preview data
head(wholesale_clean)Check data type use glimpse()
glimpse(wholesale_clean)#> Rows: 440
#> Columns: 8
#> $ Channel <fct> Retail, Retail, Retail, Horeca, Retail, Retail, Retai…
#> $ Region <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
#> $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
#> $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
#> $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
#> $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
#> $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
We will only use numeric data types as predictors.
wholesale_knn <- wholesale_clean %>% select(-Region)Check the target proportion
prop.table(table(wholesale_knn$Channel))#>
#> Horeca Retail
#> 0.6772727 0.3227273
We will divide the data into 2, namely train data and test data with a proportion of 80:20.
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(nrow(wholesale_knn), nrow(wholesale_knn)*0.8)
# splitting
wholesale_knn_train <- wholesale_knn[index,]
wholesale_knn_test <- wholesale_knn[-index,] # data yang tidak masuk trainCheck the target proportion wholesale_knn_train
prop.table(table(wholesale_knn_train$Channel))#>
#> Horeca Retail
#> 0.6846591 0.3153409
Data is not in balance we will do upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
wholesale_knn_train_up <- upSample(x = wholesale_knn_train %>% select(-Channel),
y = wholesale_knn_train$Channel,
yname = "Channel")Check the target proportion wholesale_knn_train_up
prop.table(table(wholesale_knn_train_up$Channel))#>
#> Horeca Retail
#> 0.5 0.5
For k-NN, the predictors and labels (target variables) are separated.
# prediktor data train
wholesale_knn_train_x <- wholesale_knn_train_up %>% select_if(is.numeric)
# target data train
wholesale_knn_train_y <- wholesale_knn_train_up[,"Channel"]
# prediktor data test
wholesale_knn_test_x <- wholesale_knn_test %>% select_if(is.numeric)
# target data test
wholesale_knn_test_y <- wholesale_knn_test[,"Channel"]Because of the range of each data, the next step is scaling the train data.
# scaling data
# train
wholesale_knn_train_xs <- scale(x = wholesale_knn_train_x) # data prediktor untuk data train
# test
wholesale_knn_test_xs <- scale(x = wholesale_knn_test_x,
center = attr(wholesale_knn_train_xs, "scaled:center"), #nilai rata rata data
scale = attr(wholesale_knn_train_xs,"scaled:scale")) # nilai data trainDetermine the K value with the formula below
sqrt(nrow(wholesale_knn_train))#> [1] 18.76166
K Optimum = 19
The K-NN machine learning model does not need to be modeled in advance.
library(class)
pred_knn_wholsesale <- knn(train = wholesale_knn_train_xs, # prediktor data train
test = wholesale_knn_test_xs, # prediktor data test
cl = wholesale_knn_train_y, # label dari data train
k = 19)
head(pred_knn_wholsesale)#> [1] Retail Horeca Retail Retail Retail Retail
#> Levels: Horeca Retail
# confusion matrix
library(caret)
wholesale_knn_conf <- confusionMatrix(pred_knn_wholsesale, wholesale_knn_test_y, positive = "Retail")
wholesale_knn_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Horeca Retail
#> Horeca 53 4
#> Retail 4 27
#>
#> Accuracy : 0.9091
#> 95% CI : (0.8287, 0.9599)
#> No Information Rate : 0.6477
#> P-Value [Acc > NIR] : 0.00000001509
#>
#> Kappa : 0.8008
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.8710
#> Specificity : 0.9298
#> Pos Pred Value : 0.8710
#> Neg Pred Value : 0.9298
#> Prevalence : 0.3523
#> Detection Rate : 0.3068
#> Detection Prevalence : 0.3523
#> Balanced Accuracy : 0.9004
#>
#> 'Positive' Class : Retail
#>
eval_wholesale_log <- data_frame(Accuracy = wholesale_log_con$overall[1],
Recall = wholesale_log_con$byClass[1],
Specificity = wholesale_log_con$byClass[2],
Precision = wholesale_log_con$byClass[3])
eval_wholesale_knn <- data_frame(Accuracy = wholesale_knn_conf$overall[1],
Recall = wholesale_knn_conf$byClass[1],
Specificity = wholesale_knn_conf$byClass[2],
Precision = wholesale_knn_conf$byClass[3])# Model Evaluation Logistic Regretion
eval_wholesale_log# Model Evaluation K-NN
eval_wholesale_knnWhen viewed from both methods, namely by using Logistic Regression
and K-NN, how capable is the proportion of my model to correctly guess
the Channel that is Retail better by using the
K-NN method because it has a Recall value = 87.1% greater than using the
logistic regression method.
If I am a wholesale manager, considering the goodness of the correct
guessing model, I will use the K-NN model. If the predicted goods go to
Retail but actually go to Horeca then the risk
is that Retail will run out of stock. Therefore, I will use
the Recall matrix by looking at the goodness of the model to the
positive class.