We will learn to use classification model using wholesale dataset. We want to know the relationship among variables, especially between the channel with other variables. We also want to predict the future channel based on the historical data.
In the following code we would borrow from a dataset prepared by Margarida Cardoso and available on the UCI Machine Learning repository
Load the required package
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(dplyr)
options(scipen = 100, max.print = 1e+06)Load the dataset.
# Read the dataset in, drop the "Region" feature because it's not interesting
wholesale <- read.csv("data_input/wholesale.csv")
wholesale <- wholesale[,-2]glimpse(wholesale)## Rows: 440
## Columns: 7
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
The data has 440 rows and 7 columns.
anyNA(wholesale)## [1] FALSE
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(wholesale, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
The graphic shows that a few of variables has strong correlation with
the
Channel variables.
Let’s convert the ‘Channel’ feature into ‘Industry’ and make it a factor and apply the name.
wholesale$Industry <- factor(wholesale$Channel, levels = c(1, 2), labels = c("horeca", "retail"))
# After doing that we can remove the original Channel feature
wholesale <- wholesale[,-1]
wholesale <- wholesale %>%
mutate(Industry=as.factor(Industry))
table(wholesale$Industry)##
## horeca retail
## 298 142
Notice here that, unlike the credit risk analysis example, we do not have a balanced dataset. The prior or baseline accuracy for predicting the majority class would be 67.7%.
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
Normalization to z-score:
wholesale.z <- data.frame(scale(wholesale[,-7]))
summary(wholesale.z)## Fresh Milk Grocery Frozen
## Min. :-0.9486 Min. :-0.7779 Min. :-0.8364 Min. :-0.62763
## 1st Qu.:-0.7015 1st Qu.:-0.5776 1st Qu.:-0.6101 1st Qu.:-0.47988
## Median :-0.2764 Median :-0.2939 Median :-0.3363 Median :-0.31844
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3901 3rd Qu.: 0.1889 3rd Qu.: 0.2846 3rd Qu.: 0.09935
## Max. : 7.9187 Max. : 9.1732 Max. : 8.9264 Max. :11.90545
## Detergents_Paper Delicassen
## Min. :-0.6037 Min. :-0.5396
## 1st Qu.:-0.5505 1st Qu.:-0.3960
## Median :-0.4331 Median :-0.1984
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2182 3rd Qu.: 0.1047
## Max. : 7.9586 Max. :16.4597
Merge the new data frame containing z-score values with the Industry vector
wholesale.n <- data.frame(cbind(wholesale.z, Industry = wholesale$Industry))Let’s split the dataset into train and test sets:
# But we can randomize the ordering if we're uncertain
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(nrow(wholesale.n), nrow(wholesale.n)*0.8)
#wholesale.n <- wholesale.n[sample(nrow(wholesale.n)), ]
# 80% train - 20% test
wholesale_train <-wholesale.n[index,]
wholesale_test <- wholesale.n[-index,]
table(wholesale_train$Industry)##
## horeca retail
## 241 111
We train the logistic regression model. This is the equation of a logistic regression model: \[log(\frac{p(X)}{1-p(X)}) = b_0 + b_1x_1 . X\] The left-hand side is called the log-odds or logit. On the right side, the \(b_0\) is the model intercept and \(b_1\) is the coefficient of feature \(X\).
logit_mod <- glm(Industry~.,data = wholesale_train,family = binomial("logit"))
summary(logit_mod)##
## Call:
## glm(formula = Industry ~ ., family = binomial("logit"), data = wholesale_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8246 -0.2821 -0.2092 0.0294 3.4120
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.64396 0.24972 -2.579 0.00992 **
## Fresh 0.04209 0.24377 0.173 0.86291
## Milk 0.59561 0.46204 1.289 0.19737
## Grocery 1.06188 0.65788 1.614 0.10651
## Frozen -1.00895 0.54665 -1.846 0.06493 .
## Detergents_Paper 4.00611 0.74907 5.348 0.0000000889 ***
## Delicassen -0.44435 0.39039 -1.138 0.25503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 438.81 on 351 degrees of freedom
## Residual deviance: 153.08 on 345 degrees of freedom
## AIC: 167.08
##
## Number of Fisher Scoring iterations: 7
Use model stepwise for feature selection
# model stepwise
logit_mod_step <- step(logit_mod,
direction = "backward",
trace = F)
summary(logit_mod_step)##
## Call:
## glm(formula = Industry ~ Grocery + Frozen + Detergents_Paper,
## family = binomial("logit"), data = wholesale_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8312 -0.2966 -0.2235 0.0340 3.2083
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6433 0.2367 -2.717 0.00658 **
## Grocery 1.2108 0.5244 2.309 0.02095 *
## Frozen -0.9456 0.4137 -2.286 0.02226 *
## Detergents_Paper 4.0783 0.7062 5.775 0.0000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 438.81 on 351 degrees of freedom
## Residual deviance: 156.38 on 348 degrees of freedom
## AIC: 164.38
##
## Number of Fisher Scoring iterations: 7
The output show the summary of our logistic regression model. The coefficient for the intercept is -0.29 and for the Grocery value is 1,19. Both coefficients has small p-value (< 0.05), indicating that they are significant and should not be removed from the equation. The intercept value indicate that if the sentiment value is 0, the probability of of the outcome to be “Yes” is 0.5888 (James et al. , 2013). The coefficient for sentiment value shows that for sentiment value of 1 unit-point, the probability of the outcome to be “Yes” is 0.6024. The standard error shows the confidence interval for the estimate value of each coefficient with the following equation:
K-NN classify the outcome by looking at the nearest “neighbour”. In
other words, K-NN is looking at what is the class of data-point(s) with
the least/shortest distance to the data we want to classify. The
distance is measured with Euclidean Distance, which can
penalizes neighbour with greater distance.
\[dist(A, B) = \sqrt{\sum\limits^{m}_{i=1}(x_i-y_i)^2}\]
We can customize the number of neighbours (K) we want to see in order to classify our data. First, we try to classify using the optimum number of K, which is the square root of the number of train dataset based on the rule of thumb.
# recheck class balance
prop.table(table(wholesale_train$Industry))##
## horeca retail
## 0.6846591 0.3153409
For k-NN, we will separate the predictor and target variable.
#variabel prediktor (x: yang ingin discaling)
wholesale_train_x <- wholesale_train %>% select_if(is.numeric)
wholesale_test_x <- wholesale_test %>% select_if(is.numeric)
#variabel target (y: kategorikal)
wholesale_train_y <- wholesale_train[,7]
wholesale_test_y <- wholesale_test[,7]We will try scaling data to make better proportion
# scaling data
wholesale_train_x_scale <- scale(wholesale_train_x)
attr(wholesale_train_x_scale, "scaled:center")## Fresh Milk Grocery Frozen
## 0.04207774 0.01388679 0.02790082 0.05516416
## Detergents_Paper Delicassen
## 0.02691581 0.01860786
attr(wholesale_train_x_scale, "scaled:scale")## Fresh Milk Grocery Frozen
## 1.050467 1.052833 1.061578 1.085773
## Detergents_Paper Delicassen
## 1.066339 1.056819
#using center and scale from data train
wholesale_test_x_scale <- scale(x=wholesale_test_x,
center = attr(wholesale_train_x_scale, "scaled:center"),
scale = attr(wholesale_train_x_scale, "scaled:scale"))
head(wholesale_test_x_scale)## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 3 -0.46512530 0.37440548 -0.05277607 -0.17733313 0.09955963 2.10266364
## 5 0.75890672 -0.06290046 -0.10095047 0.10913608 -0.24248376 1.21048367
## 6 -0.23480074 0.30375187 -0.30633596 -0.50724700 -0.23894335 -0.04239334
## 7 -0.03059468 -0.34744532 -0.12305518 -0.54253394 0.02560431 -0.34638546
## 14 0.65368001 0.03979823 0.67063286 -0.04642998 0.72719648 -0.32726013
## 17 -0.86653735 0.37543504 0.38703832 -0.60817522 0.29467582 -0.16687577
k-NN do not create model, so we will directly go to predict.
Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or Above Average).
pred_logit <- predict(logit_mod_step,newdata = wholesale_test,type = "response")
rmarkdown::paged_table(head(as.data.frame(pred_logit),10))We will convert the probability into class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.
# determine the class based on the threshold 0.5
pred_class <- as.factor(if_else(pred_logit > 0.5, "retail", "horeca"))
# confusion matrix
perf_logit1 <- confusionMatrix(data = pred_class, reference = wholesale_test$Industry,
positive = "retail")
perf_logit1## Confusion Matrix and Statistics
##
## Reference
## Prediction horeca retail
## horeca 54 11
## retail 3 20
##
## Accuracy : 0.8409
## 95% CI : (0.7475, 0.9102)
## No Information Rate : 0.6477
## P-Value [Acc > NIR] : 0.00004804
##
## Kappa : 0.6296
##
## Mcnemar's Test P-Value : 0.06137
##
## Sensitivity : 0.6452
## Specificity : 0.9474
## Pos Pred Value : 0.8696
## Neg Pred Value : 0.8308
## Prevalence : 0.3523
## Detection Rate : 0.2273
## Detection Prevalence : 0.2614
## Balanced Accuracy : 0.7963
##
## 'Positive' Class : retail
##
The result shows that our logistic regression model has accuracy of 84.09 % on test dataset, meaning that 84.09 % of our data is correctly classified. The value of sensitivity and specificity is 64.52 % and 94.74 %. This indicate that above average of positive outcomes are correctly classified but most of negative outcomes are correctly classified. The precision/positive predicted value is 86.96 %, meaning that 86.96 % of our positive prediction is correct.
We want to find the optimum k. Usually it is square root from total row of data train
sqrt(nrow(wholesale_train))## [1] 18.76166
#library(class)
wholesale_pred <- knn(train = wholesale_train_x_scale,
test = wholesale_test_x_scale,
cl = wholesale_train_y,
k= 19)Check confusion matrix
confusionMatrix(data = wholesale_pred,
reference = wholesale_test_y,
positive="retail")## Confusion Matrix and Statistics
##
## Reference
## Prediction horeca retail
## horeca 54 9
## retail 3 22
##
## Accuracy : 0.8636
## 95% CI : (0.7739, 0.9275)
## No Information Rate : 0.6477
## P-Value [Acc > NIR] : 0.000004804
##
## Kappa : 0.6874
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.7097
## Specificity : 0.9474
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.8571
## Prevalence : 0.3523
## Detection Rate : 0.2500
## Detection Prevalence : 0.2841
## Balanced Accuracy : 0.8285
##
## 'Positive' Class : retail
##
The result shows that our K-NN with K = 19 has accuracy of 86.36 % on test dataset, meaning that 86.36 % of our data is correctly classified. The value of sensitivity and specificity is 70.97 % and 94.74 %. This indicate that most of positive outcomes are correctly classified and most of number of negative outcomes are correctly classified too. The precision/positive predicted value is 88 %, meaning that 88 % of our positive prediction is correct.
Both model perform good in accuracy. No significant difference between logistic regression and K-NN in term of accuracy. But, K-NN has slightly better in accuracy, sensitivity, etc. So, we can conclude K-NN model is better that logistic regression in this case.