On this occasion, we will analyze using a clasification on the data
wholesale.csv.
We will explore the target variable channel to be able
to analyze how the influence of what variables are very influential for
a data grouped as a particular channel.
Read the data.
wholesale <- read.csv("wholesale.csv")
head(wholesale)Check data structure.
str(wholesale)## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
Check missing value.
anyNA(wholesale)## [1] FALSE
colSums(is.na(wholesale))## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Data Wrangling.
We will change the datatype of Channel into a
factor and We will throw away the Region variable,
since it is a class type data like our target variable and We don’t need
it.
library(dplyr)
wholesale <- wholesale %>%
mutate(Channel = as.factor(Channel),
Region = as.factor(Region)) %>%
select(-Region)
glimpse(wholesale)## Rows: 440
## Columns: 7
## $ Channel <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
Check data patern.
summary(wholesale)## Channel Fresh Milk Grocery Frozen
## 1:298 Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
## 2:142 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
## Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
## Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen
## Min. : 3.0 Min. : 3.0
## 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 816.5 Median : 965.5
## Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :40827.0 Max. :47943.0
Based on the summary above, we can see that there are no
irregularities in each variable.
Check the distribution proportion of target class
prop.table(table(wholesale$Channel))##
## 1 2
## 0.6772727 0.3227273
table(wholesale$Channel)##
## 1 2
## 298 142
When viewed from the proportion of the two classes, it is quite balanced, so we don’t really need additional pre-processing to balance the proportion between the two target classes of variables.
Splitting the data into data train(80%) and data test(20%).
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(wholesale), size = nrow(wholesale)*0.85)
# splitting
wholesale_train <- wholesale[index , ]
wholesale_test <- wholesale[-index , ]Check the distribution proportion of target class from data train.
prop.table(table(wholesale_train$Channel))##
## 1 2
## 0.6737968 0.3262032
table(wholesale_train$Channel)##
## 1 2
## 252 122
The prportion is quite balanced.
We will make a logistic regression model to predict
Channel. Based on the data and our business inquiry, We
will use all the predictor variable for building the model.
wholesale_LRMod <- glm(formula = Channel ~ ., data = wholesale_train, family = "binomial")
summary(wholesale_LRMod)##
## Call:
## glm(formula = Channel ~ ., family = "binomial", data = wholesale_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8624 -0.3243 -0.2289 0.0481 3.2562
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.636e+00 4.708e-01 -7.723 1.14e-14 ***
## Fresh 9.955e-06 1.812e-05 0.550 0.5827
## Milk 9.016e-05 5.935e-05 1.519 0.1287
## Grocery 9.621e-05 6.259e-05 1.537 0.1243
## Frozen -1.961e-04 1.015e-04 -1.933 0.0533 .
## Detergents_Paper 8.721e-04 1.488e-04 5.862 4.56e-09 ***
## Delicassen -5.699e-05 1.104e-04 -0.516 0.6056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 472.33 on 373 degrees of freedom
## Residual deviance: 176.59 on 367 degrees of freedom
## AIC: 190.59
##
## Number of Fisher Scoring iterations: 7
In the first modeling, there are still many predictor variables that
are not significant to the target variable, therefore we will try to do
a model fitting using the stepwise method.
wholesale_LRMod_step <- step(wholesale_LRMod, direction = "backward")## Start: AIC=190.59
## Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen
##
## Df Deviance AIC
## - Delicassen 1 176.86 188.86
## - Fresh 1 176.89 188.89
## <none> 176.59 190.59
## - Milk 1 178.86 190.86
## - Grocery 1 179.10 191.10
## - Frozen 1 182.25 194.25
## - Detergents_Paper 1 217.37 229.37
##
## Step: AIC=188.86
## Channel ~ Fresh + Milk + Grocery + Frozen + Detergents_Paper
##
## Df Deviance AIC
## - Fresh 1 177.09 187.09
## <none> 176.86 188.86
## - Milk 1 179.01 189.01
## - Grocery 1 179.12 189.12
## - Frozen 1 184.24 194.24
## - Detergents_Paper 1 218.39 228.39
##
## Step: AIC=187.09
## Channel ~ Milk + Grocery + Frozen + Detergents_Paper
##
## Df Deviance AIC
## <none> 177.09 187.09
## - Milk 1 179.37 187.37
## - Grocery 1 179.49 187.49
## - Frozen 1 184.50 192.50
## - Detergents_Paper 1 219.04 227.04
summary(wholesale_LRMod_step)##
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper,
## family = "binomial", data = wholesale_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8469 -0.3242 -0.2311 0.0514 3.2211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.548e+00 4.268e-01 -8.313 < 2e-16 ***
## Milk 9.048e-05 5.904e-05 1.532 0.1254
## Grocery 9.133e-05 6.070e-05 1.505 0.1324
## Frozen -1.950e-04 9.029e-05 -2.160 0.0308 *
## Detergents_Paper 8.596e-04 1.453e-04 5.916 3.3e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 472.33 on 373 degrees of freedom
## Residual deviance: 177.09 on 369 degrees of freedom
## AIC: 187.09
##
## Number of Fisher Scoring iterations: 7
By using the results from stepwise model, we will try to predict using the test data that we already have.
We will predict the probability Channel for
data wholesale_test and save it in a new column named
wholesale_pred
wholesale_test$wholesale_pred <- predict(wholesale_LRMod_step,
wholesale_test,
type = "response")
wholesale_testWe will Classify the wholesale_test data based on
wholesale_pred and save it in a new column named
channel_pred.
wholesale_test$channel_pred <- ifelse(wholesale_test$wholesale_pred > 0.5,
yes = "2",
no = "1")
wholesale_testTo evaluate the model that we have created, we will use a confusion matrix.
lr_cm <- confusionMatrix(as.factor(wholesale_test$channel_pred),
wholesale_test$Channel,
positive = "2")
lr_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 43 2
## 2 3 18
##
## Accuracy : 0.9242
## 95% CI : (0.832, 0.9749)
## No Information Rate : 0.697
## P-Value [Acc > NIR] : 7.577e-06
##
## Kappa : 0.8232
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9000
## Specificity : 0.9348
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9556
## Prevalence : 0.3030
## Detection Rate : 0.2727
## Detection Prevalence : 0.3182
## Balanced Accuracy : 0.9174
##
## 'Positive' Class : 2
##
Based on business questions, the best metrics are
accuracy & sensitivity/recall. Because we
want to predict whether a customer belongs to a certain customer group,
in this case we have a marketing strategy for each group.
In addition to predicting the logistic regression model, we will also make predictions using the K-NN method. Next, we will compare the results and we will take the best one.
Class Target.
levels(wholesale_train$Channel)## [1] "1" "2"
Class Target = 2
k optimum is the root of our data sum:
sqrt(nrow(data))
sqrt(nrow(wholesale_train))## [1] 19.33908
Pay attention to the number of target classes + Even target class -> odd number of k + Odd target class -> even number of k
K optimum is 19.
Cross Validation.
RNGkind(sample.kind = "Rounding")
set.seed(419)
# index sampling
index <- sample(x = nrow(wholesale), size = nrow(wholesale)*0.85)
# splitting
wholesale_KNNtrain <- wholesale[index , ]
wholesale_KNNtest <- wholesale[-index , ]For k-NN, separate predictor and label (target variable)
# prediktor
wholesale_train_x <- wholesale_KNNtrain %>% select_if(is.numeric)
wholesale_test_x <- wholesale_KNNtest %>% select_if(is.numeric)
# target
wholesale_train_y <- wholesale_KNNtrain[,"Channel"]
wholesale_test_y <- wholesale_KNNtest[,"Channel"]The range of each variable is not too different so there is no need for feature rescaling in the data pre-processing stage.
The K-NN method does not require prior modeling. So that predictions can be made immediately.
wholesale_knn <- knn(train = wholesale_train_x,
test = wholesale_test_x,
cl = wholesale_train_y,
k = 19)To evaluate the model that we have created, we will use a confusion matrix.
knn_cm <- confusionMatrix(data = as.factor(wholesale_knn),
reference = as.factor(wholesale_test_y),
positive = "2")
knn_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 38 5
## 2 6 17
##
## Accuracy : 0.8333
## 95% CI : (0.7213, 0.9138)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.001989
##
## Kappa : 0.6292
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.7727
## Specificity : 0.8636
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.8837
## Prevalence : 0.3333
## Detection Rate : 0.2576
## Detection Prevalence : 0.3485
## Balanced Accuracy : 0.8182
##
## 'Positive' Class : 2
##
eval_lr <- data_frame(Accuracy = lr_cm$overall[1],
Recall = lr_cm$byClass[1],
Specificity = lr_cm$byClass[2],
Precision = lr_cm$byClass[3])
eval_knn <- data_frame(Accuracy = knn_cm$overall[1],
Recall = knn_cm$byClass[1],
Specificity = knn_cm$byClass[2],
Precision = knn_cm$byClass[3])Evaluation Comparison.
eval_lreval_knnBased on the evaluation above, it can be seen that the results of the logistic regression have a better accuracy and sensitivity/recal.
So it was decided that we would use a logistic regression model to answer future business questions.