In this Learning by Building task, I would like to make model from data wholesale.csv for predicting variabel Channel using Logistic Regression and KNN
library(tidyverse)
library(gtools)
library(caret)
theme_set(theme_minimal() +
theme(legend.position = "top"))
options(scipen = 999)
wholesale <- read.csv("data_input/wholesale.csv") %>%
select(-Region) %>%
mutate(Channel = ifelse(Channel == 1, 0, 1))%>%
mutate(Channel=as.factor(Channel))
wholesale
No Null value in the dataset
colSums(is.na(wholesale))
## Channel Fresh Milk Grocery
## 0 0 0 0
## Frozen Detergents_Paper Delicassen
## 0 0 0
Check if the data class imbalance or not by looking at proportion in imbalance class target variable
# Your code here
prop.table(table(data_train$Channel))
##
## 0 1
## 0.6846591 0.3153409
We can see that data train is imbalance proportion. This will influence performance model because model will tend to predict majority class or class with dominant proportion.
There are some ways to handle Class Imbalance:
In this case, I would like to do upsample to handle class imbalance
library(caret)
# Your code here
set.seed(123)
data_train <- upSample(x = data_train %>% select(-Channel),
y = data_train$Channel,
list = F,
yname = "Channel")
table(data_train$Channel)
##
## 0 1
## 241 241
Make model regression logistic with Channel as target variable.
summary(data_train)
## Fresh Milk Grocery Frozen
## Min. : 3 Min. : 55 Min. : 137 Min. : 25.0
## 1st Qu.: 3044 1st Qu.: 2100 1st Qu.: 2856 1st Qu.: 780.2
## Median : 8090 Median : 4980 Median : 7408 Median : 1447.0
## Mean : 11698 Mean : 7339 Mean :10358 Mean : 2871.3
## 3rd Qu.: 16800 3rd Qu.: 9246 3rd Qu.:13164 3rd Qu.: 3232.0
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen Channel
## Min. : 3.0 Min. : 3.0 0:241
## 1st Qu.: 386.2 1st Qu.: 430.2 1:241
## Median : 2171.0 Median : 1037.0
## Mean : 4192.3 Mean : 1567.8
## 3rd Qu.: 6171.2 3rd Qu.: 2063.5
## Max. :40827.0 Max. :47943.0
# Your code here
model_channel <- glm(Channel ~ ., data_train, family = "binomial")
# menggunakan stepwise mencari AIC terendah
model_step <- step(model_channel, direction = "both", trace = 0)
summary(model_step)
##
## Call:
## glm(formula = Channel ~ Milk + Grocery + Frozen + Detergents_Paper +
## Delicassen, family = "binomial", data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2837 -0.3124 -0.0006 0.1388 3.3881
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.14238368 0.39834696 -7.889 0.00000000000000306 ***
## Milk 0.00011687 0.00005440 2.148 0.0317 *
## Grocery 0.00010868 0.00005899 1.842 0.0654 .
## Frozen -0.00022007 0.00009250 -2.379 0.0173 *
## Detergents_Paper 0.00094407 0.00013541 6.972 0.00000000000312735 ***
## Delicassen -0.00022092 0.00012692 -1.741 0.0818 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 668.19 on 481 degrees of freedom
## Residual deviance: 212.61 on 476 degrees of freedom
## AIC: 224.61
##
## Number of Fisher Scoring iterations: 8
Predict data test by using regression logistic model. Change prediction in probability into category, with terms if probability > 0.5 is change to be category 1 which means Retail.
# Predict data train
pred_train <- predict(model_step, data_train, type = "response")
pred_class_train <- ifelse(pred_train > 0.5, "Retail", "Horeca") %>%
as.factor()
data_train$Channel <- as.factor(ifelse(data_train$Channel ==1 , "Retail", "Horeca"))
confusionMatrix(pred_class_train, data_train$Channel, positive = "Retail")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 224 22
## Retail 17 219
##
## Accuracy : 0.9191
## 95% CI : (0.891, 0.9418)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.8382
##
## Mcnemar's Test P-Value : 0.5218
##
## Sensitivity : 0.9087
## Specificity : 0.9295
## Pos Pred Value : 0.9280
## Neg Pred Value : 0.9106
## Prevalence : 0.5000
## Detection Rate : 0.4544
## Detection Prevalence : 0.4896
## Balanced Accuracy : 0.9191
##
## 'Positive' Class : Retail
##
data_train$predtrain <- data.frame(pred_train)
ggplot(data_train, aes(x=pred_train)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data in Data Train") +
theme_minimal()
From graph above, can be interpreted that prediction data train of Retail has highest results than Horeca
# Your code here
pred_test <- predict(model_step, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "Retail", "Horeca") %>% as.factor()
data_test$Channel <- as.factor(ifelse(data_test$Channel ==1 , "Retail", "Horeca"))
log_conf <- confusionMatrix(pred_class_test, data_test$Channel, positive = "Retail")
log_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Horeca Retail
## Horeca 54 6
## Retail 3 25
##
## Accuracy : 0.8977
## 95% CI : (0.8147, 0.9522)
## No Information Rate : 0.6477
## P-Value [Acc > NIR] : 0.00000007518
##
## Kappa : 0.7708
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.8065
## Specificity : 0.9474
## Pos Pred Value : 0.8929
## Neg Pred Value : 0.9000
## Prevalence : 0.3523
## Detection Rate : 0.2841
## Detection Prevalence : 0.3182
## Balanced Accuracy : 0.8769
##
## 'Positive' Class : Retail
##
data_test$predtest <- data.frame(pred_test)
ggplot(data_test, aes(x=pred_test)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data in Data Test") +
theme_minimal()
From graph above, can be interpreted that prediction data test of Horeca has highest results than Retail
For predicting Channel, Accuracy of Data train is 0.91 and accuracy of data test is 0.89 which is overall balance with high accuracy.
RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(wholesale)
index <- sample(row_data, row_data*0.8)
data_train <- wholesale[index, ]
data_test <- wholesale[-index, ]
library(GGally)
ggcorr(data_train, label=T)
plot(as.factor(data_train$Channel), data_train$Fresh)
plot(as.factor(data_train$Channel), data_train$Milk)
plot(as.factor(data_train$Channel), data_train$Grocery)
plot(as.factor(data_train$Channel), data_train$Frozen)
plot(as.factor(data_train$Channel), data_train$Detergents_Paper)
From boxplot above, we can see that Retail tends to buy less Fresh and Frozen than Horeca, and Retail tends to buy more Milk, Grocery, and Detergents_Paper than Horeca. Grocery has high positive correlation with Detergents Paper, which means when the market buy high Grocery it means they tends to also buy Detergents Paper and vice versa. Milk also has high positive correlation with Grocery and Detergents Paper, which means when market buy high Milk, it means they tends to also buy high Grocery and high Detergents Paper and vice versa.
train_x <- data_train %>%
select(-Channel) %>% # buang target variable
scale() # lakukan scalling ke semua prediktor
# Menyimpan target variabel
train_y <- as.factor(data_train$Channel)
test_x <- data_test %>%
select(-Channel) %>% #Buang target variable
scale(center = attr(train_x,"scaled:center"), # menunjukkan rata-rata dari setiap variabel
scale = attr(train_x, "scaled:scale") # menunjukkan standar deviasi dari setiap variabel
) # lakukan scalling dengan informasi dari data train
# Menyimpan variabel target
test_y <- as.factor(data_test$Channel)
model_knn <- knn3(x = train_x,
y = train_y,
k = sqrt(nrow(train_x))
)
pred_knn_train <- predict(model_knn,train_x, type = "class")
head(pred_knn_train)
## [1] 0 1 0 0 0 0
## Levels: 0 1
confusionMatrix(pred_knn_train, train_y, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 225 12
## 1 16 99
##
## Accuracy : 0.9205
## 95% CI : (0.8871, 0.9465)
## No Information Rate : 0.6847
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.8176
##
## Mcnemar's Test P-Value : 0.5708
##
## Sensitivity : 0.8919
## Specificity : 0.9336
## Pos Pred Value : 0.8609
## Neg Pred Value : 0.9494
## Prevalence : 0.3153
## Detection Rate : 0.2812
## Detection Prevalence : 0.3267
## Balanced Accuracy : 0.9128
##
## 'Positive' Class : 1
##
pred_knn <- predict(model_knn, test_x, type = "class")
or same result different code
pred_knn <- knn3Train(train = train_x,
test = test_x,
cl = train_y,
k = sqrt(nrow(train_x)) %>% round()
) %>%
as.factor()
head(pred_knn)
## [1] 0 0 0 0 1 1
## Levels: 0 1
pred_knn_conf <- confusionMatrix(pred_knn, test_y,positive = "1")
pred_knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 9
## 1 3 22
##
## Accuracy : 0.8636
## 95% CI : (0.7739, 0.9275)
## No Information Rate : 0.6477
## P-Value [Acc > NIR] : 0.000004804
##
## Kappa : 0.6874
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.7097
## Specificity : 0.9474
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.8571
## Prevalence : 0.3523
## Detection Rate : 0.2500
## Detection Prevalence : 0.2841
## Balanced Accuracy : 0.8285
##
## 'Positive' Class : 1
##
For predicting Channel, Accuracy of Data train is 0.92 and accuracy of data test is 0.86 which is overall balance with high accuracy.
library(tibble)
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_logit
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = pred_knn_conf$byClass[3])
eval_knn
If we can see from both method, using Logistic Regression and K-NN, ability of the model for predicting Retail or Horeca is both has good performance of prediction and quite similar high value of Accuracy, Specificity, and Precision. But highest recall in Logistic Regression model.
If I am an enterpreneur I would like to targeting both Horeca or Retail, both of them are my market targets, and I want my model in predicting Channel has a high accuracy. From the model that has been made in previous time, we can see that both model Logistic Regression and KNN has good performance with high accuracy, so both model is good for predicting Channel. But if we are prioritizing to have a good prediction in predicting Retail only, we can focusing on results of Recall/Sensitivity, then we can use Logistic Regression for predicting Channel.