The data used the social network ads dataset here and predicted whether the product was purchased or not.
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
this workspace using the library() function:
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(ggpubr)
library(scales)
library(caret)
library(e1071)
library(ipred)
library(class) written library is very useful for the results of the analysis
data = read.csv("E:/Algoritma/5_lbb_logistik/User_Data.csv", stringsAsFactors = T)
head(data)Describe about dataset :
User.ID : Id of userGender : Gender of userAge : Age of userEstimatedSalary : Estimated SalaryPurchased : product was purchased or notglimpse(data)## Rows: 400
## Columns: 5
## $ User.ID <int> 15624510, 15810944, 15668575, 15603246, 15804002, 1572…
## $ Gender <fct> Male, Male, Female, Female, Male, Male, Female, Female…
## $ Age <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32…
## $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000…
## $ Purchased <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
for the initial dataset there are 5 columns with each data type
data = data %>%
select(-c(User.ID)) %>%
mutate(
Purchased = as.factor(Purchased)
)
glimpse(data)## Rows: 400
## Columns: 4
## $ Gender <fct> Male, Male, Female, Female, Male, Male, Female, Female…
## $ Age <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32…
## $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000…
## $ Purchased <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
the deleted variable is the user id. and for Purchased
variables changed to categorical data types
colSums(is.na(data))## Gender Age EstimatedSalary Purchased
## 0 0 0 0
the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required
dim(data)## [1] 400 4
prop.table(table(data$Purchased))##
## 0 1
## 0.6425 0.3575
the data used is 400 rows with 4 variables
the proportion between buying (1) and not buying (0), the result obtained is that the proportion is not balanced between buying and not buying.
RNGkind(sample.kind = "Rounding")
set.seed(1)
data_up <- upSample(x = data %>% select(-Purchased),
y = data$Purchased,
yname = "Purchased")
prop.table(table(data_up$Purchased))##
## 0 1
## 0.5 0.5
dim(data_up)## [1] 514 4
after handling the unbalanced proportions by up-sampling, the proportion is 50:50, in other words, the model is equal between buying and not buying, and it can be seen that the dimensions of the data have increased after up-sampling
set.seed(9)
index <- sample(x = nrow(data_up), size = nrow(data_up)*0.8)
# splitting (jalankan satu chunk secara keseluruhan)
data_train <- data_up[index,]
data_test <- data_up[-index,]
nrow(data_train)## [1] 411
nrow(data_test)## [1] 103
The distribution of training and testing data is 80:20. and obtained 411 rows for training data and 103 rows for testing data.
the modeling used in this case is the Logistic Regression Model and the KNN model
here the logistic regression model uses 2 models, namely for all variables and the backward model
model_all <- glm(
formula = Purchased ~ .,
data = data_train,
family = "binomial",
control = list(trace=FALSE)
)
summary(model_all)##
## Call:
## glm(formula = Purchased ~ ., family = "binomial", data = data_train,
## control = list(trace = FALSE))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4049 -0.3473 -0.0212 0.3599 2.4415
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.038576118 1.699366419 -8.850 < 0.0000000000000002 ***
## GenderMale 0.212217570 0.315557440 0.673 0.501
## Age 0.295088529 0.032855079 8.982 < 0.0000000000000002 ***
## EstimatedSalary 0.000043154 0.000006398 6.745 0.0000000000153 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 569.75 on 410 degrees of freedom
## Residual deviance: 256.13 on 407 degrees of freedom
## AIC: 264.13
##
## Number of Fisher Scoring iterations: 6
exp(model_all$coefficient)## (Intercept) GenderMale Age EstimatedSalary
## 0.0000002943265 1.2364168634805 1.3432452700561 1.0000431554145
In model_all it is known that the Gender variable has no
effect on the model.
if all variables are considered constant then the possibility that the user will buy a product is 0.00000029432
GenderMale: male users are 1.2364 times more likely to buy compared to women, provided that all other predictors have a fixed value.
Age: every 1 increase in age increases the odds, and the possibility of the user buying is 1.3432 times greater with the record that all other predictors have a fixed value.
EstimatedSalary : every 1 increase EstimatedSalary increases the odds, and the possibility of the user buying is 1.0000431 times greater with the record that all other predictors have a fixed value
model_backward = step(object = model_all,
direction = 'backward',
trace = F)
summary(model_backward)##
## Call:
## glm(formula = Purchased ~ Age + EstimatedSalary, family = "binomial",
## data = data_train, control = list(trace = FALSE))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4181 -0.3635 -0.0220 0.3526 2.3870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.771268650 1.632918439 -9.046 < 0.0000000000000002 ***
## Age 0.291823643 0.032201993 9.062 < 0.0000000000000002 ***
## EstimatedSalary 0.000042738 0.000006333 6.749 0.0000000000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 569.75 on 410 degrees of freedom
## Residual deviance: 256.59 on 408 degrees of freedom
## AIC: 262.59
##
## Number of Fisher Scoring iterations: 6
exp(model_backward$coefficient)## (Intercept) Age EstimatedSalary
## 0.0000003845205 1.3388668786092 1.0000427390420
In model_backward it is known that the Gender variable
has no effect on the model, so it is not included in the model.
if all variables are considered constant then the possibility that the user will buy a product is 0.00000038452
Age: every 1 increase in age increases the odds, and the possibility of the user buying is 1.33886 times greater with the record that all other predictors have a fixed value.
EstimatedSalary : every 1 increase EstimatedSalary increases the odds, and the possibility of the user buying is 1.0000427 times greater with the record that all other predictors have a fixed value
model_all$aic## [1] 264.1305
model_backward$aic## [1] 262.5853
based on the smallest AIC value it is obtained the model that can be
used is the model_backward
data_test$pred_purchased <- predict(
object = model_backward,
newdata = data_test,
type = "response"
)
data_test$pred_label <-
ifelse(data_test$pred_purchased < 0.5, 0, 1) %>%
as.factor()
table(predict = data_test$pred_label,
actual = data_test$Purchased)## actual
## predict 0 1
## 0 40 11
## 1 10 42
true positive (TP): Predicted buy and true buy 42 user true negative (TN): Predicted not to buy but bought 40 user false positive (FP): Predicted to buy but not buy 10 user false negative (FN): Predicted not to buy but to buy 11 user
because we want to minimize Predicted to buy but not buy so we can use precision
confusionMatrix(
data = data_test$pred_label,
reference = data_test$Purchased,
positive = "1"
)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 11
## 1 10 42
##
## Accuracy : 0.7961
## 95% CI : (0.7054, 0.8691)
## No Information Rate : 0.5146
## P-Value [Acc > NIR] : 0.000000003087
##
## Kappa : 0.5921
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7925
## Specificity : 0.8000
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.7843
## Prevalence : 0.5146
## Detection Rate : 0.4078
## Detection Prevalence : 0.5049
## Balanced Accuracy : 0.7962
##
## 'Positive' Class : 1
##
Accuracy: the model used has an accuracy of 79.61% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 79.25% Specificity: a measure of the goodness of the model to the negative class 80% Pos Pred Value/Precision: model precision measures predict positive class 80.77%
knn_train_x <- data_train %>%
select(-c(Gender, Purchased))
knn_test_x <- data_test %>%
select(-c(Gender, Purchased, pred_purchased, pred_label))
knn_train_y <- data_train[,"Purchased"]
knn_test_y <- data_test[,"Purchased"]In KNN modeling, a separation is made between the predictor variable and the target variable in each of the training data and also the testing data to facilitate analysis
train_x_scale <- scale(x = knn_train_x)
test_x_scale <- scale(x = knn_test_x,
center = attr(train_x_scale, "scaled:center"),
scale = attr(train_x_scale, "scaled:scale"))Scaling is useful for generalizing each variable, because each
variable has a different data range and different units. This scaling
uses the z-score method
sqrt(nrow(data_train))## [1] 20.27313
the k value is 20.27313 by rounding to 20
knn_pred <- knn(train = train_x_scale, # data train yang sudah discale
test = test_x_scale, # data test yang sudah discale
cl = knn_train_y, # target dari data train
k = 20)
confusionMatrix(
data = knn_pred,
reference = knn_test_y,
positive = "1"
)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 7
## 1 7 46
##
## Accuracy : 0.8641
## 95% CI : (0.7825, 0.9237)
## No Information Rate : 0.5146
## P-Value [Acc > NIR] : 0.00000000000006894
##
## Kappa : 0.7279
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8679
## Specificity : 0.8600
## Pos Pred Value : 0.8679
## Neg Pred Value : 0.8600
## Prevalence : 0.5146
## Detection Rate : 0.4466
## Detection Prevalence : 0.5146
## Balanced Accuracy : 0.8640
##
## 'Positive' Class : 1
##
true positive (TP): Predicted buy and true buy 46 user true negative (TN): Predicted not to buy but bought 43 user false positive (FP): Predicted to buy but not buy 7 user false negative (FN): Predicted not to buy but to buy 7 user
Accuracy: the model used has an accuracy of 86.41% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 86.79% Specificity: a measure of the goodness of the model to the negative class 86% Pos Pred Value/Precision: model precision measures predict positive class 86.79%
From the Logistic Regression Model and the KNN Model, the accuracy value for each model is 79,61% for the Logistic Regression Model and 86.41% for the KNN Model.
knn_pred_train <- knn(train = train_x_scale,
test = train_x_scale,
cl = knn_train_y,
k = 20)
confusionMatrix(
data = knn_pred_train,
reference = knn_train_y,
positive = "1"
)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 191 12
## 1 16 192
##
## Accuracy : 0.9319
## 95% CI : (0.903, 0.9543)
## No Information Rate : 0.5036
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.8638
##
## Mcnemar's Test P-Value : 0.5708
##
## Sensitivity : 0.9412
## Specificity : 0.9227
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.9409
## Prevalence : 0.4964
## Detection Rate : 0.4672
## Detection Prevalence : 0.5061
## Balanced Accuracy : 0.9319
##
## 'Positive' Class : 1
##
based on the predict value, the accuracy value for the training data is 0.9319 and the test data is 0.8641 when each is subtracted, it gets 0.0678. A model is to be overfit if the difference reaches more than 0.1 so that when compared 0.0678 < 0.1.
It can be concluded that the model that has been created can accommodate the available test data, in other words, the model is very good for classifying users who buy or don’t buy a product.
Based on the analysis that has been done it can be concluded that :
Good modeling in this case using the KNN model with a high accuracy value of 86.41% for testing data and 93.19% for training data
Based on the KNN Model obtained Predicted buy and true buy 46 user and Predicted not to buy but to buy 7 user with a Sensitivity value of 86.79%
Based on the KNN Model obtained true negative (TN): Predicted not to buy but bought 43 user and Predicted to buy but not buy 7 user with a Specificity value of 80%
Based on the KNN Model obtained Predicted to buy but not buy 7 user with a Precision value of 86.79%