Logistic Regression

About Dataset

The data used the social network ads dataset here and predicted whether the product was purchased or not.

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

this workspace using the library() function:

library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(ggpubr)
library(scales)
library(caret)
library(e1071)
library(ipred)
library(class) 

written library is very useful for the results of the analysis

Dataset

data = read.csv("E:/Algoritma/5_lbb_logistik/User_Data.csv", stringsAsFactors = T)
head(data)

Describe about dataset :

  • User.ID : Id of user
  • Gender : Gender of user
  • Age : Age of user
  • EstimatedSalary : Estimated Salary
  • Purchased : product was purchased or not

Exploratory Dataset

Data Wrangling

glimpse(data)
## Rows: 400
## Columns: 5
## $ User.ID         <int> 15624510, 15810944, 15668575, 15603246, 15804002, 1572…
## $ Gender          <fct> Male, Male, Female, Female, Male, Male, Female, Female…
## $ Age             <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32…
## $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000…
## $ Purchased       <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …

for the initial dataset there are 5 columns with each data type

data = data %>%
  select(-c(User.ID)) %>%
  mutate(
    Purchased = as.factor(Purchased)
  )
glimpse(data)
## Rows: 400
## Columns: 4
## $ Gender          <fct> Male, Male, Female, Female, Male, Male, Female, Female…
## $ Age             <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32…
## $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000…
## $ Purchased       <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …

the deleted variable is the user id. and for Purchased variables changed to categorical data types

Missing Value

colSums(is.na(data))
##          Gender             Age EstimatedSalary       Purchased 
##               0               0               0               0

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

Dimention and Proportion of Data

dim(data)
## [1] 400   4
prop.table(table(data$Purchased))
## 
##      0      1 
## 0.6425 0.3575

the data used is 400 rows with 4 variables

the proportion between buying (1) and not buying (0), the result obtained is that the proportion is not balanced between buying and not buying.

RNGkind(sample.kind = "Rounding")
set.seed(1)

data_up <- upSample(x = data %>% select(-Purchased), 
                           y = data$Purchased, 
                           yname = "Purchased")
prop.table(table(data_up$Purchased))
## 
##   0   1 
## 0.5 0.5
dim(data_up)
## [1] 514   4

after handling the unbalanced proportions by up-sampling, the proportion is 50:50, in other words, the model is equal between buying and not buying, and it can be seen that the dimensions of the data have increased after up-sampling

Cross Validation

set.seed(9)
index <- sample(x = nrow(data_up), size = nrow(data_up)*0.8)

# splitting (jalankan satu chunk secara keseluruhan)
data_train <- data_up[index,]
data_test <- data_up[-index,]
nrow(data_train)
## [1] 411
nrow(data_test)
## [1] 103

The distribution of training and testing data is 80:20. and obtained 411 rows for training data and 103 rows for testing data.

Modeling

the modeling used in this case is the Logistic Regression Model and the KNN model

Logistic Regression Model

here the logistic regression model uses 2 models, namely for all variables and the backward model

Model All

model_all <- glm(
  formula = Purchased ~ .,
  data = data_train,
  family = "binomial",
  control = list(trace=FALSE) 
)
summary(model_all)
## 
## Call:
## glm(formula = Purchased ~ ., family = "binomial", data = data_train, 
##     control = list(trace = FALSE))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4049  -0.3473  -0.0212   0.3599   2.4415  
## 
## Coefficients:
##                      Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)     -15.038576118   1.699366419  -8.850 < 0.0000000000000002 ***
## GenderMale        0.212217570   0.315557440   0.673                0.501    
## Age               0.295088529   0.032855079   8.982 < 0.0000000000000002 ***
## EstimatedSalary   0.000043154   0.000006398   6.745      0.0000000000153 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 569.75  on 410  degrees of freedom
## Residual deviance: 256.13  on 407  degrees of freedom
## AIC: 264.13
## 
## Number of Fisher Scoring iterations: 6
exp(model_all$coefficient)
##     (Intercept)      GenderMale             Age EstimatedSalary 
## 0.0000002943265 1.2364168634805 1.3432452700561 1.0000431554145

In model_all it is known that the Gender variable has no effect on the model.

  • if all variables are considered constant then the possibility that the user will buy a product is 0.00000029432

  • GenderMale: male users are 1.2364 times more likely to buy compared to women, provided that all other predictors have a fixed value.

  • Age: every 1 increase in age increases the odds, and the possibility of the user buying is 1.3432 times greater with the record that all other predictors have a fixed value.

  • EstimatedSalary : every 1 increase EstimatedSalary increases the odds, and the possibility of the user buying is 1.0000431 times greater with the record that all other predictors have a fixed value

Model Backward

model_backward = step(object = model_all,
                      direction = 'backward',
                      trace = F)
summary(model_backward)
## 
## Call:
## glm(formula = Purchased ~ Age + EstimatedSalary, family = "binomial", 
##     data = data_train, control = list(trace = FALSE))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4181  -0.3635  -0.0220   0.3526   2.3870  
## 
## Coefficients:
##                      Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)     -14.771268650   1.632918439  -9.046 < 0.0000000000000002 ***
## Age               0.291823643   0.032201993   9.062 < 0.0000000000000002 ***
## EstimatedSalary   0.000042738   0.000006333   6.749      0.0000000000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 569.75  on 410  degrees of freedom
## Residual deviance: 256.59  on 408  degrees of freedom
## AIC: 262.59
## 
## Number of Fisher Scoring iterations: 6
exp(model_backward$coefficient)
##     (Intercept)             Age EstimatedSalary 
## 0.0000003845205 1.3388668786092 1.0000427390420

In model_backward it is known that the Gender variable has no effect on the model, so it is not included in the model.

  • if all variables are considered constant then the possibility that the user will buy a product is 0.00000038452

  • Age: every 1 increase in age increases the odds, and the possibility of the user buying is 1.33886 times greater with the record that all other predictors have a fixed value.

  • EstimatedSalary : every 1 increase EstimatedSalary increases the odds, and the possibility of the user buying is 1.0000427 times greater with the record that all other predictors have a fixed value

model_all$aic
## [1] 264.1305
model_backward$aic
## [1] 262.5853

based on the smallest AIC value it is obtained the model that can be used is the model_backward

Predict Logistic Regression Model

data_test$pred_purchased <- predict(
  object = model_backward,
  newdata = data_test,
  type = "response"
)
data_test$pred_label <- 
  ifelse(data_test$pred_purchased < 0.5, 0, 1) %>% 
  as.factor()

table(predict = data_test$pred_label,
      actual = data_test$Purchased)
##        actual
## predict  0  1
##       0 40 11
##       1 10 42

true positive (TP): Predicted buy and true buy 42 user true negative (TN): Predicted not to buy but bought 40 user false positive (FP): Predicted to buy but not buy 10 user false negative (FN): Predicted not to buy but to buy 11 user

because we want to minimize Predicted to buy but not buy so we can use precision

confusionMatrix(
  data = data_test$pred_label, 
  reference = data_test$Purchased, 
  positive = "1"
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 11
##          1 10 42
##                                           
##                Accuracy : 0.7961          
##                  95% CI : (0.7054, 0.8691)
##     No Information Rate : 0.5146          
##     P-Value [Acc > NIR] : 0.000000003087  
##                                           
##                   Kappa : 0.5921          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7925          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.8077          
##          Neg Pred Value : 0.7843          
##              Prevalence : 0.5146          
##          Detection Rate : 0.4078          
##    Detection Prevalence : 0.5049          
##       Balanced Accuracy : 0.7962          
##                                           
##        'Positive' Class : 1               
## 

Accuracy: the model used has an accuracy of 79.61% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 79.25% Specificity: a measure of the goodness of the model to the negative class 80% Pos Pred Value/Precision: model precision measures predict positive class 80.77%

KNN Model

knn_train_x <- data_train %>% 
  select(-c(Gender, Purchased))
knn_test_x <- data_test %>% 
  select(-c(Gender, Purchased, pred_purchased, pred_label))

knn_train_y <- data_train[,"Purchased"] 
knn_test_y <- data_test[,"Purchased"]

In KNN modeling, a separation is made between the predictor variable and the target variable in each of the training data and also the testing data to facilitate analysis

train_x_scale <- scale(x = knn_train_x)
test_x_scale <- scale(x = knn_test_x,
        center = attr(train_x_scale, "scaled:center"),
        scale = attr(train_x_scale, "scaled:scale"))

Scaling is useful for generalizing each variable, because each variable has a different data range and different units. This scaling uses the z-score method

sqrt(nrow(data_train))
## [1] 20.27313

the k value is 20.27313 by rounding to 20

Predict KNN Model

knn_pred <- knn(train = train_x_scale, # data train yang sudah discale
                 test = test_x_scale, # data test yang sudah discale
                 cl = knn_train_y, # target dari data train
                 k = 20)
confusionMatrix(
  data = knn_pred,
  reference = knn_test_y,
  positive = "1"
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43  7
##          1  7 46
##                                              
##                Accuracy : 0.8641             
##                  95% CI : (0.7825, 0.9237)   
##     No Information Rate : 0.5146             
##     P-Value [Acc > NIR] : 0.00000000000006894
##                                              
##                   Kappa : 0.7279             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.8679             
##             Specificity : 0.8600             
##          Pos Pred Value : 0.8679             
##          Neg Pred Value : 0.8600             
##              Prevalence : 0.5146             
##          Detection Rate : 0.4466             
##    Detection Prevalence : 0.5146             
##       Balanced Accuracy : 0.8640             
##                                              
##        'Positive' Class : 1                  
## 

true positive (TP): Predicted buy and true buy 46 user true negative (TN): Predicted not to buy but bought 43 user false positive (FP): Predicted to buy but not buy 7 user false negative (FN): Predicted not to buy but to buy 7 user

Accuracy: the model used has an accuracy of 86.41% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 86.79% Specificity: a measure of the goodness of the model to the negative class 86% Pos Pred Value/Precision: model precision measures predict positive class 86.79%

From the Logistic Regression Model and the KNN Model, the accuracy value for each model is 79,61% for the Logistic Regression Model and 86.41% for the KNN Model.

Overfitting check

knn_pred_train <- knn(train = train_x_scale, 
                 test = train_x_scale, 
                 cl = knn_train_y, 
                 k = 20)

confusionMatrix(
  data = knn_pred_train,
  reference = knn_train_y,
  positive = "1"
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 191  12
##          1  16 192
##                                              
##                Accuracy : 0.9319             
##                  95% CI : (0.903, 0.9543)    
##     No Information Rate : 0.5036             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8638             
##                                              
##  Mcnemar's Test P-Value : 0.5708             
##                                              
##             Sensitivity : 0.9412             
##             Specificity : 0.9227             
##          Pos Pred Value : 0.9231             
##          Neg Pred Value : 0.9409             
##              Prevalence : 0.4964             
##          Detection Rate : 0.4672             
##    Detection Prevalence : 0.5061             
##       Balanced Accuracy : 0.9319             
##                                              
##        'Positive' Class : 1                  
## 

based on the predict value, the accuracy value for the training data is 0.9319 and the test data is 0.8641 when each is subtracted, it gets 0.0678. A model is to be overfit if the difference reaches more than 0.1 so that when compared 0.0678 < 0.1.

It can be concluded that the model that has been created can accommodate the available test data, in other words, the model is very good for classifying users who buy or don’t buy a product.

Conclusion

Based on the analysis that has been done it can be concluded that :

  • Good modeling in this case using the KNN model with a high accuracy value of 86.41% for testing data and 93.19% for training data

  • Based on the KNN Model obtained Predicted buy and true buy 46 user and Predicted not to buy but to buy 7 user with a Sensitivity value of 86.79%

  • Based on the KNN Model obtained true negative (TN): Predicted not to buy but bought 43 user and Predicted to buy but not buy 7 user with a Specificity value of 80%

  • Based on the KNN Model obtained Predicted to buy but not buy 7 user with a Precision value of 86.79%