Eduardo Ogawa Cardoso (IDUSP 10864890) - M.Sc Student, Marketing
Maria Carolina Dias Cavalcante (IDUSP 12436263) - P.hD Candidate, Marketing
eogawac@usp.br; mcarolinadias@usp.br
This statistical model (also known as the logit model) is frequently employed for categorization and predictive analytics. Logistic regression assesses the likelihood that a boolean event will occur based on a given set of independent variables. The dependent variable is confined between 0 and 1, hence the outcome is a probability. A logit transformation is performed to the odds in logistic regression, which is the probability of success divided by the probability of failure.
library(pscl)
library(car)
library(corrplot)
library(psych)
library(ggplot2)
library(olsrr)
library(MASS)
library(tidyverse)
library(caret)
library(caTools)
library(ROCR)
library(klaR)
library(psych)
library(MASS)
library(devtools)
This data set is a sample representation of the age, gender, and estimated salary of internal customer base. It contains entries with unique identification with a boolean variable (Purchase) as a target.
The intention of this study is to investigate whether the independent variables present in the dataset can be used as predictor in a classification model (logistic regression).
df = social_network_ads #save the dataset in a variable
names(df) = gsub("\\.", "_", names(df)) #remove spaces from columns names
df = subset(df, select = -c(User_ID)) # remove non-important variables
df$male_dummy = ifelse(df$Gender == "Male", 1, 0) #transform the gender column into dummy
df = subset(df, select = -c(Gender)) #remove original gender column
minMax = function(x) { #min - max normalization
(x - min(x)) / (max(x) - min(x))
}
df <- as.data.frame(lapply(df, minMax))
head(df) #visualize new dataset
Display the name of the columns
names(df)
[1] "Age" "EstimatedSalary" "Purchased" "male_dummy"
Now we perform a summary of all columns in the dataset. This summary helps understand the main statistical characteristics of the dataset, such as Min, Max, Median, Mean and etc.
summary(df)
Age EstimatedSalary Purchased male_dummy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
1st Qu.:0.2798 1st Qu.:0.2074 1st Qu.:0.0000 1st Qu.:0.00
Median :0.4524 Median :0.4074 Median :0.0000 Median :0.00
Mean :0.4680 Mean :0.4055 Mean :0.3575 Mean :0.49
3rd Qu.:0.6667 3rd Qu.:0.5407 3rd Qu.:1.0000 3rd Qu.:1.00
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00
First, we need to split the original dataset into training and testing subsets. The traning subset will be used as a guide to our model predictive model. Aftter the training phase, we will test our model in the test subset. Based on the results we can calculate the accuracy of our predictive logistic model.
df1 = subset(df, select = -c(beta_age_salary))
split <- sample.split(df1, SplitRatio = 0.8)
split
[1] FALSE TRUE TRUE TRUE
train_reg <- subset(df1, split == "TRUE")
test_reg <- subset(df1, split == "FALSE")
# Training model
logistic_model <- glm(Purchased ~.,
data = train_reg,
family = "binomial")
logistic_model
Call: glm(formula = Purchased ~ ., family = "binomial", data = train_reg)
Coefficients:
(Intercept) Age EstimatedSalary male_dummy
-8.46454 11.06279 5.26288 0.09607
Degrees of Freedom: 299 Total (i.e. Null); 296 Residual
Null Deviance: 393.2
Residual Deviance: 200.5 AIC: 208.5
# Summary
summary(logistic_model)
Call:
glm(formula = Purchased ~ ., family = "binomial", data = train_reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1223 -0.5360 -0.1236 0.3333 2.4446
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.46454 1.08537 -7.799 6.25e-15 ***
Age 11.06279 1.44088 7.678 1.62e-14 ***
EstimatedSalary 5.26288 0.91242 5.768 8.02e-09 ***
male_dummy 0.09607 0.35776 0.269 0.788
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 393.19 on 299 degrees of freedom
Residual deviance: 200.52 on 296 degrees of freedom
AIC: 208.52
Number of Fisher Scoring iterations: 6
anova(logistic_model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: Purchased
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 299 393.19
Age 1 145.351 298 247.84 < 2.2e-16 ***
EstimatedSalary 1 47.245 297 200.59 6.265e-12 ***
male_dummy 1 0.072 296 200.52 0.7881
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The difference between the null deviation and the residual deviance indicates the performance of our model relative to the null model (a model with only the intercept). The greater this distance, the better. Analyzing the table, we can observe the decline in deviation caused by adding each variable individually.
Even though there is no exact equivalent to the R2 of linear regression, the McFadden R2 index can be utilized to evaluate how well a model fits the data.
pR2(logistic_model)
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-100.2596541 -196.5935985 192.6678888 0.4900157 0.4738802 0.6488414
As presented in the ANOVA testing, the Age and EstimatedSalary variables can be included in the model since their p-values are less than 0.05, indicating statistical significance.
Using the intercept and the betas, we can now generate a new variable that summarizes the features of the data set.
df$beta_age_salary = (-8.46454 + 11.06279*df$Age + 5.26288*df$EstimatedSalary) #create a new variable that relates to all prior variable
We now use the new column to identify the logistic regression model’s probability curve.
ggplot(df, aes(x=df$beta_age_salary, y=df$Purchased)) +
geom_point(alpha=.5) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial))
Warning: Use of `df$beta_age_salary` is discouraged. Use `beta_age_salary` instead.
Warning: Use of `df$Purchased` is discouraged. Use `Purchased` instead.
Warning: Use of `df$beta_age_salary` is discouraged. Use `beta_age_salary` instead.
Warning: Use of `df$Purchased` is discouraged. Use `Purchased` instead.
`geom_smooth()` using formula 'y ~ x'
head(predict_reg)
1 5 9 13 17 21
0 0 0 0 0 0
test_reg$prediction = predict_reg
test_reg
table(test_reg$Purchased, predict_reg)
predict_reg
0 1
0 58 8
1 7 27
missing_classerr <- mean(test_reg$prediction != test_reg$Purchased)
missing_classerr
[1] 0.15
The ROC is a curve formed by graphing the true positive rate (TPR) vs the false positive rate (FPR) at different threshold values, whereas the AUC is the area under the ROC curve. As a general rule, the AUC of a model with strong predictive performance should be closer to 1 than to 0.5.
ROCPred <- prediction(test_reg$prediction, test_reg$Purchased)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
[1] 0.8364528
plot(ROCPer)
plot(ROCPer, colorize = TRUE,
print.cutoffs.at = seq(0.1, by = 0.1),
main = "ROC CURVE")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
The ROC curve demonstrates that the model has a strong predictive performance. However, the accuracy of 0.15 on the test is far from sufficient. Therefore, despite the fact that the independent variables can explain a substantial percentage of the target variance, they should not be employed as future predictors.
The basic purpose is to estimate the relationship between a single categorical dependent variable and a set of quantitative independent variables.
In this particular analysis we will try to understand the relationship the categorical variable “new_age” with the other variables.
df1 = social_network_ads # save the same df in a new variable
names(df1) = gsub("\\.", "_", names(df1)) #remove spaces from columns names
df1 = subset(df1, select = -c(User_ID)) # remove non-important variables
df1$male_dummy = ifelse(df1$Gender == "Male", 1, 0) #transform the gender column into dummy
df1 = subset(df1, select = -c(Gender)) #remove original gender column
df1$new_age = ifelse(df1$Age < 30, "Young", ifelse(df1$Age >= 30 & df1$Ag <50 , "Adult",ifelse(df1$Age >= 50, "Old","")))
#transform the gender column into dummy
df1 = subset(df1, select = -c(Age)) #remove original gender column
#perform a new split
split <- sample.split(df1, SplitRatio = 0.8)
train_reg1 <- subset(df1, split == "TRUE")
test_reg1 <- subset(df1, split == "FALSE")
head(df1)
lda_reg = lda(new_age~., train_reg1)
lda_reg
Call:
lda(new_age ~ ., data = train_reg1)
Prior probabilities of groups:
Adult Old Young
0.6233333 0.1300000 0.2466667
Group means:
EstimatedSalary Purchased male_dummy
Adult 70550.80 0.37433155 0.5080214
Old 80461.54 0.92307692 0.3076923
Young 60351.35 0.04054054 0.4594595
Coefficients of linear discriminants:
LD1 LD2
EstimatedSalary 9.000415e-07 1.300687e-05
Purchased -2.452797e+00 1.013132e-02
male_dummy 1.737720e-01 1.881920e+00
Proportion of trace:
LD1 LD2
0.9694 0.0306
The prior probabilities describes that 60% of the base belong to the group “Adult”, while 14% are in the “Group” and 25% in the Young Group.
As seen above, the proportion of trace is 95% for the LD1 and 5% for the LD2
The histogram analysis permits the identification of possible classification overlaps. As can be seen in the table below (hist with LD1), there are overlaps between the groups, indicating that this classification might be improved.
p <- predict(lda_reg, test_reg1)
ldahist(data = p$x[,1], g = test_reg1$new_age)
In the LD2 histogram, there is a substantial overlap between groups, which is undesirable.
p1 <- predict(lda_reg, test_reg1)
ldahist(data = p1$x[,2], g = test_reg1$new_age)
p2 <- predict(lda_reg, test_reg1)$class
tab1 <- table(Predicted = p2, Actual = test_reg1$new_age)
tab1
Actual
Predicted Adult Old Young
Adult 59 12 26
Old 1 2 0
Young 0 0 0
sum(diag(tab1))/sum(tab1)
[1] 0.61
As a consequence of the LDA, a 61 percent accuracy is presented. Therefore, the age group of clients can be reasonably predicted based on the purchasing habit, taking into account age and expected salary.