mydata <- read.csv("C:/Users/pauli/Downloads/archive (13)/Customer_Behaviour.csv")
head(mydata)
## User.ID Gender Age EstimatedSalary Purchased
## 1 15624510 Male 19 19000 0
## 2 15810944 Male 35 20000 0
## 3 15668575 Female 26 43000 0
## 4 15603246 Female 27 57000 0
## 5 15804002 Male 19 76000 0
## 6 15728773 Male 27 58000 0
The data set represents details about 400 customers of a company including the ID, the gender, the age of the customer, their salary and the information if they have decided to buy a specific product or not. This data set was retrieved 18.02.2023 from Kaggle, the author is unknown. The link to the data set is provided below: https://www.kaggle.com/datasets/denisadutca/customer-behaviour
The purpose of this analysis is to identify the probability that the customer will buy this specific product or not based on their age, estimated salary and gender.
Data set consists of 400 observations and 5 variables:
The description of the variables: USER ID: unique id of each customer GENDER: gender of a customer AGE: age of a customer, between 18 and 60 years old ESTIMATED SALARY:salary of the customer PURCHASE: Yes(purchased the product)-1, No(did not purchase the product)-0
library(psych)
describe(mydata[c(-1,-2,-5,-6),])
## vars n mean sd median trimmed mad
## User.ID 1 396 15691029.48 71439.87 15693776 15691012.87 92277.77
## Gender* 2 396 1.48 0.50 1 1.48 0.00
## Age 3 396 37.78 10.44 37 37.45 11.86
## EstimatedSalary 4 396 70010.10 34074.53 70000 67726.42 31134.60
## Purchased 5 396 0.36 0.48 0 0.33 0.00
## min max range skew kurtosis se
## User.ID 15566689 15815236 248547 -0.03 -1.19 3589.99
## Gender* 1 2 1 0.06 -2.00 0.03
## Age 18 60 42 0.23 -0.64 0.52
## EstimatedSalary 15000 150000 135000 0.49 -0.44 1712.31
## Purchased 0 1 1 0.58 -1.67 0.02
mydata$PurchasedF <-factor(mydata$Purchased,
levels = c(0, 1),
labels = c("No", "Yes")) #base is no
mydata$Gender[mydata$Gender=="Male"]<-"0"
mydata$Gender[mydata$Gender=="Female"]<-"1"
mydata$GenderF <- factor(mydata$Gender,
levels = c("0", "1"),
labels = c("Male", "Female"))
head(mydata)
## User.ID Gender Age EstimatedSalary Purchased PurchasedF GenderF
## 1 15624510 0 19 19000 0 No Male
## 2 15810944 0 35 20000 0 No Male
## 3 15668575 1 26 43000 0 No Female
## 4 15603246 1 27 57000 0 No Female
## 5 15804002 0 19 76000 0 No Male
## 6 15728773 0 27 58000 0 No Male
summary(mydata[, c(-1)])
## Gender Age EstimatedSalary Purchased
## Length:400 Min. :18.00 Min. : 15000 Min. :0.0000
## Class :character 1st Qu.:29.75 1st Qu.: 43000 1st Qu.:0.0000
## Mode :character Median :37.00 Median : 70000 Median :0.0000
## Mean :37.66 Mean : 69743 Mean :0.3575
## 3rd Qu.:46.00 3rd Qu.: 88000 3rd Qu.:1.0000
## Max. :60.00 Max. :150000 Max. :1.0000
## PurchasedF GenderF
## No :257 Male :196
## Yes:143 Female:204
##
##
##
##
The probability that a randomly selected customer will purchase a product is P(Y=1)=143/400(all)=0.3575 There is 0.35 probability that a randomly selected person will purchase this specific product.
The odds that a randomly selected customer will purchase a product are 143/400 : 257/400=0.56 The probability that the person will buy this specific product is equal to 0.56 times the probability that a person will not buy a specific product, it is less likely.
fit0 <- glm(PurchasedF ~ 1,
family = binomial,
data = mydata)
summary(fit0)
##
## Call:
## glm(formula = PurchasedF ~ 1, family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9406 -0.9406 -0.9406 1.4343 1.4343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5862 0.1043 -5.619 1.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.57 on 399 degrees of freedom
## Residual deviance: 521.57 on 399 degrees of freedom
## AIC: 523.57
##
## Number of Fisher Scoring iterations: 4
exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
## odds 2.5 % 97.5 %
## (Intercept) 0.5564202 0.4535236 0.6826623
odds ratio for Y=1 is 0.55 and the confidence interval for fit0.
head(fitted(fit0)) #Estimated probability for Y=1
## 1 2 3 4 5 6
## 0.3575 0.3575 0.3575 0.3575 0.3575 0.3575
Proportion of correctly classified units.
Pseudo_R2_fit1 <- 257/400
Pseudo_R2_fit1
## [1] 0.6425
The probability that a customer will not buy a specific product (y = 1). The probability of correctly classified units is 64.3 percent.
fit1 <- glm(PurchasedF ~ Age,
family = binomial,
data = mydata)
summary(fit1)
##
## Call:
## glm(formula = PurchasedF ~ Age, family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5091 -0.6548 -0.2923 0.5706 2.4470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.04414 0.78417 -10.258 <2e-16 ***
## Age 0.18895 0.01915 9.866 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.57 on 399 degrees of freedom
## Residual deviance: 336.26 on 398 degrees of freedom
## AIC: 340.26
##
## Number of Fisher Scoring iterations: 5
We can see that Age is statistically significant.
Next, we compare the models with -2LL statistic: H0: Simple model is better H1: Complex model is better
anova(fit0, fit1, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: PurchasedF ~ 1
## Model 2: PurchasedF ~ Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 399 521.57
## 2 398 336.26 1 185.31 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject H0 and conclude that the complex model is better (p<0.001), deviation 185.31.
exp(cbind(OR = fit1$coefficients, confint.default(fit1)))
## OR 2.5 % 97.5 %
## (Intercept) 0.0003209765 6.902164e-05 0.001492661
## Age 1.2079800978 1.163477e+00 1.254185646
If age is increased by 1 year, the odds (Y=1) for purchasing are equal to 1.207 of the initial odds under the assumption that the remaining variables remain unchanged, p<0.001.
We add Estimated salary and gender to the model.
fit2 <- glm(PurchasedF ~ Age + EstimatedSalary + GenderF,
family = binomial,
data = mydata)
summary(fit2)
##
## Call:
## glm(formula = PurchasedF ~ Age + EstimatedSalary + GenderF, family = binomial,
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9109 -0.5218 -0.1406 0.3662 2.4254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.245e+01 1.309e+00 -9.510 < 2e-16 ***
## Age 2.370e-01 2.638e-02 8.984 < 2e-16 ***
## EstimatedSalary 3.644e-05 5.473e-06 6.659 2.77e-11 ***
## GenderFFemale -3.338e-01 3.052e-01 -1.094 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.57 on 399 degrees of freedom
## Residual deviance: 275.84 on 396 degrees of freedom
## AIC: 283.84
##
## Number of Fisher Scoring iterations: 6
Gender is statistically not significant, p-value is too high.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
vif(fit2)
## Age EstimatedSalary GenderF
## 1.400914 1.385439 1.027105
Vif is OK.
Lets check for outliers and units with high impact.
mydata$StdResid <- rstandard(fit2)
mydata$CooksD <- cooks.distance(fit2)
hist(mydata$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata[order(-mydata$CooksD), c("User.ID", "CooksD")])
## User.ID CooksD
## 285 15638646 0.08061434
## 307 15603942 0.07707895
## 65 15605000 0.07254503
## 208 15794566 0.05890553
## 213 15707596 0.04775044
## 271 15680752 0.03642241
head(mydata[order(mydata$StdResid), c("User.ID", "StdResid")])
## User.ID StdResid
## 65 15605000 -2.917086
## 307 15603942 -2.905049
## 285 15638646 -2.864438
## 208 15794566 -2.734718
## 213 15707596 -2.377311
## 271 15680752 -2.185483
We delete 3 Cooks- number 285, 307, 65. There are no problematic outliers.
mydata <- mydata[c(-285,-307,-65),]
fit1 <- glm(PurchasedF ~ Age,
family = binomial,
data = mydata)
fit2 <- glm(PurchasedF ~ Age + EstimatedSalary + GenderF,
family = binomial,
data = mydata)
Again, we check which model is better. H0: simple model is better (fit1) H1: complex model is better (fit2)
anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: PurchasedF ~ Age
## Model 2: PurchasedF ~ Age + EstimatedSalary + GenderF
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 395 323.21
## 2 393 248.02 2 75.198 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance confirms that the complex model is better. Based on the p-value<0.001 we can reject H0 and accept H1 that the complex model is better.
summary(fit2)
##
## Call:
## glm(formula = PurchasedF ~ Age + EstimatedSalary + GenderF, family = binomial,
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.04430 -0.46415 -0.09278 0.26897 2.55202
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476e+01 1.619e+00 -9.116 < 2e-16 ***
## Age 2.813e-01 3.231e-02 8.705 < 2e-16 ***
## EstimatedSalary 4.493e-05 6.423e-06 6.996 2.64e-12 ***
## GenderFFemale -3.655e-01 3.225e-01 -1.134 0.257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 518.91 on 396 degrees of freedom
## Residual deviance: 248.02 on 393 degrees of freedom
## AIC: 256.02
##
## Number of Fisher Scoring iterations: 6
If age is increased by 1 year, the odds (y=1) for purchase is equal to 1.324847 of the initial odds under the assumption that remaining variables remain unchanged (p<0.001). (higher age-more likely to purchase) If estimated salary is increased by 1 dollar, the odds (y=1) for purchase is equal to 1.000045 times of the initial odds under the assumption that all other variables remain unchanged. (higher salary-more likely to purchase) Gender we cannot explain, it is not statistically significant.
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 3.887628e-07 1.627495e-08 9.286449e-06
## Age 1.324847e+00 1.243544e+00 1.411465e+00
## EstimatedSalary 1.000045e+00 1.000032e+00 1.000058e+00
## GenderFFemale 6.938330e-01 3.687942e-01 1.305347e+00
mydata$EstProb <- fitted(fit2)
head(mydata)
## User.ID Gender Age EstimatedSalary Purchased PurchasedF GenderF StdResid
## 1 15624510 0 19 19000 0 No Male -0.03759471
## 2 15810944 0 35 20000 0 No Male -0.25351285
## 3 15668575 1 26 43000 0 No Female -0.11284002
## 4 15603246 1 27 57000 0 No Female -0.16374958
## 5 15804002 0 19 76000 0 No Male -0.10615600
## 6 15728773 0 27 58000 0 No Male -0.19680569
## CooksD EstProb
## 1 7.273387e-08 0.0001912181
## 2 4.284827e-05 0.0177019461
## 3 3.071217e-06 0.0027871957
## 4 1.021259e-05 0.0068983435
## 5 2.568750e-06 0.0024710158
## 6 1.698819e-05 0.0103630307
The probability that the person with User.ID 15624510 will purchase a product is 0.0001.
The cut point is 0.5, if estimated probability is less than 0.5, then a person will not purchase a specific product (NO) and if above then will purchase (YES).
mydata$Classification <- ifelse(test = mydata$EstProb < 0.50,
yes = "NO",
no = "YES")
mydata$ClassificationF <- factor(mydata$Classification,
levels = c("NO", "YES"),
labels = c("NO", "YES"))
ClassificationTable <- table(mydata$PurchasedF, mydata$ClassificationF) #Classification table based on 2 categorical variables.
ClassificationTable
##
## NO YES
## No 234 20
## Yes 38 105
How many customers were correctly classified with the fit2 using a probability of 0.5 as cut off: 234+105/397=85,39
#we check if we have made the right calculation
Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit2
## [1] 0.8539043
We correctly classified 85.4% of the individuals.