hw4mva

mydata <- read.csv("C:/Users/pauli/Downloads/archive (13)/Customer_Behaviour.csv")
head(mydata)

##    User.ID Gender Age EstimatedSalary Purchased
## 1 15624510   Male  19           19000         0
## 2 15810944   Male  35           20000         0
## 3 15668575 Female  26           43000         0
## 4 15603246 Female  27           57000         0
## 5 15804002   Male  19           76000         0
## 6 15728773   Male  27           58000         0

The data set represents details about 400 customers of a company including the ID, the gender, the age of the customer, their salary and the information if they have decided to buy a specific product or not. This data set was retrieved 18.02.2023 from Kaggle, the author is unknown. The link to the data set is provided below: https://www.kaggle.com/datasets/denisadutca/customer-behaviour

The purpose of this analysis is to identify the probability that the customer will buy this specific product or not based on their age, estimated salary and gender.

Data set consists of 400 observations and 5 variables:

The description of the variables: USER ID: unique id of each customer GENDER: gender of a customer AGE: age of a customer, between 18 and 60 years old ESTIMATED SALARY:salary of the customer PURCHASE: Yes(purchased the product)-1, No(did not purchase the product)-0

library(psych)
describe(mydata[c(-1,-2,-5,-6),])

##                 vars   n        mean       sd   median     trimmed      mad
## User.ID            1 396 15691029.48 71439.87 15693776 15691012.87 92277.77
## Gender*            2 396        1.48     0.50        1        1.48     0.00
## Age                3 396       37.78    10.44       37       37.45    11.86
## EstimatedSalary    4 396    70010.10 34074.53    70000    67726.42 31134.60
## Purchased          5 396        0.36     0.48        0        0.33     0.00
##                      min      max  range  skew kurtosis      se
## User.ID         15566689 15815236 248547 -0.03    -1.19 3589.99
## Gender*                1        2      1  0.06    -2.00    0.03
## Age                   18       60     42  0.23    -0.64    0.52
## EstimatedSalary    15000   150000 135000  0.49    -0.44 1712.31
## Purchased              0        1      1  0.58    -1.67    0.02

mydata$PurchasedF <-factor(mydata$Purchased,
                          levels = c(0, 1),
                          labels = c("No", "Yes")) #base is no

mydata$Gender[mydata$Gender=="Male"]<-"0"
mydata$Gender[mydata$Gender=="Female"]<-"1"

mydata$GenderF <- factor(mydata$Gender, 
                             levels = c("0", "1"), 
                             labels = c("Male", "Female"))
head(mydata)

##    User.ID Gender Age EstimatedSalary Purchased PurchasedF GenderF
## 1 15624510      0  19           19000         0         No    Male
## 2 15810944      0  35           20000         0         No    Male
## 3 15668575      1  26           43000         0         No  Female
## 4 15603246      1  27           57000         0         No  Female
## 5 15804002      0  19           76000         0         No    Male
## 6 15728773      0  27           58000         0         No    Male

summary(mydata[, c(-1)])

##     Gender               Age        EstimatedSalary    Purchased     
##  Length:400         Min.   :18.00   Min.   : 15000   Min.   :0.0000  
##  Class :character   1st Qu.:29.75   1st Qu.: 43000   1st Qu.:0.0000  
##  Mode  :character   Median :37.00   Median : 70000   Median :0.0000  
##                     Mean   :37.66   Mean   : 69743   Mean   :0.3575  
##                     3rd Qu.:46.00   3rd Qu.: 88000   3rd Qu.:1.0000  
##                     Max.   :60.00   Max.   :150000   Max.   :1.0000  
##  PurchasedF   GenderF   
##  No :257    Male  :196  
##  Yes:143    Female:204  
##                         
##                         
##                         
##

The probability that a randomly selected customer will purchase a product is P(Y=1)=143/400(all)=0.3575 There is 0.35 probability that a randomly selected person will purchase this specific product.

The odds that a randomly selected customer will purchase a product are 143/400 : 257/400=0.56 The probability that the person will buy this specific product is equal to 0.56 times the probability that a person will not buy a specific product, it is less likely.

fit0 <- glm(PurchasedF ~ 1, 
            family = binomial, 
            data = mydata)

summary(fit0)

## 
## Call:
## glm(formula = PurchasedF ~ 1, family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9406  -0.9406  -0.9406   1.4343   1.4343  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5862     0.1043  -5.619 1.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 521.57  on 399  degrees of freedom
## Residual deviance: 521.57  on 399  degrees of freedom
## AIC: 523.57
## 
## Number of Fisher Scoring iterations: 4

exp(cbind(odds = fit0$coefficients, confint.default(fit0)))

##                  odds     2.5 %    97.5 %
## (Intercept) 0.5564202 0.4535236 0.6826623

odds ratio for Y=1 is 0.55 and the confidence interval for fit0.

head(fitted(fit0)) #Estimated probability for Y=1

##      1      2      3      4      5      6 
## 0.3575 0.3575 0.3575 0.3575 0.3575 0.3575

Proportion of correctly classified units.

Pseudo_R2_fit1 <- 257/400 

Pseudo_R2_fit1

## [1] 0.6425

The probability that a customer will not buy a specific product (y = 1). The probability of correctly classified units is 64.3 percent.

fit1 <- glm(PurchasedF ~ Age,  
            family = binomial, 
            data = mydata)

summary(fit1)

## 
## Call:
## glm(formula = PurchasedF ~ Age, family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5091  -0.6548  -0.2923   0.5706   2.4470  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.04414    0.78417 -10.258   <2e-16 ***
## Age          0.18895    0.01915   9.866   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 521.57  on 399  degrees of freedom
## Residual deviance: 336.26  on 398  degrees of freedom
## AIC: 340.26
## 
## Number of Fisher Scoring iterations: 5

We can see that Age is statistically significant.

Next, we compare the models with -2LL statistic: H0: Simple model is better H1: Complex model is better

anova(fit0, fit1, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: PurchasedF ~ 1
## Model 2: PurchasedF ~ Age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       399     521.57                          
## 2       398     336.26  1   185.31 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject H0 and conclude that the complex model is better (p<0.001), deviation 185.31.

exp(cbind(OR = fit1$coefficients, confint.default(fit1)))

##                       OR        2.5 %      97.5 %
## (Intercept) 0.0003209765 6.902164e-05 0.001492661
## Age         1.2079800978 1.163477e+00 1.254185646

If age is increased by 1 year, the odds (Y=1) for purchasing are equal to 1.207 of the initial odds under the assumption that the remaining variables remain unchanged, p<0.001.

We add Estimated salary and gender to the model.

fit2 <- glm(PurchasedF ~ Age + EstimatedSalary + GenderF,  
            family = binomial, 
            data = mydata)
summary(fit2)

## 
## Call:
## glm(formula = PurchasedF ~ Age + EstimatedSalary + GenderF, family = binomial, 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9109  -0.5218  -0.1406   0.3662   2.4254  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.245e+01  1.309e+00  -9.510  < 2e-16 ***
## Age              2.370e-01  2.638e-02   8.984  < 2e-16 ***
## EstimatedSalary  3.644e-05  5.473e-06   6.659 2.77e-11 ***
## GenderFFemale   -3.338e-01  3.052e-01  -1.094    0.274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 521.57  on 399  degrees of freedom
## Residual deviance: 275.84  on 396  degrees of freedom
## AIC: 283.84
## 
## Number of Fisher Scoring iterations: 6

Gender is statistically not significant, p-value is too high.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

vif(fit2)

##             Age EstimatedSalary         GenderF 
##        1.400914        1.385439        1.027105

Vif is OK.

Lets check for outliers and units with high impact.

mydata$StdResid <- rstandard(fit2)
mydata$CooksD <- cooks.distance(fit2)

hist(mydata$StdResid,
     main = "Histogram of standardized residuals",
     ylab = "Frequency",
     xlab = "Standardized residuals")

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata[order(-mydata$CooksD), c("User.ID", "CooksD")])

##      User.ID     CooksD
## 285 15638646 0.08061434
## 307 15603942 0.07707895
## 65  15605000 0.07254503
## 208 15794566 0.05890553
## 213 15707596 0.04775044
## 271 15680752 0.03642241

head(mydata[order(mydata$StdResid), c("User.ID", "StdResid")])

##      User.ID  StdResid
## 65  15605000 -2.917086
## 307 15603942 -2.905049
## 285 15638646 -2.864438
## 208 15794566 -2.734718
## 213 15707596 -2.377311
## 271 15680752 -2.185483

We delete 3 Cooks- number 285, 307, 65. There are no problematic outliers.

mydata <- mydata[c(-285,-307,-65),]

fit1 <- glm(PurchasedF ~ Age,  
            family = binomial, 
            data = mydata)
fit2 <- glm(PurchasedF ~ Age + EstimatedSalary + GenderF,  
            family = binomial, 
            data = mydata)

Again, we check which model is better. H0: simple model is better (fit1) H1: complex model is better (fit2)

anova(fit1, fit2, test =  "Chi")

## Analysis of Deviance Table
## 
## Model 1: PurchasedF ~ Age
## Model 2: PurchasedF ~ Age + EstimatedSalary + GenderF
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       395     323.21                          
## 2       393     248.02  2   75.198 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Deviance confirms that the complex model is better. Based on the p-value<0.001 we can reject H0 and accept H1 that the complex model is better.

summary(fit2)

## 
## Call:
## glm(formula = PurchasedF ~ Age + EstimatedSalary + GenderF, family = binomial, 
##     data = mydata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.04430  -0.46415  -0.09278   0.26897   2.55202  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.476e+01  1.619e+00  -9.116  < 2e-16 ***
## Age              2.813e-01  3.231e-02   8.705  < 2e-16 ***
## EstimatedSalary  4.493e-05  6.423e-06   6.996 2.64e-12 ***
## GenderFFemale   -3.655e-01  3.225e-01  -1.134    0.257    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 518.91  on 396  degrees of freedom
## Residual deviance: 248.02  on 393  degrees of freedom
## AIC: 256.02
## 
## Number of Fisher Scoring iterations: 6

If age is increased by 1 year, the odds (y=1) for purchase is equal to 1.324847 of the initial odds under the assumption that remaining variables remain unchanged (p<0.001). (higher age-more likely to purchase) If estimated salary is increased by 1 dollar, the odds (y=1) for purchase is equal to 1.000045 times of the initial odds under the assumption that all other variables remain unchanged. (higher salary-more likely to purchase) Gender we cannot explain, it is not statistically significant.

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))

##                           OR        2.5 %       97.5 %
## (Intercept)     3.887628e-07 1.627495e-08 9.286449e-06
## Age             1.324847e+00 1.243544e+00 1.411465e+00
## EstimatedSalary 1.000045e+00 1.000032e+00 1.000058e+00
## GenderFFemale   6.938330e-01 3.687942e-01 1.305347e+00

mydata$EstProb <- fitted(fit2) 
head(mydata)

##    User.ID Gender Age EstimatedSalary Purchased PurchasedF GenderF    StdResid
## 1 15624510      0  19           19000         0         No    Male -0.03759471
## 2 15810944      0  35           20000         0         No    Male -0.25351285
## 3 15668575      1  26           43000         0         No  Female -0.11284002
## 4 15603246      1  27           57000         0         No  Female -0.16374958
## 5 15804002      0  19           76000         0         No    Male -0.10615600
## 6 15728773      0  27           58000         0         No    Male -0.19680569
##         CooksD      EstProb
## 1 7.273387e-08 0.0001912181
## 2 4.284827e-05 0.0177019461
## 3 3.071217e-06 0.0027871957
## 4 1.021259e-05 0.0068983435
## 5 2.568750e-06 0.0024710158
## 6 1.698819e-05 0.0103630307

The probability that the person with User.ID 15624510 will purchase a product is 0.0001.

The cut point is 0.5, if estimated probability is less than 0.5, then a person will not purchase a specific product (NO) and if above then will purchase (YES).

mydata$Classification <- ifelse(test = mydata$EstProb < 0.50, 
                                yes = "NO", 
                                no = "YES")

mydata$ClassificationF <- factor(mydata$Classification,
                                 levels = c("NO", "YES"),
                                 labels = c("NO", "YES"))

ClassificationTable <- table(mydata$PurchasedF, mydata$ClassificationF) #Classification table based on 2 categorical variables.
ClassificationTable

##      
##        NO YES
##   No  234  20
##   Yes  38 105

How many customers were correctly classified with the fit2 using a probability of 0.5 as cut off: 234+105/397=85,39

#we check if we have made the right calculation
Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit2

## [1] 0.8539043

We correctly classified 85.4% of the individuals.

hw4mva

Paulina Suvorov

2023-02-18