Objectives To create a logistic regression model from Starbucks Malaysia customer data. This will result on a prediction on which variables that strongly affect the customer loyalty and how accurate this predictive model can be.
This data is about the result of Starbucks satisfactory survey conducted from 100 Starbucks customers in Malaysia. While the year of the survey was not mentioned in the Kaggle (check the references for link), I am assuming that it was conducted around 2019 or 2020, since the data was uploaded on 2020.
Things included in this data: - Demographic info about customers – gender, age range, employment status, income range (in Malaysian ringgit)
- Their current behavior in buying Starbucks
- Facilities and features of Starbucks that contribute to the behavior
survey <- read.csv("Starbucks satisfactory survey encode cleaned.csv") # For the original/uncoded result, please see the Kaggle link on references
glimpse(survey)Observations: 113
Variables: 33
$ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ gender <int> 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1…
$ age <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ status <int> 0, 0, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2…
$ income <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0…
$ visitNo <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3…
$ method <int> 0, 2, 0, 2, 2, 0, 0, 0, 1, 2, 0, 0, 2, 2, 2, 2…
$ timeSpend <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1…
$ location <int> 0, 1, 2, 2, 1, 2, 0, 2, 2, 2, 2, 2, 1, 1, 0, 2…
$ membershipCard <int> 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0…
$ itemPurchaseCoffee <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itempurchaseCold <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchasePastries <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseJuices <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseSandwiches <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseOthers <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ spendPurchase <int> 1, 1, 1, 1, 2, 1, 2, 1, 3, 2, 1, 1, 2, 1, 1, 1…
$ productRate <int> 4, 4, 4, 2, 3, 4, 5, 4, 5, 4, 4, 3, 4, 4, 5, 4…
$ priceRate <int> 3, 3, 3, 1, 3, 3, 5, 2, 4, 3, 1, 2, 3, 3, 2, 1…
$ promoRate <int> 5, 4, 4, 4, 4, 5, 5, 3, 4, 3, 4, 4, 2, 4, 5, 5…
$ ambianceRate <int> 5, 4, 4, 3, 2, 5, 5, 3, 4, 4, 5, 4, 4, 4, 5, 4…
$ wifiRate <int> 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 3, 3, 4, 4, 5, 3…
$ serviceRate <int> 4, 5, 4, 3, 3, 5, 5, 3, 4, 3, 3, 4, 3, 4, 5, 4…
$ chooseRate <int> 3, 2, 3, 3, 3, 4, 5, 3, 4, 4, 4, 4, 4, 3, 2, 4…
$ promoMethodApp <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodSoc <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodEmail <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodDeal <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodFriend <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodDisplay <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodBillboard <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodOthers <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ loyal <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1…
Id :gender : 0 - Male 1 - Femaleage : 0 (Below 20), 1 (From 20 to 29), 2 (From 30 to 39), 3 (40 and above)status : 0 (Student), 1 (Self-Employed), 2 (Employed), 3 (Housewife)income : 0 (Less than RM25,000), 1 (RM25,000 – RM50,000), 2 (RM50,000 – RM100,000), 3 (RM100,000 – RM150,000), 4 (More than RM150,000)visitNo : 0 - Daily, 1 - Weekly, 3 - Monthly, 4 - Nevermethod : 0 Dine In, 1 Drive-thru, 2 Take away, 3 Never 4 OtherstimeSpend: 0 (Below 30 mins), 1 (30 mins to 1h), 2 (1h to 2h), 3 (2h to 3h), 4 (More than 3h)location : 0 (Within 1km), 1 (1km to 3km), 2 (More than 3km)membershipCard / itemPurchaseCoffee / itempurchaseCold / itemPurchasePastries / itemPurchaseJuices / itemPurchaseSandwiches / itemPurchaseOthers : 0 Yes 1 NospendPurchase : 0 (Zero) ,1 (Less than RM20), 2 (RM 20 to RM40), 3 (More than RM40)productRate / priceRate/ promoRate / ambianceRate / wifiRate / serviceRate / chooseRate : Scaled 1-5 (1 Very Bad, 5 Excellent)promoMethodApp / promoMethodSoc / promoMethodEmail / promoMethodDeal / promoMethodFriend / promoMethodDisplay / promoMethodBillboard / promoMethodOthers : 0 Yes 1 Noloyal : 0 Yes 1 NoSince this is quite long, feel free to jump to the next section. Although in the later section we only use a handful of variables from these selections, I will still convert every variables into their correct class for future uses.
survey <- survey %>%
mutate_if(is.integer, as.factor) %>%
mutate(gender = factor(gender, levels = c(0,1), labels = c("Male", "Female")),
age = factor(age, levels = c(0,1,2,3), labels = c("<20", "20-29", "30-39", ">40")),
status = factor(status, levels = c(0,1,2,3), labels = c("Student", "Self-Employed", "Employed", "Housewife")),
income = factor(income, levels = c(0,1,2,3,4), labels = c("<RM25,000", "RM25,000 – RM50,000", "RM50,000 – RM100,000", "RM100,000 – RM150,000", ">RM150,000")),
visitNo = factor(visitNo, levels = c(0,1,3,4), labels = c("Daily", "Weekly", "Monthly", "Never")),
method = factor(method, levels = c(0,1,2,5), labels = c("Dine In", "Drive-thru", "Take away", "Others")),
timeSpend = factor(timeSpend, levels = c(0,1,2,3,4), labels = c("Below 30 mins", "30 mins to 1h", "1h to 2h", "2h to 3h", "More than 3h")),
location = factor(location, levels = c(0,1,2), labels = c("Within 1km", "1km to 3km", "More than 3km")),
membershipCard = factor(membershipCard, levels = c(0,1), labels = c("Yes", "No")),
itemPurchaseCoffee = factor(itemPurchaseCoffee, levels = c(0,1), labels = c("Yes", "No")),
itempurchaseCold = factor(itempurchaseCold, levels = c(0,1), labels = c("Yes", "No")),
itemPurchasePastries = factor(itemPurchasePastries, levels = c(0,1), labels = c("Yes", "No")),
itemPurchaseJuices = factor(itemPurchaseJuices, levels = c(0,1), labels = c("Yes", "No")),
itemPurchaseSandwiches = factor(itemPurchaseSandwiches, levels = c(0,1), labels = c("Yes", "No")),
itemPurchaseOthers = factor(itemPurchaseOthers, levels = c(0,1), labels = c("Yes", "No")),
spendPurchase = factor(spendPurchase, levels = c(0,1,2,3), labels = c("Zero", "Less than RM20", "RM 20 to RM40", "More than RM40")),
promoMethodApp = factor(promoMethodApp, levels = c(0,1), labels = c("Yes", "No")),
promoMethodSoc = factor(promoMethodSoc, levels = c(0,1), labels = c("Yes", "No")),
promoMethodEmail = factor(promoMethodEmail, levels = c(0,1), labels = c("Yes", "No")),
promoMethodDeal = factor(promoMethodDeal, levels = c(0,1), labels = c("Yes", "No")),
promoMethodDisplay = factor(promoMethodDisplay, levels = c(0,1), labels = c("Yes", "No")),
promoMethodFriend = factor(promoMethodFriend, levels = c(0,1), labels = c("Yes", "No")),
promoMethodBillboard = factor(promoMethodBillboard, levels = c(0,1), labels = c("Yes", "No")),
promoMethodOthers = factor(promoMethodOthers, levels = c(0,1), labels = c("Yes", "No")),
loyal = factor(loyal, levels = c(0,1), labels = c("Yes", "No"))) %>%
select(-Id)
glimpse(survey)Observations: 113
Variables: 32
$ gender <fct> Female, Female, Male, Female, Male, Female, Fe…
$ age <fct> 20-29, 20-29, 20-29, 20-29, 20-29, 20-29, 20-2…
$ status <fct> Student, Student, Employed, Student, Student, …
$ income <fct> "<RM25,000", "<RM25,000", "<RM25,000", "<RM25,…
$ visitNo <fct> Monthly, Monthly, NA, Monthly, NA, Monthly, Mo…
$ method <fct> Dine In, Take away, Dine In, Take away, Take a…
$ timeSpend <fct> 30 mins to 1h, Below 30 mins, 30 mins to 1h, B…
$ location <fct> Within 1km, 1km to 3km, More than 3km, More th…
$ membershipCard <fct> Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, No, …
$ itemPurchaseCoffee <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itempurchaseCold <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchasePastries <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseJuices <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseSandwiches <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseOthers <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ spendPurchase <fct> Less than RM20, Less than RM20, Less than RM20…
$ productRate <fct> 4, 4, 4, 2, 3, 4, 5, 4, 5, 4, 4, 3, 4, 4, 5, 4…
$ priceRate <fct> 3, 3, 3, 1, 3, 3, 5, 2, 4, 3, 1, 2, 3, 3, 2, 1…
$ promoRate <fct> 5, 4, 4, 4, 4, 5, 5, 3, 4, 3, 4, 4, 2, 4, 5, 5…
$ ambianceRate <fct> 5, 4, 4, 3, 2, 5, 5, 3, 4, 4, 5, 4, 4, 4, 5, 4…
$ wifiRate <fct> 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 3, 3, 4, 4, 5, 3…
$ serviceRate <fct> 4, 5, 4, 3, 3, 5, 5, 3, 4, 3, 3, 4, 3, 4, 5, 4…
$ chooseRate <fct> 3, 2, 3, 3, 3, 4, 5, 3, 4, 4, 4, 4, 4, 3, 2, 4…
$ promoMethodApp <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodSoc <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodEmail <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodDeal <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodFriend <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodDisplay <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodBillboard <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodOthers <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ loyal <fct> Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Ye…
Id gender age
0 0 0
status income visitNo
0 0 26
method timeSpend location
0 0 0
membershipCard itemPurchaseCoffee itempurchaseCold
0 0 0
itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
0 0 0
itemPurchaseOthers spendPurchase productRate
0 0 0
priceRate promoRate ambianceRate
0 0 0
wifiRate serviceRate chooseRate
0 0 0
promoMethodApp promoMethodSoc promoMethodEmail
0 0 0
promoMethodDeal promoMethodFriend promoMethodDisplay
0 0 0
promoMethodBillboard promoMethodOthers loyal
0 0 0
Logistic regression is appropriate to use with categorical/factor-type of data (and since this is survey, there will be a lot of “yes” or “no” answers).
Using glm() function.
model1 <- glm(formula = loyal~age+income+gender+membershipCard+timeSpend+method, data = survey, family = "binomial")
summary(model1)
Call:
glm(formula = loyal ~ age + income + gender + membershipCard +
timeSpend + method, family = "binomial", data = survey)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3915 -0.6751 -0.3061 -0.1056 2.5619
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6101 1.1844 -2.204 0.02754 *
age20-29 -0.4270 0.8518 -0.501 0.61622
age30-39 -1.1608 1.3790 -0.842 0.39989
age>40 -2.1266 1.9961 -1.065 0.28669
incomeRM25,000 – RM50,000 0.4150 0.8936 0.464 0.64234
incomeRM50,000 – RM100,000 -0.2726 1.2909 -0.211 0.83274
incomeRM100,000 – RM150,000 3.5143 2.0791 1.690 0.09096 .
income>RM150,000 2.4572 1.7389 1.413 0.15763
genderFemale 0.2975 0.6513 0.457 0.64787
membershipCardNo 2.4775 0.8008 3.094 0.00198 **
timeSpend30 mins to 1h 0.7526 0.7090 1.061 0.28851
timeSpend1h to 2h -1.4398 1.4595 -0.986 0.32390
timeSpend2h to 3h -16.2523 3956.1806 -0.004 0.99672
timeSpendMore than 3h 21.0805 2354.5121 0.009 0.99286
methodDrive-thru -1.5278 1.0146 -1.506 0.13212
[ reached getOption("max.print") -- omitted 2 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 114.191 on 112 degrees of freedom
Residual deviance: 84.701 on 96 degrees of freedom
AIC: 118.7
Number of Fisher Scoring iterations: 16
From the variables I have chosen, it seems that there are a lot of insignificant variables that may not affect target variable (loyal). We will then use model fitting using stepwise method.
Start: AIC=118.7
loyal ~ age + income + gender + membershipCard + timeSpend +
method
Df Deviance AIC
- age 3 86.133 114.13
- income 4 89.128 115.13
- method 3 87.728 115.73
- gender 1 84.912 116.91
<none> 84.701 118.70
- timeSpend 4 96.422 122.42
- membershipCard 1 98.182 130.18
Step: AIC=114.13
loyal ~ income + gender + membershipCard + timeSpend + method
Df Deviance AIC
- income 4 89.296 109.30
- method 3 89.364 111.36
- gender 1 86.517 112.52
<none> 86.133 114.13
- timeSpend 4 98.668 118.67
- membershipCard 1 99.129 125.13
Step: AIC=109.3
loyal ~ gender + membershipCard + timeSpend + method
Df Deviance AIC
- method 3 92.454 106.45
- gender 1 89.377 107.38
<none> 89.296 109.30
- timeSpend 4 99.858 111.86
- membershipCard 1 103.091 121.09
Step: AIC=106.45
loyal ~ gender + membershipCard + timeSpend
Df Deviance AIC
- gender 1 92.549 104.55
<none> 92.454 106.45
- timeSpend 4 102.240 108.24
- membershipCard 1 105.656 117.66
Step: AIC=104.55
loyal ~ membershipCard + timeSpend
Df Deviance AIC
<none> 92.549 104.55
- timeSpend 4 102.344 106.34
- membershipCard 1 105.754 115.75
Call:
glm(formula = loyal ~ membershipCard + timeSpend, family = "binomial",
data = survey)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0961 -0.8501 -0.3404 -0.2299 2.3989
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.8195 0.6059 -4.653 3.27e-06 ***
membershipCardNo 1.9878 0.6148 3.233 0.00122 **
timeSpend30 mins to 1h 0.6376 0.5680 1.123 0.26164
timeSpend1h to 2h -0.8008 1.1291 -0.709 0.47819
timeSpend2h to 3h -13.7466 2399.5448 -0.006 0.99543
timeSpendMore than 3h 18.6653 1569.8286 0.012 0.99051
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 114.191 on 112 degrees of freedom
Residual deviance: 92.549 on 107 degrees of freedom
AIC: 104.55
Number of Fisher Scoring iterations: 15
test_prob <- survey_test %>%
mutate(probability = predict(object = model2, newdata = ., type = "response"))
test_prob Id gender age status income visitNo method timeSpend
1 4 Female 20-29 Student <RM25,000 Monthly Take away Below 30 mins
2 13 Female 20-29 Student <RM25,000 Weekly Take away Below 30 mins
location membershipCard itemPurchaseCoffee itempurchaseCold
1 More than 3km No No No
2 1km to 3km Yes No No
itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
1 No No No
2 No No No
itemPurchaseOthers spendPurchase productRate priceRate promoRate
1 No Less than RM20 2 1 4
2 No RM 20 to RM40 4 3 2
ambianceRate wifiRate serviceRate chooseRate promoMethodApp promoMethodSoc
1 3 3 3 3 No No
2 4 4 3 4 No No
promoMethodEmail promoMethodDeal promoMethodFriend promoMethodDisplay
1 No No No No
2 No No No No
promoMethodBillboard promoMethodOthers loyal probability
1 No No No 0.30328247
2 No No Yes 0.05627987
[ reached 'max' / getOption("max.print") -- omitted 32 rows ]
survey_test$probability <- predict(model2, newdata = survey_test, type = "response")
survey_test$probability [1] 3.032825e-01 5.627987e-02 1.013838e-01 5.627987e-02 5.627987e-02
[6] 5.627987e-02 1.013838e-01 1.013838e-01 1.013838e-01 3.032825e-01
[11] 3.032825e-01 5.627987e-02 1.013838e-01 5.627987e-02 1.013838e-01
[16] 1.013838e-01 3.032825e-01 1.013838e-01 1.013838e-01 3.032825e-01
[21] 2.607762e-02 3.032825e-01 3.032825e-01 3.032825e-01 3.032825e-01
[26] 1.013838e-01 6.389220e-08 1.013838e-01 5.627987e-02 5.627987e-02
[31] 5.627987e-02 1.013838e-01 4.516107e-01 3.032825e-01
Judging from the numbers above, our range of probability is very low (everything below 0.5). If we put our loyalty threshold to 0.5, it may result in an inaccurate prediction. Which is why I will try to put a lower threshold of 0.3 instead:
Id gender age status income visitNo method timeSpend
4 4 Female 20-29 Student <RM25,000 Monthly Take away Below 30 mins
13 13 Female 20-29 Student <RM25,000 Weekly Take away Below 30 mins
location membershipCard itemPurchaseCoffee itempurchaseCold
4 More than 3km No No No
13 1km to 3km Yes No No
itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
4 No No No
13 No No No
itemPurchaseOthers spendPurchase productRate priceRate promoRate
4 No Less than RM20 2 1 4
13 No RM 20 to RM40 4 3 2
ambianceRate wifiRate serviceRate chooseRate promoMethodApp promoMethodSoc
4 3 3 3 3 No No
13 4 4 3 4 No No
promoMethodEmail promoMethodDeal promoMethodFriend promoMethodDisplay
4 No No No No
13 No No No No
promoMethodBillboard promoMethodOthers loyal probability prediction
4 No No No 0.30328247 Yes
13 No No Yes 0.05627987 No
[ reached 'max' / getOption("max.print") -- omitted 32 rows ]
Obviously, from this result alone I will have to re-adjust/re-do the logistic regression model. However, I will evaluate the model that I’m using either way.
confusionMatrix(data = as.factor(survey_test$prediction),
reference = as.factor(survey_test$loyal),
positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 8 3
No 20 3
Accuracy : 0.3235
95% CI : (0.1739, 0.5053)
No Information Rate : 0.8235
P-Value [Acc > NIR] : 1.0000000
Kappa : -0.1014
Mcnemar's Test P-Value : 0.0008492
Sensitivity : 0.2857
Specificity : 0.5000
Pos Pred Value : 0.7273
Neg Pred Value : 0.1304
Prevalence : 0.8235
Detection Rate : 0.2353
Detection Prevalence : 0.3235
Balanced Accuracy : 0.3929
'Positive' Class : Yes
No surprise there, the accuracy result is only around 41%. Which is probably why we can go back to the modelling process and re-select the variables to use in predicting customer loyalty.