Objectives To create a logistic regression model from Starbucks Malaysia customer data. This will result on a prediction on which variables that strongly affect the customer loyalty and how accurate this predictive model can be.

Read Data

This data is about the result of Starbucks satisfactory survey conducted from 100 Starbucks customers in Malaysia. While the year of the survey was not mentioned in the Kaggle (check the references for link), I am assuming that it was conducted around 2019 or 2020, since the data was uploaded on 2020.

Things included in this data: - Demographic info about customers – gender, age range, employment status, income range (in Malaysian ringgit)
- Their current behavior in buying Starbucks
- Facilities and features of Starbucks that contribute to the behavior

Data

Observations: 113
Variables: 33
$ Id                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ gender                 <int> 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1…
$ age                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ status                 <int> 0, 0, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2…
$ income                 <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0…
$ visitNo                <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3…
$ method                 <int> 0, 2, 0, 2, 2, 0, 0, 0, 1, 2, 0, 0, 2, 2, 2, 2…
$ timeSpend              <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1…
$ location               <int> 0, 1, 2, 2, 1, 2, 0, 2, 2, 2, 2, 2, 1, 1, 0, 2…
$ membershipCard         <int> 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0…
$ itemPurchaseCoffee     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itempurchaseCold       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchasePastries   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseJuices     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseSandwiches <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ itemPurchaseOthers     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ spendPurchase          <int> 1, 1, 1, 1, 2, 1, 2, 1, 3, 2, 1, 1, 2, 1, 1, 1…
$ productRate            <int> 4, 4, 4, 2, 3, 4, 5, 4, 5, 4, 4, 3, 4, 4, 5, 4…
$ priceRate              <int> 3, 3, 3, 1, 3, 3, 5, 2, 4, 3, 1, 2, 3, 3, 2, 1…
$ promoRate              <int> 5, 4, 4, 4, 4, 5, 5, 3, 4, 3, 4, 4, 2, 4, 5, 5…
$ ambianceRate           <int> 5, 4, 4, 3, 2, 5, 5, 3, 4, 4, 5, 4, 4, 4, 5, 4…
$ wifiRate               <int> 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 3, 3, 4, 4, 5, 3…
$ serviceRate            <int> 4, 5, 4, 3, 3, 5, 5, 3, 4, 3, 3, 4, 3, 4, 5, 4…
$ chooseRate             <int> 3, 2, 3, 3, 3, 4, 5, 3, 4, 4, 4, 4, 4, 3, 2, 4…
$ promoMethodApp         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodSoc         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodEmail       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodDeal        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodFriend      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodDisplay     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodBillboard   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ promoMethodOthers      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ loyal                  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1…
  • Id :
  • gender : 0 - Male 1 - Female
  • age : 0 (Below 20), 1 (From 20 to 29), 2 (From 30 to 39), 3 (40 and above)
  • status : 0 (Student), 1 (Self-Employed), 2 (Employed), 3 (Housewife)
  • income : 0 (Less than RM25,000), 1 (RM25,000 – RM50,000), 2 (RM50,000 – RM100,000), 3 (RM100,000 – RM150,000), 4 (More than RM150,000)
  • visitNo : 0 - Daily, 1 - Weekly, 3 - Monthly, 4 - Never
  • method : 0 Dine In, 1 Drive-thru, 2 Take away, 3 Never 4 Others
  • timeSpend: 0 (Below 30 mins), 1 (30 mins to 1h), 2 (1h to 2h), 3 (2h to 3h), 4 (More than 3h)
  • location : 0 (Within 1km), 1 (1km to 3km), 2 (More than 3km)
  • membershipCard / itemPurchaseCoffee / itempurchaseCold / itemPurchasePastries / itemPurchaseJuices / itemPurchaseSandwiches / itemPurchaseOthers : 0 Yes 1 No
  • spendPurchase : 0 (Zero) ,1 (Less than RM20), 2 (RM 20 to RM40), 3 (More than RM40)
  • productRate / priceRate/ promoRate / ambianceRate / wifiRate / serviceRate / chooseRate : Scaled 1-5 (1 Very Bad, 5 Excellent)
  • promoMethodApp / promoMethodSoc / promoMethodEmail / promoMethodDeal / promoMethodFriend / promoMethodDisplay / promoMethodBillboard / promoMethodOthers : 0 Yes 1 No
  • loyal : 0 Yes 1 No

Exploratory Data Analysis

Data Manipulation

Since this is quite long, feel free to jump to the next section. Although in the later section we only use a handful of variables from these selections, I will still convert every variables into their correct class for future uses.

survey <- survey %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate(gender = factor(gender, levels = c(0,1), labels = c("Male", "Female")),
         age = factor(age, levels = c(0,1,2,3), labels = c("<20", "20-29", "30-39", ">40")),
         status = factor(status, levels = c(0,1,2,3), labels = c("Student", "Self-Employed", "Employed", "Housewife")),
         income = factor(income, levels = c(0,1,2,3,4), labels = c("<RM25,000", "RM25,000 – RM50,000", "RM50,000 – RM100,000", "RM100,000 – RM150,000", ">RM150,000")),
         visitNo = factor(visitNo, levels = c(0,1,3,4), labels = c("Daily", "Weekly", "Monthly", "Never")),
         method = factor(method, levels = c(0,1,2,5), labels = c("Dine In", "Drive-thru", "Take away", "Others")),
         timeSpend = factor(timeSpend, levels = c(0,1,2,3,4), labels = c("Below 30 mins", "30 mins to 1h", "1h to 2h", "2h to 3h", "More than 3h")),
         location = factor(location, levels = c(0,1,2), labels = c("Within 1km", "1km to 3km", "More than 3km")),
         membershipCard = factor(membershipCard, levels = c(0,1), labels = c("Yes", "No")),
         itemPurchaseCoffee = factor(itemPurchaseCoffee, levels = c(0,1), labels = c("Yes", "No")),
         itempurchaseCold = factor(itempurchaseCold, levels = c(0,1), labels = c("Yes", "No")),
         itemPurchasePastries = factor(itemPurchasePastries, levels = c(0,1), labels = c("Yes", "No")),
         itemPurchaseJuices = factor(itemPurchaseJuices, levels = c(0,1), labels = c("Yes", "No")),
         itemPurchaseSandwiches = factor(itemPurchaseSandwiches, levels = c(0,1), labels = c("Yes", "No")),
         itemPurchaseOthers = factor(itemPurchaseOthers, levels = c(0,1), labels = c("Yes", "No")),
         spendPurchase = factor(spendPurchase, levels = c(0,1,2,3), labels = c("Zero", "Less than RM20", "RM 20 to RM40", "More than RM40")),
         promoMethodApp = factor(promoMethodApp, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodSoc = factor(promoMethodSoc, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodEmail = factor(promoMethodEmail, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodDeal = factor(promoMethodDeal, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodDisplay = factor(promoMethodDisplay, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodFriend = factor(promoMethodFriend, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodBillboard = factor(promoMethodBillboard, levels = c(0,1), labels = c("Yes", "No")),
         promoMethodOthers = factor(promoMethodOthers, levels = c(0,1), labels = c("Yes", "No")),
         loyal = factor(loyal, levels = c(0,1), labels = c("Yes", "No"))) %>% 
  select(-Id)
         

glimpse(survey)
Observations: 113
Variables: 32
$ gender                 <fct> Female, Female, Male, Female, Male, Female, Fe…
$ age                    <fct> 20-29, 20-29, 20-29, 20-29, 20-29, 20-29, 20-2…
$ status                 <fct> Student, Student, Employed, Student, Student, …
$ income                 <fct> "<RM25,000", "<RM25,000", "<RM25,000", "<RM25,…
$ visitNo                <fct> Monthly, Monthly, NA, Monthly, NA, Monthly, Mo…
$ method                 <fct> Dine In, Take away, Dine In, Take away, Take a…
$ timeSpend              <fct> 30 mins to 1h, Below 30 mins, 30 mins to 1h, B…
$ location               <fct> Within 1km, 1km to 3km, More than 3km, More th…
$ membershipCard         <fct> Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, No, …
$ itemPurchaseCoffee     <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itempurchaseCold       <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchasePastries   <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseJuices     <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseSandwiches <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ itemPurchaseOthers     <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ spendPurchase          <fct> Less than RM20, Less than RM20, Less than RM20…
$ productRate            <fct> 4, 4, 4, 2, 3, 4, 5, 4, 5, 4, 4, 3, 4, 4, 5, 4…
$ priceRate              <fct> 3, 3, 3, 1, 3, 3, 5, 2, 4, 3, 1, 2, 3, 3, 2, 1…
$ promoRate              <fct> 5, 4, 4, 4, 4, 5, 5, 3, 4, 3, 4, 4, 2, 4, 5, 5…
$ ambianceRate           <fct> 5, 4, 4, 3, 2, 5, 5, 3, 4, 4, 5, 4, 4, 4, 5, 4…
$ wifiRate               <fct> 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 3, 3, 4, 4, 5, 3…
$ serviceRate            <fct> 4, 5, 4, 3, 3, 5, 5, 3, 4, 3, 3, 4, 3, 4, 5, 4…
$ chooseRate             <fct> 3, 2, 3, 3, 3, 4, 5, 3, 4, 4, 4, 4, 4, 3, 2, 4…
$ promoMethodApp         <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodSoc         <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodEmail       <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodDeal        <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodFriend      <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodDisplay     <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodBillboard   <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ promoMethodOthers      <fct> No, No, No, No, No, No, No, No, No, No, No, No…
$ loyal                  <fct> Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Ye…
                    Id                 gender                    age 
                     0                      0                      0 
                status                 income                visitNo 
                     0                      0                     26 
                method              timeSpend               location 
                     0                      0                      0 
        membershipCard     itemPurchaseCoffee       itempurchaseCold 
                     0                      0                      0 
  itemPurchasePastries     itemPurchaseJuices itemPurchaseSandwiches 
                     0                      0                      0 
    itemPurchaseOthers          spendPurchase            productRate 
                     0                      0                      0 
             priceRate              promoRate           ambianceRate 
                     0                      0                      0 
              wifiRate            serviceRate             chooseRate 
                     0                      0                      0 
        promoMethodApp         promoMethodSoc       promoMethodEmail 
                     0                      0                      0 
       promoMethodDeal      promoMethodFriend     promoMethodDisplay 
                     0                      0                      0 
  promoMethodBillboard      promoMethodOthers                  loyal 
                     0                      0                      0 

Logistic Regression

Logistic regression is appropriate to use with categorical/factor-type of data (and since this is survey, there will be a lot of “yes” or “no” answers).

Model 1

Using glm() function.


Call:
glm(formula = loyal ~ age + income + gender + membershipCard + 
    timeSpend + method, family = "binomial", data = survey)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3915  -0.6751  -0.3061  -0.1056   2.5619  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)   
(Intercept)                   -2.6101     1.1844  -2.204  0.02754 * 
age20-29                      -0.4270     0.8518  -0.501  0.61622   
age30-39                      -1.1608     1.3790  -0.842  0.39989   
age>40                        -2.1266     1.9961  -1.065  0.28669   
incomeRM25,000 – RM50,000      0.4150     0.8936   0.464  0.64234   
incomeRM50,000 – RM100,000    -0.2726     1.2909  -0.211  0.83274   
incomeRM100,000 – RM150,000    3.5143     2.0791   1.690  0.09096 . 
income>RM150,000               2.4572     1.7389   1.413  0.15763   
genderFemale                   0.2975     0.6513   0.457  0.64787   
membershipCardNo               2.4775     0.8008   3.094  0.00198 **
timeSpend30 mins to 1h         0.7526     0.7090   1.061  0.28851   
timeSpend1h to 2h             -1.4398     1.4595  -0.986  0.32390   
timeSpend2h to 3h            -16.2523  3956.1806  -0.004  0.99672   
timeSpendMore than 3h         21.0805  2354.5121   0.009  0.99286   
methodDrive-thru              -1.5278     1.0146  -1.506  0.13212   
 [ reached getOption("max.print") -- omitted 2 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 114.191  on 112  degrees of freedom
Residual deviance:  84.701  on  96  degrees of freedom
AIC: 118.7

Number of Fisher Scoring iterations: 16

From the variables I have chosen, it seems that there are a lot of insignificant variables that may not affect target variable (loyal). We will then use model fitting using stepwise method.

Model Fitting & Model 2

Start:  AIC=118.7
loyal ~ age + income + gender + membershipCard + timeSpend + 
    method

                 Df Deviance    AIC
- age             3   86.133 114.13
- income          4   89.128 115.13
- method          3   87.728 115.73
- gender          1   84.912 116.91
<none>                84.701 118.70
- timeSpend       4   96.422 122.42
- membershipCard  1   98.182 130.18

Step:  AIC=114.13
loyal ~ income + gender + membershipCard + timeSpend + method

                 Df Deviance    AIC
- income          4   89.296 109.30
- method          3   89.364 111.36
- gender          1   86.517 112.52
<none>                86.133 114.13
- timeSpend       4   98.668 118.67
- membershipCard  1   99.129 125.13

Step:  AIC=109.3
loyal ~ gender + membershipCard + timeSpend + method

                 Df Deviance    AIC
- method          3   92.454 106.45
- gender          1   89.377 107.38
<none>                89.296 109.30
- timeSpend       4   99.858 111.86
- membershipCard  1  103.091 121.09

Step:  AIC=106.45
loyal ~ gender + membershipCard + timeSpend

                 Df Deviance    AIC
- gender          1   92.549 104.55
<none>                92.454 106.45
- timeSpend       4  102.240 108.24
- membershipCard  1  105.656 117.66

Step:  AIC=104.55
loyal ~ membershipCard + timeSpend

                 Df Deviance    AIC
<none>                92.549 104.55
- timeSpend       4  102.344 106.34
- membershipCard  1  105.754 115.75

Call:
glm(formula = loyal ~ membershipCard + timeSpend, family = "binomial", 
    data = survey)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0961  -0.8501  -0.3404  -0.2299   2.3989  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.8195     0.6059  -4.653 3.27e-06 ***
membershipCardNo          1.9878     0.6148   3.233  0.00122 ** 
timeSpend30 mins to 1h    0.6376     0.5680   1.123  0.26164    
timeSpend1h to 2h        -0.8008     1.1291  -0.709  0.47819    
timeSpend2h to 3h       -13.7466  2399.5448  -0.006  0.99543    
timeSpendMore than 3h    18.6653  1569.8286   0.012  0.99051    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 114.191  on 112  degrees of freedom
Residual deviance:  92.549  on 107  degrees of freedom
AIC: 104.55

Number of Fisher Scoring iterations: 15

Predict to Data Test

  Id gender   age  status    income visitNo    method     timeSpend
1  4 Female 20-29 Student <RM25,000 Monthly Take away Below 30 mins
2 13 Female 20-29 Student <RM25,000  Weekly Take away Below 30 mins
       location membershipCard itemPurchaseCoffee itempurchaseCold
1 More than 3km             No                 No               No
2    1km to 3km            Yes                 No               No
  itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
1                   No                 No                     No
2                   No                 No                     No
  itemPurchaseOthers  spendPurchase productRate priceRate promoRate
1                 No Less than RM20           2         1         4
2                 No  RM 20 to RM40           4         3         2
  ambianceRate wifiRate serviceRate chooseRate promoMethodApp promoMethodSoc
1            3        3           3          3             No             No
2            4        4           3          4             No             No
  promoMethodEmail promoMethodDeal promoMethodFriend promoMethodDisplay
1               No              No                No                 No
2               No              No                No                 No
  promoMethodBillboard promoMethodOthers loyal probability
1                   No                No    No  0.30328247
2                   No                No   Yes  0.05627987
 [ reached 'max' / getOption("max.print") -- omitted 32 rows ]
 [1] 3.032825e-01 5.627987e-02 1.013838e-01 5.627987e-02 5.627987e-02
 [6] 5.627987e-02 1.013838e-01 1.013838e-01 1.013838e-01 3.032825e-01
[11] 3.032825e-01 5.627987e-02 1.013838e-01 5.627987e-02 1.013838e-01
[16] 1.013838e-01 3.032825e-01 1.013838e-01 1.013838e-01 3.032825e-01
[21] 2.607762e-02 3.032825e-01 3.032825e-01 3.032825e-01 3.032825e-01
[26] 1.013838e-01 6.389220e-08 1.013838e-01 5.627987e-02 5.627987e-02
[31] 5.627987e-02 1.013838e-01 4.516107e-01 3.032825e-01

Judging from the numbers above, our range of probability is very low (everything below 0.5). If we put our loyalty threshold to 0.5, it may result in an inaccurate prediction. Which is why I will try to put a lower threshold of 0.3 instead:

   Id gender   age  status    income visitNo    method     timeSpend
4   4 Female 20-29 Student <RM25,000 Monthly Take away Below 30 mins
13 13 Female 20-29 Student <RM25,000  Weekly Take away Below 30 mins
        location membershipCard itemPurchaseCoffee itempurchaseCold
4  More than 3km             No                 No               No
13    1km to 3km            Yes                 No               No
   itemPurchasePastries itemPurchaseJuices itemPurchaseSandwiches
4                    No                 No                     No
13                   No                 No                     No
   itemPurchaseOthers  spendPurchase productRate priceRate promoRate
4                  No Less than RM20           2         1         4
13                 No  RM 20 to RM40           4         3         2
   ambianceRate wifiRate serviceRate chooseRate promoMethodApp promoMethodSoc
4             3        3           3          3             No             No
13            4        4           3          4             No             No
   promoMethodEmail promoMethodDeal promoMethodFriend promoMethodDisplay
4                No              No                No                 No
13               No              No                No                 No
   promoMethodBillboard promoMethodOthers loyal probability prediction
4                    No                No    No  0.30328247        Yes
13                   No                No   Yes  0.05627987         No
 [ reached 'max' / getOption("max.print") -- omitted 32 rows ]

Obviously, from this result alone I will have to re-adjust/re-do the logistic regression model. However, I will evaluate the model that I’m using either way.

Model Evaluation

Confusion Matrix and Statistics

          Reference
Prediction Yes No
       Yes   8  3
       No   20  3
                                          
               Accuracy : 0.3235          
                 95% CI : (0.1739, 0.5053)
    No Information Rate : 0.8235          
    P-Value [Acc > NIR] : 1.0000000       
                                          
                  Kappa : -0.1014         
                                          
 Mcnemar's Test P-Value : 0.0008492       
                                          
            Sensitivity : 0.2857          
            Specificity : 0.5000          
         Pos Pred Value : 0.7273          
         Neg Pred Value : 0.1304          
             Prevalence : 0.8235          
         Detection Rate : 0.2353          
   Detection Prevalence : 0.3235          
      Balanced Accuracy : 0.3929          
                                          
       'Positive' Class : Yes             
                                          

No surprise there, the accuracy result is only around 41%. Which is probably why we can go back to the modelling process and re-select the variables to use in predicting customer loyalty.

References