Introduction

The goal of the project is to analyze people’s preferences towards coffee using association rule mining. Two separate datasets and algorithms - Apriori and cSPADE - are used to perform the analysis. In the first part, results of anonymized survey performed during a coffee testing event are examined. In the second part, sequential pattern mining is used to analyze coffee sales from a vending machine.

R Setup

Following R packages were used to perform the analysis:

library(arules) # Apriori algorithm
library(arulesSequences) #cSPADE algorithm
library(arulesViz) # data visualization
library(tidyverse) # data preprocessing and cleaning

Apriori Algorithm

The goal of this part of the project is to analyze the profile of people whose purpose of drinking coffee is to increase their energy levels through caffeine. Apriori algorithm is used.

Data Overview

The first dataset was found on Kaggle, and includes data on coffee drinking habits and preferences, collected during a virtual coffee tasting event. 4042 participants took part in the survey, in which they had to answer over 100 questions. The data was not preprocessed or cleaned prior to posting the survey results.

coffee_survey <- read.csv("GACTT_RESULTS_ANONYMIZED_v2.csv") # reading the data
names(coffee_survey)
##   [1] "Submission.ID"                                                                                   
##   [2] "What.is.your.age."                                                                               
##   [3] "How.many.cups.of.coffee.do.you.typically.drink.per.day."                                         
##   [4] "Where.do.you.typically.drink.coffee."                                                            
##   [5] "Where.do.you.typically.drink.coffee...At.home."                                                  
##   [6] "Where.do.you.typically.drink.coffee...At.the.office."                                            
##   [7] "Where.do.you.typically.drink.coffee...On.the.go."                                                
##   [8] "Where.do.you.typically.drink.coffee...At.a.cafe."                                                
##   [9] "Where.do.you.typically.drink.coffee...None.of.these."                                            
##  [10] "How.do.you.brew.coffee.at.home."                                                                 
##  [11] "How.do.you.brew.coffee.at.home...Pour.over."                                                     
##  [12] "How.do.you.brew.coffee.at.home...French.press."                                                  
##  [13] "How.do.you.brew.coffee.at.home...Espresso."                                                      
##  [14] "How.do.you.brew.coffee.at.home...Coffee.brewing.machine..e.g..Mr..Coffee.."                      
##  [15] "How.do.you.brew.coffee.at.home...Pod.capsule.machine..e.g..Keurig.Nespresso.."                   
##  [16] "How.do.you.brew.coffee.at.home...Instant.coffee."                                                
##  [17] "How.do.you.brew.coffee.at.home...Bean.to.cup.machine."                                           
##  [18] "How.do.you.brew.coffee.at.home...Cold.brew."                                                     
##  [19] "How.do.you.brew.coffee.at.home...Coffee.extract..e.g..Cometeer.."                                
##  [20] "How.do.you.brew.coffee.at.home...Other."                                                         
##  [21] "How.else.do.you.brew.coffee.at.home."                                                            
##  [22] "On.the.go..where.do.you.typically.purchase.coffee."                                              
##  [23] "On.the.go..where.do.you.typically.purchase.coffee...National.chain..e.g..Starbucks..Dunkin.."    
##  [24] "On.the.go..where.do.you.typically.purchase.coffee...Local.cafe."                                 
##  [25] "On.the.go..where.do.you.typically.purchase.coffee...Drive.thru."                                 
##  [26] "On.the.go..where.do.you.typically.purchase.coffee...Specialty.coffee.shop."                      
##  [27] "On.the.go..where.do.you.typically.purchase.coffee...Deli.or.supermarket."                        
##  [28] "On.the.go..where.do.you.typically.purchase.coffee...Other."                                      
##  [29] "Where.else.do.you.purchase.coffee."                                                              
##  [30] "What.is.your.favorite.coffee.drink."                                                             
##  [31] "Please.specify.what.your.favorite.coffee.drink.is"                                               
##  [32] "Do.you.usually.add.anything.to.your.coffee."                                                     
##  [33] "Do.you.usually.add.anything.to.your.coffee...No...just.black."                                   
##  [34] "Do.you.usually.add.anything.to.your.coffee...Milk..dairy.alternative..or.coffee.creamer."        
##  [35] "Do.you.usually.add.anything.to.your.coffee...Sugar.or.sweetener."                                
##  [36] "Do.you.usually.add.anything.to.your.coffee...Flavor.syrup."                                      
##  [37] "Do.you.usually.add.anything.to.your.coffee...Other."                                             
##  [38] "What.else.do.you.add.to.your.coffee."                                                            
##  [39] "What.kind.of.dairy.do.you.add."                                                                  
##  [40] "What.kind.of.dairy.do.you.add...Whole.milk."                                                     
##  [41] "What.kind.of.dairy.do.you.add...Skim.milk."                                                      
##  [42] "What.kind.of.dairy.do.you.add...Half.and.half."                                                  
##  [43] "What.kind.of.dairy.do.you.add...Coffee.creamer."                                                 
##  [44] "What.kind.of.dairy.do.you.add...Flavored.coffee.creamer."                                        
##  [45] "What.kind.of.dairy.do.you.add...Oat.milk."                                                       
##  [46] "What.kind.of.dairy.do.you.add...Almond.milk."                                                    
##  [47] "What.kind.of.dairy.do.you.add...Soy.milk."                                                       
##  [48] "What.kind.of.dairy.do.you.add...Other."                                                          
##  [49] "What.kind.of.sugar.or.sweetener.do.you.add."                                                     
##  [50] "What.kind.of.sugar.or.sweetener.do.you.add...Granulated.Sugar."                                  
##  [51] "What.kind.of.sugar.or.sweetener.do.you.add...Artificial.Sweeteners..e.g...Splenda.."             
##  [52] "What.kind.of.sugar.or.sweetener.do.you.add...Honey."                                             
##  [53] "What.kind.of.sugar.or.sweetener.do.you.add...Maple.Syrup."                                       
##  [54] "What.kind.of.sugar.or.sweetener.do.you.add...Stevia."                                            
##  [55] "What.kind.of.sugar.or.sweetener.do.you.add...Agave.Nectar."                                      
##  [56] "What.kind.of.sugar.or.sweetener.do.you.add...Brown.Sugar."                                       
##  [57] "What.kind.of.sugar.or.sweetener.do.you.add...Raw.Sugar..Turbinado.."                             
##  [58] "What.kind.of.flavorings.do.you.add."                                                             
##  [59] "What.kind.of.flavorings.do.you.add...Vanilla.Syrup."                                             
##  [60] "What.kind.of.flavorings.do.you.add...Caramel.Syrup."                                             
##  [61] "What.kind.of.flavorings.do.you.add...Hazelnut.Syrup."                                            
##  [62] "What.kind.of.flavorings.do.you.add...Cinnamon..Ground.or.Stick.."                                
##  [63] "What.kind.of.flavorings.do.you.add...Peppermint.Syrup."                                          
##  [64] "What.kind.of.flavorings.do.you.add...Other."                                                     
##  [65] "What.other.flavoring.do.you.use."                                                                
##  [66] "Before.today.s.tasting..which.of.the.following.best.described.what.kind.of.coffee.you.like."     
##  [67] "How.strong.do.you.like.your.coffee."                                                             
##  [68] "What.roast.level.of.coffee.do.you.prefer."                                                       
##  [69] "How.much.caffeine.do.you.like.in.your.coffee."                                                   
##  [70] "Lastly..how.would.you.rate.your.own.coffee.expertise."                                           
##  [71] "Coffee.A...Bitterness"                                                                           
##  [72] "Coffee.A...Acidity"                                                                              
##  [73] "Coffee.A...Personal.Preference"                                                                  
##  [74] "Coffee.A...Notes"                                                                                
##  [75] "Coffee.B...Bitterness"                                                                           
##  [76] "Coffee.B...Acidity"                                                                              
##  [77] "Coffee.B...Personal.Preference"                                                                  
##  [78] "Coffee.B...Notes"                                                                                
##  [79] "Coffee.C...Bitterness"                                                                           
##  [80] "Coffee.C...Acidity"                                                                              
##  [81] "Coffee.C...Personal.Preference"                                                                  
##  [82] "Coffee.C...Notes"                                                                                
##  [83] "Coffee.D...Bitterness"                                                                           
##  [84] "Coffee.D...Acidity"                                                                              
##  [85] "Coffee.D...Personal.Preference"                                                                  
##  [86] "Coffee.D...Notes"                                                                                
##  [87] "Between.Coffee.A..Coffee.B..and.Coffee.C.which.did.you.prefer."                                  
##  [88] "Between.Coffee.A.and.Coffee.D..which.did.you.prefer."                                            
##  [89] "Lastly..what.was.your.favorite.overall.coffee."                                                  
##  [90] "Do.you.work.from.home.or.in.person."                                                             
##  [91] "In.total..much.money.do.you.typically.spend.on.coffee.in.a.month."                               
##  [92] "Why.do.you.drink.coffee."                                                                        
##  [93] "Why.do.you.drink.coffee...It.tastes.good."                                                       
##  [94] "Why.do.you.drink.coffee...I.need.the.caffeine."                                                  
##  [95] "Why.do.you.drink.coffee...I.need.the.ritual."                                                    
##  [96] "Why.do.you.drink.coffee...It.makes.me.go.to.the.bathroom."                                       
##  [97] "Why.do.you.drink.coffee...Other."                                                                
##  [98] "Other.reason.for.drinking.coffee"                                                                
##  [99] "Do.you.like.the.taste.of.coffee."                                                                
## [100] "Do.you.know.where.your.coffee.comes.from."                                                       
## [101] "What.is.the.most.you.ve.ever.paid.for.a.cup.of.coffee."                                          
## [102] "What.is.the.most.you.d.ever.be.willing.to.pay.for.a.cup.of.coffee."                              
## [103] "Do.you.feel.like.you.re.getting.good.value.for.your.money.when.you.buy.coffee.at.a.cafe."        
## [104] "Approximately.how.much.have.you.spent.on.coffee.equipment.in.the.past.5.years."                  
## [105] "Do.you.feel.like.you.re.getting.good.value.for.your.money.with.regards.to.your.coffee.equipment."
## [106] "Gender"                                                                                          
## [107] "Gender..please.specify."                                                                         
## [108] "Education.Level"                                                                                 
## [109] "Ethnicity.Race"                                                                                  
## [110] "Ethnicity.Race..please.specify."                                                                 
## [111] "Employment.Status"                                                                               
## [112] "Number.of.Children"                                                                              
## [113] "Political.Affiliation"

Data Preprocessing

There are 113 columns total, however, due to their redundancy or overly detailed focus on a particular subject, part of the attributes can be removed. As a result, only 29 columns are used in the final analysis and data cleaning. Column names are also transformed to ensure better readability.

coffee_survey2 <- tibble(
  age = as.factor(coffee_survey$What.is.your.age.), 
  cups_daily = as.factor(coffee_survey$How.many.cups.of.coffee.do.you.typically.drink.per.day.),
  drink_home = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.home.),
  drink_at_office = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.the.office.),
  drink_cafe = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.a.cafe.),
  brew_pour_over = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Pour.over.),
  brew_french_press = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...French.press.),
  brew_coffee_machine = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Coffee.brewing.machine..e.g..Mr..Coffee..),
  brew_coffee_pods = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Pod.capsule.machine..e.g..Keurig.Nespresso..),
  brew_instant_coffee = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Instant.coffee.),
  brew_cold_brew = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Cold.brew.),
  purchase_national_chain = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...National.chain..e.g..Starbucks..Dunkin..),
  purchase_local_cafe = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Local.cafe.),
  purchase_specialty_shop = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Specialty.coffee.shop.),
  purchase_deli_supermarket = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Deli.or.supermarket.),
  favorite_drink = as.factor(coffee_survey$What.is.your.favorite.coffee.drink.),
  coffee_black = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...No...just.black.),
  coffee_milk_creamer = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...Milk..dairy.alternative..or.coffee.creamer.),
  coffee_sugar_sweetener = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...Sugar.or.sweetener.),
  coffee_strength = as.factor(coffee_survey$How.strong.do.you.like.your.coffee.),
  coffee_roast = as.factor(coffee_survey$What.roast.level.of.coffee.do.you.prefer.),
  caffeine_level = as.factor(coffee_survey$How.much.caffeine.do.you.like.in.your.coffee.),
  drink_for_taste = as.factor(coffee_survey$Why.do.you.drink.coffee...It.tastes.good.),
  drink_for_caffeine = as.factor(coffee_survey$Why.do.you.drink.coffee...I.need.the.caffeine.),
  like_coffee_taste = as.factor(coffee_survey$Do.you.like.the.taste.of.coffee.),
  gender = as.factor(coffee_survey$Gender), 
  education = as.factor(coffee_survey$Education.Level),
  employment_status = as.factor(coffee_survey$Employment.Status),
  political_affiliation = as.factor(coffee_survey$Political.Affiliation)
)

Let’s now take a look inside the data:

str(coffee_survey2)
## tibble [4,042 × 29] (S3: tbl_df/tbl/data.frame)
##  $ age                      : Factor w/ 7 levels "<18 years old",..: 3 4 4 5 4 7 3 NA 4 NA ...
##  $ cups_daily               : Factor w/ 6 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA 5 NA ...
##  $ drink_home               : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 1 NA ...
##  $ drink_at_office          : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 1 NA ...
##  $ drink_cafe               : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 2 NA ...
##  $ brew_pour_over           : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 2 1 2 NA NA NA ...
##  $ brew_french_press        : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 2 2 NA NA NA ...
##  $ brew_coffee_machine      : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 2 1 1 1 NA NA NA ...
##  $ brew_coffee_pods         : Factor w/ 2 levels "FALSE","TRUE": NA 2 1 1 1 2 2 NA NA NA ...
##  $ brew_instant_coffee      : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 1 2 NA NA NA ...
##  $ brew_cold_brew           : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 1 1 NA NA NA ...
##  $ purchase_national_chain  : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
##  $ purchase_local_cafe      : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
##  $ purchase_specialty_shop  : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
##  $ purchase_deli_supermarket: Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 1 NA NA NA ...
##  $ favorite_drink           : Factor w/ 11 levels "Americano","Blended drink (e.g. Frappuccino)",..: 11 7 11 7 8 7 10 NA 11 NA ...
##  $ coffee_black             : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 1 2 NA 2 1 ...
##  $ coffee_milk_creamer      : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 NA 1 1 ...
##  $ coffee_sugar_sweetener   : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 NA 1 1 ...
##  $ coffee_strength          : Factor w/ 5 levels "Medium","Somewhat light",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ coffee_roast             : Factor w/ 7 levels "Blonde","Dark",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ caffeine_level           : Factor w/ 3 levels "Decaf","Full caffeine",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ drink_for_taste          : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA NA NA NA NA ...
##  $ drink_for_caffeine       : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA NA NA NA NA ...
##  $ like_coffee_taste        : Factor w/ 2 levels "No","Yes": NA NA NA NA NA NA NA NA NA NA ...
##  $ gender                   : Factor w/ 5 levels "Female","Male",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ education                : Factor w/ 6 levels "Bachelor's degree",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ employment_status        : Factor w/ 6 levels "Employed full-time",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ political_affiliation    : Factor w/ 4 levels "Democrat","Independent",..: NA NA NA NA NA NA NA NA NA NA ...

Numerous null values appear in the dataset, likely because responding to certain questions was optional. The rows with null values are dropped.

coffee_survey2 <- coffee_survey2 %>% drop_na()

Next, factor labels are renamed to ensure they contain information about both the variable and its level.

coffee_survey2 <- coffee_survey2 %>%
  mutate(
    cups_daily = ifelse(cups_daily == "Less than 1", "Less than 1 coffee cup daily", 
    ifelse(cups_daily == "1", "1 coffee cup daily",  
    ifelse(cups_daily == "2", "2 coffee cups daily",  
    ifelse(cups_daily == "3", "3 coffee cups daily",  
    ifelse(cups_daily == "4", "4 coffee cups daily",  
    "More than 4 coffee cups daily"))))),
    drink_home = ifelse(drink_at_office == "TRUE", "Drinks coffee at home", "Does not drink coffee at home"),
    drink_at_office = ifelse(drink_at_office == "TRUE", "Drinks coffee at the office", "Does not drink coffee at the office"),
    drink_cafe = ifelse(drink_cafe == "TRUE", "Drinks coffee at a cafe", "Does not drink coffee at a cafe"),
    brew_pour_over = ifelse(brew_pour_over == "TRUE", "Brews coffee using pour-over", "Does not brew coffee using pour-over"),
    brew_french_press = ifelse(brew_french_press == "TRUE", "Brews coffee using French press", "Does not brew coffee using French press"),
    brew_coffee_machine = ifelse(brew_coffee_machine == "TRUE", "Brews coffee using a coffee machine", "Does not brew coffee using a coffee machine"),
    brew_coffee_pods = ifelse(brew_coffee_pods == "TRUE", "Brews coffee using pods", "Does not brew coffee using pods"),
    brew_instant_coffee = ifelse(brew_instant_coffee == "TRUE", "Brews coffee using instant coffee", "Does not brew coffee using instant coffee"),
    brew_cold_brew = ifelse(brew_cold_brew == "TRUE", "Brews cold brew coffee", "Does not brew cold brew coffee"),
    purchase_national_chain = ifelse(purchase_national_chain == "TRUE", "Buys coffee from a national chain (i.e. Starbucks)", "Does not buy coffee from a national chain (i.e. Starbucks)"),
    purchase_local_cafe = ifelse(purchase_local_cafe == "TRUE", "Buys coffee from a local cafe", "Does not buy coffee from a local cafe"),
    purchase_specialty_shop = ifelse(purchase_specialty_shop == "TRUE", "Buys coffee from a specialty shop", "Does not buy coffee from a specialty shop"),
    purchase_deli_supermarket = ifelse(purchase_deli_supermarket == "TRUE", "Buys coffee from a deli or supermarket", "Does not buy coffee from a deli or supermarket"),
    coffee_black = ifelse(coffee_black == "TRUE", "Drinks black coffee", "Does not drink black coffee"),
    coffee_milk_creamer = ifelse(coffee_milk_creamer == "TRUE", "Adds milk or creamer to coffee", "Does not add milk or creamer to coffee"),
    coffee_sugar_sweetener = ifelse(coffee_sugar_sweetener == "TRUE", "Adds sugar or sweetener to coffee", "Does not add sugar or sweetener to coffee"),
    drink_for_taste = ifelse(drink_for_taste == "TRUE", "Drinks coffee for the taste", "Does not drink coffee for the taste"),
    drink_for_caffeine = ifelse(drink_for_caffeine == "TRUE", "Drinks coffee for the caffeine", "Does not drink coffee for the caffeine"),
    like_coffee_taste = ifelse(like_coffee_taste == "Yes", "Likes the taste of coffee", "Does not like the taste of coffee"),
    gender = ifelse(gender == "Male", "Male", 
    ifelse(gender == "Female", "Female", "Other"))
  )

Let’s display the structure of the data once again, to ensure proper formatting:

str(coffee_survey2)
## tibble [541 × 29] (S3: tbl_df/tbl/data.frame)
##  $ age                      : Factor w/ 7 levels "<18 years old",..: 6 4 4 4 4 5 4 5 4 7 ...
##  $ cups_daily               : Factor w/ 6 levels "1 coffee cup daily",..: 2 2 2 2 1 1 3 1 5 4 ...
##  $ drink_home               : Factor w/ 2 levels "Does not drink coffee at home",..: 2 2 1 1 2 1 2 1 1 2 ...
##  $ drink_at_office          : Factor w/ 2 levels "Does not drink coffee at the office",..: 2 2 1 1 2 1 2 1 1 2 ...
##  $ drink_cafe               : Factor w/ 2 levels "Does not drink coffee at a cafe",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ brew_pour_over           : Factor w/ 2 levels "Brews coffee using pour-over",..: 1 1 1 1 2 1 2 2 1 1 ...
##  $ brew_french_press        : Factor w/ 2 levels "Brews coffee using French press",..: 1 2 1 1 2 2 2 2 2 1 ...
##  $ brew_coffee_machine      : Factor w/ 2 levels "Brews coffee using a coffee machine",..: 2 2 1 1 2 2 1 2 2 2 ...
##  $ brew_coffee_pods         : Factor w/ 2 levels "Brews coffee using pods",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ brew_instant_coffee      : Factor w/ 2 levels "Brews coffee using instant coffee",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ brew_cold_brew           : Factor w/ 2 levels "Brews cold brew coffee",..: 2 2 2 1 2 2 2 2 2 2 ...
##  $ purchase_national_chain  : Factor w/ 2 levels "Buys coffee from a national chain (i.e. Starbucks)",..: 2 2 1 2 1 2 1 2 1 2 ...
##  $ purchase_local_cafe      : Factor w/ 2 levels "Buys coffee from a local cafe",..: 1 2 1 1 2 1 1 1 2 1 ...
##  $ purchase_specialty_shop  : Factor w/ 2 levels "Buys coffee from a specialty shop",..: 1 1 1 2 2 2 2 1 2 1 ...
##  $ purchase_deli_supermarket: Factor w/ 2 levels "Buys coffee from a deli or supermarket",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ favorite_drink           : Factor w/ 11 levels "Americano","Blended drink (e.g. Frappuccino)",..: 11 3 10 10 6 10 1 8 8 1 ...
##  $ coffee_black             : Factor w/ 2 levels "Does not drink black coffee",..: 2 2 1 2 2 2 2 1 1 2 ...
##  $ coffee_milk_creamer      : Factor w/ 2 levels "Adds milk or creamer to coffee",..: 2 2 1 2 2 2 2 1 1 2 ...
##  $ coffee_sugar_sweetener   : Factor w/ 2 levels "Adds sugar or sweetener to coffee",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ coffee_strength          : Factor w/ 5 levels "Medium","Somewhat light",..: 1 3 3 4 4 3 3 3 5 1 ...
##  $ coffee_roast             : Factor w/ 7 levels "Blonde","Dark",..: 5 5 5 5 6 6 5 2 6 3 ...
##  $ caffeine_level           : Factor w/ 3 levels "Decaf","Full caffeine",..: 2 2 2 2 2 2 2 3 1 2 ...
##  $ drink_for_taste          : Factor w/ 2 levels "Does not drink coffee for the taste",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ drink_for_caffeine       : Factor w/ 2 levels "Does not drink coffee for the caffeine",..: 1 1 2 1 1 2 2 2 1 1 ...
##  $ like_coffee_taste        : Factor w/ 2 levels "Does not like the taste of coffee",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ gender                   : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 1 2 1 1 2 ...
##  $ education                : Factor w/ 6 levels "Bachelor's degree",..: 5 1 1 5 1 2 1 5 5 6 ...
##  $ employment_status        : Factor w/ 6 levels "Employed full-time",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ political_affiliation    : Factor w/ 4 levels "Democrat","Independent",..: 3 1 1 2 1 3 1 1 3 4 ...

As a last step of preprocessing, the data is saved into a separate .csv file.

write.csv(coffee_survey2, "coffee_survey2.csv", row.names = FALSE)

Rules

The final data file is first loaded as a transactions object.

trans<-read.transactions("coffee_survey2.csv", format="basket", sep=",", skip=0)
summary(trans)
## transactions as itemMatrix in sparse format with
##  542 rows (elements/itemsets/transactions) and
##  124 columns (items) and a density of 0.2326211 
## 
## most frequent items:
##                      Likes the taste of coffee 
##                                            533 
##                    Drinks coffee for the taste 
##                                            526 
##      Does not brew coffee using instant coffee 
##                                            513 
## Does not buy coffee from a deli or supermarket 
##                                            504 
##                                  Full caffeine 
##                                            503 
##                                        (Other) 
##                                          13055 
## 
## element (itemset/transaction) length distribution:
## sizes
##  28  29 
##  84 458 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   29.00   29.00   28.85   29.00   29.00 
## 
## includes extended item information - examples:
##               labels
## 1      <18 years old
## 2      >65 years old
## 3 1 coffee cup daily

The dataset contains 542 transactions (survey responses) with 124 items describing people’s coffee preferences and habits. Most transactions contain around 29 items, and each transaction has between 28 and 29 items.

The data is then cleaned from rare observations, and frequency of remaining items is displayed.

trans<-trans[, itemFrequency(trans)>0.05]
sort(itemFrequency(trans, type="relative")) 
##                          Brews coffee using instant coffee 
##                                                 0.05166052 
##                               Less than 1 coffee cup daily 
##                                                 0.05166052 
##                                             Somewhat light 
##                                                 0.05166052 
##                                                  Half caff 
##                                                 0.05350554 
##                                                Iced coffee 
##                                                 0.05350554 
##                                            45-54 years old 
##                                                 0.05719557 
##                                                 Republican 
##                                                 0.05719557 
##                                                   Espresso 
##                                                 0.06088561 
##                     Buys coffee from a deli or supermarket 
##                                                 0.06826568 
##                                                    Cortado 
##                                                 0.07564576 
##                                         Employed part-time 
##                                                 0.07564576 
##                                                 Cappuccino 
##                                                 0.07933579 
##                           Doctorate or professional degree 
##                                                 0.07933579 
##                                                  Americano 
##                                                 0.08302583 
##                                        Regular drip coffee 
##                                                 0.08302583 
##                                                    Student 
##                                                 0.09778598 
##                                                       Dark 
##                                                 0.10332103 
##                                                Very strong 
##                                                 0.10516605 
##                                            18-24 years old 
##                                                 0.11623616 
##                                    Brews coffee using pods 
##                                                 0.13468635 
##                                        3 coffee cups daily 
##                                                 0.14206642 
##                         Some college or associate's degree 
##                                                 0.15682657 
##                                                Independent 
##                                                 0.16051661 
##                          Adds sugar or sweetener to coffee 
##                                                 0.17158672 
##                                                      Latte 
##                                                 0.20848708 
##                                            Master's degree 
##                                                 0.22324723 
##                                             No affiliation 
##                                                 0.24169742 
##                                            35-44 years old 
##                                                 0.24538745 
##                                     Brews cold brew coffee 
##                                                 0.25276753 
##                            Brews coffee using French press 
##                                                 0.25830258 
##                                                   Pourover 
##                                                 0.26014760 
##                        Brews coffee using a coffee machine 
##                                                 0.26568266 
##                                         1 coffee cup daily 
##                                                 0.28966790 
##                                                     Female 
##                                                 0.28966790 
##                     Does not drink coffee for the caffeine 
##                                                 0.29335793 
##                  Does not buy coffee from a specialty shop 
##                                                 0.35793358 
##                       Does not brew coffee using pour-over 
##                                                 0.36162362 
##                                Does not drink black coffee 
##                                                 0.39114391 
##                              Does not drink coffee at home 
##                                                 0.42804428 
##                        Does not drink coffee at the office 
##                                                 0.42804428 
##                      Does not buy coffee from a local cafe 
##                                                 0.44095941 
##                                                      Light 
##                                                 0.44833948 
##                                        2 coffee cups daily 
##                                                 0.45202952 
##         Buys coffee from a national chain (i.e. Starbucks) 
##                                                 0.45387454 
##                                            Somewhat strong 
##                                                 0.48708487 
##                                          Bachelor's degree 
##                                                 0.48892989 
##                            Does not drink coffee at a cafe 
##                                                 0.49446494 
##                             Adds milk or creamer to coffee 
##                                                 0.49815498 
##                     Does not add milk or creamer to coffee 
##                                                 0.50000000 
##                                    Drinks coffee at a cafe 
##                                                 0.50369004 
##                                                   Democrat 
##                                                 0.53874539 
##                                            25-34 years old 
##                                                 0.54243542 
## Does not buy coffee from a national chain (i.e. Starbucks) 
##                                                 0.54428044 
##                              Buys coffee from a local cafe 
##                                                 0.55719557 
##                                      Drinks coffee at home 
##                                                 0.57011070 
##                                Drinks coffee at the office 
##                                                 0.57011070 
##                                                     Medium 
##                                                 0.59778598 
##                                        Drinks black coffee 
##                                                 0.60701107 
##                               Brews coffee using pour-over 
##                                                 0.63653137 
##                          Buys coffee from a specialty shop 
##                                                 0.64022140 
##                                                       Male 
##                                                 0.66051661 
##                             Drinks coffee for the caffeine 
##                                                 0.70479705 
##                Does not brew coffee using a coffee machine 
##                                                 0.73247232 
##                    Does not brew coffee using French press 
##                                                 0.73985240 
##                             Does not brew cold brew coffee 
##                                                 0.74538745 
##                                         Employed full-time 
##                                                 0.76937269 
##                  Does not add sugar or sweetener to coffee 
##                                                 0.82656827 
##                            Does not brew coffee using pods 
##                                                 0.86346863 
##                                              Full caffeine 
##                                                 0.92804428 
##             Does not buy coffee from a deli or supermarket 
##                                                 0.92988930 
##                  Does not brew coffee using instant coffee 
##                                                 0.94649446 
##                                Drinks coffee for the taste 
##                                                 0.97047970 
##                                  Likes the taste of coffee 
##                                                 0.98339483

The item “Likes the taste of coffee”, followed by “Drinks coffee for the taste” are most frequently occuring in the dataset. This is not surprising - the survey was carried out during a coffee tasting event, so it was most likely attended by coffee lovers.

In the next step, rules are created. The goal is to find out the profile of people who drink coffee for the caffeine. Initial value of support is 0.1, and confidence - 0.8.

rules<-apriori(data=trans, parameter=list(supp=0.1, conf=0.8), appearance=list(default="lhs", rhs="Drinks coffee for the caffeine"), control=list(verbose=F)) 
summary(rules) 
## set of 10809 rules
## 
## rule length distribution (lhs + rhs):sizes
##    3    4    5    6    7    8 
##   19  256 1148 2694 3641 3051 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   7.000   6.743   8.000   8.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1015   Min.   :0.8000   Min.   :0.1107   Min.   :1.135  
##  1st Qu.:0.1052   1st Qu.:0.8061   1st Qu.:0.1292   1st Qu.:1.144  
##  Median :0.1125   Median :0.8143   Median :0.1384   Median :1.155  
##  Mean   :0.1188   Mean   :0.8207   Mean   :0.1450   Mean   :1.164  
##  3rd Qu.:0.1255   3rd Qu.:0.8286   3rd Qu.:0.1550   3rd Qu.:1.176  
##  Max.   :0.2583   Max.   :0.9403   Max.   :0.3229   Max.   :1.334  
##      count       
##  Min.   : 55.00  
##  1st Qu.: 57.00  
##  Median : 61.00  
##  Mean   : 64.41  
##  3rd Qu.: 68.00  
##  Max.   :140.00  
## 
## mining info:
##   data ntransactions support confidence
##  trans           542     0.1        0.8
##                                                                                                                                                                      call
##  apriori(data = trans, parameter = list(supp = 0.1, conf = 0.8), appearance = list(default = "lhs", rhs = "Drinks coffee for the caffeine"), control = list(verbose = F))

Initially, 10809 rules are found. The majority of the rules contain a relatively high number of items, ranging between 6 and 8. To reduce the number of rules, redundant, insignificant, and non-maximal rules are removed. Confidence is also increased to 0.87.

rules<-apriori(data=trans, parameter=list(supp=0.1, conf=0.87), appearance=list(default="lhs", rhs="Drinks coffee for the caffeine"), control=list(verbose=F)) 
rules.clean<-rules[!is.redundant(rules)] 
rules.clean<-rules.clean[is.significant(rules.clean, trans)]
rules.clean<-rules.clean[is.maximal(rules.clean)]
summary(rules.clean)
## set of 31 rules
## 
## rule length distribution (lhs + rhs):sizes
## 4 5 6 7 8 
## 2 9 9 9 2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4       5       6       6       7       8 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1015   Min.   :0.8701   Min.   :0.1107   Min.   :1.235  
##  1st Qu.:0.1015   1st Qu.:0.8731   1st Qu.:0.1162   1st Qu.:1.239  
##  Median :0.1052   Median :0.8806   Median :0.1181   Median :1.249  
##  Mean   :0.1079   Mean   :0.8859   Mean   :0.1218   Mean   :1.257  
##  3rd Qu.:0.1116   3rd Qu.:0.8898   3rd Qu.:0.1236   3rd Qu.:1.262  
##  Max.   :0.1255   Max.   :0.9403   Max.   :0.1439   Max.   :1.334  
##      count      
##  Min.   :55.00  
##  1st Qu.:55.00  
##  Median :57.00  
##  Mean   :58.48  
##  3rd Qu.:60.50  
##  Max.   :68.00  
## 
## mining info:
##   data ntransactions support confidence
##  trans           542     0.1       0.87
##                                                                                                                                                                       call
##  apriori(data = trans, parameter = list(supp = 0.1, conf = 0.87), appearance = list(default = "lhs", rhs = "Drinks coffee for the caffeine"), control = list(verbose = F))

The number of rules is reduced to 31.

Results Analysis

The interactive table below displays obtained results.

inspectDT(rules.clean)

Next, results with highest confidence are be analyzed to understand which factors influence people to drink coffee for caffeine.

rules.clean.byconf<-sort(rules.clean, by="confidence", decreasing=TRUE)
inspect(head(rules.clean.byconf))
##     lhs                                                      rhs                                support confidence  coverage     lift count
## [1] {25-34 years old,                                                                                                                      
##      Buys coffee from a national chain (i.e. Starbucks),                                                                                   
##      Drinks coffee at home,                                                                                                                
##      Employed full-time}                                  => {Drinks coffee for the caffeine} 0.1162362  0.9402985 0.1236162 1.334141    63
## [2] {25-34 years old,                                                                                                                      
##      Buys coffee from a national chain (i.e. Starbucks),                                                                                   
##      Drinks coffee at the office,                                                                                                          
##      Employed full-time}                                  => {Drinks coffee for the caffeine} 0.1162362  0.9402985 0.1236162 1.334141    63
## [3] {Does not buy coffee from a deli or supermarket,                                                                                       
##      Does not buy coffee from a specialty shop,                                                                                            
##      Female,                                                                                                                               
##      Full caffeine}                                       => {Drinks coffee for the caffeine} 0.1014760  0.9166667 0.1107011 1.300611    55
## [4] {Does not brew coffee using instant coffee,                                                                                            
##      Does not buy coffee from a specialty shop,                                                                                            
##      Female,                                                                                                                               
##      Full caffeine}                                       => {Drinks coffee for the caffeine} 0.1051661  0.9047619 0.1162362 1.283720    57
## [5] {2 coffee cups daily,                                                                                                                  
##      Does not brew coffee using French press,                                                                                              
##      Does not brew coffee using instant coffee,                                                                                            
##      Does not buy coffee from a deli or supermarket,                                                                                       
##      Does not drink coffee at a cafe,                                                                                                      
##      Drinks coffee for the taste,                                                                                                          
##      Full caffeine}                                       => {Drinks coffee for the caffeine} 0.1254613  0.8947368 0.1402214 1.269496    68
## [6] {25-34 years old,                                                                                                                      
##      Adds milk or creamer to coffee,                                                                                                       
##      Does not brew coffee using instant coffee,                                                                                            
##      Drinks coffee at home,                                                                                                                
##      Employed full-time,                                                                                                                   
##      Full caffeine}                                       => {Drinks coffee for the caffeine} 0.1070111  0.8923077 0.1199262 1.266049    58

People who drink coffee for caffeine are young Zillenials, mostly female, who are working full-time. Since they are employed, they prefer full-caffeine coffee. They do not buy coffee to brew at home, but rather, drink it from a national chain (such as Starbucks). They also tend to drink more than 1 cup per day, which, fortunately, is not above the recommended norm.

cSPADE Algorithm

The second part of the project focuses on sequential pattern mining, and finding the pattern of sequential orders from a coffee vending machine. Analyzing events in a sequence means adding a time component to the analysis. Rather than just analyzing a basket of coffee orders, the cSPADE algorithm enables to determine whether a past drink choice influenced a future purchase. Thus, it allows for modeling the evolution of the customer’s choices.

Data overview

The second dataset was downloaded from Kaggle, and contains coffee sales from a vending machine. Following columns will be used for sequential pattern mining:

  • datetime: Datetime of purchase.
  • card: Anonymous card number of the client.
  • coffee_name: Purchased coffee type.

These variables allow for performing the analysis. Card number can be treated as customer ID, allowing for unique identification of customers. Datetime allows for ordering purchases in form of a sequence. Lastly, coffee_name identifies products the customer has purchased.

online_retail <- read.csv("index.csv")
online_retail <- online_retail[complete.cases(online_retail), ]

Data Preprocessing

As a first step, the data is loaded and null values are removed.

str(online_retail)
## 'data.frame':    2838 obs. of  6 variables:
##  $ date       : chr  "2024-03-01" "2024-03-01" "2024-03-01" "2024-03-01" ...
##  $ datetime   : chr  "2024-03-01 10:15:50.520" "2024-03-01 12:19:22.539" "2024-03-01 12:20:18.089" "2024-03-01 13:46:33.006" ...
##  $ cash_type  : chr  "card" "card" "card" "card" ...
##  $ card       : chr  "ANON-0000-0000-0001" "ANON-0000-0000-0002" "ANON-0000-0000-0002" "ANON-0000-0000-0003" ...
##  $ money      : num  38.7 38.7 38.7 28.9 38.7 33.8 38.7 33.8 38.7 33.8 ...
##  $ coffee_name: chr  "Latte" "Hot Chocolate" "Hot Chocolate" "Americano" ...

Next, the data is transformed to match the pattern demanded by arulesSequences:: library. In order to use cSPADE, the data needs to be reformatted into a transaction matrix, and contain the following attributes:

  • sequenceID: Customer identifier of a sequence of events (here - card).
  • eventID: The order of purchase within the same sequence (does not exist in the table yet).
  • transactionID: The total number of purchases, ordered by card and datetime (does not exist in the table yet).
  • items: The product purchased by the customer (in this case - coffee_name).

Instances where the same product appears repeatedly are removed, as keeping these records would not return any meaningful results. Only the first unique record will be retained.

online_retail_seq <- online_retail %>% 
  group_by(card) %>% 
  arrange(datetime) %>% 
  distinct(card, coffee_name, .keep_all = TRUE)

Next, eventID and transactionID is added. Columns card and coffee_name are renamed to items and sequenceID.

# adding event IDs

online_retail_seq <- online_retail_seq %>%  
  group_by(card) %>% 
  arrange(datetime) %>% 
  mutate(eventID = row_number()) %>%
  ungroup()

# adding transaction IDs
online_retail_seq <- online_retail_seq %>%  
  arrange(card, datetime) %>% 
  mutate(transactionID = row_number()) %>%
  ungroup()

# renaming columns
online_retail_seq <- online_retail_seq %>% rename(items = coffee_name)
online_retail_seq <- online_retail_seq %>% rename(sequenceID = card)

Columns needed to perform sequential pattern mining are saved as a separate dataframe. Columns are also converted into factors, as demanded by the arulesSequence:: package.

sessions <- online_retail_seq[, c(4, 6, 7)] # column order

# converting data to factors
sessions$sequenceID <- as.factor(sessions$sequenceID)
sessions$eventID <- as.factor(sessions$eventID)
sessions$items <- as.factor(sessions$items)

Now, the data can be converted into transactions object.

transactions <-  as(sessions %>% transmute(items = items), "transactions")
transactionInfo(transactions)$sequenceID <- sessions$sequenceID
transactionInfo(transactions)$eventID = sessions$eventID
itemLabels(transactions) <- str_replace_all(itemLabels(transactions), "items=", "")

Results Analysis

Let’s run the cSPADE algorithm. The support value is restricted to 0.02. This will restrict the results to only show purchases that appear at least in 2% of customers.

itemsets <- cspade(transactions, 
                   parameter = list(support = 0.02), 
                   control = list(verbose = FALSE))
inspect((itemsets))
##     items                      support 
##   1 <{Americano}>           0.16310160 
##   2 <{Americano with Milk}> 0.27985740 
##   3 <{Cappuccino}>          0.21836007 
##   4 <{Cocoa}>               0.08823529 
##   5 <{Cortado}>             0.09001783 
##   6 <{Espresso}>            0.06506239 
##   7 <{Hot Chocolate}>       0.13279857 
##   8 <{Latte}>               0.30748663 
##   9 <{Americano with Milk},  
##      {Latte}>               0.02406417 
##  10 <{Latte},                
##      {Cocoa}>               0.02049911 
##  11 <{Americano with Milk},  
##      {Cappuccino}>          0.03030303 
##  12 <{Latte},                
##      {Cappuccino}>          0.03208556 
##  13 <{Latte},                
##      {Americano with Milk}> 0.02228164 
## 

All items with highest support value are one-item sequences. Latte is the most popular choice, purchased by over 30% of customers. Multiple two item-sequences can also be found. For example:

  • Americano with Milk -> Cappuccino: 3% of customers have first ordered Americano with Milk, and then Capuccino.
  • Latte -> Cappuccino: 3% of customers have first ordered Latte, and then Capuccino.
  • Americano with Milk -> Latte: 3% of customers have first ordered Americano with Milk, and then Latte.

It seems that customers are unlikely to modify their original choice of drink.