The goal of the project is to analyze people’s preferences towards coffee using association rule mining. Two separate datasets and algorithms - Apriori and cSPADE - are used to perform the analysis. In the first part, results of anonymized survey performed during a coffee testing event are examined. In the second part, sequential pattern mining is used to analyze coffee sales from a vending machine.
Following R packages were used to perform the analysis:
library(arules) # Apriori algorithm
library(arulesSequences) #cSPADE algorithm
library(arulesViz) # data visualization
library(tidyverse) # data preprocessing and cleaning
The goal of this part of the project is to analyze the profile of people whose purpose of drinking coffee is to increase their energy levels through caffeine. Apriori algorithm is used.
The first dataset was found on Kaggle, and includes data on coffee drinking habits and preferences, collected during a virtual coffee tasting event. 4042 participants took part in the survey, in which they had to answer over 100 questions. The data was not preprocessed or cleaned prior to posting the survey results.
coffee_survey <- read.csv("GACTT_RESULTS_ANONYMIZED_v2.csv") # reading the data
names(coffee_survey)
## [1] "Submission.ID"
## [2] "What.is.your.age."
## [3] "How.many.cups.of.coffee.do.you.typically.drink.per.day."
## [4] "Where.do.you.typically.drink.coffee."
## [5] "Where.do.you.typically.drink.coffee...At.home."
## [6] "Where.do.you.typically.drink.coffee...At.the.office."
## [7] "Where.do.you.typically.drink.coffee...On.the.go."
## [8] "Where.do.you.typically.drink.coffee...At.a.cafe."
## [9] "Where.do.you.typically.drink.coffee...None.of.these."
## [10] "How.do.you.brew.coffee.at.home."
## [11] "How.do.you.brew.coffee.at.home...Pour.over."
## [12] "How.do.you.brew.coffee.at.home...French.press."
## [13] "How.do.you.brew.coffee.at.home...Espresso."
## [14] "How.do.you.brew.coffee.at.home...Coffee.brewing.machine..e.g..Mr..Coffee.."
## [15] "How.do.you.brew.coffee.at.home...Pod.capsule.machine..e.g..Keurig.Nespresso.."
## [16] "How.do.you.brew.coffee.at.home...Instant.coffee."
## [17] "How.do.you.brew.coffee.at.home...Bean.to.cup.machine."
## [18] "How.do.you.brew.coffee.at.home...Cold.brew."
## [19] "How.do.you.brew.coffee.at.home...Coffee.extract..e.g..Cometeer.."
## [20] "How.do.you.brew.coffee.at.home...Other."
## [21] "How.else.do.you.brew.coffee.at.home."
## [22] "On.the.go..where.do.you.typically.purchase.coffee."
## [23] "On.the.go..where.do.you.typically.purchase.coffee...National.chain..e.g..Starbucks..Dunkin.."
## [24] "On.the.go..where.do.you.typically.purchase.coffee...Local.cafe."
## [25] "On.the.go..where.do.you.typically.purchase.coffee...Drive.thru."
## [26] "On.the.go..where.do.you.typically.purchase.coffee...Specialty.coffee.shop."
## [27] "On.the.go..where.do.you.typically.purchase.coffee...Deli.or.supermarket."
## [28] "On.the.go..where.do.you.typically.purchase.coffee...Other."
## [29] "Where.else.do.you.purchase.coffee."
## [30] "What.is.your.favorite.coffee.drink."
## [31] "Please.specify.what.your.favorite.coffee.drink.is"
## [32] "Do.you.usually.add.anything.to.your.coffee."
## [33] "Do.you.usually.add.anything.to.your.coffee...No...just.black."
## [34] "Do.you.usually.add.anything.to.your.coffee...Milk..dairy.alternative..or.coffee.creamer."
## [35] "Do.you.usually.add.anything.to.your.coffee...Sugar.or.sweetener."
## [36] "Do.you.usually.add.anything.to.your.coffee...Flavor.syrup."
## [37] "Do.you.usually.add.anything.to.your.coffee...Other."
## [38] "What.else.do.you.add.to.your.coffee."
## [39] "What.kind.of.dairy.do.you.add."
## [40] "What.kind.of.dairy.do.you.add...Whole.milk."
## [41] "What.kind.of.dairy.do.you.add...Skim.milk."
## [42] "What.kind.of.dairy.do.you.add...Half.and.half."
## [43] "What.kind.of.dairy.do.you.add...Coffee.creamer."
## [44] "What.kind.of.dairy.do.you.add...Flavored.coffee.creamer."
## [45] "What.kind.of.dairy.do.you.add...Oat.milk."
## [46] "What.kind.of.dairy.do.you.add...Almond.milk."
## [47] "What.kind.of.dairy.do.you.add...Soy.milk."
## [48] "What.kind.of.dairy.do.you.add...Other."
## [49] "What.kind.of.sugar.or.sweetener.do.you.add."
## [50] "What.kind.of.sugar.or.sweetener.do.you.add...Granulated.Sugar."
## [51] "What.kind.of.sugar.or.sweetener.do.you.add...Artificial.Sweeteners..e.g...Splenda.."
## [52] "What.kind.of.sugar.or.sweetener.do.you.add...Honey."
## [53] "What.kind.of.sugar.or.sweetener.do.you.add...Maple.Syrup."
## [54] "What.kind.of.sugar.or.sweetener.do.you.add...Stevia."
## [55] "What.kind.of.sugar.or.sweetener.do.you.add...Agave.Nectar."
## [56] "What.kind.of.sugar.or.sweetener.do.you.add...Brown.Sugar."
## [57] "What.kind.of.sugar.or.sweetener.do.you.add...Raw.Sugar..Turbinado.."
## [58] "What.kind.of.flavorings.do.you.add."
## [59] "What.kind.of.flavorings.do.you.add...Vanilla.Syrup."
## [60] "What.kind.of.flavorings.do.you.add...Caramel.Syrup."
## [61] "What.kind.of.flavorings.do.you.add...Hazelnut.Syrup."
## [62] "What.kind.of.flavorings.do.you.add...Cinnamon..Ground.or.Stick.."
## [63] "What.kind.of.flavorings.do.you.add...Peppermint.Syrup."
## [64] "What.kind.of.flavorings.do.you.add...Other."
## [65] "What.other.flavoring.do.you.use."
## [66] "Before.today.s.tasting..which.of.the.following.best.described.what.kind.of.coffee.you.like."
## [67] "How.strong.do.you.like.your.coffee."
## [68] "What.roast.level.of.coffee.do.you.prefer."
## [69] "How.much.caffeine.do.you.like.in.your.coffee."
## [70] "Lastly..how.would.you.rate.your.own.coffee.expertise."
## [71] "Coffee.A...Bitterness"
## [72] "Coffee.A...Acidity"
## [73] "Coffee.A...Personal.Preference"
## [74] "Coffee.A...Notes"
## [75] "Coffee.B...Bitterness"
## [76] "Coffee.B...Acidity"
## [77] "Coffee.B...Personal.Preference"
## [78] "Coffee.B...Notes"
## [79] "Coffee.C...Bitterness"
## [80] "Coffee.C...Acidity"
## [81] "Coffee.C...Personal.Preference"
## [82] "Coffee.C...Notes"
## [83] "Coffee.D...Bitterness"
## [84] "Coffee.D...Acidity"
## [85] "Coffee.D...Personal.Preference"
## [86] "Coffee.D...Notes"
## [87] "Between.Coffee.A..Coffee.B..and.Coffee.C.which.did.you.prefer."
## [88] "Between.Coffee.A.and.Coffee.D..which.did.you.prefer."
## [89] "Lastly..what.was.your.favorite.overall.coffee."
## [90] "Do.you.work.from.home.or.in.person."
## [91] "In.total..much.money.do.you.typically.spend.on.coffee.in.a.month."
## [92] "Why.do.you.drink.coffee."
## [93] "Why.do.you.drink.coffee...It.tastes.good."
## [94] "Why.do.you.drink.coffee...I.need.the.caffeine."
## [95] "Why.do.you.drink.coffee...I.need.the.ritual."
## [96] "Why.do.you.drink.coffee...It.makes.me.go.to.the.bathroom."
## [97] "Why.do.you.drink.coffee...Other."
## [98] "Other.reason.for.drinking.coffee"
## [99] "Do.you.like.the.taste.of.coffee."
## [100] "Do.you.know.where.your.coffee.comes.from."
## [101] "What.is.the.most.you.ve.ever.paid.for.a.cup.of.coffee."
## [102] "What.is.the.most.you.d.ever.be.willing.to.pay.for.a.cup.of.coffee."
## [103] "Do.you.feel.like.you.re.getting.good.value.for.your.money.when.you.buy.coffee.at.a.cafe."
## [104] "Approximately.how.much.have.you.spent.on.coffee.equipment.in.the.past.5.years."
## [105] "Do.you.feel.like.you.re.getting.good.value.for.your.money.with.regards.to.your.coffee.equipment."
## [106] "Gender"
## [107] "Gender..please.specify."
## [108] "Education.Level"
## [109] "Ethnicity.Race"
## [110] "Ethnicity.Race..please.specify."
## [111] "Employment.Status"
## [112] "Number.of.Children"
## [113] "Political.Affiliation"
There are 113 columns total, however, due to their redundancy or overly detailed focus on a particular subject, part of the attributes can be removed. As a result, only 29 columns are used in the final analysis and data cleaning. Column names are also transformed to ensure better readability.
coffee_survey2 <- tibble(
age = as.factor(coffee_survey$What.is.your.age.),
cups_daily = as.factor(coffee_survey$How.many.cups.of.coffee.do.you.typically.drink.per.day.),
drink_home = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.home.),
drink_at_office = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.the.office.),
drink_cafe = as.factor(coffee_survey$Where.do.you.typically.drink.coffee...At.a.cafe.),
brew_pour_over = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Pour.over.),
brew_french_press = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...French.press.),
brew_coffee_machine = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Coffee.brewing.machine..e.g..Mr..Coffee..),
brew_coffee_pods = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Pod.capsule.machine..e.g..Keurig.Nespresso..),
brew_instant_coffee = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Instant.coffee.),
brew_cold_brew = as.factor(coffee_survey$How.do.you.brew.coffee.at.home...Cold.brew.),
purchase_national_chain = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...National.chain..e.g..Starbucks..Dunkin..),
purchase_local_cafe = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Local.cafe.),
purchase_specialty_shop = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Specialty.coffee.shop.),
purchase_deli_supermarket = as.factor(coffee_survey$On.the.go..where.do.you.typically.purchase.coffee...Deli.or.supermarket.),
favorite_drink = as.factor(coffee_survey$What.is.your.favorite.coffee.drink.),
coffee_black = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...No...just.black.),
coffee_milk_creamer = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...Milk..dairy.alternative..or.coffee.creamer.),
coffee_sugar_sweetener = as.factor(coffee_survey$Do.you.usually.add.anything.to.your.coffee...Sugar.or.sweetener.),
coffee_strength = as.factor(coffee_survey$How.strong.do.you.like.your.coffee.),
coffee_roast = as.factor(coffee_survey$What.roast.level.of.coffee.do.you.prefer.),
caffeine_level = as.factor(coffee_survey$How.much.caffeine.do.you.like.in.your.coffee.),
drink_for_taste = as.factor(coffee_survey$Why.do.you.drink.coffee...It.tastes.good.),
drink_for_caffeine = as.factor(coffee_survey$Why.do.you.drink.coffee...I.need.the.caffeine.),
like_coffee_taste = as.factor(coffee_survey$Do.you.like.the.taste.of.coffee.),
gender = as.factor(coffee_survey$Gender),
education = as.factor(coffee_survey$Education.Level),
employment_status = as.factor(coffee_survey$Employment.Status),
political_affiliation = as.factor(coffee_survey$Political.Affiliation)
)
Let’s now take a look inside the data:
str(coffee_survey2)
## tibble [4,042 × 29] (S3: tbl_df/tbl/data.frame)
## $ age : Factor w/ 7 levels "<18 years old",..: 3 4 4 5 4 7 3 NA 4 NA ...
## $ cups_daily : Factor w/ 6 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA 5 NA ...
## $ drink_home : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 1 NA ...
## $ drink_at_office : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 1 NA ...
## $ drink_cafe : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA 2 NA ...
## $ brew_pour_over : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 2 1 2 NA NA NA ...
## $ brew_french_press : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 2 2 NA NA NA ...
## $ brew_coffee_machine : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 2 1 1 1 NA NA NA ...
## $ brew_coffee_pods : Factor w/ 2 levels "FALSE","TRUE": NA 2 1 1 1 2 2 NA NA NA ...
## $ brew_instant_coffee : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 1 2 NA NA NA ...
## $ brew_cold_brew : Factor w/ 2 levels "FALSE","TRUE": NA 1 1 1 1 1 1 NA NA NA ...
## $ purchase_national_chain : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
## $ purchase_local_cafe : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
## $ purchase_specialty_shop : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 2 NA NA NA ...
## $ purchase_deli_supermarket: Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA 1 NA NA NA ...
## $ favorite_drink : Factor w/ 11 levels "Americano","Blended drink (e.g. Frappuccino)",..: 11 7 11 7 8 7 10 NA 11 NA ...
## $ coffee_black : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 1 2 NA 2 1 ...
## $ coffee_milk_creamer : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 NA 1 1 ...
## $ coffee_sugar_sweetener : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 NA 1 1 ...
## $ coffee_strength : Factor w/ 5 levels "Medium","Somewhat light",..: NA NA NA NA NA NA NA NA NA NA ...
## $ coffee_roast : Factor w/ 7 levels "Blonde","Dark",..: NA NA NA NA NA NA NA NA NA NA ...
## $ caffeine_level : Factor w/ 3 levels "Decaf","Full caffeine",..: NA NA NA NA NA NA NA NA NA NA ...
## $ drink_for_taste : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA NA NA NA NA ...
## $ drink_for_caffeine : Factor w/ 2 levels "FALSE","TRUE": NA NA NA NA NA NA NA NA NA NA ...
## $ like_coffee_taste : Factor w/ 2 levels "No","Yes": NA NA NA NA NA NA NA NA NA NA ...
## $ gender : Factor w/ 5 levels "Female","Male",..: NA NA NA NA NA NA NA NA NA NA ...
## $ education : Factor w/ 6 levels "Bachelor's degree",..: NA NA NA NA NA NA NA NA NA NA ...
## $ employment_status : Factor w/ 6 levels "Employed full-time",..: NA NA NA NA NA NA NA NA NA NA ...
## $ political_affiliation : Factor w/ 4 levels "Democrat","Independent",..: NA NA NA NA NA NA NA NA NA NA ...
Numerous null values appear in the dataset, likely because responding to certain questions was optional. The rows with null values are dropped.
coffee_survey2 <- coffee_survey2 %>% drop_na()
Next, factor labels are renamed to ensure they contain information about both the variable and its level.
coffee_survey2 <- coffee_survey2 %>%
mutate(
cups_daily = ifelse(cups_daily == "Less than 1", "Less than 1 coffee cup daily",
ifelse(cups_daily == "1", "1 coffee cup daily",
ifelse(cups_daily == "2", "2 coffee cups daily",
ifelse(cups_daily == "3", "3 coffee cups daily",
ifelse(cups_daily == "4", "4 coffee cups daily",
"More than 4 coffee cups daily"))))),
drink_home = ifelse(drink_at_office == "TRUE", "Drinks coffee at home", "Does not drink coffee at home"),
drink_at_office = ifelse(drink_at_office == "TRUE", "Drinks coffee at the office", "Does not drink coffee at the office"),
drink_cafe = ifelse(drink_cafe == "TRUE", "Drinks coffee at a cafe", "Does not drink coffee at a cafe"),
brew_pour_over = ifelse(brew_pour_over == "TRUE", "Brews coffee using pour-over", "Does not brew coffee using pour-over"),
brew_french_press = ifelse(brew_french_press == "TRUE", "Brews coffee using French press", "Does not brew coffee using French press"),
brew_coffee_machine = ifelse(brew_coffee_machine == "TRUE", "Brews coffee using a coffee machine", "Does not brew coffee using a coffee machine"),
brew_coffee_pods = ifelse(brew_coffee_pods == "TRUE", "Brews coffee using pods", "Does not brew coffee using pods"),
brew_instant_coffee = ifelse(brew_instant_coffee == "TRUE", "Brews coffee using instant coffee", "Does not brew coffee using instant coffee"),
brew_cold_brew = ifelse(brew_cold_brew == "TRUE", "Brews cold brew coffee", "Does not brew cold brew coffee"),
purchase_national_chain = ifelse(purchase_national_chain == "TRUE", "Buys coffee from a national chain (i.e. Starbucks)", "Does not buy coffee from a national chain (i.e. Starbucks)"),
purchase_local_cafe = ifelse(purchase_local_cafe == "TRUE", "Buys coffee from a local cafe", "Does not buy coffee from a local cafe"),
purchase_specialty_shop = ifelse(purchase_specialty_shop == "TRUE", "Buys coffee from a specialty shop", "Does not buy coffee from a specialty shop"),
purchase_deli_supermarket = ifelse(purchase_deli_supermarket == "TRUE", "Buys coffee from a deli or supermarket", "Does not buy coffee from a deli or supermarket"),
coffee_black = ifelse(coffee_black == "TRUE", "Drinks black coffee", "Does not drink black coffee"),
coffee_milk_creamer = ifelse(coffee_milk_creamer == "TRUE", "Adds milk or creamer to coffee", "Does not add milk or creamer to coffee"),
coffee_sugar_sweetener = ifelse(coffee_sugar_sweetener == "TRUE", "Adds sugar or sweetener to coffee", "Does not add sugar or sweetener to coffee"),
drink_for_taste = ifelse(drink_for_taste == "TRUE", "Drinks coffee for the taste", "Does not drink coffee for the taste"),
drink_for_caffeine = ifelse(drink_for_caffeine == "TRUE", "Drinks coffee for the caffeine", "Does not drink coffee for the caffeine"),
like_coffee_taste = ifelse(like_coffee_taste == "Yes", "Likes the taste of coffee", "Does not like the taste of coffee"),
gender = ifelse(gender == "Male", "Male",
ifelse(gender == "Female", "Female", "Other"))
)
Let’s display the structure of the data once again, to ensure proper formatting:
str(coffee_survey2)
## tibble [541 × 29] (S3: tbl_df/tbl/data.frame)
## $ age : Factor w/ 7 levels "<18 years old",..: 6 4 4 4 4 5 4 5 4 7 ...
## $ cups_daily : Factor w/ 6 levels "1 coffee cup daily",..: 2 2 2 2 1 1 3 1 5 4 ...
## $ drink_home : Factor w/ 2 levels "Does not drink coffee at home",..: 2 2 1 1 2 1 2 1 1 2 ...
## $ drink_at_office : Factor w/ 2 levels "Does not drink coffee at the office",..: 2 2 1 1 2 1 2 1 1 2 ...
## $ drink_cafe : Factor w/ 2 levels "Does not drink coffee at a cafe",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ brew_pour_over : Factor w/ 2 levels "Brews coffee using pour-over",..: 1 1 1 1 2 1 2 2 1 1 ...
## $ brew_french_press : Factor w/ 2 levels "Brews coffee using French press",..: 1 2 1 1 2 2 2 2 2 1 ...
## $ brew_coffee_machine : Factor w/ 2 levels "Brews coffee using a coffee machine",..: 2 2 1 1 2 2 1 2 2 2 ...
## $ brew_coffee_pods : Factor w/ 2 levels "Brews coffee using pods",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ brew_instant_coffee : Factor w/ 2 levels "Brews coffee using instant coffee",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ brew_cold_brew : Factor w/ 2 levels "Brews cold brew coffee",..: 2 2 2 1 2 2 2 2 2 2 ...
## $ purchase_national_chain : Factor w/ 2 levels "Buys coffee from a national chain (i.e. Starbucks)",..: 2 2 1 2 1 2 1 2 1 2 ...
## $ purchase_local_cafe : Factor w/ 2 levels "Buys coffee from a local cafe",..: 1 2 1 1 2 1 1 1 2 1 ...
## $ purchase_specialty_shop : Factor w/ 2 levels "Buys coffee from a specialty shop",..: 1 1 1 2 2 2 2 1 2 1 ...
## $ purchase_deli_supermarket: Factor w/ 2 levels "Buys coffee from a deli or supermarket",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ favorite_drink : Factor w/ 11 levels "Americano","Blended drink (e.g. Frappuccino)",..: 11 3 10 10 6 10 1 8 8 1 ...
## $ coffee_black : Factor w/ 2 levels "Does not drink black coffee",..: 2 2 1 2 2 2 2 1 1 2 ...
## $ coffee_milk_creamer : Factor w/ 2 levels "Adds milk or creamer to coffee",..: 2 2 1 2 2 2 2 1 1 2 ...
## $ coffee_sugar_sweetener : Factor w/ 2 levels "Adds sugar or sweetener to coffee",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ coffee_strength : Factor w/ 5 levels "Medium","Somewhat light",..: 1 3 3 4 4 3 3 3 5 1 ...
## $ coffee_roast : Factor w/ 7 levels "Blonde","Dark",..: 5 5 5 5 6 6 5 2 6 3 ...
## $ caffeine_level : Factor w/ 3 levels "Decaf","Full caffeine",..: 2 2 2 2 2 2 2 3 1 2 ...
## $ drink_for_taste : Factor w/ 2 levels "Does not drink coffee for the taste",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ drink_for_caffeine : Factor w/ 2 levels "Does not drink coffee for the caffeine",..: 1 1 2 1 1 2 2 2 1 1 ...
## $ like_coffee_taste : Factor w/ 2 levels "Does not like the taste of coffee",..: 2 2 2 2 2 2 2 2 1 2 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 1 2 1 1 2 ...
## $ education : Factor w/ 6 levels "Bachelor's degree",..: 5 1 1 5 1 2 1 5 5 6 ...
## $ employment_status : Factor w/ 6 levels "Employed full-time",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ political_affiliation : Factor w/ 4 levels "Democrat","Independent",..: 3 1 1 2 1 3 1 1 3 4 ...
As a last step of preprocessing, the data is saved into a separate .csv file.
write.csv(coffee_survey2, "coffee_survey2.csv", row.names = FALSE)
The final data file is first loaded as a transactions object.
trans<-read.transactions("coffee_survey2.csv", format="basket", sep=",", skip=0)
summary(trans)
## transactions as itemMatrix in sparse format with
## 542 rows (elements/itemsets/transactions) and
## 124 columns (items) and a density of 0.2326211
##
## most frequent items:
## Likes the taste of coffee
## 533
## Drinks coffee for the taste
## 526
## Does not brew coffee using instant coffee
## 513
## Does not buy coffee from a deli or supermarket
## 504
## Full caffeine
## 503
## (Other)
## 13055
##
## element (itemset/transaction) length distribution:
## sizes
## 28 29
## 84 458
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.00 29.00 29.00 28.85 29.00 29.00
##
## includes extended item information - examples:
## labels
## 1 <18 years old
## 2 >65 years old
## 3 1 coffee cup daily
The dataset contains 542 transactions (survey responses) with 124 items describing people’s coffee preferences and habits. Most transactions contain around 29 items, and each transaction has between 28 and 29 items.
The data is then cleaned from rare observations, and frequency of remaining items is displayed.
trans<-trans[, itemFrequency(trans)>0.05]
sort(itemFrequency(trans, type="relative"))
## Brews coffee using instant coffee
## 0.05166052
## Less than 1 coffee cup daily
## 0.05166052
## Somewhat light
## 0.05166052
## Half caff
## 0.05350554
## Iced coffee
## 0.05350554
## 45-54 years old
## 0.05719557
## Republican
## 0.05719557
## Espresso
## 0.06088561
## Buys coffee from a deli or supermarket
## 0.06826568
## Cortado
## 0.07564576
## Employed part-time
## 0.07564576
## Cappuccino
## 0.07933579
## Doctorate or professional degree
## 0.07933579
## Americano
## 0.08302583
## Regular drip coffee
## 0.08302583
## Student
## 0.09778598
## Dark
## 0.10332103
## Very strong
## 0.10516605
## 18-24 years old
## 0.11623616
## Brews coffee using pods
## 0.13468635
## 3 coffee cups daily
## 0.14206642
## Some college or associate's degree
## 0.15682657
## Independent
## 0.16051661
## Adds sugar or sweetener to coffee
## 0.17158672
## Latte
## 0.20848708
## Master's degree
## 0.22324723
## No affiliation
## 0.24169742
## 35-44 years old
## 0.24538745
## Brews cold brew coffee
## 0.25276753
## Brews coffee using French press
## 0.25830258
## Pourover
## 0.26014760
## Brews coffee using a coffee machine
## 0.26568266
## 1 coffee cup daily
## 0.28966790
## Female
## 0.28966790
## Does not drink coffee for the caffeine
## 0.29335793
## Does not buy coffee from a specialty shop
## 0.35793358
## Does not brew coffee using pour-over
## 0.36162362
## Does not drink black coffee
## 0.39114391
## Does not drink coffee at home
## 0.42804428
## Does not drink coffee at the office
## 0.42804428
## Does not buy coffee from a local cafe
## 0.44095941
## Light
## 0.44833948
## 2 coffee cups daily
## 0.45202952
## Buys coffee from a national chain (i.e. Starbucks)
## 0.45387454
## Somewhat strong
## 0.48708487
## Bachelor's degree
## 0.48892989
## Does not drink coffee at a cafe
## 0.49446494
## Adds milk or creamer to coffee
## 0.49815498
## Does not add milk or creamer to coffee
## 0.50000000
## Drinks coffee at a cafe
## 0.50369004
## Democrat
## 0.53874539
## 25-34 years old
## 0.54243542
## Does not buy coffee from a national chain (i.e. Starbucks)
## 0.54428044
## Buys coffee from a local cafe
## 0.55719557
## Drinks coffee at home
## 0.57011070
## Drinks coffee at the office
## 0.57011070
## Medium
## 0.59778598
## Drinks black coffee
## 0.60701107
## Brews coffee using pour-over
## 0.63653137
## Buys coffee from a specialty shop
## 0.64022140
## Male
## 0.66051661
## Drinks coffee for the caffeine
## 0.70479705
## Does not brew coffee using a coffee machine
## 0.73247232
## Does not brew coffee using French press
## 0.73985240
## Does not brew cold brew coffee
## 0.74538745
## Employed full-time
## 0.76937269
## Does not add sugar or sweetener to coffee
## 0.82656827
## Does not brew coffee using pods
## 0.86346863
## Full caffeine
## 0.92804428
## Does not buy coffee from a deli or supermarket
## 0.92988930
## Does not brew coffee using instant coffee
## 0.94649446
## Drinks coffee for the taste
## 0.97047970
## Likes the taste of coffee
## 0.98339483
The item “Likes the taste of coffee”, followed by “Drinks coffee for the taste” are most frequently occuring in the dataset. This is not surprising - the survey was carried out during a coffee tasting event, so it was most likely attended by coffee lovers.
In the next step, rules are created. The goal is to find out the profile of people who drink coffee for the caffeine. Initial value of support is 0.1, and confidence - 0.8.
rules<-apriori(data=trans, parameter=list(supp=0.1, conf=0.8), appearance=list(default="lhs", rhs="Drinks coffee for the caffeine"), control=list(verbose=F))
summary(rules)
## set of 10809 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6 7 8
## 19 256 1148 2694 3641 3051
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 7.000 6.743 8.000 8.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1015 Min. :0.8000 Min. :0.1107 Min. :1.135
## 1st Qu.:0.1052 1st Qu.:0.8061 1st Qu.:0.1292 1st Qu.:1.144
## Median :0.1125 Median :0.8143 Median :0.1384 Median :1.155
## Mean :0.1188 Mean :0.8207 Mean :0.1450 Mean :1.164
## 3rd Qu.:0.1255 3rd Qu.:0.8286 3rd Qu.:0.1550 3rd Qu.:1.176
## Max. :0.2583 Max. :0.9403 Max. :0.3229 Max. :1.334
## count
## Min. : 55.00
## 1st Qu.: 57.00
## Median : 61.00
## Mean : 64.41
## 3rd Qu.: 68.00
## Max. :140.00
##
## mining info:
## data ntransactions support confidence
## trans 542 0.1 0.8
## call
## apriori(data = trans, parameter = list(supp = 0.1, conf = 0.8), appearance = list(default = "lhs", rhs = "Drinks coffee for the caffeine"), control = list(verbose = F))
Initially, 10809 rules are found. The majority of the rules contain a relatively high number of items, ranging between 6 and 8. To reduce the number of rules, redundant, insignificant, and non-maximal rules are removed. Confidence is also increased to 0.87.
rules<-apriori(data=trans, parameter=list(supp=0.1, conf=0.87), appearance=list(default="lhs", rhs="Drinks coffee for the caffeine"), control=list(verbose=F))
rules.clean<-rules[!is.redundant(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, trans)]
rules.clean<-rules.clean[is.maximal(rules.clean)]
summary(rules.clean)
## set of 31 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5 6 7 8
## 2 9 9 9 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 5 6 6 7 8
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1015 Min. :0.8701 Min. :0.1107 Min. :1.235
## 1st Qu.:0.1015 1st Qu.:0.8731 1st Qu.:0.1162 1st Qu.:1.239
## Median :0.1052 Median :0.8806 Median :0.1181 Median :1.249
## Mean :0.1079 Mean :0.8859 Mean :0.1218 Mean :1.257
## 3rd Qu.:0.1116 3rd Qu.:0.8898 3rd Qu.:0.1236 3rd Qu.:1.262
## Max. :0.1255 Max. :0.9403 Max. :0.1439 Max. :1.334
## count
## Min. :55.00
## 1st Qu.:55.00
## Median :57.00
## Mean :58.48
## 3rd Qu.:60.50
## Max. :68.00
##
## mining info:
## data ntransactions support confidence
## trans 542 0.1 0.87
## call
## apriori(data = trans, parameter = list(supp = 0.1, conf = 0.87), appearance = list(default = "lhs", rhs = "Drinks coffee for the caffeine"), control = list(verbose = F))
The number of rules is reduced to 31.
The interactive table below displays obtained results.
inspectDT(rules.clean)
Next, results with highest confidence are be analyzed to understand which factors influence people to drink coffee for caffeine.
rules.clean.byconf<-sort(rules.clean, by="confidence", decreasing=TRUE)
inspect(head(rules.clean.byconf))
## lhs rhs support confidence coverage lift count
## [1] {25-34 years old,
## Buys coffee from a national chain (i.e. Starbucks),
## Drinks coffee at home,
## Employed full-time} => {Drinks coffee for the caffeine} 0.1162362 0.9402985 0.1236162 1.334141 63
## [2] {25-34 years old,
## Buys coffee from a national chain (i.e. Starbucks),
## Drinks coffee at the office,
## Employed full-time} => {Drinks coffee for the caffeine} 0.1162362 0.9402985 0.1236162 1.334141 63
## [3] {Does not buy coffee from a deli or supermarket,
## Does not buy coffee from a specialty shop,
## Female,
## Full caffeine} => {Drinks coffee for the caffeine} 0.1014760 0.9166667 0.1107011 1.300611 55
## [4] {Does not brew coffee using instant coffee,
## Does not buy coffee from a specialty shop,
## Female,
## Full caffeine} => {Drinks coffee for the caffeine} 0.1051661 0.9047619 0.1162362 1.283720 57
## [5] {2 coffee cups daily,
## Does not brew coffee using French press,
## Does not brew coffee using instant coffee,
## Does not buy coffee from a deli or supermarket,
## Does not drink coffee at a cafe,
## Drinks coffee for the taste,
## Full caffeine} => {Drinks coffee for the caffeine} 0.1254613 0.8947368 0.1402214 1.269496 68
## [6] {25-34 years old,
## Adds milk or creamer to coffee,
## Does not brew coffee using instant coffee,
## Drinks coffee at home,
## Employed full-time,
## Full caffeine} => {Drinks coffee for the caffeine} 0.1070111 0.8923077 0.1199262 1.266049 58
People who drink coffee for caffeine are young Zillenials, mostly female, who are working full-time. Since they are employed, they prefer full-caffeine coffee. They do not buy coffee to brew at home, but rather, drink it from a national chain (such as Starbucks). They also tend to drink more than 1 cup per day, which, fortunately, is not above the recommended norm.
The second part of the project focuses on sequential pattern mining, and finding the pattern of sequential orders from a coffee vending machine. Analyzing events in a sequence means adding a time component to the analysis. Rather than just analyzing a basket of coffee orders, the cSPADE algorithm enables to determine whether a past drink choice influenced a future purchase. Thus, it allows for modeling the evolution of the customer’s choices.
The second dataset was downloaded from Kaggle, and contains coffee sales from a vending machine. Following columns will be used for sequential pattern mining:
These variables allow for performing the analysis. Card number can be treated as customer ID, allowing for unique identification of customers. Datetime allows for ordering purchases in form of a sequence. Lastly, coffee_name identifies products the customer has purchased.
online_retail <- read.csv("index.csv")
online_retail <- online_retail[complete.cases(online_retail), ]
As a first step, the data is loaded and null values are removed.
str(online_retail)
## 'data.frame': 2838 obs. of 6 variables:
## $ date : chr "2024-03-01" "2024-03-01" "2024-03-01" "2024-03-01" ...
## $ datetime : chr "2024-03-01 10:15:50.520" "2024-03-01 12:19:22.539" "2024-03-01 12:20:18.089" "2024-03-01 13:46:33.006" ...
## $ cash_type : chr "card" "card" "card" "card" ...
## $ card : chr "ANON-0000-0000-0001" "ANON-0000-0000-0002" "ANON-0000-0000-0002" "ANON-0000-0000-0003" ...
## $ money : num 38.7 38.7 38.7 28.9 38.7 33.8 38.7 33.8 38.7 33.8 ...
## $ coffee_name: chr "Latte" "Hot Chocolate" "Hot Chocolate" "Americano" ...
Next, the data is transformed to match the pattern demanded by arulesSequences:: library. In order to use cSPADE, the data needs to be reformatted into a transaction matrix, and contain the following attributes:
Instances where the same product appears repeatedly are removed, as keeping these records would not return any meaningful results. Only the first unique record will be retained.
online_retail_seq <- online_retail %>%
group_by(card) %>%
arrange(datetime) %>%
distinct(card, coffee_name, .keep_all = TRUE)
Next, eventID and transactionID is added. Columns card and coffee_name are renamed to items and sequenceID.
# adding event IDs
online_retail_seq <- online_retail_seq %>%
group_by(card) %>%
arrange(datetime) %>%
mutate(eventID = row_number()) %>%
ungroup()
# adding transaction IDs
online_retail_seq <- online_retail_seq %>%
arrange(card, datetime) %>%
mutate(transactionID = row_number()) %>%
ungroup()
# renaming columns
online_retail_seq <- online_retail_seq %>% rename(items = coffee_name)
online_retail_seq <- online_retail_seq %>% rename(sequenceID = card)
Columns needed to perform sequential pattern mining are saved as a separate dataframe. Columns are also converted into factors, as demanded by the arulesSequence:: package.
sessions <- online_retail_seq[, c(4, 6, 7)] # column order
# converting data to factors
sessions$sequenceID <- as.factor(sessions$sequenceID)
sessions$eventID <- as.factor(sessions$eventID)
sessions$items <- as.factor(sessions$items)
Now, the data can be converted into transactions object.
transactions <- as(sessions %>% transmute(items = items), "transactions")
transactionInfo(transactions)$sequenceID <- sessions$sequenceID
transactionInfo(transactions)$eventID = sessions$eventID
itemLabels(transactions) <- str_replace_all(itemLabels(transactions), "items=", "")
Let’s run the cSPADE algorithm. The support value is restricted to 0.02. This will restrict the results to only show purchases that appear at least in 2% of customers.
itemsets <- cspade(transactions,
parameter = list(support = 0.02),
control = list(verbose = FALSE))
inspect((itemsets))
## items support
## 1 <{Americano}> 0.16310160
## 2 <{Americano with Milk}> 0.27985740
## 3 <{Cappuccino}> 0.21836007
## 4 <{Cocoa}> 0.08823529
## 5 <{Cortado}> 0.09001783
## 6 <{Espresso}> 0.06506239
## 7 <{Hot Chocolate}> 0.13279857
## 8 <{Latte}> 0.30748663
## 9 <{Americano with Milk},
## {Latte}> 0.02406417
## 10 <{Latte},
## {Cocoa}> 0.02049911
## 11 <{Americano with Milk},
## {Cappuccino}> 0.03030303
## 12 <{Latte},
## {Cappuccino}> 0.03208556
## 13 <{Latte},
## {Americano with Milk}> 0.02228164
##
All items with highest support value are one-item sequences. Latte is the most popular choice, purchased by over 30% of customers. Multiple two item-sequences can also be found. For example:
It seems that customers are unlikely to modify their original choice of drink.