This notebook uses Starbucks Customer Data and corresponding Task from Kaggle.
portfolio.csv : Information about the promotional offers that are possible to receive, and basic information about each one including the promotional type, duration of the promotion, reward, and how the promotion was distributed to customers
profile.csv : Dimensional data about each person, including their age, salary, and gender. There is one unique customer for each record.
transcript.csv : Records show the different steps of promotional offers that a customer received. The different values of receiving a promotion are receiving, viewing, and completing. You also see the different transactions that a person made in the time since he became a customer. With all records, you see the day that they interacted with Starbucks and the amount that it is worth.
portfolio.csv
| variable | description |
|---|---|
| # | id |
| reward | Reward given for completing an offer |
| channels | (list of strings) |
| difficulty | Minimum required spend to complete an offer |
| duration | Unknown |
| offer_type | Type of offer ie BOGO, discount, informational |
| id | Offer id |
profile.csv
| variable | description |
|---|---|
| # | id |
| gender | Gender of the customer (note some entries contain ‘O’ for other rather than M or F) |
| age | Age of the customer |
| id | Customer id |
| became_member_on | Date when customer created an app account |
| income | Customer’s income |
transcript.csv
| variable | description |
|---|---|
| # | id |
| person | customer id |
| event | record description (ie transaction, offer received, offer viewed, etc.) |
| value | (dict of strings) - either an offer id or transaction amount depending on the record |
| time | time in hours. The data begins at time t=0 |
Overview of Starbucks population
* What is gender distribution?
* What is the income distribution?
* When do people typically become a member?
* What is the average purchase distribution?
Reactions to different promotions
* Do people react to different promotions differently?
* Does the reward of the promotion make people react differently?
* Does it make sense to offer certain rewards?
* Would Starbucks save money overall if they offer a certain reward?
Clustering
* How many clusters should Starbucks use?
library(tidyverse)
library(lubridate)
library(patchwork)
library(colorspace)
library(scales)
library(ggstatsplot)
library(plotly)
library(factoextra)
library(NbClust)
library(dendextend)
library(ggdendro)
theme_set(theme_bw(base_size=10))
portfolio = read.csv("Starbucks Customer Data/portfolio.csv", stringsAsFactors = T)
profile = read.csv("Starbucks Customer Data/profile.csv",stringsAsFactors = T)
transcript = read.csv("Starbucks Customer Data/transcript.csv",stringsAsFactors = T)
str(portfolio)
'data.frame': 10 obs. of 7 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9
$ reward : int 10 10 0 5 5 3 2 0 5 2
$ channels : Factor w/ 4 levels "['email', 'mobile', 'social']",..: 1 2 3 3 4 2 2 1 2 3
$ difficulty: int 10 10 0 5 20 7 10 0 5 10
$ duration : int 7 5 4 7 10 7 10 3 5 7
$ offer_type: Factor w/ 3 levels "bogo","discount",..: 1 1 3 1 2 2 2 3 1 2
$ id : Factor w/ 10 levels "0b1e1539f2cc45b7b9fa7c272da2e1d7",..: 8 5 4 7 1 2 10 6 9 3
profile$became_member_on = ymd(profile$became_member_on)
str(profile)
'data.frame': 17000 obs. of 6 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ gender : Factor w/ 4 levels "","F","M","O": 1 2 1 2 1 3 1 1 3 1 ...
$ age : int 118 55 118 75 118 68 118 118 65 118 ...
$ id : Factor w/ 17000 levels "0009655768c64bdeb2e877511632db8f",..: 6962 399 3747 7997 10736 15044 9525 6940 3729 9147 ...
$ became_member_on: Date, format: "2017-02-12" "2017-07-15" "2018-07-12" ...
$ income : num NA 112000 NA 100000 NA 70000 NA NA 53000 NA ...
str(transcript)
'data.frame': 306534 obs. of 5 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ person: Factor w/ 17000 levels "0009655768c64bdeb2e877511632db8f",..: 7997 10736 15044 9525 6940 3729 13109 3060 11411 3275 ...
$ event : Factor w/ 4 levels "offer completed",..: 2 2 2 2 2 2 2 2 2 2 ...
$ value : Factor w/ 5121 levels "{'amount': 0.05}",..: 5110 5104 5106 5113 5108 5112 5105 5107 5104 5104 ...
$ time : int 0 0 0 0 0 0 0 0 0 0 ...
summary(portfolio)
X reward channels difficulty duration
Min. :0.00 Min. : 0.0 ['email', 'mobile', 'social'] :2 Min. : 0.0 Min. : 3.0
1st Qu.:2.25 1st Qu.: 2.0 ['web', 'email', 'mobile', 'social']:4 1st Qu.: 5.0 1st Qu.: 5.0
Median :4.50 Median : 4.0 ['web', 'email', 'mobile'] :3 Median : 8.5 Median : 7.0
Mean :4.50 Mean : 4.2 ['web', 'email'] :1 Mean : 7.7 Mean : 6.5
3rd Qu.:6.75 3rd Qu.: 5.0 3rd Qu.:10.0 3rd Qu.: 7.0
Max. :9.00 Max. :10.0 Max. :20.0 Max. :10.0
offer_type id
bogo :4 0b1e1539f2cc45b7b9fa7c272da2e1d7:1
discount :4 2298d6c36e964ae4a3e7e9706d1fb8c2:1
informational:2 2906b810c7d4411798c6938adc9daaa5:1
3f207df678b143eea3cee63160fa8bed:1
4d5c57ea9a6940dd891ad53e9dbe8da0:1
5a8bc65990b245e5a138643cd4eb9837:1
(Other) :4
summary(profile)
X gender age id became_member_on
Min. : 0 :2175 Min. : 18.00 0009655768c64bdeb2e877511632db8f: 1 Min. :2013-07-29
1st Qu.: 4250 F:6129 1st Qu.: 45.00 00116118485d4dfda04fdbaba9a87b5c: 1 1st Qu.:2016-05-26
Median : 8500 M:8484 Median : 58.00 0011e0d4e6b944f998e987f904e8c1e5: 1 Median :2017-08-02
Mean : 8500 O: 212 Mean : 62.53 0020c2b971eb4e9188eac86d93036a77: 1 Mean :2017-02-23
3rd Qu.:12749 3rd Qu.: 73.00 0020ccbbb6d84e358d3414a3ff76cffd: 1 3rd Qu.:2017-12-30
Max. :16999 Max. :118.00 003d66b6608740288d6cc97a6903f4f0: 1 Max. :2018-07-26
(Other) :16994
income
Min. : 30000
1st Qu.: 49000
Median : 64000
Mean : 65405
3rd Qu.: 80000
Max. :120000
NA's :2175
summary(transcript)
X person event
Min. : 0 94de646f7b6041228ca7dec82adb97d2: 51 offer completed: 33579
1st Qu.: 76633 8dbfa485249f409aa223a2130f40634a: 49 offer received : 76277
Median :153266 5e60c6aa3b834e44b822ea43a3efea26: 48 offer viewed : 57725
Mean :153266 79d9d4f86aca4bed9290350fb43817c2: 48 transaction :138953
3rd Qu.:229900 d0a80415b84c4df4908b8403b19765e3: 48
Max. :306533 28681c16026943e68f26feaccab0907f: 46
(Other) :306244
value time
{'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'}: 14983 Min. : 0.0
{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}: 14924 1st Qu.:186.0
{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}: 14891 Median :408.0
{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'}: 14835 Mean :366.4
{'offer id': 'ae264e3637204a6fb9bb56bc8210ddfd'}: 14374 3rd Qu.:528.0
{'offer id': '5a8bc65990b245e5a138643cd4eb9837'}: 14305 Max. :714.0
(Other) :218222
# profile data
profile %>% summarise(across(everything(), ~mean(!is.na(.)))) %>%
gather() %>%
mutate(key= fct_reorder(key, value))
# portfolio data
portfolio %>% summarise(across(everything(), ~mean(!is.na(.)))) %>%
gather() %>%
mutate(key= fct_reorder(key, value))
# transcript data
transcript %>% summarise(across(everything(), ~mean(!is.na(.)))) %>%
gather() %>%
mutate(key= fct_reorder(key, value))
dim(profile)
[1] 17000 6
n_distinct(profile$id)
[1] 17000
p1 = profile %>% ggplot(aes(x=age)) + geom_boxplot(color="#2B3A67") + theme(axis.ticks.y=element_blank(), axis.text.y=element_blank())
p2 = profile %>% ggplot(aes(x=income)) + geom_boxplot(color="#66999B") + theme(axis.ticks.y=element_blank(), axis.text.y=element_blank())
p3 = profile %>% group_by(became_member_on) %>% tally() %>%
ggplot(aes(x=became_member_on, y=n)) +
geom_line(alpha=0.9, color="#496A81")
p4 = profile %>% ggplot(aes(x=gender)) + geom_bar(alpha=0.9, fill="#B3AF8F")
(p1 | p3) / (p2 | p4) + plot_annotation(subtitle="Profile")
# clean profile data
profile %>%
filter(age<=100) %>% #drop obs with age >100
summary() #no missing values in gender and income
X gender age id
Min. : 1 : 0 Min. : 18.00 0009655768c64bdeb2e877511632db8f: 1
1st Qu.: 4274 F:6124 1st Qu.: 42.00 0011e0d4e6b944f998e987f904e8c1e5: 1
Median : 8490 M:8484 Median : 55.00 0020c2b971eb4e9188eac86d93036a77: 1
Mean : 8496 O: 212 Mean : 54.38 0020ccbbb6d84e358d3414a3ff76cffd: 1
3rd Qu.:12729 3rd Qu.: 66.00 003d66b6608740288d6cc97a6903f4f0: 1
Max. :16999 Max. :100.00 00426fe3ffde4c6b9cb9ad6d077a13ea: 1
(Other) :14814
became_member_on income
Min. :2013-07-29 Min. : 30000
1st Qu.:2016-05-20 1st Qu.: 49000
Median :2017-08-02 Median : 64000
Mean :2017-02-18 Mean : 65404
3rd Qu.:2017-12-30 3rd Qu.: 80000
Max. :2018-07-26 Max. :120000
# get components of become_member_on date
profile_cln = profile %>%
filter(age<=100) %>%
mutate(year= year(became_member_on),
month= month(became_member_on),
day= day(became_member_on),
wday = wday(became_member_on, label=TRUE))
p4 = profile_cln %>%
group_by(gender) %>% tally() %>%
mutate(prop=round(n/sum(n)*100,1),
gender_long=recode(gender,"F"="Female", "M"="Male","O"="Other")) %>%
mutate(perc = paste0("(",prop,"%",")")) %>%
ggplot(aes(x=gender_long, y=n, fill=gender_long)) +
geom_col(show.legend = F) +
geom_text(aes(label=paste(n, perc)),vjust=-0.5, size=2.7) +
scale_fill_manual(values=c("#496A81","#FFC482","#B3AF8F")) +
labs(y="count", subtitle="Gender distribution", x="gender")
p5 = profile_cln %>%
ggplot(aes(x=income)) +
geom_histogram(binwidth = 1000, fill="#89b0ae", alpha=0.9) +
geom_vline(aes(xintercept=mean(income)),
linetype="dashed", size=1) +
labs(subtitle="Income distribution")
p6 = profile_cln %>%
ggplot(aes(x=income)) +
geom_boxplot(color="#15616d") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank())
p4 | (p5/p6) + plot_layout(nrow = 2, heights = c(4, 1))
# became_member_on
p7 = profile_cln %>%
group_by(year, month) %>%
summarise(n=n_distinct(id)) %>%
mutate(year_month=paste0(year,'-',month,'-',"1")) %>%
mutate(date2 = ymd(year_month)) %>%
ggplot(aes(x=date2, y=n, fill=n)) +
geom_col(alpha=0.9, show.legend=F) +
scale_fill_continuous_sequential(palette="batlow") +
labs(y="count", fill="", x="date")
p8 = profile_cln %>% ggplot(aes(y=forcats::fct_rev(fct_infreq(factor(year))))) + geom_bar(fill="#2B3A67") + labs(y="year")
p9 = profile_cln %>% ggplot(aes(y=forcats::fct_rev(fct_infreq(factor(month))))) + geom_bar(fill="#496A81") + labs(y="month")
p10 = profile_cln %>% ggplot(aes(y=forcats::fct_rev(fct_infreq(factor(wday))))) + geom_bar(fill="#66999B") + labs(y="day of the week")
p7 / (p8 + p9 + p10) + plot_annotation(subtitle="became_member_on")
# age
p11= profile_cln %>%
ggplot(aes(x=age)) +
geom_histogram(binwidth = 1, fill="#89b0ae", alpha=0.9) +
geom_vline(aes(xintercept=mean(age)),
linetype="dashed", size=1) +
labs(subtitle="Age distribution")
p12= profile_cln %>%
ggplot(aes(x=age)) +
geom_boxplot(color="#15616d") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank())
# age group
profile_cln$age_group = cut(profile_cln$age,
breaks = c(0, 20, 40, 60, 80, 100),
labels = c("0-20", "21-40", "41-60","61-80","81-100")) #recode to categorical
p13= profile_cln %>%
group_by(age_group) %>% tally() %>% mutate(prop=round(n/sum(n)*100,1)) %>%
mutate(perc = paste0("(",prop,"%",")")) %>%
ggplot(aes(x=age_group, y=n, fill=age_group)) +
geom_col(show.legend = F) +
geom_text(aes(label=paste(n, perc)),vjust=-0.5, size=2.7) +
scale_fill_manual(values=c("#2B3A67","#496A81","#66999B", "#B3AF8F","#FFC482")) +
labs(y="count", subtitle="Age group")
(p11/p12)+ plot_layout(nrow = 2, heights = c(4, 1)) | p13
NA
p16a =portfolio %>% ggplot(aes(y=channels)) + geom_bar(fill="#FFC482")
p16b =portfolio %>% ggplot(aes(x=duration)) + geom_bar(fill="#496A81")
p16c =portfolio %>% ggplot(aes(x=offer_type)) + geom_bar(fill="#2B3A67")
p16d =portfolio %>% ggplot(aes(x=difficulty)) + geom_boxplot(fill="#66999B", alpha=0.9, outlier.colour = "red")
p16e =portfolio %>% ggplot(aes(x=reward)) + geom_bar(fill="#B3AF8F")
(p16c | p16a) /
(p16d | p16e | p16b) + plot_annotation(subtitle="Portfolio")
# channel
p17 = portfolio %>%
mutate(email= str_detect(channels, "email"),
mobile= str_detect(channels, "mobile"),
social= str_detect(channels, "social"),
web= str_detect(channels, "web")) %>%
pivot_longer(email:web) %>%
group_by(name, value) %>% tally() %>%
ggplot(aes(y=fct_rev(name), x=n, fill=(factor(value)))) +
geom_col() +
scale_fill_manual(values=c("grey","#496A81")) +
scale_x_continuous(breaks=seq(0,10,2)) +
labs(x= "count of promotion", y="channel component", fill="value")
# channel and duration
p18 = portfolio %>% ggplot(aes(x=factor(duration), label=channels)) +
geom_bar(aes(fill=channels), width=.8, position=position_dodge2(width=0.3,preserve = "single")) +
scale_fill_manual(values=c("#2B3A67","#66999B", "#B3AF8F","#FFC482")) +
labs(x="duration", y= "count of promotion", subtitle="Duration and channel combination")
(p17/ p18)
n_distinct(transcript$person)
[1] 17000
n_distinct(transcript$value)
[1] 5121
n_distinct(transcript$time)
[1] 120
summary(transcript)
X person event
Min. : 0 94de646f7b6041228ca7dec82adb97d2: 51 offer completed: 33579
1st Qu.: 76633 8dbfa485249f409aa223a2130f40634a: 49 offer received : 76277
Median :153266 5e60c6aa3b834e44b822ea43a3efea26: 48 offer viewed : 57725
Mean :153266 79d9d4f86aca4bed9290350fb43817c2: 48 transaction :138953
3rd Qu.:229900 d0a80415b84c4df4908b8403b19765e3: 48
Max. :306533 28681c16026943e68f26feaccab0907f: 46
(Other) :306244
value time
{'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'}: 14983 Min. : 0.0
{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}: 14924 1st Qu.:186.0
{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}: 14891 Median :408.0
{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'}: 14835 Mean :366.4
{'offer id': 'ae264e3637204a6fb9bb56bc8210ddfd'}: 14374 3rd Qu.:528.0
{'offer id': '5a8bc65990b245e5a138643cd4eb9837'}: 14305 Max. :714.0
(Other) :218222
transcript %>% group_by(person) %>% tally(sort=T) %>% top_n(5)
transcript %>% group_by(person) %>% tally() %>% arrange(n) %>% slice(1:5)
# event type
p14 = transcript %>%
group_by(event) %>% tally() %>% mutate(prop=round(n/sum(n)*100,1)) %>%
mutate(perc = paste0("(",prop,"%",")")) %>%
ggplot(aes(x=event, y=n/1000, fill=event)) +
geom_col(show.legend = F) +
geom_text(aes(label=paste(n, perc)),vjust=-0.5, size=2.7) +
scale_fill_manual(values=c("#2B3A67","#496A81","#66999B", "#B3AF8F","#FFC482")) +
labs(y="count", subtitle="Event distribution") +
scale_y_continuous(labels=unit_format(unit = "K", sep = ""))
# time intervals
transcript$time_group = cut(transcript$time,
breaks = c(0,100,200,300,400,500,600,700,800),
labels = c("0-100", "101-200", "201-300","301-400",
"401-500","501-600","601-700","701-800"))
p15 = transcript %>%
group_by(time_group) %>% tally() %>% mutate(prop=round(n/sum(n)*100,1)) %>%
mutate(perc = paste0("(",prop,"%",")")) %>%
ggplot(aes(x=time_group, y=n/1000, fill=I(ifelse(n==max(n),"#FFC482","slategrey")))) +
geom_col(show.legend = F) +
geom_text(aes(label=paste(n, perc)),vjust=-0.5, size=2.7) +
labs(y="count", x="time_group (in days)",subtitle="Time group") +
scale_y_continuous(labels=unit_format(unit = "K", sep = ""))
# record count per person
p16 = transcript %>% group_by(person) %>% tally(sort=T) %>%
ggplot(aes(x=n)) +
geom_histogram(binwidth=1,fill="#89b0ae", alpha=0.9) +
geom_vline(aes(xintercept=mean(n)),
linetype="dashed", size=1) +
labs(subtitle="Records count/person distribution", x="record_count")
p17 = transcript %>% group_by(person) %>% tally(sort=T) %>%
ggplot(aes(x=n)) +
geom_boxplot(color="#15616d") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank()) +
labs(x="record_count")
(p16/p17)+ plot_layout(nrow = 2, heights = c(4, 1)) | p14
p15
summary(transcript$event)
offer completed offer received offer viewed transaction
33579 76277 57725 138953
# subsets
viewed = transcript %>% filter(event=="offer viewed")
completed = transcript %>% filter(event=="offer completed")
received = transcript %>% filter(event=="offer received")
# received and viewed offer
received %>% mutate(received_viewed=ifelse(received$person %in% viewed$person, "yes","no")) %>%
group_by(received_viewed) %>% tally() %>% mutate("%"=round(n/sum(n)*100,1))
# viewed and completed offer
viewed %>% mutate(viewed_completed=ifelse(viewed$person %in% completed$person, "yes","no")) %>%
group_by(viewed_completed) %>% tally() %>% mutate("%"=round(n/sum(n)*100,1))
# received and completed offer
received %>% mutate(recieved_completed=ifelse(received$person %in% completed$person, "yes","no")) %>%
group_by(recieved_completed) %>% tally() %>% mutate("%"=round(n/sum(n)*100,1))
# prepare df for joining
df1 = transcript %>% mutate(promo_id= substr(value,15,46)) %>% select(-X) #extract promo string
df2 = portfolio %>% mutate(promo_id=id, promo_ref=paste("Promo",X)) %>% select(-X, -id) #get promo ref
df3 = profile_cln %>% rename(person=id) %>% select(-X)
dim(df1)
[1] 306534 5
dim(df2)
[1] 10 7
dim(df3)
[1] 14820 9
df4 = df1 %>% left_join(df2, by="promo_id")
dim(df4) # check for obs number
[1] 306534 11
df4 %>% filter(event!="transaction") %>% group_by(promo_id, promo_ref) %>% tally(sort=T) %>% ungroup() %>%
mutate(promo_ref2=paste0("(",promo_ref,")"),
promo_long=paste(promo_id, promo_ref2),
prop=round(n/sum(n)*100,1),
perc = paste0("(",prop,"%",")")) %>%
ggplot(aes(y=reorder(promo_long,n), x=n, fill=I(ifelse(n==max(n),"#496A81","grey")))) +
geom_col() +
geom_text(aes(label=paste(n, perc), color=I(ifelse(n==max(n),"white","black"))),hjust=1.2, size=3) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 10)) +
theme(plot.title.position = "plot") +
scale_x_continuous(expand=c(0,500)) +
labs(y="promotion_id", x="count", subtitle="Count plot of promotion ID in transcript data",fill="count")
# offer response by promo_id
df4 %>% filter(event!="transaction") %>%
group_by(promo_ref, event) %>% tally() %>%
ggplot(aes(y=fct_rev(promo_ref), x=n, color=event)) +
geom_line(aes(group=promo_ref), color="grey") +
geom_point(size=2.5, alpha=0.9) +
theme(panel.grid.minor=element_blank()) +
labs(y="promotion reference", x="count", subtitle= "Offer response") +
scale_color_manual(values=c("#023e8a","#2a9d8f","#ffb703"))
# percentage
df4 %>% filter(event!="transaction") %>%
group_by(promo_ref, event) %>% tally() %>% ungroup() %>%
pivot_wider(names_from=event, values_from=n) %>% #pivot wider to calculate perc
mutate(viewed_recieved = round(`offer viewed`/`offer received`,3),
completed_recieved = round(`offer completed`/`offer received`,3),
completed_viewed = round(`offer completed`/`offer viewed`,3)) %>%
select(-`offer completed`,-`offer received`,-`offer viewed`) %>%
pivot_longer(cols=!promo_ref) -> df4_table
data_line = df4_table %>% group_by(name) %>% summarise(mean_x=mean(value,na.rm=T))
data_line
df4_table %>%
ggplot(aes(y=fct_rev(promo_ref), x=value, fill=name)) +
geom_vline(data=data_line, aes(xintercept=mean_x), color="#ffb703", linetype="dashed") +
geom_col(show.legend = F, width=0.8) +
geom_text(aes(label=value*100), color="white", size=2.7, hjust=1.5) +
facet_wrap(~factor(name, levels=c("viewed_recieved","completed_recieved","completed_viewed", ordered=T)), ncol = 3) +
scale_x_continuous(labels=scales::percent_format()) +
scale_fill_manual(values=c("#2B3A67","#496A81","#66999B")) +
theme(strip.background =element_rect(fill=NA)) +
labs(x="percentage", y="promotion reference", subtitle= "Offer response percentage")
df4 %>% select(person, event) %>%
group_by(person) %>% count(event) %>% ungroup() ->tdf
ggstatsplot::ggbetweenstats(data=tdf,
x=event,
y=n,
messages=FALSE,
results.subtitle=FALSE,
pairwise.comparisons = FALSE)
# clustering
# data preparation
cdf= df4 %>% select(person, event) %>%
group_by(person) %>% count(event) %>% ungroup() %>%
pivot_wider(names_from=event, values_from=n) %>%
replace(is.na(.), 0) %>% #replace all NAs with 0
select(-person) %>%
clean_names()
dim(cdf)
[1] 17000 4
# drop outliers
cdf2 = cdf %>%
# get zscore
mutate(zscore_oc =(offer_completed- mean(offer_completed))/ sd(offer_completed),
zscore_or =(offer_received- mean(offer_received))/ sd(offer_received),
zscore_ov =(offer_viewed- mean(offer_viewed))/ sd(offer_viewed),
zscore_t =(transaction- mean(transaction))/ sd(transaction)) %>%
# drop outliers
filter(between(zscore_oc,-3,3)) %>%
filter(between(zscore_or,-3,3)) %>%
filter(between(zscore_ov,-3,3)) %>%
filter(between(zscore_t,-3,3)) %>%
# select variables
select(offer_completed, offer_received, offer_viewed, transaction)
dim(cdf2)
[1] 16792 4
# correlation
set.seed(123)
c1 = ggcorrmat(data = cdf,
car.vars=c("offer completed":"transaction"),
title="Correlation")
# correlation after dropping outliers
set.seed(123)
c2 = ggcorrmat(data = cdf2,
car.vars=c("offer completed":"transaction"),
title="Correlation after dropping outliers")
# combine plot
c1 + c2
# scale df without outliers
cdf2_scaled = scale(cdf2)
head(cdf2_scaled)
offer_completed offer_received offer_viewed transaction
[1,] 0.6407481 0.4778884 0.4662897 -0.01035126
[2,] -1.2302422 -2.3778776 -1.0868068 -1.03281189
[3,] 0.6407481 0.4778884 1.2428379 -0.62382764
[4,] 0.6407481 0.4778884 -0.3102586 -0.01035126
[5,] 0.6407481 -0.4740336 0.4662897 0.80761725
[6,] 0.6407481 0.4778884 0.4662897 2.03457001
# elbow method
set.seed(123)
fviz_nbclust(cdf2_scaled,kmeans,method="wss") + ggtitle("Elbow method")
fviz_nbclust(cdf2_scaled, kmeans, method = "silhouette", k.max = 20) + ggtitle("Silhouette method")
res.nbclust <- NbClust(cdf_scaled, distance = "euclidean",
min.nc = 2, max.nc = 9,
method = "complete", index ="all")
factoextra::fviz_nbclust(res.nbclust) + ggtitle("NbClust method")
# hclust
set.seed(1234)
h1= hclust(dist(cdf2_scaled))
plot(h1)
#color 3 clusters
dend_obj=as.dendrogram(h1)
dend3= color_branches(dend_obj,k=3)
plot(dend3)
#color 4 clusters
dend4= color_branches(dend_obj,k=4)
plot(dend4)
# 4 clusters
set.seed(123)
km4 = kmeans(cdf2_scaled, centers=4, nstart =5)
km4
K-means clustering with 4 clusters of sizes 4395, 5206, 3687, 3504
Cluster means:
offer_completed offer_received offer_viewed transaction
1 -0.6465954 0.5268382 0.5320180 -0.6553739
2 -0.5755514 -0.9357322 -0.9897010 -0.5382363
3 0.3602940 -0.1469153 -0.1173326 1.0980770
4 1.2870157 0.8840310 0.9265895 0.4662718
Clustering vector:
[1] 4 2 4 3 3 3 3 2 4 1 2 1 2 4 1 3 1 1 4 1 3 1 1 1 4 2 4 3 2 2 1 3 4 2 4 1 2 1 3 3 1 2 1 4 4
[46] 4 1 3 1 2 3 3 1 1 3 2 2 2 4 3 4 2 1 1 1 1 3 2 3 1 4 4 1 3 4 2 1 2 2 4 1 3 4 3 2 3 4 4 3 1
[91] 4 2 2 1 4 2 3 3 2 2 2 2 4 4 4 2 1 2 2 3 1 1 1 2 2 2 2 4 4 1 1 3 3 2 4 2 4 4 3 3 2 4 4 2 1
[136] 1 3 4 1 3 2 3 3 1 1 3 1 1 1 1 2 2 3 2 2 2 4 3 2 2 4 4 3 3 3 3 3 2 2 4 1 2 2 2 4 3 3 1 2 2
[181] 3 3 2 2 1 1 3 1 3 3 1 4 4 4 2 2 1 2 2 1 1 3 3 2 4 1 1 1 2 2 2 3 3 4 1 2 2 1 2 1 1 2 2 1 4
[226] 4 3 1 2 2 2 1 2 2 3 2 2 4 2 3 3 2 1 3 2 3 4 4 1 4 3 2 2 1 1 4 2 1 2 1 3 1 3 1 2 3 1 1 1 4
[271] 2 1 1 2 1 2 1 3 1 2 1 2 2 2 4 4 1 1 3 2 2 2 2 1 3 1 4 2 2 4 3 4 4 2 2 4 4 2 1 4 3 2 2 2 4
[316] 4 1 3 2 3 3 4 2 1 2 4 3 4 1 3 3 1 1 2 2 2 2 2 1 3 2 1 3 3 2 3 2 2 2 2 2 1 1 1 2 4 1 3 3 2
[361] 4 2 4 3 2 4 3 2 2 3 1 3 2 1 4 3 3 4 4 3 3 3 1 4 1 2 2 4 3 1 4 4 2 1 3 2 4 3 1 4 2 2 3 3 4
[406] 4 2 4 3 2 1 4 4 3 4 3 3 1 3 4 1 2 3 2 2 2 2 1 4 3 4 3 3 2 1 4 2 4 2 4 1 1 2 3 3 1 1 3 3 1
[451] 2 3 1 4 1 3 4 4 1 2 2 3 1 4 1 3 1 1 3 2 3 2 4 1 1 2 4 3 4 2 3 3 4 1 2 2 2 2 2 1 2 3 3 3 4
[496] 3 4 3 4 3 1 4 2 3 4 4 1 1 3 3 1 1 3 2 1 1 4 2 4 1 1 2 1 4 2 4 2 1 4 2 4 2 2 3 4 3 4 1 4 3
[541] 1 1 3 3 2 2 1 4 4 3 3 2 2 1 4 2 3 4 3 1 3 4 3 2 1 4 3 1 2 1 3 4 1 1 3 2 1 2 1 2 3 4 1 1 4
[586] 1 2 4 2 1 3 2 2 4 3 2 2 3 4 1 2 3 1 2 1 2 3 1 4 4 4 4 3 2 4 1 3 2 2 4 4 3 3 1 1 1 3 2 1 2
[631] 2 2 2 2 3 1 2 3 2 1 4 4 4 1 1 3 2 4 4 2 3 4 1 4 4 3 2 1 4 2 4 2 3 4 2 2 3 4 2 2 1 1 2 2 1
[676] 3 3 3 4 3 2 1 4 1 4 2 3 2 1 2 1 3 3 2 1 4 1 4 4 4 3 4 2 2 2 4 1 2 4 2 1 3 2 4 3 2 3 1 4 2
[721] 2 1 1 1 3 3 4 1 2 1 2 2 2 4 1 3 3 2 4 4 1 3 2 4 2 2 2 4 1 2 3 3 2 3 3 1 1 4 1 4 4 1 2 4 1
[766] 3 2 2 3 3 1 2 3 3 2 4 2 3 1 3 2 1 1 2 2 1 3 4 1 2 3 2 3 3 4 1 1 2 1 2 2 4 2 3 3 1 2 2 1 2
[811] 2 2 1 4 2 1 4 1 2 2 1 3 4 2 1 1 4 2 3 4 2 2 1 4 2 4 4 2 2 3 2 3 2 3 1 3 4 3 1 2 4 1 1 4 3
[856] 2 4 2 4 2 4 2 4 2 4 1 4 1 2 2 2 1 4 1 2 1 3 1 3 4 2 1 1 2 2 1 3 1 3 2 2 2 2 3 4 2 3 2 1 4
[901] 4 1 2 4 3 2 2 1 1 2 1 1 2 2 3 1 1 4 2 1 2 3 1 1 2 3 4 1 4 2 4 2 1 3 3 2 4 4 2 3 2 2 4 1 2
[946] 2 4 1 2 1 4 1 4 2 3 2 2 2 2 4 1 4 4 2 2 4 4 1 2 4 3 1 4 2 2 1 1 2 2 1 2 1 3 4 3 2 2 4 3 4
[991] 1 4 1 2 3 2 2 2 3 3
[ reached getOption("max.print") -- omitted 15792 entries ]
Within cluster sum of squares by cluster:
[1] 7349.483 9346.092 7504.223 6517.442
(between_SS / total_SS = 54.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
fviz_cluster(km4, data=cdf2_scaled, labelsize=0) #cluster plot
with(cdf2,pairs(cdf2_scaled,col=(1:4)[km4$cluster])) #pair plot
# 3 clusters
set.seed(123)
km3 = kmeans(cdf2_scaled, centers=3, nstart =5)
km3
K-means clustering with 3 clusters of sizes 6111, 5849, 4832
Cluster means:
offer_completed offer_received offer_viewed transaction
1 -0.4707424 -0.9087929 -0.9382575 -0.3410322
2 0.9680941 0.5101128 0.5285569 0.8600251
3 -0.5765056 0.5318675 0.5468051 -0.6097349
Clustering vector:
[1] 2 1 3 2 2 2 2 1 2 3 1 3 1 2 3 2 3 3 2 3 3 3 3 3 2 1 3 2 1 1 3 2 2 1 2 3 1 3 2 2 3 1 3 2 2
[46] 2 3 2 3 1 2 1 3 3 2 1 1 1 2 1 2 1 3 3 3 3 2 1 2 3 2 2 3 1 2 1 3 1 1 2 3 1 2 1 1 2 2 2 2 3
[91] 2 1 1 3 2 1 2 2 1 1 1 1 2 2 2 1 3 1 1 1 3 3 3 1 1 1 1 2 2 3 3 2 1 1 2 1 2 2 1 2 1 2 2 1 3
[136] 3 2 2 3 2 1 2 2 3 3 2 3 3 3 3 1 1 2 1 1 1 2 2 1 1 2 3 1 2 1 1 3 1 1 2 3 1 1 1 2 2 2 3 1 1
[181] 2 2 1 1 3 3 2 3 2 2 3 2 2 2 1 1 3 1 1 3 3 2 2 1 2 3 3 3 1 1 1 2 2 2 3 1 1 3 1 3 3 1 1 3 2
[226] 2 2 3 1 1 1 3 1 1 3 1 1 2 1 3 2 1 3 2 1 1 2 2 3 2 2 1 1 3 3 2 1 3 1 3 1 3 3 3 1 2 3 3 3 2
[271] 1 3 3 1 3 1 3 1 3 1 3 3 1 1 2 2 3 3 2 1 1 1 1 3 2 3 2 1 1 2 2 2 3 1 1 2 2 1 3 3 1 1 1 1 3
[316] 3 3 1 1 2 2 2 1 3 1 2 2 2 3 2 1 3 3 1 1 1 1 1 3 2 1 3 2 3 1 2 1 1 1 1 1 3 3 3 1 2 3 2 2 1
[361] 2 1 2 2 1 2 2 1 1 1 3 2 1 3 2 2 1 2 2 2 2 2 3 2 3 1 1 2 2 3 2 2 1 3 2 1 2 2 3 2 1 1 1 1 2
[406] 2 1 2 1 1 3 2 2 2 2 1 1 3 1 2 3 1 2 1 1 1 1 3 2 2 2 2 2 1 3 2 1 2 1 2 3 3 1 2 1 3 3 2 2 2
[451] 1 2 3 2 3 3 2 2 3 1 1 1 3 2 3 1 3 3 1 1 1 1 2 3 3 1 2 3 2 1 2 2 2 3 1 1 1 1 1 3 1 2 2 2 2
[496] 2 2 1 2 2 3 2 1 1 2 2 3 3 1 2 3 3 2 1 3 3 2 1 2 3 3 1 3 2 1 2 1 3 2 1 2 1 1 1 3 1 2 3 2 2
[541] 3 3 1 1 1 1 3 2 2 2 2 1 1 3 2 1 1 2 2 3 1 2 1 1 3 2 1 3 1 3 2 2 3 3 2 1 3 1 3 1 1 2 3 3 2
[586] 3 1 3 1 3 2 1 1 2 1 1 1 2 2 3 1 1 3 1 3 1 2 3 2 2 2 2 2 1 2 3 2 1 1 2 3 2 2 3 3 3 1 1 3 1
[631] 1 1 1 1 2 3 1 2 1 3 2 2 2 3 3 1 1 2 2 1 3 2 3 2 2 1 1 3 3 1 3 1 2 2 1 1 2 2 1 1 3 3 1 1 3
[676] 2 2 2 3 2 1 3 2 3 2 1 2 1 3 1 3 2 2 1 3 2 3 2 2 2 1 2 1 1 1 2 3 1 2 1 3 1 1 2 2 1 2 3 2 1
[721] 1 3 3 3 2 2 2 3 1 3 1 1 1 2 3 2 2 1 2 2 3 2 1 2 1 1 1 2 3 1 3 2 1 2 1 3 3 2 3 2 2 3 1 2 3
[766] 2 1 1 1 2 3 1 2 2 1 2 1 2 3 2 1 3 3 1 1 3 2 2 3 1 2 1 1 1 2 3 3 1 3 1 1 2 1 2 3 3 1 1 3 1
[811] 1 1 3 2 1 3 2 3 1 1 3 2 2 1 3 3 3 1 2 2 1 1 3 2 1 2 2 1 1 2 1 2 1 2 3 2 2 1 3 1 2 3 3 2 1
[856] 1 2 1 2 1 2 1 2 1 2 3 2 3 1 1 1 3 2 3 1 3 2 3 2 2 1 3 3 1 1 3 2 3 2 1 1 1 1 2 2 1 2 1 3 2
[901] 3 3 1 2 2 1 1 3 3 1 3 3 1 1 2 3 3 3 1 3 1 2 3 3 1 1 2 3 2 1 2 1 3 2 2 1 2 2 1 2 1 1 2 3 1
[946] 1 2 3 1 3 2 3 2 1 2 1 1 1 1 2 3 2 2 1 1 2 2 3 1 2 1 3 3 1 1 3 3 1 1 3 1 3 1 2 2 1 1 2 1 3
[991] 3 2 3 1 2 1 1 1 2 1
[ reached getOption("max.print") -- omitted 15792 entries ]
Within cluster sum of squares by cluster:
[1] 12330.933 14467.088 8696.321
(between_SS / total_SS = 47.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
fviz_cluster(km3, data=cdf2_scaled, labelsize=0) #cluster plot
with(cdf2,pairs(cdf2_scaled,col=(1:3)[km3$cluster])) #pair plot
k-means 3 clusters
* Cluster means:
+ offer completed: c2 > c1 > c3
+ offer received: c3 > c2 > c1
+ offer viewed: c3 > c2 > c1
+ transaction: c2 > c1 > c3
cdf2$cluster_id = as.factor(km4$cluster)
# summary
by(cdf2,cdf2$cluster_id,summary)
cdf2$cluster_id: 1
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.0000 Min. :4.000 Min. :1.000 Min. : 0.000 1:4395
1st Qu.:0.0000 1st Qu.:5.000 1st Qu.:4.000 1st Qu.: 3.000 2: 0
Median :1.0000 Median :5.000 Median :4.000 Median : 5.000 3: 0
Mean :0.9358 Mean :5.051 Mean :4.085 Mean : 4.846 4: 0
3rd Qu.:2.0000 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.: 7.000
Max. :3.0000 Max. :6.000 Max. :6.000 Max. :17.000
------------------------------------------------------------------------
cdf2$cluster_id: 2
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.00 Min. :2.000 Min. :0.000 Min. : 0.000 1: 0
1st Qu.:0.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 3.000 2:5206
Median :1.00 Median :4.000 Median :2.000 Median : 5.000 3: 0
Mean :1.05 Mean :3.515 Mean :2.125 Mean : 5.419 4: 0
3rd Qu.:2.00 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.: 7.000
Max. :4.00 Max. :6.000 Max. :3.000 Max. :16.000
------------------------------------------------------------------------
cdf2$cluster_id: 3
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.00 Min. :2.000 Min. :0.000 Min. : 5.00 1: 0
1st Qu.:2.00 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:10.00 2: 0
Median :3.00 Median :4.000 Median :3.000 Median :13.00 3:3687
Mean :2.55 Mean :4.344 Mean :3.248 Mean :13.42 4: 0
3rd Qu.:3.00 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:16.00
Max. :5.00 Max. :6.000 Max. :6.000 Max. :23.00
------------------------------------------------------------------------
cdf2$cluster_id: 4
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.000 Min. :4.000 Min. :1.000 Min. : 1.00 1: 0
1st Qu.:3.000 1st Qu.:5.000 1st Qu.:4.000 1st Qu.: 7.00 2: 0
Median :4.000 Median :5.000 Median :5.000 Median :10.00 3: 0
Mean :4.036 Mean :5.427 Mean :4.593 Mean :10.33 4:3504
3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.:13.00
Max. :6.000 Max. :6.000 Max. :6.000 Max. :23.00
# pivot longer
cdf2_long = cdf2 %>% pivot_longer(!cluster_id)
# distribution by cluster id
cdf2_long %>% ggplot(aes(x= cluster_id, y=value, color=cluster_id, fill=cluster_id)) +
geom_half_violin(side="r") +
geom_half_boxplot(fill=NA, outlier.colour = "red", outlier.shape=21, outlier.size = 0.5) +
facet_wrap(~name, ncol=2, scales="free") +
scale_fill_manual(values=c("#2B3A67","#66999B", "#B3AF8F","#FFC482")) +
scale_color_manual(values=c("#2B3A67","#66999B", "#B3AF8F","#FFC482")) +
theme(legend.position = "none",
panel.grid.minor = element_blank(),
strip.background=element_rect(fill="slategrey"),
strip.text=element_text(color="white"))
cdf2$cluster_id = as.factor(km3$cluster)
# summary
by(cdf2,cdf2$cluster_id,summary)
cdf2$cluster_id: 1
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.000 Min. :2.000 Min. :0.000 Min. : 0.000 1:6111
1st Qu.:0.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 4.000 2: 0
Median :1.000 Median :4.000 Median :2.000 Median : 6.000 3: 0
Mean :1.218 Mean :3.543 Mean :2.191 Mean : 6.383
3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.: 8.000
Max. :4.000 Max. :6.000 Max. :3.000 Max. :22.000
------------------------------------------------------------------------
cdf2$cluster_id: 2
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.000 Min. :3.000 Min. :1.00 Min. : 3.00 1: 0
1st Qu.:3.000 1st Qu.:5.000 1st Qu.:3.00 1st Qu.: 9.00 2:5849
Median :4.000 Median :5.000 Median :4.00 Median :12.00 3: 0
Mean :3.525 Mean :5.034 Mean :4.08 Mean :12.26
3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:5.00 3rd Qu.:15.00
Max. :6.000 Max. :6.000 Max. :6.00 Max. :23.00
------------------------------------------------------------------------
cdf2$cluster_id: 3
offer_completed offer_received offer_viewed transaction cluster_id
Min. :0.000 Min. :4.000 Min. :1.000 Min. : 0.000 1: 0
1st Qu.:0.000 1st Qu.:5.000 1st Qu.:4.000 1st Qu.: 3.000 2: 0
Median :1.000 Median :5.000 Median :4.000 Median : 5.000 3:4832
Mean :1.048 Mean :5.057 Mean :4.104 Mean : 5.069
3rd Qu.:2.000 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.: 7.000
Max. :4.000 Max. :6.000 Max. :6.000 Max. :16.000
# pivot longer
cdf2_long = cdf2 %>% pivot_longer(!cluster_id)
# distribution by cluster id
cdf2_long %>% ggplot(aes(x= cluster_id, y=value, color=cluster_id, fill=cluster_id)) +
geom_half_violin(side="r") +
geom_half_boxplot(fill=NA, outlier.colour = "red", outlier.shape=21, outlier.size = 0.5) +
facet_wrap(~name, ncol=2, scales="free") +
scale_fill_manual(values=c("#2B3A67","#66999B", "#B3AF8F")) +
scale_color_manual(values=c("#2B3A67","#66999B", "#B3AF8F")) +
theme(legend.position = "none",
panel.grid.minor = element_blank(),
strip.background=element_rect(fill="slategrey"),
strip.text=element_text(color="white"))
The clusters are more unique in 3-groups than 4-groups, especially in offer_received and transaction variables. Hence, it is more suitable to cluster customers into 3 groups than 4 groups.