Github : Ulabox : Baskets Clustering and Analysis

Introduction

Ulabox adalah salah satu perusahaan di Spanyol yang bergerak dibidang online supermarket. Pada tahun 2017, Ulabox mendapatkan €1 juta per-bulan dan berhasil mendapatkan 95% kepuasan pelanggan. Ulabox melayani 8 kategori produk pada tahun 2017, dan tujuh dari delapan kategori disajikan di seluruh Spanyol, kecuali Kategori produk Fresh hanya disajikan di kota yaitu Madrid & Barcelona.

Goal

Goal dari projek ini yaitu melakukan clustering untuk mengetahui seorang customer itu tergolong customer yang cenderung membeli produk kategori apa dan mencari tau habbit jam melakukan order pada setiap clusternya.

Data Preprocessing

Data Inspection

Data yang digunakan mencakup 30k order dari 10k customer pada tahun 2017. Setiap baris data merepresentasikan keranjang belanja customer dalam 1 kali order. Sumber data bisa didapatkan dari sini The Ulabox Online Supermarket Dataset 2017. Berikut informasi data pada dataset:

Order: merupakan order id
Customer: merupakan ID Customer.
Total Items: jumlah item yang dibeli pada order terkait.
Discount: Total diskon yang didapat customer pada setiap order.
Weekday: Hari ketika customer melakukan order. format data 1-7 merepresentasikan hari senin-minggu.
Hour: Jam ketika customer melakukan order. format 0-23 merepresentasikan 24 jam/hari.
Categories: terdapat 8 kolom kategori produk yang diorder yaitu Food, Fresh, Drinks, Home, Beauty, Health, Baby, Pets. Setiap kolom mereprensentasikan persentese biaya yang dikeluarkan pada kategori produk terkait dalam 1 kali order.

Berikut 10 data teratas:

ulabox <- read_csv("data_input/ulabox_orders_2017.csv")

head(ulabox,10) %>% 
kable("html", escape = F, align = "c") %>%
kable_styling(c("striped", "hover", "condensed", "responsive"), full_width = T) %>%
row_spec(0, color = "white", background = "#222629")

customer	order	total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%
0	0	45	23.03	4	13	9.46	87.06	3.48	0.00	0.00	0.00
0	1	38	1.22	5	13	15.87	75.80	6.22	2.12	0.00	0.00
0	2	51	18.08	4	13	16.88	56.75	3.37	16.48	6.53	0.00
1	3	57	16.51	1	12	28.81	35.99	11.78	4.62	2.87	15.92
1	4	53	18.31	2	11	24.13	60.38	7.78	7.72	0.00	0.00
1	5	8	23.89	4	13	0.00	100.00	0.00	0.00	0.00	0.00
1	6	35	17.26	1	10	13.01	51.84	29.36	5.79	0.00	0.00
1	7	12	6.61	4	8	17.21	67.93	14.86	0.00	0.00	0.00
1	8	35	22.29	1	12	15.50	61.65	22.85	0.00	0.00	0.00
1	9	44	20.39	2	12	15.29	44.11	12.28	8.90	15.53	3.90

Data Structure

glimpse(ulabox)

#> Observations: 30,000
#> Variables: 14
#> $ customer    <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ order       <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
#> $ total_items <dbl> 45, 38, 51, 57, 53, 8, 35, 12, 35, 44, 17, 7, 30, 35, 3...
#> $ `discount%` <dbl> 23.03, 1.22, 18.08, 16.51, 18.31, 23.89, 17.26, 6.61, 2...
#> $ weekday     <dbl> 4, 5, 4, 1, 2, 4, 1, 4, 1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ hour        <dbl> 13, 13, 13, 12, 11, 13, 10, 8, 12, 12, 9, 8, 9, 11, 10,...
#> $ `Food%`     <dbl> 9.46, 15.87, 16.88, 28.81, 24.13, 0.00, 13.01, 17.21, 1...
#> $ `Fresh%`    <dbl> 87.06, 75.80, 56.75, 35.99, 60.38, 100.00, 51.84, 67.93...
#> $ `Drinks%`   <dbl> 3.48, 6.22, 3.37, 11.78, 7.78, 0.00, 29.36, 14.86, 22.8...
#> $ `Home%`     <dbl> 0.00, 2.12, 16.48, 4.62, 7.72, 0.00, 5.79, 0.00, 0.00, ...
#> $ `Beauty%`   <dbl> 0.00, 0.00, 6.53, 2.87, 0.00, 0.00, 0.00, 0.00, 0.00, 1...
#> $ `Health%`   <dbl> 0.00, 0.00, 0.00, 15.92, 0.00, 0.00, 0.00, 0.00, 0.00, ...
#> $ `Baby%`     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ `Pets%`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

format data sudah sesuai, namun saya pribadi lebih suka weekday dan hour direpresentasikan dalam tipe faktor. Selain itu nama kolom tidak mengandung karakter % dan bersifat lowercase karena mempermudah dalam memproses data.

ulabox$weekday <- sapply(as.factor(ulabox$weekday),switch,
                         "1"="monday",
                         "2"="tuesday",
                         "3"="wednesday",
                         "4"="thrusday",
                         "5"="friday",
                         "6"="saturday",
                         "7"="sunday")
ulabox$weekday <- as.factor(ulabox$weekday)
ulabox$weekday <- ordered(ulabox$weekday, levels=c("monday","tuesday","wednesday","thrusday","friday","saturday","sunday"))

ulabox$hour <- as.factor(ulabox$hour)
ulabox$hour <- ordered(ulabox$hour, levels=c(0:23))

colnames(ulabox) <- sapply(strsplit(tolower(names(ulabox)), "%"), `[[`, 1)
glimpse(ulabox)

#> Observations: 30,000
#> Variables: 14
#> $ customer    <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ order       <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
#> $ total_items <dbl> 45, 38, 51, 57, 53, 8, 35, 12, 35, 44, 17, 7, 30, 35, 3...
#> $ discount    <dbl> 23.03, 1.22, 18.08, 16.51, 18.31, 23.89, 17.26, 6.61, 2...
#> $ weekday     <ord> thrusday, friday, thrusday, monday, tuesday, thrusday, ...
#> $ hour        <ord> 13, 13, 13, 12, 11, 13, 10, 8, 12, 12, 9, 8, 9, 11, 10,...
#> $ food        <dbl> 9.46, 15.87, 16.88, 28.81, 24.13, 0.00, 13.01, 17.21, 1...
#> $ fresh       <dbl> 87.06, 75.80, 56.75, 35.99, 60.38, 100.00, 51.84, 67.93...
#> $ drinks      <dbl> 3.48, 6.22, 3.37, 11.78, 7.78, 0.00, 29.36, 14.86, 22.8...
#> $ home        <dbl> 0.00, 2.12, 16.48, 4.62, 7.72, 0.00, 5.79, 0.00, 0.00, ...
#> $ beauty      <dbl> 0.00, 0.00, 6.53, 2.87, 0.00, 0.00, 0.00, 0.00, 0.00, 1...
#> $ health      <dbl> 0.00, 0.00, 0.00, 15.92, 0.00, 0.00, 0.00, 0.00, 0.00, ...
#> $ baby        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ pets        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Missing Value

colSums(is.na(ulabox))

#>    customer       order total_items    discount     weekday        hour 
#>           0           0           0           0           0           0 
#>        food       fresh      drinks        home      beauty      health 
#>           0           0           0           0           0           0 
#>        baby        pets 
#>           0           0

Tidak terdapat missing value.

Duplicate Value

data.frame(
  "nrow.data" = nrow(ulabox),
  "nrow.uniqe"= ulabox %>% distinct() %>% nrow()
)

#>   nrow.data nrow.uniqe
#> 1     30000      30000

Tidak terdapat duplicate value.

Exploratory Data Analysis

Sesuai yang informasikan dari The Ulabox Online Supermarket Dataset 2017 terdapat customer yang merupakan Influencer, dan customer kategori ini mendapatkan diskon 100% atau tidak melakukan pembayaran. Mari kita cek proporsinya.

data.frame(
  total_influencer = ulabox %>% filter(discount==100) %>% select(customer) %>% distinct() %>% nrow(),
  total_customer = ulabox %>% filter(discount!=100) %>% select(customer) %>% distinct() %>% nrow()
) %>% gather("kategori","total") %>% 
mutate(
  prop = total/sum(total)*100
)

#>           kategori total     prop
#> 1 total_influencer   321  3.12835
#> 2   total_customer  9940 96.87165

Terdapat 321 (3.13%) customer kategori influencer. Customer kategori ini tidak perlu dilakukan clustering karena sebagai influencer, lagipula jumlah datanya cukup kecil sehingga dapat kita keluarkan. berikut hasilnya:

ulabox <- ulabox %>%  filter(discount!=100)

data.frame(
  total_influencer = ulabox %>% filter(discount==100) %>% select(customer) %>% distinct() %>% nrow(),
  total_customer = ulabox %>% filter(!discount==100) %>% select(customer) %>% distinct() %>% nrow()
) %>% gather("kategori","total") %>% 
mutate(
  prop = total/sum(total)*100
)

#>           kategori total prop
#> 1 total_influencer     0    0
#> 2   total_customer  9940  100

Top Selling Product

plot_top_selling <- ulabox %>%
  select (food, fresh, drinks, home, beauty, health, baby, pets) %>% 
  summarise_if(is.numeric, sum) %>% 
  gather(key = "category", value = "total", food, fresh, drinks, home, beauty, health, baby, pets) %>% 
  mutate(
    total = round(total/30000,2),
    category = as.factor(category),
    category = reorder(category,total)
  ) %>% 
  ggplot(aes(category,total,fill=category,group=category))+
  geom_bar(aes(fill=category),stat = "identity")+
  geom_text(aes(label=paste0(total,"%"), y=total+1),color="white",size=3)+
  labs(
    title = "Top Selling Product Category in 2017",
    x="Product Category",
    y="Selling (%)",
    fill=""
  )+
  my_theme_fill()+
  my_plot_theme(10)

ggplotly(plot_top_selling,tooltip = NULL) %>% 
  layout(showlegend=FALSE)

Produk kategori Food, Drinks, Fresh paling laris ditahun 2017 dengan mendominasi 65.8% penjualan. Chart diatas memberikan informasi unik, meskipun produk kategori Fresh hanya dijual di Madrid & Barcelona, tapi produk kategori ini berada diurutan ke-3 dengan mendominasi 20% penjualan di tahun 2017. Mari kita lihat penjualaan hariannya.

avg_daily <- ulabox %>% 
  select (weekday, food,fresh, drinks, home, beauty, health, baby, pets) %>% 
  group_by(weekday) %>% 
  summarise_if(is.numeric, mean) %>% 
  ungroup() %>% 
   gather(key = "category", value = "total", food, fresh, drinks, home, beauty, health, baby, pets) %>% 
  mutate(total = round(total,2),
         category = as.factor(category),
    category = reorder(category,total))
  
  
#serror <- function(x) sqrt(var(x)/length(x))   

plot_avg_daily <-  ggplot(avg_daily,aes(weekday,total,fill=category,group=category)) +
  geom_bar(stat="identity")+
  geom_text(aes(label=ifelse(round(total,2)>10,paste0(round(total,2),"%"),"")),
            position = position_stack(vjust=0.5), size=2.5, color="white")+
  labs(
    title = "Average Daily Sales per Category",
    x = "",
    y = ""
  )+
  coord_flip()+
  my_theme_fill()+
  my_plot_theme(11)
  

ggplotly(plot_avg_daily, tooltip = NULL) %>% 
  layout(title = list(text = paste0('Average Daily Sales per Category',
                                    '<br>',
                                    '<sup>',
                                    'Label just displayed on selling > 10%',
                                    '</sup>')))

Rata-rata penjualan setiap harinya juga menunjukan produk kategori Food, Drinks, Fresh paling laris dan setiap produk mendominasi penjualan setiap harinya. Rata-rata penjualan masing-masing kategori hampir mirip setiap harinya dan kurang menunjukan adanya efek seasonal mingguan dari kebiasaan customer membeli produk sehingga clustering kita bisa fokus ke jam ketika customer membeli.

Loyal Customer based on Order Frequency

loyal_freq <- ulabox %>% 
  group_by(customer) %>% 
  summarise(freq=n()) %>% 
  ungroup() %>% 
  mutate(
    popup=glue("Customer ID: {customer}
               Freq : {freq}")
  ) %>% 
  arrange(freq)

getmodus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}


plot_loyal_freq <- ggplot(loyal_freq,aes(customer,freq))+
  geom_point(aes(color=freq, size=freq, text=popup), alpha=0.7)+
  geom_hline(yintercept=getmodus(loyal_freq$freq),linetype="dashed", color = "yellow",size=0.5,text="adad")+
  labs(title = "Loyal Customer based on Order Frequency", 
       x = "Customer", 
       y = "Order Frequency",
       color = "Order Frequency")+
  my_plot_theme(10)+
  my_theme_gradientn(10)+
  theme(
    axis.text.x = element_blank()
  )+
  annotate(geom="text", 
           x=length(loyal_freq$customer)-1/10*length(loyal_freq$customer), 
           y=getmodus(loyal_freq$freq)+1.5,size=3,
           label=paste0("Modus: ",(tabulate(match(loyal_freq$freq, unique(loyal_freq$freq))) %>% sort(decreasing = T) %>% .[1]),
                        "/",length(loyal_freq$customer), " cust"),
           color="yellow")

ggplotly(plot_loyal_freq, tooltip="text")

Chart diatas menunjukan Frekuensi Order dari masing-masing customer. Terdapat 1 Customer yang melakukan order 52 kali ditahun 2017. Selain itu, garis kuning menunjukan customer dengan frekuensi order 1 kali paling banyak dengan jumlah 4239 dari 9940 customer.

Loyal Customer based on Total Order Items

loyal_cust_item <-  ulabox %>% 
  group_by(customer) %>% 
  summarise(total_item_order=sum(total_items)) %>% 
  ungroup() %>% 
  mutate(
    popup=glue("Customer ID: {customer}
               Total Item Order : {total_item_order}")
  ) %>% 
  arrange(total_item_order) 


getmodus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

plot_loyal_cust <- ggplot(loyal_cust_item, aes(customer,total_item_order))+
  geom_point(aes(color=total_item_order,text=popup),alpha=0.8, size=2.5)+
  geom_hline(yintercept=getmodus(loyal_cust_item$total_item_order),linetype="dashed", color = "yellow",size=0.5)+
  labs(title = "Loyal Customer based on Total Order Items", 
       x = "Customer", 
       y = "Total Item Order",
       color = "Total Item")+
  my_plot_theme(10)+
  my_theme_gradientn(10)+
  theme(
    axis.text.x = element_blank()
  )+
  annotate(geom="text", 
           x=length(loyal_cust_item$customer)-1/10*length(loyal_cust_item$customer), 
           y=getmodus(loyal_cust_item$total_item_order)+50,size=3,
           label=paste0("Modus: ",(tabulate(match(loyal_cust_item$total_item_order, 
                                                  unique(loyal_cust_item$total_item_order))) %>% 
                                     sort(decreasing = T) %>% .[1]),"/",length(loyal_cust_item$customer), " cust"),
           color="yellow")

ggplotly(plot_loyal_cust)

Chart diatas menunjukan jumlah item yang di order masing-masing customer cukup bervaratif. Meskipun begitu. proporsi customer yang hanya meng-order 1 item adalah yang paling besar dengan jumlah 304 dari 9940 customer.

Clustering & Analysis

Berdasarkan analisa sebleumnya, maka memungkinkan kita untuk melakukan clustering karena kategori produk yang diorder customer cukup bervariatif.

Data Transformation

Tujuan dari projek ini yaitu melakukan clustering untuk mengetahui seorang customer itu tergolong customer yang cenderung membeli produk kategori apa dan mencari tau habbit jam melakukan order pada setiap clusternya. Untuk itu kita bisa fokus ke data persentase pembelian pada setiap kategori produk, sehingga kita bisa mentransformasi datanya menjadi sebgai berikut:

food : rata-rata persentase pembelian produk kategori food
fresh : rata-rata persentase pembelian produk kategori fresh
drinks : rata-rata persentase pembelian produk kategori drinks
home : rata-rata persentase pembelian produk kategori home
beauty : rata-rata persentase pembelian produk kategori beauty
health : rata-rata persentase pembelian produk kategori health
baby : rata-rata persentase pembelian produk kategori baby
pets : rata-rata persentase pembelian produk kategori pets

Saya memilih menggunakan rata-rata karena data kita menunjukan 1 customer bisa melakukan order beberapa kali. Berikut hasil transformasinya:

ub_data <- ulabox

ub_data <- ub_data %>% 
  group_by(customer) %>% 
  summarise(
    food = mean(food),
    fresh = mean(fresh),
    drinks = mean(drinks),
    home = mean(home),
    beauty = mean(beauty),
    health = mean(health),
    baby = mean(baby),
    pets = mean(pets)
  ) %>% 
  ungroup() %>% arrange(customer)


head(ub_data,10) %>% 
kable("html", escape = F, align = "c") %>%
kable_styling(c("striped", "hover", "condensed", "responsive"), full_width = T) %>%
row_spec(0, color = "white", background = "#222629")

customer	food	fresh	drinks	home	beauty	health	baby	pets
0	14.070000	73.20333	4.356667	6.200000	2.1766667	0.000	0.0000000	0.000000
1	17.762000	52.90900	17.761000	3.207500	2.3145000	4.352	1.6950000	0.000000
2	24.100000	22.29000	38.690000	14.920000	0.0000000	0.000	0.0000000	0.000000
3	23.825652	51.28087	8.220870	14.773478	0.0000000	0.000	1.8986957	0.000000
4	24.841379	51.08241	10.291035	13.035172	0.6837931	0.000	0.0655172	0.000000
5	6.840000	0.00000	24.020000	26.870000	10.2100000	0.000	32.0600000	0.000000
6	39.315385	23.76692	19.721539	5.633077	2.1884615	0.000	6.6176923	2.756923
7	23.700000	52.54000	13.930000	0.000000	9.8300000	0.000	0.0000000	0.000000
8	24.091250	37.56625	22.392500	12.033750	3.9175000	0.000	0.0000000	0.000000
9	9.296667	64.84000	14.153333	3.257500	2.4391667	5.390	0.6250000	0.000000

Optimal K for Clustering

Pada dasarnya kita tidak mengetahui berapa jumlah cluster pada data ini, maka kita perlu mencari jumlah cluster yang optimal. Untuk itu kita bisa menggunakan Elbow Method, Silhouette Method, Gap Statistic atau pendekatan lainnya. Untuk case ini kita akan mencoba Elbow Method dan Silhouette Method. Pada setiap metode kita perlu menentukan nilai K-Maksimum, dalam case ini saya memilih 15. Berikut hasilnya:

Elbow Method
Dalam menggunakan metode Elbow jumlah cluster terbaik dengan cara melihat persentase hasil perbandingan antara jumlah cluster yang akan membentuk siku pada suatu titik atau jumlah perubahan tinggi titik yang tidak signifikan. Untuk mendapatkan perbandingannya adalah dengan menghitung SSE (Sum of Square Error) dari masing-masing nilai cluster. Jika dilihat dari visualisasi dibawah, perubahan tinggi titik yang tidak signifikan ada pada K = 7.

fviz_nbclust(ub_data[,-c(1)], kmeans, method = "wss", k.max = 15) + labs(subtitle = "Elbow method")

Silhouette Method
Metode Silhouette mengukur koefisien Silhouette dengan menghitung rata-rata jarak setiap data terhadap semua data pada cluster yang sama kemudian menghitung rata-rata jarak setiap data dengan semua data pada cluster lain. Kemudian data akan dikelompokan kembali berdasarkan jarak minimum. Jika dalam perulangan didapati posisi centorid berubah, maka data akan dikelompokan ulang. Berdasarkan visualisasi dibawah, nilai K optimal dapat dilihat pada titik tertinggi. Berdasarkan visualisasi dibawah maka K-Optimum = 3.

fviz_nbclust(ub_data[,-c(1)], kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Kedua metode diatas menunjukan nilai K-Optimum yang berbeda, disisi lain kita tahu bahwa terdapat 8 kategori produk maka ada kemungkinan setiap kategori memiliki pasarnya sendiri. maka mari kita putuskan untuk membandingkan menggunakan K=3, K=7 dan K=8.

K-Means Clustering

Metode K-Means akan melakukan Clustering berdasarkan jarak. Salah satu hal penting yang perlu dilakukan sebelum menggunakan metode k-means yaitu memastikan seluruh data sudah dalam skala yang sama. Dalam case ini kita tidak perlu melakukan scalling karna setiap variabel berisikan nilai dari skala 0-100%. Kebaikan hasil clustering dapat dilihat dari 3 nilai:

Within Sum of Squares ($withinss): jarak tiap observasi ke centroid untuk tiap cluster
Total Sum of Squares ($totss): jarak tiap observasi ke global sample mean (rata-rata data keseluruhan).
Between Sum of Squares ($betweenss): jarak centroid tiap cluster ke global sample mean

Clustering yang baik akan menghasilkan withinss semakin rendah dan hasil betweenss/totss mendekati 1 atau 100%

Berikut hasil Clustering menggunakan K = 3

df_cust_cat <- ub_data
set.seed(119)
k3_optimum <- 3
k3_cust<-kmeans(df_cust_cat[,-c(1)], k3_optimum)

k3_cust

#> K-means clustering with 3 clusters of sizes 1972, 4066, 3902
#> 
#> Cluster means:
#>        food     fresh    drinks      home    beauty    health      baby
#> 1  4.144657  1.517635  5.801302  5.662340 20.625002 0.7561837 60.510503
#> 2 34.636697 29.981821 15.450952  9.671546  5.051911 1.5630036  2.503607
#> 3 17.194045  4.710507 39.169165 25.547770  6.455837 1.2501758  3.658397
#>        pets
#> 1 0.2926920
#> 2 0.9973163
#> 3 1.8626793
#> 
#> Clustering vector:
#>  [1] 2 2 3 2 2 3 2 2 2 2 3 1 3 2 2 3 2 2 2 2 2 3 2 2 2 2 3 2 3 3 2 2 3 2 3 3 3 3
#> [39] 3 2 2 2 2 3 2 3 3 3 2 3 3 2 3 2 2 2 2 3 3 3 2 2 3 1 3 1 3 2 2 2 2 2 1 3 2
#>  [ reached getOption("max.print") -- omitted 9865 entries ]
#> 
#> Within cluster sum of squares by cluster:
#> [1] 5868160 5533611 6792775
#>  (between_SS / total_SS =  38.1 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Informasi diatas merupakan hasil Clustering menggunakan K=3. Dapat dilihat hasil Between Sum of Squares/Total Sum of Squares sebesar 35%*.

Berikut hasil Clustering menggunakan K = 7

df_cust_cat <- ub_data

k7_optimum <- 7
set.seed(119)
k7_cust<-kmeans(df_cust_cat[,-c(1)], k7_optimum)

k7_cust

#> K-means clustering with 7 clusters of sizes 822, 2157, 3257, 1355, 1087, 819, 443
#> 
#> Cluster means:
#>        food      fresh     drinks      home    beauty    health       baby
#> 1 72.634893  3.5745393 11.4990336  6.516010  3.122583 0.8492414  1.1049016
#> 2 21.616916 46.7465372 14.9544180  8.367141  4.074205 1.0169853  2.2737306
#> 3 25.755703  9.9926348 27.7172784 17.795447  7.156089 2.1387442  6.9275840
#> 4  3.721255  1.1629981  5.6275472  5.009523  2.258572 0.7522967 81.0078906
#> 5 10.980113  3.2666825 71.6755813  8.340212  2.657171 0.6297949  1.6576544
#> 6  9.819353  2.1107969 11.1817915 63.298947  6.835147 0.8427081  4.5286517
#> 7  1.935915  0.2420034  0.9552731  2.525132 90.734421 1.0673194  0.2914409
#>        pets
#> 1 0.5621508
#> 2 0.8903947
#> 3 2.2598641
#> 4 0.2614330
#> 5 0.6328756
#> 6 1.2670735
#> 7 0.1749300
#> 
#> Clustering vector:
#>  [1] 2 2 3 2 2 3 3 2 2 2 6 3 3 2 3 6 2 2 2 2 3 3 3 2 1 3 6 3 6 3 1 3 3 3 5 3 3 6
#> [39] 5 3 2 2 3 6 1 5 3 3 2 6 5 2 6 2 2 2 2 5 5 5 2 2 6 4 3 4 6 2 1 3 2 2 4 1 2
#>  [ reached getOption("max.print") -- omitted 9865 entries ]
#> 
#> Within cluster sum of squares by cluster:
#> [1]  579756.3 1164556.4 2881670.6  864062.1  725046.3  799289.6  176151.1
#>  (between_SS / total_SS =  75.5 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Informasi diatas merupakan hasil Clustering menggunakan K=7. Dapat dilihat hasil Between Sum of Squares/Total Sum of Squares sebesar 75.5%

Berikut hasil Clustering menggunakan K = 8

df_cust_cat <- ub_data

k8_optimum <- 8
set.seed(119)
k8_cust<-kmeans(df_cust_cat[,-c(1)], k8_optimum)

k8_cust

#> K-means clustering with 8 clusters of sizes 747, 2002, 1751, 1331, 561, 733, 433, 2382
#> 
#> Cluster means:
#>        food      fresh     drinks      home    beauty    health       baby
#> 1 74.891264  3.0963573 11.1558610  5.832302  2.814554 0.8010432  0.8853915
#> 2 21.301708 48.1155018 14.8937688  7.966844  3.860182 0.9725023  2.0007819
#> 3 21.463596  7.4038248 45.4815345 15.180979  4.802523 1.0927013  3.5163991
#> 4  3.525836  1.0754350  5.5502680  4.875917  2.150576 0.7337031 81.6859535
#> 5  5.761327  1.4679575 85.6695209  3.997770  1.433126 0.2395933  1.0011833
#> 6  8.971923  1.8457169 10.6814909 65.955252  6.570576 0.6699657  4.2468895
#> 7  1.479994  0.2090012  0.8867275  2.279642 91.679635 1.0017610  0.1905889
#> 8 27.196657 11.7450098 19.1651569 18.986109  8.389825 2.6154853  8.5961851
#>        pets
#> 1 0.4288583
#> 2 0.8362807
#> 3 0.9201566
#> 4 0.2100219
#> 5 0.2391613
#> 6 0.9894541
#> 7 0.1511316
#> 8 2.9765943
#> 
#> Clustering vector:
#>  [1] 2 2 3 2 2 8 8 2 2 2 6 8 8 2 8 6 2 2 2 2 8 8 8 2 1 8 6 3 6 8 1 8 3 8 3 8 8 6
#> [39] 3 8 2 2 8 6 1 3 8 3 2 6 3 2 6 2 2 2 2 3 5 3 2 2 6 4 8 4 6 2 1 8 2 2 4 3 2
#>  [ reached getOption("max.print") -- omitted 9865 entries ]
#> 
#> Within cluster sum of squares by cluster:
#> [1]  484841.1 1013157.1 1068915.9  811257.3  184177.6  655817.8  148013.5
#> [8] 2265444.1
#>  (between_SS / total_SS =  77.4 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Informasi diatas merupakan hasil Clustering menggunakan K=8. Dapat dilihat hasil Between Sum of Squares/Total Sum of Squares sebesar 78.4%. Mari kita coba visualisasikan setiap cluster.

cust_cluster <- as.matrix(df_cust_cat[,-c(1)])

hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')

cluster <- c(1: k3_optimum)
center_df <- data.frame(cluster, k3_cust$centers)

# Reshape the data
center_reshape <- gather(center_df, features, values, food : pets)
#head(center_reshape)

#plot data
ggplot(data = center_reshape, aes(x = features, y = cluster)) +
  scale_y_continuous(breaks = seq(1, k3_optimum, by = 1)) +
  geom_tile(aes(fill = values)) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(90)) +
  labs(
    title = "Clustering using K=3"
  )+
  my_plot_theme(10)

cust_cluster <- as.matrix(df_cust_cat[,-c(1)])

hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')

cluster <- c(1: k7_optimum)
center_df <- data.frame(cluster, k7_cust$centers)

# Reshape the data
center_reshape <- gather(center_df, features, values, food : pets)
#head(center_reshape)

#plot data
ggplot(data = center_reshape, aes(x = features, y = cluster)) +
  scale_y_continuous(breaks = seq(1, k7_optimum, by = 1)) +
  geom_tile(aes(fill = values)) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(90)) +
  labs(
    title = "Clustering using K=7"
  )+
  my_plot_theme(10)

cust_cluster <- as.matrix(df_cust_cat[,-c(1)])

hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')

cluster <- c(1: k8_optimum)
center_df <- data.frame(cluster, k8_cust$centers)

# Reshape the data
center_reshape <- gather(center_df, features, values, food : pets)
#head(center_reshape)

#plot data
ggplot(data = center_reshape, aes(x = features, y = cluster)) +
  scale_y_continuous(breaks = seq(1, k8_optimum, by = 1)) +
  geom_tile(aes(fill = values)) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(90)) +
  labs(
    title = "Clustering using K=8"
  )+
  my_plot_theme(10)

Berdasarkan visualisasi diatas, kita akan memilih untuk mengelompokan customer ke 7 cluster karena cenderung memiliki pola yang unik/beridentitas dibandingkan dengan 8 cluster.

Cluster Profiling & Analysis

Berdasarkan visualisasi pada chart Clustering using K=7, maka berikut ini deskripsi dan label yang dapat kita berikan:

cluster 1 : Customer yang dominan membeli produk kategori Home, kita beri label = home decorator
cluster 2 : Customer yang dominan membeli produk kategori Food, kita beri label = coocking enthusiast
cluster 3 : Customer ini dominan membeli produk Fresh, namun juga sering membeli produk kategori Drinks,Food dan Home, kita beri label = family
cluster 4 : Customer ini sangat dominan membeli produk Beauty, kita beri label = beauty conscious
cluster 5 : Customer ini sangat dominan membeli produk Drinks, kita beri label = teenager
cluster 6 : Customer ini sangat dominan membeli produk Baby, kita beri label = new parents
cluster 7 : Customer ini memiliki pola pembelian yang hampir mirip dengan cluster 3, namun customer ini lebih cenderung membeli produk katgori food, drinks dan home. Terlihat seperti belanja bulanan, maka kita beri label = regular shopper

ub_data <- ulabox

ub_data <- ub_data %>% 
  group_by(customer) %>% 
  summarise(
    min_hour = as.numeric(min(hour)),
    max_hour = as.numeric(max(hour)),
    modus_hour = as.numeric(getmodus(hour)),
    food = mean(food),
    fresh = mean(fresh),
    drinks = mean(drinks),
    home = mean(home),
    beauty = mean(beauty),
    health = mean(health),
    baby = mean(baby),
    pets = mean(pets)
  ) %>% 
  ungroup() %>% arrange(customer)


ub_cust_cluster <- cbind(ub_data,cluster=k7_cust$cluster)
ub_cust_cluster$cluster <-  sapply(as.factor(ub_cust_cluster$cluster),switch,
                         "1"="home decorator",
                         "2"="cooking enthusiast",
                         "3"="family",
                         "4"="beauty conscious",
                         "5"="teenager",
                         "6"="new parents",
                         "7"="regular shopper")

data.frame(table(ub_cust_cluster$cluster)) %>% 
  mutate(
    Var1 = reorder(Var1,Freq)
  ) %>% 
  ggplot(aes(Var1,Freq))+
  geom_bar(stat="identity",aes(fill=Var1), show.legend = FALSE)+
  geom_text(aes(label=Freq, y=Freq+150),size=3, color="white")+
  labs(
    title = "Total Customer by Cluster",
    x = "Cluster",
    y = "Total Customer"
  )+
  my_theme_fill()+
  my_plot_theme(10)

Chart diatas adalah menunjukan informasi total customer berdasarkan hasil clustering yang sudah dilakukan. Customer kategori Beauty Concious paling sedikit dan customer kategori reguler shopper paling banyak. Oke, mari kita lihat pola frekuensi order dari setiap cluster.

cust_time <- ub_cust_cluster %>% select(customer,cluster) %>% 
  left_join(ulabox,by=c("customer","customer"))

hours <- data.frame(hour=c(0:23)) %>% mutate(hour=ordered(hour,0:23))
serror <- function(x) sqrt(var(x)/length(x)) 

# regular shopper
c_regular_shopper <- cust_time %>% filter(cluster=="regular shopper") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_regular_shopper, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Regular Shopper"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar7


# new parents
c_new_parents <- cust_time %>% filter(cluster=="new parents") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_new_parents, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for New Parents"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar5


# home decorator
c_home_decorator <- cust_time %>% filter(cluster=="home decorator") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_home_decorator, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Home Decorator"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar2


# teenager
c_teenager <- cust_time %>% filter(cluster=="teenager") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_teenager, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Teenager"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar4


# beauty conscious
c_beauty_conscious <- cust_time %>% filter(cluster=="beauty conscious") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_beauty_conscious, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Beauty Conscious"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar1



# cooking enthusiast
c_cooking_enthusiast <- cust_time %>% filter(cluster=="cooking enthusiast") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_cooking_enthusiast, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Cooking Enthusiast"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar3



# family
c_family <- cust_time %>% filter(cluster=="family") %>%
  group_by(hour) %>% 
  summarise(
    freq_order = n(),
    total_items = sum(total_items)
  ) %>% 
  ungroup() %>% 
  right_join(hours, by=c("hour","hour")) %>% 
  mutate(
    freq_order = ifelse(is.na(freq_order),0,freq_order),
    total_items = ifelse(is.na(total_items),0,total_items),
    hour = reorder(hour,0:23)
  )

ggplot(c_family, aes(x=hour, y=freq_order, fill = hour)) +
  geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
  geom_errorbar(aes(ymin = freq_order - serror(freq_order), 
                    ymax = freq_order + serror(freq_order), 
                    color = hour), 
                    width = .2) + 
  scale_y_continuous(breaks = 0:nlevels(hour)) +
  labs(
    title = "Hourly Order for Family"
  )+
 theme_minimal() +
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5, size=10, face="bold"),
        axis.text.y = element_blank(),
        axis.text.x=element_text(face="bold"))+
  coord_polar()+
  my_theme_fill()+
  my_theme_color() -> polar6



grid.arrange(polar1,polar2,polar3,polar4,polar5,polar6,polar7, ncol = 3)

Sebelumnya perlu diketahui, chart diatas bersifat individual sehingga kita tidak dapat membandingkan frekuensi order antar cluster dengan melihat tinggi rendahnya barplot. Namun, dari chart diatas dapat kita simpulkan bahwa frekuensi order minimum terjadi dari pukul 01.00 sampai 06.00. Selain itu kita juga mendapat informasi habbit setiap cluster dalam melakukan order berdasarkan jam per-harinya sebagai berikut:

Cluster Beauty Concious: cenderung melakukan order pada pukul 08.00 dan 17.00.
Cluster home decorator: cenderung melakukan transaksi dari pukul 08.00 sampai 00.00, namun frekuensi order paling tinggi terjadi dari pukul 10.00 sampai 12.00.
Cluster cooking enthusiast: cenderung melakukan transaksi dari pukul 09.00 sampai 00.00 dan frekuensi order paling tinggi terjadi pada pukul 12.00.
Cluster teenager: cenderung melakukan transaksi dari pukul 09.00 sampai 00.00 dan frekuensi order paling tinggi terjadi pada pukul 10.00 sampai 13.00 dan dari pukul 18.00 sampai 22.00.
Cluster new parents: cenderung melakukan transaksi dari pukul 08.00 sampai 00.00, namun frekuensi order paling tinggi terjadi pada pukul 10.00 sampai 11.00 dan 22.00 sampai 23.00.
Cluster family: cenderung melakukan transaksi dari pukul 08.00 sampai 00.00 dan frekuensi order paling tinggi terjadi pada pukul 10.00 sampai 13.00 dan pukul 20.00 sampai 22.00.
Cluster regular shopper: cenderung melakukan transaksi dari pukul 08.00 sampai 00.00 dan frekuensi order paling tinggi terjadi dari pukul 21.00 sampai 22.00.

Summary

Setelah dilakukan Clustering diketahui terdapat 7 Cluster Customer berdasarkan data keranjang belanja. Rata-rata customer ulabox melakukan order dari rentang waktu pukul 08.00 sampai dengan 00.00, namun setiap cluster memiliki habbit melakukan order pada jam yang cenderung berbeda. Hasil analisis yang dilakukan dapat membantu Ulabox untuk menentukan strategi penjualan pada masing-masing cluster. Salah satunya yang dapat kita sarankan yaitu menjalankan strategi promo dan diskon secara spesifik terhadap masing-masing cluster pada jam-jam yang memiliki frekuensi order tinggi.

Reference

Martinez, j. (2017) Ulabox orders with categories’ partials 2017

Hari, H. (2017) Customer segmentation analysis, Clustering

Arga,A. (2019) K-Means Clustering

Fauziyyah, N.A. (2019) Unsupervised Learning: Wheat Seed Analysis

Secun,S. (2017) Ulabox orders 2017- Clustering and regression