Customer Market Optimization
library(tidyverse)
library(lubridate)
library(scales)
library(readxl)
library(zoo)
#visualization
library(plotly)
library(glue)
library(RColorBrewer)
library(gridExtra)
library(sparkline)
#data model
library(rfm)
library(GGally)
#Segmentation
library(dbscan) # DBScan
library(factoextra) # K-Means
#recomendation
library(recommenderlab)Project Background
sumber: Statistik E-Commerce 2019, Badan Pusat Statistik Indonesia
Berdasarkan artikel yang dipublish oleh www.xendit.co , hasil riset Accenture mengatakan bahwa 40% konsumen meninggalkan suatu website bisnis dan melakukan pembelian di website atau toko online lain karena merasa kebingungan akibat terlalu banyaknya pilihan saat akan membuat keputusan pembelian.
Data Description
data yang digunakan adalah data transaksi yang terjadi antara 01/12/2010 sampai 09/12/2011 pada salah satu perusahaan online retail yang berbasis di Inggris. Banyak pelanggan dari perusahaan adalah grosir.
Sumber Data Online Retail Data Set
Data Preparation
Read Data
Total Data awal 541909 data dab 8 variabel.
df_input <- read_excel("data_input/online_retail.xlsx", trim_ws = TRUE)
data.frame("total.data" = dim(df_input)[1],
"total.variabel" = dim(df_input)[2])#> total.data total.variabel
#> 1 541909 8
10 Data teratas
#> # A tibble: 10 x 8
#> InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
#> <chr> <chr> <chr> <dbl> <dttm> <dbl>
#> 1 536365 85123A WHITE HANG~ 6 2010-12-01 08:26:00 2.55
#> 2 536365 71053 WHITE META~ 6 2010-12-01 08:26:00 3.39
#> 3 536365 84406B CREAM CUPI~ 8 2010-12-01 08:26:00 2.75
#> 4 536365 84029G KNITTED UN~ 6 2010-12-01 08:26:00 3.39
#> 5 536365 84029E RED WOOLLY~ 6 2010-12-01 08:26:00 3.39
#> 6 536365 22752 SET 7 BABU~ 2 2010-12-01 08:26:00 7.65
#> 7 536365 21730 GLASS STAR~ 6 2010-12-01 08:26:00 4.25
#> 8 536366 22633 HAND WARME~ 6 2010-12-01 08:28:00 1.85
#> 9 536366 22632 HAND WARME~ 6 2010-12-01 08:28:00 1.85
#> 10 536367 84879 ASSORTED C~ 32 2010-12-01 08:34:00 1.69
#> # ... with 2 more variables: CustomerID <dbl>, Country <chr>
Saya merasa lebih mudah jika seluruh nama kolom menggunakan lowercase dan setiap kata dipisahkan karakter underscore.
colnames(df_input) <- tolower(colnames(df_input))
df_input <- df_input %>%
rename("invoice_no" = invoiceno,
"stock_code" = stockcode,
"invoice_date" = invoicedate,
"unit_price" = unitprice,
"customer_id" = customerid)
colnames(df_input)#> [1] "invoice_no" "stock_code" "description" "quantity" "invoice_date"
#> [6] "unit_price" "customer_id" "country"
Variable Description
Berdasar informasi dari Online Retail Data Set berikut deskripsi datanya:
| Variable | Description |
|---|---|
| invoice_no | Nomor invoice terdiri dari 6 digit angkat yang bersifat unik untuk setiap transaksi. Jika diawali huruf C maka mengindikasikan transaksi berstatus Cancel |
| stock_code | Kode Produk yang bersifat unik setiap produk, terdiri dari 5 digit angka yang bersifat unik untuk setiap produk. |
| description | Nama Produk |
| quantity | Jumlah produk yang dibeli |
| invoice_date | Tanggal dan waktu transaksi |
| unit_price | harga produk per unit |
| customer_id | ID Customer yang terdiri dari 5 digit angka yang bersifat unik untuk setiap customer. |
| country | Negara customer |
Data Preprocessing
Data Structure
Berikut Struktur Datanya:
#> Observations: 541,909
#> Variables: 8
#> $ invoice_no <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ stock_code <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ unit_price <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom", ...
data.frame(
invoice_unik = df_input$invoice_no %>% unique() %>% length(),
stock_code_unik = df_input$stock_code %>% unique() %>% length(),
description_unik = df_input$description %>% unique() %>% length(),
country_unik = df_input$country %>% unique() %>% length(),
customer_unik = df_input$customer_id %>% unique() %>% length()
)#> invoice_unik stock_code_unik description_unik country_unik customer_unik
#> 1 25900 4070 4212 38 4373
Dari data diatas country hanya berjumlah 38, sehingga mari kita ubah tipe datanya menjadi factor
Data Cleansing
Data Range Adjustment
Data ini berisikan transaksi antara 01/12/2010 sampai 09/12/2011. Data transaksi Desember 2011 tidak full 1 bulan, sehinnga saya memilih untuk melakukan analisa dari 01/12/2010 sampai 30/11/2011.
Cancelled Transaction
Sesuai deskripsi variabel, jika data invoice_no diawali huruf C maka mengindikasikan transaksi berstatus Cancel. Data transaksi yang dibatalkan tidak sesuai untuk kebutuhan analisis ini, maka perlu kita remove.
#> # A tibble: 1 x 1
#> total_cancelled_transaction
#> <int>
#> 1 8928
Terdapat 9288 transaksi yang dicancel. Data ini tidak kita gunakan maka data dapat kita exclude.
Invalid Invoice
Sesuai deskripsi, Invoice No yang valid memiliki 6 digit angka.
#> # A tibble: 3 x 8
#> invoice_no stock_code description quantity invoice_date unit_price
#> <chr> <chr> <chr> <dbl> <dttm> <dbl>
#> 1 A563185 B Adjust bad~ 1 2011-08-12 14:50:00 11062.
#> 2 A563186 B Adjust bad~ 1 2011-08-12 14:51:00 -11062.
#> 3 A563187 B Adjust bad~ 1 2011-08-12 14:52:00 -11062.
#> # ... with 2 more variables: customer_id <dbl>, country <fct>
Terdapat 3 data transaksi yang tidak valid, maka data ini dapat di exclude.
Invalid Quantity
Apakah ada data yang memiliki quantity<=0 ? Berapa banyak?
#> [1] 1306
Terdapat 1306 data transaksi yang memiliki quantity<=0, sehingga data ini perlu dibuang karena tidak valid jika suatu transaksi tidak memiliki quantity.
Invalid Product
Sesuai deskripsi variabel, stock code selalu diawali dengan 5 digit angka. Untuk hal ini, kita akan substring 5 digit pertama dari stock code kemudian kita convert ke tipe data numerik. Jika stock code tersebut menjadi NA Value, maka stock tersebut tidak valid.
# Create Stock Validation
# substring first 5 digit of stock code then mutate to numeric
# if there are any NA Values of stock code validation, then that product isn't valid
stock_valid <- df_input %>% mutate(
stock_code_validation = substr(stock_code,start = 1, stop = 5)
) %>% mutate(
stock_code_validation = as.numeric(stock_code_validation)
) %>% select(stock_code,stock_code_validation,description) %>% distinct()
df_input %>% filter(description %in% (stock_valid %>% filter(is.na(stock_code_validation)) %>% .$description))#> # A tibble: 2,888 x 8
#> invoice_no stock_code description quantity invoice_date unit_price
#> <chr> <chr> <chr> <dbl> <dttm> <dbl>
#> 1 536370 POST POSTAGE 3 2010-12-01 08:45:00 18
#> 2 536403 POST POSTAGE 1 2010-12-01 11:27:00 15
#> 3 536414 22139 <NA> 56 2010-12-01 11:52:00 0
#> 4 536527 POST POSTAGE 1 2010-12-01 13:04:00 18
#> 5 536540 C2 CARRIAGE 1 2010-12-01 14:05:00 50
#> 6 536544 DOT DOTCOM POS~ 1 2010-12-01 14:32:00 570.
#> 7 536545 21134 <NA> 1 2010-12-01 14:32:00 0
#> 8 536546 22145 <NA> 1 2010-12-01 14:33:00 0
#> 9 536547 37509 <NA> 1 2010-12-01 14:33:00 0
#> 10 536549 85226A <NA> 1 2010-12-01 14:34:00 0
#> # ... with 2,878 more rows, and 2 more variables: customer_id <dbl>,
#> # country <fct>
Terdapat 2888 data transaksi yang memiliki stock code tidak valid. Jika dilihat dari deskripsi data tersebut merupakan ongkos kirim, bank charges, NA Values dan lainnya. Seluruh data transaksi diatas dapat kita exclude karena tidak sesuai untuk kebutuhan analisis ini.
invalid_stock <- stock_valid %>% filter(is.na(stock_code_validation)) %>% .$description
df_input <- df_input %>% filter(!description %in% invalid_stock)Mengacu pada artikel yang dipublish oleh Diego Usai pada Mar 14, 2019 terkait Market Basket Analysis menggunakan data online retail ini, ia berhasil menemukan 50 deskripsi yang terkesan diinputkan secara manual yaitu:
# Additional adjustment codes to remove
descr <- c( "check", "check?", "?", "??", "damaged", "found",
"adjustment", "Amazon", "AMAZON", "amazon adjust",
"Amazon Adjustment", "amazon sales", "Found", "FOUND",
"found box", "Found by jackie ","Found in w/hse","dotcom",
"dotcom adjust", "allocate stock for dotcom orders ta", "FBA",
"Dotcomgiftshop Gift Voucher £100.00", "on cargo order",
"wrongly sold (22719) barcode", "wrongly marked 23343",
"dotcomstock", "rcvd be air temp fix for dotcom sit",
"Manual", "John Lewis", "had been put aside",
"for online retail orders", "taig adjust", "amazon",
"incorrectly credited C550456 see 47", "returned",
"wrongly coded 20713", "came coded as 20713",
"add stock to allocate online orders", "Adjust bad debt",
"alan hodge cant mamage this section", "website fixed",
"did a credit and did not tick ret", "michel oops",
"incorrectly credited C550456 see 47", "mailout", "test",
"Sale error", "Lighthouse Trading zero invc incorr", "SAMPLES",
"Marked as 23343", "wrongly coded 23343","Adjustment",
"rcvd be air temp fix for dotcom sit", "Had been put aside." )
descr#> [1] "check" "check?"
#> [3] "?" "??"
#> [5] "damaged" "found"
#> [7] "adjustment" "Amazon"
#> [9] "AMAZON" "amazon adjust"
#> [11] "Amazon Adjustment" "amazon sales"
#> [13] "Found" "FOUND"
#> [15] "found box" "Found by jackie "
#> [17] "Found in w/hse" "dotcom"
#> [19] "dotcom adjust" "allocate stock for dotcom orders ta"
#> [21] "FBA" "Dotcomgiftshop Gift Voucher £100.00"
#> [23] "on cargo order" "wrongly sold (22719) barcode"
#> [25] "wrongly marked 23343" "dotcomstock"
#> [27] "rcvd be air temp fix for dotcom sit" "Manual"
#> [29] "John Lewis" "had been put aside"
#> [31] "for online retail orders" "taig adjust"
#> [33] "amazon" "incorrectly credited C550456 see 47"
#> [35] "returned" "wrongly coded 20713"
#> [37] "came coded as 20713" "add stock to allocate online orders"
#> [39] "Adjust bad debt" "alan hodge cant mamage this section"
#> [41] "website fixed" "did a credit and did not tick ret"
#> [43] "michel oops" "incorrectly credited C550456 see 47"
#> [45] "mailout" "test"
#> [47] "Sale error" "Lighthouse Trading zero invc incorr"
#> [49] "SAMPLES" "Marked as 23343"
#> [51] "wrongly coded 23343" "Adjustment"
#> [53] "rcvd be air temp fix for dotcom sit" "Had been put aside."
Saya setuju dengan hasil temuannya, oleh karena itu data transaksi yang memiliki deskripsi diatas perlu di remove.
Duplicated StockCode-Description
Proses ini penting karena data stock code dan description harus bersifat unik supaya matriks produk untuk sistem rekomendasi bisa valid.
df_product <- df_input %>% select(stock_code,description) %>% distinct()
df_product <- df_product %>%
mutate(stock_code_lowercase = tolower(stock_code),
description_lowercase = tolower(description))data_frame(
stock_code_unik = df_product$stock_code %>% unique() %>% length(),
stock_code_unik_lowercase = df_product$stock_code_lowercase %>% unique() %>% length(),
description_unik = df_product$description %>% unique() %>% length(),
description_unik_lower = df_product$description_lowercase %>% unique() %>% length()
)#> # A tibble: 1 x 4
#> stock_code_unik stock_code_unik_lowerca~ description_unik description_unik_lo~
#> <int> <int> <int> <int>
#> 1 3898 3790 3993 3993
Jumlah stock code unik dan produk unik tidak sama, mari kita cek dari sisi stock code dahulu.
stock_code_check_dupli <- df_product %>% select(stock_code,stock_code_lowercase) %>%
distinct() %>%
group_by(stock_code_lowercase) %>%
summarise(freq = n()) %>%
ungroup() %>%
filter(freq>1)Data diatas merupakan data stock code yang duplikat apabila kita ubah menjadi lowecase. Mari kita cek apakah benar duplikat.
df_product %>% filter(stock_code_lowercase %in% stock_code_check_dupli$stock_code_lowercase) %>%
arrange(stock_code_lowercase)#> # A tibble: 227 x 4
#> stock_code description stock_code_lowerca~ description_lowercase
#> <chr> <chr> <chr> <chr>
#> 1 15056BL EDWARDIAN PARASOL BLA~ 15056bl edwardian parasol black
#> 2 15056bl EDWARDIAN PARASOL BLA~ 15056bl edwardian parasol black
#> 3 15056N EDWARDIAN PARASOL NAT~ 15056n edwardian parasol natu~
#> 4 15056n EDWARDIAN PARASOL NAT~ 15056n edwardian parasol natu~
#> 5 15056P EDWARDIAN PARASOL PINK 15056p edwardian parasol pink
#> 6 15056p EDWARDIAN PARASOL PINK 15056p edwardian parasol pink
#> 7 15060B FAIRY CAKE DESIGN UMB~ 15060b fairy cake design umbr~
#> 8 15060b FAIRY CAKE DESIGN UMB~ 15060b fairy cake design umbr~
#> 9 18098C PORCELAIN BUTTERFLY O~ 18098c porcelain butterfly oi~
#> 10 18098c PORCELAIN BUTTERFLY O~ 18098c porcelain butterfly oi~
#> # ... with 217 more rows
Ternyata terdapat data stock code yang duplicated karena efek case sensitive. Oleh karena itu, seluruh stock_code akan kita convert menjadi UPPERCASE untuk mengilahkan efek case sensitive.
df_input <- df_input %>% mutate(
stock_code = toupper(stock_code)
)
data_frame(
jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
stock_code_unik_lowercase = df_input$stock_code %>% tolower() %>% unique() %>% length(),
description_unik = df_input$description %>% unique() %>% length(),
description_unik_lower = df_input$description %>% tolower() %>% unique() %>% length()
)#> # A tibble: 1 x 4
#> jumlah_stock_code_u~ stock_code_unik_low~ description_unik description_unik_l~
#> <int> <int> <int> <int>
#> 1 3790 3790 3993 3993
Oke, data stock_code sudah clean. Namun, seharusnya jumlah stock_code dan description berjumlah sama karena bersifat unik. Hal ini mengindikasikan duplikat data. Mari kita cek.
description_check <- df_input %>% select(stock_code,description) %>%
distinct() %>%
group_by(stock_code) %>%
summarise(freq = n()) %>%
ungroup() %>%
filter(freq>1)
description_check#> # A tibble: 212 x 2
#> stock_code freq
#> <chr> <int>
#> 1 16156L 2
#> 2 17107D 3
#> 3 20622 2
#> 4 20725 2
#> 5 20914 2
#> 6 21109 2
#> 7 21112 2
#> 8 21175 2
#> 9 21192 2
#> 10 21232 2
#> # ... with 202 more rows
Data diatas adalah data stock code yang memiliki description > 1. Mari kita sampling datanya.
df_input %>% filter(stock_code %in% description_check$stock_code) %>%
select(stock_code,description) %>%
distinct() %>%
arrange(stock_code,description)#> # A tibble: 443 x 2
#> stock_code description
#> <chr> <chr>
#> 1 16156L WRAP CAROUSEL
#> 2 16156L WRAP, CAROUSEL
#> 3 17107D FLOWER FAIRY 5 DRAWER LINERS
#> 4 17107D FLOWER FAIRY 5 SUMMER DRAW LINERS
#> 5 17107D FLOWER FAIRY,5 SUMMER B'DRAW LINERS
#> 6 20622 VIP PASSPORT COVER
#> 7 20622 VIPPASSPORT COVER
#> 8 20725 LUNCH BAG RED RETROSPOT
#> 9 20725 LUNCH BAG RED SPOTTY
#> 10 20914 SET/5 RED RETROSPOT LID GLASS BOWLS
#> # ... with 433 more rows
Dari hasil pengecekan diatas dapat kita simpulkan bahwa terdapat kesalahan pada deskripsi yang berupa tanda baca, spasi hingga kesalahan penulisan deskripsi produk. Untuk itu kita akan generate deskripsi produk menggunakan data pertamanya.
df_description <- df_input %>% select(stock_code,description) %>%
filter(stock_code %in% description_check$stock_code) %>%
distinct() %>%
group_by(stock_code) %>%
slice(1) %>%
ungroup()
df_input <- df_input %>% left_join(df_description, by=c("stock_code","stock_code")) %>%
mutate(description = ifelse(is.na(description.y),description.x,description.y)) %>%
select(-c(description.y,description.x))
data_frame(
jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
jumlah_description_unik = df_input$description %>% unique() %>% length()
)#> # A tibble: 1 x 2
#> jumlah_stock_code_unik jumlah_description_unik
#> <int> <int>
#> 1 3790 3764
Oke, selisih nya sudah mulai berkurang. Jumlah diatas mengindikasikan terdapat stock_code yang memiliki description sama.
df_description <- df_input %>% select(stock_code,description) %>%
distinct() %>%
group_by(description) %>%
summarise(freq = n()) %>%
ungroup() %>%
filter(freq>1)
df_description#> # A tibble: 24 x 2
#> description freq
#> <chr> <int>
#> 1 BATHROOM METAL SIGN 2
#> 2 COLOURING PENCILS BROWN TUBE 2
#> 3 COLUMBIAN CANDLE RECTANGLE 2
#> 4 COLUMBIAN CANDLE ROUND 3
#> 5 EAU DE NILE JEWELLED PHOTOFRAME 2
#> 6 FRENCH FLORAL CUSHION COVER 2
#> 7 FRENCH LATTICE CUSHION COVER 2
#> 8 FRENCH PAISLEY CUSHION COVER 2
#> 9 FROSTED WHITE BASE 2
#> 10 HEART T-LIGHT HOLDER 2
#> # ... with 14 more rows
Terdapat 24 stock code yang memiliki deskripsi sama.
df_input %>% select(stock_code,description) %>%
distinct() %>%
filter(description %in% c(df_description$description)) %>%
arrange(description)#> # A tibble: 50 x 2
#> stock_code description
#> <chr> <chr>
#> 1 82580 BATHROOM METAL SIGN
#> 2 21171 BATHROOM METAL SIGN
#> 3 10133 COLOURING PENCILS BROWN TUBE
#> 4 10135 COLOURING PENCILS BROWN TUBE
#> 5 72133 COLUMBIAN CANDLE RECTANGLE
#> 6 72131 COLUMBIAN CANDLE RECTANGLE
#> 7 72127 COLUMBIAN CANDLE ROUND
#> 8 72130 COLUMBIAN CANDLE ROUND
#> 9 72128 COLUMBIAN CANDLE ROUND
#> 10 85023B EAU DE NILE JEWELLED PHOTOFRAME
#> # ... with 40 more rows
Untuk case ini kita bisa berasumsi bahwa produk tersebut sama. Sehinnga, kita akan generate setiap description menggunakan stock code pertama.
df_product_unik <- df_input %>% select(stock_code,description) %>%
filter(description %in% c(df_description$description)) %>%
distinct() %>%
group_by(description) %>%
slice(1) %>%
ungroup()
df_input <- df_input %>% left_join(df_product_unik, by=c("description","description")) %>%
mutate(stock_code = ifelse(is.na(stock_code.y),stock_code.x,stock_code.y)) %>%
select(-c("stock_code.x","stock_code.y")) %>%
select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price)
data_frame(
jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
jumlah_description_unik = df_input$description %>% unique() %>% length(),
jumlah_code_description_unik = df_input %>% select(stock_code,description) %>% distinct() %>% nrow()
)#> # A tibble: 1 x 3
#> jumlah_stock_code_unik jumlah_description_unik jumlah_code_description_unik
#> <int> <int> <int>
#> 1 3764 3764 3764
Oke, setiap stock sudah bersifat unik. mari kita lanjut cek data country.
Invalid Country
data_frame(
customer_id_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow(),
customer_country_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow()
)#> # A tibble: 1 x 2
#> customer_id_unik customer_country_unik
#> <int> <int>
#> 1 4311 4295
Berdasar data diatas terdapat 1 customer yang memiliki 2 negara. Mungkin bisa karena customer tersebut pindah, oleh karena itu kita bisa ambil negara customer berdasarkan negara terakhir ia melakukan transaksi.
df_master_customer <- df_input %>%
arrange(desc(invoice_date,customer_id)) %>%
select(customer_id, country) %>%
group_by(customer_id) %>%
slice(1)
df_input <- df_input %>% select(-country) %>%
left_join(df_master_customer, by = c("customer_id","customer_id"))
data_frame(
customer_id_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow(),
customer_country_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow()
)#> # A tibble: 1 x 2
#> customer_id_unik customer_country_unik
#> <int> <int>
#> 1 4295 4295
Oke, setiap data customer dan negara sudah sesuai. mari kita lanjut cek data duplikat.
Duplicated Data
data.frame(
jumlah_data = df_input %>% nrow(),
jumlah_data_unik = df_input %>% distinct() %>% nrow()
)#> jumlah_data jumlah_data_unik
#> 1 503105 498077
Dataset ini memiliki data yang duplikat, untuk itu perlu kita remove.
#> [1] 498077
Missing Values
#> invoice_no invoice_date customer_id stock_code description quantity
#> 0 0 123903 0 0 0
#> unit_price country
#> 0 0
Dapat dilihat terdapat 123938 Missing Value pada customer_id. Projek ini ditujukan untuk melakukan segmentasi pelanggan dan membuat personalisasi rekomendasi produk. Segmentasi pelanggan jelas harus mengetahui siapa pelangannya, sehingga data transaksi yang tidak memiliki data customer_id perlu di-exclude. Kemudian untuk membangun sistem rekomendasi, kita bisa mengabaikan customer nya dan fokus pada produk yang dibeli, sehingga dapat menggunakan data invoice_no dan stock_code.
Oleh karena itu dapat kita putuskan untuk membagi data ini menjadi 3 dataset, yaitu:
1. df_general_transaction : dataset data original.
1. df_customer_transaction : dataset ini digungakan untuk segmentasi pelanggan, sehingga harus mengexclude missing values.
2. df_product_recomm : dataset untuk membangun sistem rekomendasi produk yang hanya terdiri dari customer_id, invoice_no, stock_code dan description.
Feature Extraction
Kita bisa mengekstrak data total amount per transaksi berdasarkan quantity * unit_price mempermudah proses analisis selanjutnya.
df_general_transaction <- df_general_transaction %>% mutate(total_amount = quantity * unit_price) %>%
select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price,total_amount)
df_customer_transaction <- df_customer_transaction %>% mutate(total_amount = quantity * unit_price) %>%
select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price,total_amount)
head(df_general_transaction,5)#> # A tibble: 5 x 9
#> invoice_no invoice_date customer_id country stock_code description
#> <chr> <dttm> <dbl> <fct> <chr> <chr>
#> 1 536365 2010-12-01 08:26:00 17850 United~ 85123A WHITE HANG~
#> 2 536365 2010-12-01 08:26:00 17850 United~ 71053 WHITE META~
#> 3 536365 2010-12-01 08:26:00 17850 United~ 84406B CREAM CUPI~
#> 4 536365 2010-12-01 08:26:00 17850 United~ 84029G KNITTED UN~
#> 5 536365 2010-12-01 08:26:00 17850 United~ 84029E RED WOOLLY~
#> # ... with 3 more variables: quantity <dbl>, unit_price <dbl>,
#> # total_amount <dbl>
Data Summary
df_general_transaction
#> Observations: 498,077
#> Variables: 9
#> $ invoice_no <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...
df_customer_transaction
#> Observations: 374,174
#> Variables: 9
#> $ invoice_no <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...
Data ini akan digunakan untuk melakukan segmentasi berdasarkan Recency, Frequency dan Monetary. Untuk itu perlu kita cek apakah terdapat data yang memiliki unit_price <= 0 ? Berapa banyak?
#> [1] 33
Terdapat 33 data transaksi yang memiliki unit_price <= 0. Bisa jadi karena efek promo, diskon atau kesalahan input. Namun karena kita tidak mengetahui data aslinya, data ini dapat kita remove karena jumlahnya juga tidak banyak.
df_product_recomm
#> Observations: 498,077
#> Variables: 4
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
#> $ invoice_no <chr> "536365", "536365", "536365", "536365", "536365", "5363...
#> $ stock_code <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...
wd <- as.character(getwd())
saveRDS(object=df_product_recomm, file=paste(paste(wd,"/data_clean/",sep = ""),"df_product_recomm.rds",sep=""))kita ekstrak seluruh produk nya untuk menjadi master produk.
df_master_product <- df_product_recomm %>% select(stock_code,description) %>% distinct() %>% arrange(stock_code)
glimpse(df_master_product)#> Observations: 3,764
#> Variables: 2
#> $ stock_code <chr> "10002", "10080", "10120", "10123C", "10124A", "10124G"...
#> $ description <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATABLE...
Exploratory Data Analysis
Transaction
Berapa Total Order Amount yang dilakukan?
#> # A tibble: 1 x 1
#> total_order_amount
#> <dbl>
#> 1 8224194.
Bagaimana Total Order Amount tiap bulan?
plot_monthly_transaction <- df_customer_transaction %>%
mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
yearmonth = ymd(yearmonth),
ym = as.yearmon(invoice_date)) %>%
group_by(ym,yearmonth) %>%
summarise(total_order_amount = sum(total_amount)) %>%
ungroup() %>%
mutate(
normalisasi = (total_order_amount-min(total_order_amount))/(max(total_order_amount)-min(total_order_amount)),
popup=glue("Year-Month : {ym}
Total Amount : {total_order_amount}")
) %>%
ggplot(aes(yearmonth,normalisasi))+
geom_area(fill="green",alpha=0.7)+
geom_line(size=0.7,color="#181818") +
labs(
title = "Total Transaction per Month",
x = "Month-Year",
y = NULL
)+
geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
scale_x_date(breaks=date_breaks('1 months'),
labels=date_format('%b %y'))+
theme(
axis.text.y = element_blank()
)
ggplotly(plot_monthly_transaction, tooltip = "text")Frequency Order
Berapa Total Transaksi yang terjadi?
#> [1] 17656
Bagaimana total transaksi yang terjadi tiap bulan?
plot_monthly_frequency <- df_customer_transaction %>% select(invoice_date,invoice_no) %>%
distinct() %>%
mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
yearmonth = ymd(yearmonth),
ym = as.yearmon(invoice_date)) %>%
group_by(ym,yearmonth) %>%
summarise(total_frequency = n()) %>%
ungroup() %>%
mutate(
normalisasi = (total_frequency-min(total_frequency))/(max(total_frequency)-min(total_frequency)),
popup=glue("Year-Month : {ym}
Total Transaction : {total_frequency}")
) %>%
ggplot(aes(yearmonth,normalisasi))+
geom_area(fill="yellow",alpha=0.7)+
geom_line(size=0.7,color="#181818") +
labs(
title = "Total Transaction per Month",
x = "Month-Year",
y = NULL
)+
geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
scale_x_date(breaks=date_breaks('1 months'),
labels=date_format('%b %y'))+
theme(
axis.text.y = element_blank()
)
ggplotly(plot_monthly_frequency, tooltip = "text")Customer
Berapa total customer yang ada?
#> [1] 4293
Bagaimana jumlah customer yang melakukan transaksi tiap bulan?
plot_monthly_customer <- df_customer_transaction %>% select(invoice_date,customer_id) %>%
mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
yearmonth = ymd(yearmonth),
ym = as.yearmon(invoice_date)) %>%
select(customer_id,ym,yearmonth) %>%
distinct() %>%
group_by(ym,yearmonth) %>%
summarise(total_customer = n()) %>%
ungroup() %>%
mutate(
normalisasi = (total_customer-min(total_customer))/(max(total_customer)-min(total_customer)),
popup=glue("Year-Month : {ym}
Total Customer : {total_customer}")
) %>%
ggplot(aes(yearmonth,normalisasi))+
geom_area(fill="blue",alpha=0.7)+
geom_line(size=0.7,color="#181818") +
labs(
title = "Total Customer per Month",
x = "Month-Year",
y = NULL
)+
geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
scale_x_date(breaks=date_breaks('1 months'),
labels=date_format('%b %y'))+
theme(
axis.text.y = element_blank()
)
ggplotly(plot_monthly_customer, tooltip = "text")Pertumbuhan customer baru tiap bulan?
plot_monthly_new_customer <- df_customer_transaction %>%
group_by(customer_id) %>%
summarise(first_order = min(invoice_date)) %>%
ungroup() %>%
mutate( yearmonth = format(first_order, format="%Y-%m-1"),
yearmonth = ymd(yearmonth),
ym = as.yearmon(first_order)) %>%
group_by(ym,yearmonth) %>%
summarise(total_new_customer = n()) %>%
ungroup() %>%
mutate(
normalisasi = (total_new_customer-min(total_new_customer))/(max(total_new_customer)-min(total_new_customer)),
popup=glue("Year-Month : {ym}
Total New Customer : {total_new_customer} ({round((total_new_customer/sum(total_new_customer))*100,1)}%)")
) %>%
ggplot(aes(yearmonth,normalisasi))+
geom_area(fill="blue",alpha=0.7)+
geom_line(size=0.7,color="#181818") +
labs(
title = "Growth of New Customer",
x = "Month-Year",
y = NULL
)+
geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
scale_x_date(breaks=date_breaks('1 months'),
labels=date_format('%b %y'))+
theme(
axis.text.y = element_blank()
)
ggplotly(plot_monthly_new_customer, tooltip = "text")Order Habbit
Hari apa customer sering berbelanja?
plot_order_wday <- df_customer_transaction %>% select(invoice_date,invoice_no) %>%
distinct() %>%
mutate(wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>%
group_by(wday) %>%
summarise(total_transaction = n()) %>%
ungroup() %>%
mutate(popup = glue("Weekday: {wday}
Total Transaction: {total_transaction}")) %>%
ggplot(aes(wday,total_transaction))+
geom_bar(stat="identity", aes(fill=total_transaction, text=popup), show.legend = FALSE)+
labs(
title = "Day of Most Order Frequency",
x = "Hour of Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
ggplotly(plot_order_wday, tooltip="text")Jam berapa customer sering berbelanja?
plot_order_time <- data_frame(hour_of_day = c(1:24)) %>% left_join(
df_customer_transaction %>% select(invoice_date,invoice_no) %>%
distinct() %>%
mutate(hour_of_day = hour(invoice_date)) %>%
group_by(hour_of_day) %>%
summarise(total_transaction = n()) %>%
ungroup(), by=c("hour_of_day","hour_of_day")) %>%
mutate(total_transaction = replace_na(total_transaction,0)) %>%
mutate(popup = glue("Hour of Day : {hour_of_day}
Total Transaction: {total_transaction}")) %>%
ggplot(aes(hour_of_day,total_transaction))+
geom_bar(stat="identity", aes(fill=total_transaction, text=popup), show.legend = FALSE)+
labs(
title = "Time of Most Order Frequency",
x = "Hour of Day",
y = NULL)
ggplotly(plot_order_time, tooltip="text")RFM Segmentation
Metode analisis Recency, Frequency, Monetary Value (RFM) adalah salah satu metode analisis perilaku pelanggan berdasarkan histori transaksinya. Output dari analisis RFM dapat digunakan untuk melakukan segmentasi pelanggan. Metode analisis RFM dapat menjelaskan:
- Seberapa baru pelanggan melakukan transaksi?
- Seberapa sering pelanggan melaukan transaksi?
- Seberapa besar transaksi yang sudah dilakukan pelanggan?
Sesuai artikel yang dipublish oleh www.marketeers.com, GO-JEK adalah salah satu perusahaan yang menggunakan metode analisis RFM dalam menentukan segmentasi pelanggan. GO-JEK membagi segmentasi pelanggan ke dalam empat kelas, yaitu Gold, Silver, Bronze, dan Non-Profit. Segmentasi Gold memiliki kualifikasi konsumen dengan high monetary, high frequency, dan high recency. Silver memiliki kualifikasi konsumen dengan tingkat monetary yang tinggi, frequency yang rendah, dan recency yang tinggi. Kategori Bronze terdiri dari konsumen dengan tingkat monetary rendah, frequency, dan recency yang tinggi. Sementara segmentasi Non-Profit memiliki kualifikasi konsumen dengan kualifikasi monetary, frequency, dan recency yang rendah.
Dalam case ini, dataset berisikan data transaksi dari 1 Desember 2010 sampai 30 November 2011, sehingga berikut perlakukan RFM Value untuk case ini:
Recency : Selisih antara hari terakhir pelanggan melakukan transaksi dan hari melakukan analisis. Dalam case ini, hari analisis menggunakan data hari terakhir transaksi + 1. Frequency : Jumlah transaksi yang dilakukan oleh pelanggan dari 1 Desember 2010 sampai 30 November 2011. Monetary : Jumlah total order amount yang sudah dikeluarkan pelanggan dari 1 Desember 2010 sampai 30 November 2011.
Generate RFM Value
Berikut sample data RFM Value:
analysis_date = date(max(df_customer_transaction$invoice_date))+days(1)
df_customer_rfm <- df_customer_transaction %>%
group_by(customer_id,country) %>%
summarise(
first_order = as.Date(min(invoice_date)),
last_order = as.Date(max(invoice_date)),
min_amount_order = min(total_amount),
med_amount_order = median(total_amount),
max_amount_order = max(total_amount),
recency = as.integer(analysis_date-date(last_order)),
monetary = sum(total_amount))
df_customer_rfm <- df_customer_transaction %>% select(customer_id,invoice_no) %>%
distinct() %>%
group_by(customer_id) %>%
summarise(frequency = n()) %>%
ungroup() %>%
left_join(df_customer_rfm,by=c("customer_id","customer_id"))
# Baskets size
df_customer_rfm <- df_customer_transaction %>% group_by(customer_id,invoice_no) %>%
summarise(freq = n()) %>% ungroup() %>%
group_by(customer_id) %>%
summarise(
min_baskets = min(freq),
max_baskets = max(freq),
avg_baskets = round(median(freq),0)
) %>% ungroup() %>%
left_join(df_customer_rfm, by = c("customer_id","customer_id")) %>%
select(customer_id, country, first_order,last_order,min_amount_order,med_amount_order,max_amount_order,
min_baskets,max_baskets,avg_baskets,recency, frequency, monetary)
head(df_customer_rfm,10)#> # A tibble: 10 x 13
#> customer_id country first_order last_order min_amount_order med_amount_order
#> <dbl> <fct> <date> <date> <dbl> <dbl>
#> 1 12346 United~ 2011-01-18 2011-01-18 77184. 77184.
#> 2 12347 Iceland 2010-12-07 2011-10-31 5.04 17
#> 3 12348 Finland 2010-12-16 2011-09-25 13.2 41.8
#> 4 12349 Italy 2011-11-21 2011-11-21 6.64 17.7
#> 5 12350 Norway 2011-02-02 2011-02-02 8.5 18.8
#> 6 12352 Norway 2011-02-16 2011-11-03 9.9 17.4
#> 7 12353 Bahrain 2011-05-19 2011-05-19 11.6 18.8
#> 8 12354 Spain 2011-04-21 2011-04-21 8.5 17.0
#> 9 12355 Bahrain 2011-05-09 2011-05-09 17.7 25.5
#> 10 12356 Portug~ 2011-01-18 2011-11-17 3.75 34.8
#> # ... with 7 more variables: max_amount_order <dbl>, min_baskets <int>,
#> # max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> # monetary <dbl>
Recency Segmentation
Data ini merupakan histori transksi selama 1 tahun, untuk itu bisa kita buat aturan sebagai berikut:
- Active : Recency <= 30
- Warm : 30 < Recency <= 90
- Cold : 90 < Recency <= 180
- Inactive : Recency > 180
recencny_adjust <- data.frame(
active = 30,
warm = 90,
cold = 180,
inactive = Inf
)
df_customer_rfm <- df_customer_rfm %>% mutate(
recency_segment = case_when(recency <= recencny_adjust$active ~ "active",
recency > recencny_adjust$active & recency <= recencny_adjust$warm ~ "warm",
recency > recencny_adjust$warm & recency <= recencny_adjust$cold ~ "cold",
TRUE ~ "inactive"),
recency_segment = factor(recency_segment, levels=c(names(recencny_adjust)))
)recency_segmentation <- df_customer_rfm %>% group_by(recency) %>% summarise(freq=n()) %>% ungroup()
plot_recency_segmentation <- ggplot(recency_segmentation,aes(recency,freq))+
#geom_line()+
geom_vline(xintercept = recencny_adjust$active,linetype="dotted",color = "black", size=1,alpha=0.5)+
geom_rect(aes(xmin = 0, xmax = recencny_adjust$active,
ymin = 0, ymax = Inf),fill = "green", alpha = 0.002)+
geom_vline(xintercept = recencny_adjust$warm,linetype="dotted",color = "black", size=1,alpha=0.5)+
geom_rect(aes(xmin = recencny_adjust$active, xmax = recencny_adjust$warm,
ymin = 0, ymax = Inf),fill = "orange", alpha = 0.002)+
geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1, alpha=0.5)+
geom_rect(aes(xmin = recencny_adjust$warm, xmax = recencny_adjust$cold,
ymin = 0, ymax = Inf),fill = "blue", alpha = 0.002)+
#geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1)+
geom_rect(aes(xmin = recencny_adjust$cold, xmax = Inf,
ymin = 0, ymax = Inf),fill = "red", alpha = 0.8)+
geom_area(fill="black",alpha=0.2)+
geom_point(aes(x=recencny_adjust$active, y=10), colour="black", size=3)+
annotate("text", x = recencny_adjust$active, y = 10+5,
color = "black", size=3, label=paste0(recencny_adjust$active," days"))+
geom_point(aes(x=recencny_adjust$warm, y=10), colour="black", size=3)+
annotate("text", x = recencny_adjust$warm, y = 10+5,
color = "black", size=3, label=paste0(recencny_adjust$warm," days"))+
geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
annotate("text", x = recencny_adjust$cold, y = 10+5,
color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+
geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
annotate("text", x = recencny_adjust$cold, y = 10+5,
color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+
annotate("text", x = (0+recencny_adjust$active)/2, y = max(recency_segmentation$freq)-5,
color = "black", size=4, label="Active")+
annotate("text", x = median(recencny_adjust$active:recencny_adjust$warm), y = max(recency_segmentation$freq)-5,
color = "black", size=4, label="Warm")+
annotate("text", x = median(recencny_adjust$warm:recencny_adjust$cold), y = max(recency_segmentation$freq)-5,
color = "black", size=4, label="Cold")+
annotate("text", x = median(recencny_adjust$cold:max(recency_segmentation$recency)), y = max(recency_segmentation$freq)-5,
color = "black", size=4, label="Inactive")+
labs(
title="Distribution of Recency Segementation",
x = "Recency (days since last transaction)",
y = "Total Customer",
subtitle = paste0("Data from ", min(df_customer_rfm$first_order)," untill ", max(df_customer_rfm$last_order))
)+
theme_minimal()
ggplotly(plot_recency_segmentation)plot_cust_recency <- df_customer_rfm %>% group_by(recency_segment) %>%
summarise(
total_customer = n(),
total_transaction = sum(frequency),
total_monetary = sum(monetary),
percent_monetary = round((total_monetary/sum(df_customer_rfm$monetary))*100,1)
) %>%
ungroup() %>%
mutate(
popup = glue("Recency Segment : {recency_segment}
Total Customer : {comma(total_customer)}
Total Trasanction: {total_transaction}
Total Monetary: BRL {comma(total_monetary)} ({percent_monetary}%)")
) %>%
ggplot(aes(recency_segment,total_monetary))+
geom_bar(stat = "identity", aes(fill=recency_segment, text=popup), show.legend = FALSE)+
geom_text(aes(label=paste0(comma(total_customer)," (",round((total_customer/sum(total_customer))*100,1),"%) Cust"),
y=total_monetary+450000),size=3,color="black")+
geom_text(aes(label=paste0(round((total_monetary/sum(total_monetary))*100,1),"% Revenue"),y=total_monetary+150000), size=3)+
labs(
title = "Total Revenue based Recency Segmentation",
x = "Recency Segment",
y = "Total Revenue",
subtitle = paste0("Data from ", min(df_customer_rfm$first_order)," untill ", max(df_customer_rfm$last_order))
) +
theme_minimal()
ggplotly(plot_cust_recency,tooltip="text") %>%
layout(showlegend=FALSE)Chart diatas menunjukan hanya 38.7% dari total customer kita yang masuk segmen Active dengan Revenue yang dihasilkan sebesar 70.7%, maka tentunya perusahaan harus mempertahankan customer pada segmen ini untuk dapat tetap aktif.
Namun, bagaimana jika terdapat High value customer yang tidak masuk ke segmen active customer? Segmen diatas dapat dibagi lagi berdasarkan nilai transaksi customer. Mengelompokan berdasarkan nilai customer sangat penting karena perusahaan dapat menawarkan promosi yang sesuai dengan kebiasaan pengeluaran customer. Mengetahui setiap nilai customer juga membantu bisnis menciptakan pemasaran bertarget untuk pelanggan bernilai tinggi. Misalnya, pada segmen warm, cold dan inactive terdapat customer yang memberikan revenue besar bagi perusahaan, maka customer ini dapat menjadi target utama pemasaran. Untuk itu kita perlu breakdown segmen ini berdasarkan Frekuensi Order dan total pengeluaran customer.
Frequency/Monetary Segmentation
Correlaction Checking
Jika dilihat frequency dan monetary memiliki korelasi yang cukup kuat, dalam artian semakin tinggi korelasi maka semakin besar nilai monetary.
Outlier Checking
Mari kita cek dahulu sebaran data dan outliernya.
par(mfrow=c(2,2),cex=0.8)
#hist(cust_rfm$log_recency,main="log of recency")
hist(df_customer_rfm$frequency,main="frequency",breaks=200)
hist(df_customer_rfm$monetary,main="monetary", breaks=10000)
boxplot(df_customer_rfm$frequency,main="frequency")
boxplot(df_customer_rfm$monetary, main="monetary")Dari visualisasi diatas, nilai frequency dan monetary diatas memiliki outlier yang cukup banyak. Mari kita cek lebih detail lewat visualisasi yang lebih jelas.
ggplotly(
df_customer_rfm %>%
ggplot(aes(frequency,monetary))+
geom_jitter()+
geom_smooth(method = lm)+
labs(
title = "Frequency vs Monetary of all data"
)+
scale_x_continuous(breaks = seq(from = 0, to = 200, by = 10))+
scale_y_continuous(breaks = seq(from = 0, to = 270000, by = 10000)))Dapat dilihat lebih jelas terdapat beberapa customer yang sangat jauh dari pusat data dan bisa kita sebut outlier. Jika dizoom, bisa dilihat lebih jelas data cukup berpusat dari pada frequency <= 38 dan monetary <= 15000. Mari kita cek:
frequency_adjust = 30
monetary_adjust = 12000
df_customer_rfm_adjust <- df_customer_rfm %>%
filter(frequency <= frequency_adjust, monetary <=monetary_adjust)
txt_freqmon_range <- paste0("Range Frequency <= ",frequency_adjust, " and Monetary <=", monetary_adjust)
txt_data_total <- paste0("(",round((df_customer_rfm_adjust %>% nrow())/(df_customer_rfm %>% nrow())*100,1),"% Customer)")
ggplotly(
df_customer_rfm_adjust %>%
ggplot(aes(frequency,monetary))+
geom_jitter()+
geom_smooth(method = lm)+
labs(
title = "Frequency vs Monetary"
)) %>%
layout(title = list(text = paste0('Frequency vs Monetary',
'<br>',
'<sup>',
paste0(txt_freqmon_range," ",txt_data_total),
'</sup>')))79 (98.2%) customer berada pada range frequency <= 30 dan monetary <= 15000. Metode clustering cukup sensitif dengan jarak, sehingga kita perlu mengexclude data yang terlalu jauh kemudian melakukan scalling. Untuk itu kita akan melakukan clustering terhadap customer pada range ini dan data lainnya kita analisa terpisah.
Scalling
norm_minmax <- function(x){
return ((x - min(x))/(max(x) - min(x)))
}
df_customer_rfm_adjust$mm_frequency <- norm_minmax(df_customer_rfm_adjust$frequency)
df_customer_rfm_adjust$mm_monetary <- norm_minmax(df_customer_rfm_adjust$monetary)
par(mfrow=c(2,2),cex=0.9)
#hist(cust_rfm$mm_recency,main="1 - minmax of log_recency")
hist(df_customer_rfm_adjust$mm_frequency,main="minmax of frequency")
hist(df_customer_rfm_adjust$mm_monetary,main="minmax of monetary")
#boxplot(cust_rfm$mm_recency, main="1 - minmax of log_recency")
boxplot(df_customer_rfm_adjust$mm_frequency,main="minmax of frequency")
boxplot(df_customer_rfm_adjust$mm_monetary, main="minmax of monetary")Optimal K For Clustering
Proses Clustering akan menggunakan nilai dari min-max normalization.
Elbow Method
fviz_nbclust(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], kmeans, method = "wss", k.max = 15) +
labs(subtitle = "Elbow method")Berdasarkan plot penentuan Optimal-K diatas, Elbow Method menunjukan penuruan mulai stabil pada K=3. Namun, jika dilihat pada K=5, penurunan cenderung statis hingga K=15. Untuk itu mari kita coba menggunakan nilai K=5.
Silhouette Method
fviz_nbclust(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], kmeans, "silhouette", k.max = 15) +
labs(subtitle = "Silhouette method")Silhouette Method menunjukan bahwa optimal K=2, namun jika dilihat nilai K=3 dan K=5 juga cukup optimum. Kita ingin mengclusteringkan customer lebih spesifik, tentunya setiap cluster harus memiliki identitas. Jika menggunakan 2 cluster maka cluster cenderung kurang memiliki identitas, maka mari kita coba melakukan clustering menggunakan K=3 dan K=5.
Clustering
Clustering using K=3
K-Means
Hasil K-Means dengan 3 clusters terbagi menjadi 2624:396:1024 dengan nilai between_SS / total_SS = 73.4 %
set.seed(2020, sample.kind = "Rounding")
kmeans_mm_k3 <- kmeans(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")],centers = 3, iter.max = 20)
kmeans_mm_k3$size#> [1] 3141 865 208
#> [1] 75.1
DBScan
Hasil DBScan dengan 3 clusters terbagi menjadi 1477:720:364 dengan noise sebanyak 1654. Jumlah noise terlalu banyak.
kNNdistplot(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], k = 3)
abline(h = 0.02, col = "red")db_clust_k3 <- dbscan(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], eps = 0.02, minPts = 150)
db_clust_k3#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.02, minPts = 150
#> The clustering contains 3 cluster(s) and 1654 noise points.
#>
#> 0 1 2 3
#> 1654 1476 720 364
#>
#> Available fields: cluster, eps, minPts
K-Means vs DBScan using K=3
# K-Means
cust_fm_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
fm_segment=kmeans_mm_k3$cluster) %>%
mutate(
fm_segment=as.factor(fm_segment))
perfom <- round((kmeans_mm_k3$betweenss/kmeans_mm_k3$totss)*100,1)
cluster_rfm_mm_profile_k3 <- cust_fm_mm_k3 %>% group_by(fm_segment) %>%
summarise(
min_recency = min(recency),
max_recency = max(recency),
avg_recency = round(mean(recency),2),
min_frequency = min(frequency),
max_frequency = max(frequency),
avg_frequency = round(mean(frequency),2),
med_frequency = round(median(frequency),2),
min_monetary = min(monetary),
max_monetary = max(monetary),
avg_monetary = round(mean(monetary),2),
med_monetary = round(median(monetary),2),
total_monetary = sum(monetary),
total_customer = n()
) %>% ungroup()
cluster_rfm_mm_profile_k3 %>%
gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
min_monetary,max_monetary,avg_monetary,med_monetary) %>%
ggplot(aes(x = fm_segment, y = features)) +
#scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
geom_tile(aes(fill = values), show.legend = FALSE) +
geom_text(aes(label=round(values,2)), size=3, color="white")+
coord_equal() +
scale_fill_gradientn(colours = hm.palette(27)) +
labs(
title = "K-Means Clustering for 3 Segment",
subtitle = paste("betweenss/totts:",perfom,"%"),
x = "Cluster of Frequency/Monetary"
) -> plot_kmeans_k3
# DBScan
dbscan_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
fm_segment=db_clust_k3$cluster) %>%
mutate(fm_segment=as.factor(fm_segment))
dbscan_mm_k3_profile <- dbscan_mm_k3 %>% group_by(fm_segment) %>%
summarise(
min_recency = min(recency),
max_recency = max(recency),
avg_recency = round(mean(recency),2),
min_frequency = min(frequency),
max_frequency = max(frequency),
avg_frequency = round(mean(frequency),2),
med_frequency = round(median(frequency),2),
min_monetary = min(monetary),
max_monetary = max(monetary),
avg_monetary = round(mean(monetary),2),
med_monetary = round(median(monetary),2),
total_monetary = sum(monetary),
total_customer = n()
) %>% ungroup()
dbscan_mm_k3_profile %>%
gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
min_monetary,max_monetary,avg_monetary,med_monetary) %>%
ggplot(aes(x = fm_segment, y = features)) +
#scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
geom_tile(aes(fill = values), show.legend = FALSE) +
geom_text(aes(label=round(values,2)), size=3, color="white")+
coord_equal() +
scale_fill_gradientn(colours = hm.palette(27)) +
labs(
title = "DBScan Clustering for 3 segment, with 569 noise",
x = "Cluster of Frequency/Monetary"
) -> plot_dbscan_k3
grid.arrange(plot_kmeans_k3, plot_dbscan_k3, nrow = 1)Clustering using K=5
K-Means
Hasil K-Means dengan 5 clusters terbagi menjadi 2304, 151, 1041, 176, 372 dengan nilai between_SS / total_SS = 84.1 %
set.seed(2020, sample.kind = "Rounding")
kmeans_mm_k5 <- kmeans(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")],centers = 5, iter.max = 20)
kmeans_mm_k5$size#> [1] 2306 1048 538 219 103
#> [1] 83.6
DBScan
Hasil DBScan dengan 5 cluster terbagi menjadi 321:1520:773:460:174 dengan noise sebanyak 967. Jumlah noise terlalu banyak.
kNNdistplot(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], k = 5)
abline(h = 0.025, col = "red")db_clust_k5 <- dbscan(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], eps = 0.025, minPts = 50)
db_clust_k5#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.025, minPts = 50
#> The clustering contains 5 cluster(s) and 967 noise points.
#>
#> 0 1 2 3 4 5
#> 967 322 1519 773 460 173
#>
#> Available fields: cluster, eps, minPts
K-Means vs DBScan using K=5
# K-Means
cust_fm_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
fm_segment=kmeans_mm_k5$cluster) %>%
mutate(
fm_segment=as.factor(fm_segment))
perfom <- round((kmeans_mm_k5$betweenss/kmeans_mm_k5$totss)*100,1)
cluster_rfm_mm_profile_k5 <- cust_fm_mm_k5 %>% group_by(fm_segment) %>%
summarise(
min_recency = min(recency),
max_recency = max(recency),
avg_recency = round(mean(recency),2),
min_frequency = min(frequency),
max_frequency = max(frequency),
avg_frequency = round(mean(frequency),2),
med_frequency = round(median(frequency),2),
min_monetary = min(monetary),
max_monetary = max(monetary),
avg_monetary = round(mean(monetary),2),
med_monetary = round(median(monetary),2),
total_monetary = sum(monetary),
total_customer = n()
) %>% ungroup()
cluster_rfm_mm_profile_k5 %>%
gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
min_monetary,max_monetary,avg_monetary,med_monetary) %>%
ggplot(aes(x = fm_segment, y = features)) +
#scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
geom_tile(aes(fill = values), show.legend = FALSE) +
geom_text(aes(label=round(values,2)), size=3, color="white")+
coord_equal() +
scale_fill_gradientn(colours = hm.palette(27)) +
labs(
title = "K-Means Clustering for 5 Segment",
subtitle = paste("betweenss/totts:",perfom,"%"),
x = "Cluster of Frequency/Monetary"
) -> plot_kmeans_k5
# DBScan
dbscan_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
fm_segment=db_clust_k5$cluster) %>%
mutate(fm_segment=as.factor(fm_segment))
dbscan_mm_k5_profile <- dbscan_mm_k5 %>% group_by(fm_segment) %>%
summarise(
min_recency = min(recency),
max_recency = max(recency),
avg_recency = round(mean(recency),2),
min_frequency = min(frequency),
max_frequency = max(frequency),
avg_frequency = round(mean(frequency),2),
med_frequency = round(median(frequency),2),
min_monetary = min(monetary),
max_monetary = max(monetary),
avg_monetary = round(mean(monetary),2),
med_monetary = round(median(monetary),2),
total_monetary = sum(monetary),
total_customer = n()
) %>% ungroup()
dbscan_mm_k5_profile %>%
gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
min_monetary,max_monetary,avg_monetary,med_monetary) %>%
ggplot(aes(x = fm_segment, y = features)) +
#scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
geom_tile(aes(fill = values), show.legend = FALSE) +
geom_text(aes(label=round(values,2)), size=3, color="white")+
coord_equal() +
scale_fill_gradientn(colours = hm.palette(27)) +
labs(
title = "DBScan Clustering for 5 segment, exclude 345 noise",
x = "Cluster of Frequency/Monetary"
) -> plot_dbscan_k5
grid.arrange(plot_kmeans_k5, plot_dbscan_k5, nrow = 1)FM Segmentation Profiling
cluster_fm_summary <- cluster_rfm_mm_profile_k5 %>% mutate(
popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
Total Monetary: BRL {comma(total_monetary)} ({comma((total_monetary/sum(total_monetary))*100)}%)
Frequency Range: {min_frequency} - {max_frequency}
Monetary Range: BRL {min_monetary} - {max_monetary}
Avg. Frequency: {avg_frequency} Order
Avg. Monetary: BRL {avg_monetary}"),
fm_segment = as.factor(fm_segment)
) %>%
ggplot(aes(fm_segment,avg_frequency)) +
geom_bar(aes(fill=fm_segment, text=popup),stat="identity", show.legend = FALSE)+
geom_text(aes(label=paste0('Avg: ',round(avg_frequency,0)),y=avg_frequency+0.5),size=3, color="black")+
#geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
# y=total_monetary+220000),size=3, color="black")+
labs(
title="Profiling: Average Frequency per Cluster using K=5",
x="Cluster",
y="Average Frequency"
)+theme_minimal()
ggplotly(cluster_fm_summary,tooltip="text") %>%
layout(showlegend=FALSE) cluster_fm_summary <- cluster_rfm_mm_profile_k5 %>% mutate(
popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
Total Monetary: BRL {comma(total_monetary)} ({comma((total_monetary/sum(total_monetary))*100)}%)
Frequency Range: {min_frequency} - {max_frequency}
Monetary Range: BRL {min_monetary} - {max_monetary}
Avg. Frequency: {avg_frequency} Order
Avg. Monetary: BRL {avg_monetary}"),
fm_segment = as.factor(fm_segment)
) %>%
ggplot(aes(fm_segment,avg_monetary)) +
geom_bar(aes(fill=fm_segment, text=popup),stat="identity", show.legend = FALSE)+
geom_text(aes(label=paste0("BRL ",comma(avg_monetary,1)),y=avg_monetary+150),size=3, color="black")+
#geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
# y=total_monetary+220000),size=3, color="black")+
labs(
title="Profiling: Average Monetary per Cluster using K=5",
x="Cluster",
y="Average Monetary"
)+theme_minimal()
ggplotly(cluster_fm_summary,tooltip="text") %>%
layout(showlegend=FALSE)Berdasarkan visualisasi diatas maka, setiap cluster bisa artikan sebagai berikut:
- Cluster 1: Medium Frequency, Medium Monetary. code: VLF-VLM. Kita sebut Very Low Value Customer.
- Cluster 2: Very Low Frequency, Very Low Monetary. LF-LM. Kita sebut Low Value Customer
- Cluster 3: Very High Frequency, Very High Monetary. code: MF-MM. Kita sebut Medium Value Customer
- Cluster 4: Low Frequency, Low Monetary. code: HF-HM. Kita sebut High Value Customer
- Cluster 5: High Frequency, High Monetary. code: VHF-VHM. Kita sebut Very High Value Customer
Berikut ini sebaran data hasil clusteringnya:
fm_segment_result <- cust_fm_mm_k5 %>%
mutate(
fm_segment_code = case_when(fm_segment == "1" ~ "VLF-VLM",
fm_segment == "2" ~ "LF-LM",
fm_segment == "3" ~ "MF-MM",
fm_segment == "4" ~ "HF-HM",
TRUE ~ "VHF-VHM"),
fm_segment_code = factor(fm_segment_code, levels=c("VLF-VLM","LF-LM","MF-MM","HF-HM","VHF-VHM")),
fm_segment_desc = case_when(fm_segment == "1" ~ "Very Low Frequency, Very Low Monetary",
fm_segment == "2" ~ "Low Frequency, Low Monetary",
fm_segment == "3" ~ "Medium Frequency, Medium Monetary",
fm_segment == "4" ~ "High Frequency, High Monetary",
TRUE ~ "Very High Frequency, Very High Monetary"),
fm_segment_desc = factor(fm_segment_desc,levels=c("Very Low Frequency, Very Low Monetary",
"Low Frequency, Low Monetary",
"Medium Frequency, Medium Monetary",
"High Frequency, High Monetary",
"Very High Frequency, Very High Monetary")),
fm_segment_name = case_when(fm_segment == "1" ~ "Very Low Value",
fm_segment == "2" ~ "Low Value",
fm_segment == "3" ~ "Medium Value",
fm_segment == "4" ~ "High Value",
TRUE ~ "Very High Value"),
fm_segment_name = factor(fm_segment_name, levels=c("Very Low Value","Low Value","Medium Value",
"High Value","Very High Value"))) %>%
select(customer_id,fm_segment_code,fm_segment_desc,fm_segment_name)
rfm_segment_result <- df_customer_rfm_adjust %>%
#filter(frequency<400, monetary<6000) %>%
left_join(fm_segment_result, by=c("customer_id","customer_id")) %>%
select(-c(mm_frequency,mm_monetary))
head(rfm_segment_result,10)#> # A tibble: 10 x 17
#> customer_id country first_order last_order min_amount_order med_amount_order
#> <dbl> <fct> <date> <date> <dbl> <dbl>
#> 1 12347 Iceland 2010-12-07 2011-10-31 5.04 17
#> 2 12348 Finland 2010-12-16 2011-09-25 13.2 41.8
#> 3 12349 Italy 2011-11-21 2011-11-21 6.64 17.7
#> 4 12350 Norway 2011-02-02 2011-02-02 8.5 18.8
#> 5 12352 Norway 2011-02-16 2011-11-03 9.9 17.4
#> 6 12353 Bahrain 2011-05-19 2011-05-19 11.6 18.8
#> 7 12354 Spain 2011-04-21 2011-04-21 8.5 17.0
#> 8 12355 Bahrain 2011-05-09 2011-05-09 17.7 25.5
#> 9 12356 Portug~ 2011-01-18 2011-11-17 3.75 34.8
#> 10 12357 Switze~ 2011-11-06 2011-11-06 10.1 35.4
#> # ... with 11 more variables: max_amount_order <dbl>, min_baskets <int>,
#> # max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> # monetary <dbl>, recency_segment <fct>, fm_segment_code <fct>,
#> # fm_segment_desc <fct>, fm_segment_name <fct>
plot_ly(rfm_segment_result, x = ~recency, y = ~frequency, z = ~monetary, color = ~fm_segment_name,
#colors=c("#207c06","#799922","#beb448","#ffd178","#f3a252","#e5703b","#d33333"),
hoverinfo = 'text', text = ~paste("Customer ID: ", customer_id,
"<br>Segment : ",fm_segment_name,
"<br>Segment Description: <b>",fm_segment_desc,"</b>",
"<br>Recency: <b>",recency,"</b>",
"<br>Frequency: <b>", frequency,"</b>",
"<br>Monetary: <b>BRL ",monetary,"</b>")) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary'))) %>%
layout(title="Customer Value Segmentation")RFM Segmentation Summary
plot_fm_total_monetary <- rfm_segment_result %>% group_by(fm_segment_name) %>%
summarise(total_monetary = sum(monetary),
total_customer = n(),
avg_frequency = median(frequency),
avg_monetary = median(monetary)) %>%
ungroup() %>%
mutate(popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
Total Monetary: GBP {comma(total_monetary)} ({comma((total_monetary/sum(total_monetary))*100)}%)
Avg. Frequency: {avg_frequency} Order
Avg. Monetary: GBP {avg_monetary}")) %>%
ggplot(aes(fm_segment_name,total_monetary)) +
geom_bar(aes(fill=fm_segment_name, text=popup),stat="identity", show.legend = FALSE)+
geom_text(aes(label=paste0("GBP ",comma(total_monetary),'(',round(total_monetary/sum(total_monetary)*100,1),'%)'),
y=total_monetary+70000),size=3, color="black")+
geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
y=total_monetary+140000),size=3, color="black")+
labs(
title="Total Monetary per FM Segment",
x="Cluster",
y="Total Monetary"
)+theme_minimal()
ggplotly(plot_fm_total_monetary,tooltip="text") %>%
layout(showlegend=FALSE)plot_rfm_summary <- rfm_segment_result %>%
group_by(recency_segment, fm_segment_name) %>%
summarise(total_monetary=sum(monetary),
total_customer=n()) %>%
ungroup() %>%
mutate(
popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,2)}%)
Total Monetary : GBP {total_monetary} ({comma(round((total_monetary/sum(total_monetary))*100,2))}%)")
) %>%
ggplot(aes(x=fm_segment_name,total_monetary))+
geom_text(aes(label=paste0("M: ",round((total_monetary/sum(total_monetary))*100,2),"%"),
y=total_monetary+35000),size=3, color="black")+
geom_text(aes(label=paste0("C: ",round((total_customer/sum(total_customer))*100,2),"%"),
y=total_monetary+100000),size=3, color="black")+
# geom_bar(aes(fill=fm_segment),stat="identity", show.legend = FALSE)+
geom_rect(aes(xmin = 0, xmax = Inf,
ymin = 0, ymax = Inf),fill = "red", alpha = 0.02)+
geom_bar(aes(fill=fm_segment_name, text=popup),stat="identity", show.legend = FALSE)+
labs(
title="RFM Segment Summary",
x="Customer Value Segment",
y="Total Monetary"
)+
facet_wrap(~ recency_segment)+
theme(
legend.position = "none",
axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5),
panel.spacing = unit(1, "lines"))
ggplotly(plot_rfm_summary, tooltip = "text", height = 640)%>%
layout(showlegend=FALSE) %>%
#layout(margin = list(l = 50, r = 30, b = 50, t = 90)) %>%
layout(title = list(text = paste0('RFM Segmentation Summary',
'<br>',
'<sup>',
'</sup>')))Sebelumnya terdapat 79 data customer yang kita exclude dari proses clustering karena outlier. Customer ini jelas dapat kita kategorikan sebagai Special Customer karena memiliki frequency dan Monetary yang terlalu jauh dari customer lainnya.
cust_segment_result <- rbind(
rfm_segment_result,
df_customer_rfm %>% filter(!customer_id %in% c(df_customer_rfm_adjust$customer_id %>% unique())) %>%
mutate(
fm_segment_code = "SPC",
fm_segment_desc = "Special Customer",
fm_segment_name = "Special Customer",
fm_segment_code = factor(fm_segment_code, levels=c("VLF-VLM","LF-LM","MM-MM","HF-HM","VHF-VHM","SPC")),
fm_segment_desc = factor(fm_segment_desc,levels=c("Very Low Frequency, Very Low Monetary",
"Low Frequency, Low Monetary",
"Medium Frequency, Medium Monetary",
"High Frequency, High Monetary",
"Very High Frequency, Very High Monetary",
"Special Customer")),
fm_segment_name = factor(fm_segment_name, levels=c("Very Low Value","Low Value","Medium Value",
"High Value","Very High Value","Special Customer"))))plot_tile_rfm_segmentation <- cust_segment_result %>%
group_by(recency_segment,fm_segment_name,fm_segment_desc) %>%
summarise(freq = n(),
total_monetary=sum(monetary),
avg_monetary = round(mean(monetary)),
avg_frequency = median(frequency)) %>%
ungroup() %>%
mutate(
popup=glue("Receny Segment : {toupper(recency_segment)}
Value Segment : {toupper(fm_segment_name)}
Total Customer: {freq}
Total Monetary : {comma(total_monetary)}
Avg. Monetary : GBP {comma(avg_monetary)}
Most Frequency : {avg_frequency}")
) %>%
mutate(percent_freq = round((freq/sum(freq))*100,2)) %>%
ggplot(aes(recency_segment,fm_segment_name))+
geom_tile(aes(fill = avg_monetary, text=popup), colour = "white", show.legend = TRUE) +
geom_text(aes(label=paste0(percent_freq,"%"),text=popup), size=3, color="white", show.legend = FALSE)+
labs(
y= NULL,
x= "Days since last purchase",
fill = "Avg. Monetary"
)+
scale_fill_gradient(low="red",na.value = "#C0C0C0", high="green")+
theme(legend.title=element_text(size=9),
legend.position = "bottom")
ggplotly(plot_tile_rfm_segmentation, tooltip="text") %>%
#layout(showlegend=FALSE) %>%
#layout(margin = list(l = 50, r = 30, b = 50, t = 90)) %>%
layout(legend=list(orientation = "h", y=-0.1))%>%
layout(title = list(text = paste0('RFM Segmentation Distribution',
'<br>',
'<sup> Customer Distribution based on RFM Segmentation',
'</sup>')))Segment Analysis
Daily Order Behaviour
rfm_order_habit_day <- df_customer_transaction %>%
left_join(cust_segment_result, by=c("customer_id","customer_id")) %>%
select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>%
distinct() %>%
mutate(wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>%
group_by(wday,fm_segment_name) %>%
summarise(total_transaction = n()) %>%
ungroup() %>%
mutate(popup = glue("Weekday: {wday}
Total Transaction: {total_transaction}"))
rfm_order_habit_day %>% filter(fm_segment_name=="Special Customer") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "Special Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_1
rfm_order_habit_day %>% filter(fm_segment_name=="Very High Value") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "Very High Value Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_2
rfm_order_habit_day %>% filter(fm_segment_name=="High Value") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "High Value Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_3
rfm_order_habit_day %>% filter(fm_segment_name=="Medium Value") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "Medium Value Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_4
rfm_order_habit_day %>% filter(fm_segment_name=="Low Value") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "Low Value Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_5
rfm_order_habit_day %>% filter(fm_segment_name=="Very Low Value") %>%
ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
labs(
title = "Very Low Value Customer",
subtitle = "Daily Order Frequency",
x = "Day",
y = NULL)+
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_day_6
grid.arrange(plot_rfm_day_1, plot_rfm_day_2, plot_rfm_day_3,
plot_rfm_day_4, plot_rfm_day_5, plot_rfm_day_6,
ncol = 3)Hourly Order Behaviour
serror <- function(x) sqrt(var(x)/length(x))
time_range = data_frame(hour_of_day = c(0:23))
rfm_order_habit_time <- df_customer_transaction %>%
left_join(cust_segment_result, by=c("customer_id","customer_id")) %>%
select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>%
distinct() %>%
mutate(hour_of_day = hour(invoice_date)) %>%
group_by(hour_of_day,fm_segment_name) %>%
summarise(total_transaction = n()) %>%
ungroup() %>%
mutate(popup = glue("Hour of Day : {hour_of_day}
Total Transaction: {total_transaction}"))
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="Special Customer"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "Special Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_1
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="Very High Value"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "Very High Value Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_2
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="High Value"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "High Value Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_3
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="Medium Value"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "Medium Value Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_4
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="Low Value"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "Low Value Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_5
data_frame(hour_of_day = c(0:23)) %>%
left_join(
rfm_order_habit_time %>% filter(fm_segment_name=="Very Low Value"),
by=c("hour_of_day","hour_of_day")) %>%
mutate(hour_of_day = as.factor(hour_of_day)) %>%
ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
geom_bar(width=1, stat="identity", show.legend = FALSE)+
geom_errorbar(aes(ymin = total_transaction - serror(total_transaction),
ymax = total_transaction + serror(total_transaction),
color = hour_of_day),
width = .2) +
labs(
title = "Very Low Value Customer",
subtitle = "Hourly Order Frequency",
x = "Hour of Day",
y = NULL)+
theme_minimal()+
theme(axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5,size=11, face="bold"),
plot.subtitle = element_text(hjust = 0.5,size=10),
axis.text.y = element_blank(),
axis.text.x=element_text(size=9, face="bold"))+
coord_polar() -> plot_rfm_time_6
grid.arrange(plot_rfm_time_1,plot_rfm_time_2,plot_rfm_time_3,
plot_rfm_time_4,plot_rfm_time_5,plot_rfm_time_6,
ncol = 3)Recommender System
Sistem rekomendasi dibuat menggunakan data histori transaksi setiap invoice_no.
df_product_recomm <- readRDS("data_clean/df_product_recomm.rds")
sapply(df_product_recomm[ ,c('invoice_no','stock_code')], function(x) length(unique(x)))#> invoice_no stock_code
#> 18975 3764
#> customer_id stock_code
#> 4295 3764
Rating Matrix
Data yang kita miliki merupakan data histori transaksi, sehingga kita bisa menggunakan binaryRatingsMatrix dengan ketentuan Jika produk dibeli maka di set 1 dan jika tidak maka di set 0. Kita akan mencoba menggunakan matriks atas data Customer-Produk dan Invoice-Produk.
Rating Matrix Customer-Product
Kelemahan menggunakan rating Customer-Product pada data ini yaitu cukup banyak data dibuang karena terdapat missing values pada customer_id.
# drop NA Values of customer_id
df_custprod <- df_product_recomm %>% select(customer_id,stock_code)
df_custprod <- drop_na(df_custprod)
# Remove duplicate product per customer
df_custprod <- df_custprod %>% mutate(identifier = paste0(customer_id,"-",stock_code))
df_custprod <- df_custprod[!duplicated(df_custprod$identifier), ] %>% select(-identifier)
rating_matrix_custprod <- df_custprod %>%
select(customer_id, stock_code) %>%
mutate(value = 1) %>%
spread(stock_code, value, fill = 0) %>%
select(-customer_id) %>%
as.matrix() %>%
as("binaryRatingMatrix")
rating_matrix_custprod#> 4294 x 3625 rating matrix of class 'binaryRatingMatrix' with 257567 ratings.
Rating Matrix Invoice-Product
# Remove duplicate product per product
df_invprod <- df_product_recomm %>% select(invoice_no,stock_code)
df_invprod <- df_product_recomm %>% mutate(identifier = paste0(invoice_no,"-",stock_code))
df_invprod <- df_invprod[!duplicated(df_invprod$identifier), ] %>% select(-identifier)
rating_matrix_invprod <- df_invprod %>%
select(invoice_no, stock_code) %>%
mutate(value = 1) %>%
spread(stock_code, value, fill = 0) %>%
select(-invoice_no) %>%
as.matrix() %>%
as("binaryRatingMatrix")
rating_matrix_invprod#> 18975 x 3764 rating matrix of class 'binaryRatingMatrix' with 492975 ratings.
Dapat dilihat jumlah matriks yang terbentuk pada rating matrix Customer-Product dan rating matrix Invoice-Product cukup besar. Meskipun begitu, mari tetap kita coba dan bandingkan hasil akhirnya.
Model Validation with K-Fold
Train dan test set menggunakan Fold dengan K=5 dan proporsi setiap fold terdiri dari 80% train sets dan 20% test sets.
Modelling
Algoritma yang tersedia untuk binaryRatingMatrix menggunakan library Recommender Lab
recommender_models <- recommenderRegistry$get_entries(dataType = "binaryRatingMatrix")
summary(recommender_models)#> Length Class Mode
#> ALS_implicit_binaryRatingMatrix 6 recommender_method list
#> AR_binaryRatingMatrix 6 recommender_method list
#> IBCF_binaryRatingMatrix 6 recommender_method list
#> POPULAR_binaryRatingMatrix 6 recommender_method list
#> RANDOM_binaryRatingMatrix 6 recommender_method list
#> RERECOMMEND_binaryRatingMatrix 6 recommender_method list
#> UBCF_binaryRatingMatrix 6 recommender_method list
Mari kita coba seluruh algoritma diatas dan bandingkan hasilnya:
algorithms_binary <- list(
"association_rules" = list(name = "AR", param = list(support = 0.05, confidence = 0.05)),
"popular" = list(name = "POPULAR", param = NULL),
"random" = list(name = "RANDOM", param = NULL),
"ibcf" = list(name = "IBCF", param = list(k = 5)), #JACCARD
#"ibcf_20" = list(name = "IBCF", param = list(k = 20)),
#"ibcf_50" = list(name = "IBCF", param = list(k = 50)),
#"ibcf_cosine" = list(name = "IBCF", param = list(method = "Cosine", k = 5)),
"ibcf_pearson_5"= list(name = "IBCF", param = list(method = "Pearson", k = 5)),
"ibcf_pearson_50"= list(name = "IBCF", param = list(method = "Pearson", k = 50)),
"ibcf_pearson_100"= list(name = "IBCF", param = list(method = "Pearson", k = 100)),
#"ubcf_25" = list(name = "UBCF", param = list(method = "Cosine", nn = 25)),
#"ubcf_50" = list(name = "UBCF", param = list(method = "Cosine", nn = 50)),
"ubcf_cosine_100" = list(name = "UBCF", param = list(method = "Cosine", nn = 100)),
#"ubcf_200" = list(name = "UBCF", param = list(method = "Cosine", nn = 200))
"ubcf_pearson_100" = list(name = "UBCF", param = list(method = "Pearson", nn = 100))
)
memory.limit(size=56000)
start <- Sys.time()
results_custprod <- recommenderlab::evaluate(scheme_custprod,
algorithms_binary,
type = "topNList",
n = c(1, 3, 5, 10, 15, 20))
results_invprod <- recommenderlab::evaluate(scheme_invprod,
algorithms_binary,
type = "topNList",
n = c(1, 3, 5, 10, 15, 20))
end <- Sys.time()
cat('runtime', end - start)
wd <- as.character(getwd())
saveRDS(object=results_custprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
"results_custprod.rds",sep=""))
saveRDS(object=results_invprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
"results_invprod.rds",sep=""))
results_custprod
results_invprodresults_custprod <- readRDS("modelling/recommendation_development/results_custprod.rds")
results_invprod <- readRDS("modelling/recommendation_development/results_invprod.rds")
results_custprod#> List of evaluation results for 9 recommenders:
#> Evaluation results for 5 folds/samples using method 'AR'.
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> List of evaluation results for 8 recommenders:
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
Model Evaluation
Eval: Rating Matrix Customer-Product
avg_conf_matr_personal <- function(results_custprod) {
tmp <- results_custprod %>%
getConfusionMatrix() %>%
as.list()
as.data.frame( Reduce("+",tmp) / length(tmp)) %>%
mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
select('n', 'precision', 'recall', 'TPR', 'FPR')
}
results_tbl_personal <- results_custprod %>%
map(avg_conf_matr_personal) %>%
# Turning into an unnested tibble
enframe() %>%
# Unnesting to have all variables on same level
unnest()
#results_tbl_personal
results_tbl_personal %>%
ggplot(aes(FPR, TPR, colour = fct_reorder2(as.factor(name), FPR, TPR))) +
geom_line() +
geom_label(aes(label = n)) +
labs(title = "ROC curves using Matrix Customer-Product",
colour = "Model") +
theme_grey(base_size = 14)results_tbl_personal %>%
ggplot(aes(recall, precision,
colour = fct_reorder2(as.factor(name), precision, recall))) +
geom_line() +
geom_label(aes(label = n)) +
geom_text(aes(label = method_dis(name,n)),hjust=-1, size=3) +
labs(title = "Precision-Recall curves using Matrix Customer-Product",
colour = "Model") +
theme_grey(base_size = 14)Eval: Rating Matrix Invoice-Product
avg_conf_matr_baskets <- function(results_invprod) {
tmp <- results_invprod %>%
getConfusionMatrix() %>%
as.list()
as.data.frame( Reduce("+",tmp) / length(tmp)) %>%
mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
select('n', 'precision', 'recall', 'TPR', 'FPR')
}
results_tbl_baskets <- results_invprod %>%
map(avg_conf_matr_baskets) %>%
# Turning into an unnested tibble
enframe() %>%
# Unnesting to have all variables on same level
unnest()
#results_tbl_baskets
results_tbl_baskets %>%
ggplot(aes(FPR, TPR, colour = fct_reorder2(as.factor(name), FPR, TPR))) +
geom_line() +
geom_label(aes(label = n)) +
labs(title = "ROC curves using Matrix Invoice-Product",
colour = "Model") +
theme_grey(base_size = 14)results_tbl_baskets %>%
ggplot(aes(recall, precision,
colour = fct_reorder2(as.factor(name), precision, recall))) +
geom_line() +
geom_label(aes(label = n)) +
geom_text(aes(label = method_dis(name,n)),hjust=-1, size=3) +
labs(title = "Precision-Recall curves using Matrix Invoice-Product",
colour = "Model") +
theme_grey(base_size = 14)Model Selection
Berdasarkan hasil ROC Curves dan Precision-Recall Curves diatas, Model rekomendasi menggunakan Matrix Invoice-Product lebih baik karena memiliki nilai Precision dan Recall yang lebih tinggi. Kemudian, algoritma IBCF dan UBCF menunjukan performa yang lebih baik dari algortima lainnya. Pada kurva ROC kita menilai dari algoritma yang memiliki nilai FPR dan TPR tertinggi. Jika pada kurva Precision-Recall kita bisa memlihat yang memiliki precision yang tinggi karena kita ingin produk yang direkomendasikan sesuai dengan yang diinginkan customer.
Berdasar kurva Precision-Recall, jika hanya merekomendasikan 1 produk maka yang terbaik adalah algoritma UBCF, namun tidak selisih jauh dengan algoritma IBCF. Namun, model UBCF menunjukan performa yang lebih rendah dari IBCF jika semakin banyak produk yang hendak direkomendasikan. Sehingga untuk case ini, kita akan memilih model IBCF untuk membangun sistem rekomendasi.
Prediction
model_ibcf_recomm <- readRDS("modelling/recommendation_development/model_ibcf_recomm.rds")
# Data invoice-product
df_invprod_predict <- df_product_recomm %>% select(invoice_no,stock_code)
df_invprod_predict <- df_invprod_predict %>% mutate(identifier = paste0(invoice_no,"-",stock_code))
df_invprod_predict <- df_invprod_predict[!duplicated(df_invprod_predict$identifier), ] %>% select(-identifier)Recommendation based History Order
Rekomendasi ini dihasilkan berdasarkan histori produk yang dibeli oleh customer. Misalkan kita simulasi untuk customer_id == 17850. Histori produk yang dibeli sebagai berikut:
df_master_product <- readRDS("data_clean/df_master_product.rds")
cust_id <- "17850"
cust_hist_product <- df_customer_transaction %>% filter(customer_id == cust_id) %>% .$stock_code %>% unique()
data.frame(stock_code = cust_hist_product) %>% left_join(df_master_product, by=c("stock_code","stock_code"))#> stock_code description
#> 1 85123A WHITE HANGING HEART T-LIGHT HOLDER
#> 2 71053 WHITE METAL LANTERN
#> 3 84406B CREAM CUPID HEARTS COAT HANGER
#> 4 84029G KNITTED UNION FLAG HOT WATER BOTTLE
#> 5 84029E RED WOOLLY HOTTIE WHITE HEART.
#> 6 22752 SET 7 BABUSHKA NESTING BOXES
#> 7 21730 GLASS STAR FROSTED T-LIGHT HOLDER
#> 8 22633 HAND WARMER UNION JACK
#> 9 22632 HAND WARMER RED POLKA DOT
#> 10 20679 EDWARDIAN PARASOL RED
#> 11 37370 RETRO COFFEE MUGS ASSORTED
#> 12 21871 SAVE THE PLANET MUG
#> 13 21071 VINTAGE BILLBOARD DRINK ME MUG
#> 14 21068 VINTAGE BILLBOARD LOVE/HATE MUG
#> 15 82483 WOOD 2 DRAWER CABINET WHITE FINISH
#> 16 82486 WOOD S/3 CABINET ANT WHITE FINISH
#> 17 82482 WOODEN PICTURE FRAME WHITE FINISH
#> 18 82494L WOODEN FRAME ANTIQUE WHITE
#> 19 15056BL EDWARDIAN PARASOL BLACK
#> 20 22803 IVORY EMBROIDERED QUILT
#> 21 22411 JUMBO SHOPPER VINTAGE RED PAISLEY
Kemudian siapkan matriks nya:
rating_matrix_histprod <- df_invprod_predict %>%
select(stock_code) %>%
unique() %>%
mutate(value = as.numeric(stock_code %in% cust_hist_product)) %>%
spread(stock_code, value) %>%
as.matrix() %>%
as("binaryRatingMatrix")
rating_matrix_histprod#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 21 ratings.
Berikut Top produk yang bisa direkomendasikan kepada customer_id == 17850 :
# Predict
predict_result_histprod <- predict(model_ibcf_recomm, newdata = rating_matrix_histprod, n= 20, type="topNList")
result_cust_recomm = NULL
if((length(predict_result_histprod@items$`1`))>0){
pred_prod_cust <- data.frame(
customer_id = cust_id,
recommendation_based="History Order",
stock_code = getList(predict_result_histprod)$`1`,
recom_rate = predict_result_histprod@ratings$`1`
)
pred_prod_cust <- pred_prod_cust %>%
left_join(df_master_product, by=c("stock_code","stock_code")) %>%
select(customer_id,recommendation_based,stock_code,description,recom_rate)
result_cust_recomm <- rbind(result_cust_recomm,pred_prod_cust)
}
result_cust_recomm #> customer_id recommendation_based stock_code
#> 1 17850 History Order 21931
#> 2 17850 History Order 22866
#> 3 17850 History Order 85099B
#> 4 17850 History Order 15056N
#> 5 17850 History Order 21929
#> 6 17850 History Order 21928
#> 7 17850 History Order 22865
#> 8 17850 History Order 22386
#> 9 17850 History Order 85099C
#> 10 17850 History Order 20712
#> 11 17850 History Order 23439
#> 12 17850 History Order 22867
#> 13 17850 History Order 21877
#> 14 17850 History Order 20711
#> 15 17850 History Order 21733
#> 16 17850 History Order 15056P
#> 17 17850 History Order 21069
#> 18 17850 History Order 21479
#> 19 17850 History Order 21485
#> 20 17850 History Order 22114
#> description recom_rate
#> 1 JUMBO STORAGE BAG SUKI 0.2878999
#> 2 HAND WARMER SCOTTY DOG DESIGN 0.2747569
#> 3 JUMBO BAG RED RETROSPOT 0.2684497
#> 4 EDWARDIAN PARASOL NATURAL 0.2583729
#> 5 JUMBO BAG PINK VINTAGE PAISLEY 0.2560500
#> 6 JUMBO BAG SCANDINAVIAN PAISLEY 0.2504039
#> 7 HAND WARMER OWL DESIGN 0.2481908
#> 8 JUMBO BAG PINK POLKADOT 0.2481848
#> 9 JUMBO BAG BAROQUE BLACK WHITE 0.2464097
#> 10 JUMBO BAG WOODLAND ANIMALS 0.2456964
#> 11 HAND WARMER RED LOVE HEART 0.2340989
#> 12 HAND WARMER BIRD DESIGN 0.2112100
#> 13 HOME SWEET HOME MUG 0.2097130
#> 14 JUMBO BAG TOYS 0.2000000
#> 15 RED HANGING HEART T-LIGHT HOLDER 0.1939300
#> 16 EDWARDIAN PARASOL PINK 0.1819672
#> 17 VINTAGE BILLBOARD TEA MUG 0.1818182
#> 18 WHITE SKULL HOT WATER BOTTLE 0.1808874
#> 19 RETROSPOT HEART HOT WATER BOTTLE 0.1652174
#> 20 HOT WATER BOTTLE TEA AND SYMPATHY 0.1526391
Recommendation based on Spesific Product
Rekomendasi ini bisa terapkan berdasarkan data keranjang belanja, produk yang sedang dilihat oleh customer atau produk yang dibeli customer dalam periode waktu tertentu. Misalkan kita simulasikan jika didalam keranjang belanja customer terdapat produk dengan stock_code 84029E, 21730
cart_stock_code <- c("84029E", "21730")
df_master_product %>% filter(stock_code %in% cart_stock_code) %>%
select(stock_code,description)#> # A tibble: 2 x 2
#> stock_code description
#> <chr> <chr>
#> 1 21730 GLASS STAR FROSTED T-LIGHT HOLDER
#> 2 84029E RED WOOLLY HOTTIE WHITE HEART.
Kemudian siapkan matrix-nya:
rating_matrix_cart <- df_invprod_predict %>%
select(stock_code) %>%
unique() %>%
mutate(value = as.numeric(stock_code %in% cart_stock_code)) %>%
spread(stock_code, value) %>%
as.matrix() %>%
as("binaryRatingMatrix")
rating_matrix_cart#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 2 ratings.
Berikut Top produk yang bisa direkomendasikan kepada customer:
# Predict
predict_result_cart <- predict(model_ibcf_recomm, newdata = rating_matrix_cart, n= 20, type="topNList")
result_cust_recomm = NULL
if((length(predict_result_cart@items$`1`))>0){
pred_prod_cust <- data.frame(
customer_id = cust_id,
recommendation_based="Customer Cart",
stock_code = getList(predict_result_cart)$`1`,
recom_rate = predict_result_cart@ratings$`1`
)
pred_prod_cust <- pred_prod_cust %>%
left_join(df_master_product, by=c("stock_code","stock_code")) %>%
select(customer_id,recommendation_based,stock_code,description,recom_rate)
result_cust_recomm <- rbind(result_cust_recomm,pred_prod_cust)
}
result_cust_recomm #> customer_id recommendation_based stock_code
#> 1 17850 Customer Cart 84029G
#> 2 17850 Customer Cart 21479
#> 3 17850 Customer Cart 21485
#> 4 17850 Customer Cart 22114
#> 5 17850 Customer Cart 21484
#> 6 17850 Customer Cart 21481
#> 7 17850 Customer Cart 71053
#> 8 17850 Customer Cart 22803
#> description recom_rate
#> 1 KNITTED UNION FLAG HOT WATER BOTTLE 0.20248668
#> 2 WHITE SKULL HOT WATER BOTTLE 0.18088737
#> 3 RETROSPOT HEART HOT WATER BOTTLE 0.16521739
#> 4 HOT WATER BOTTLE TEA AND SYMPATHY 0.15263909
#> 5 CHICK GREY HOT WATER BOTTLE 0.14466546
#> 6 FAWN BLUE HOT WATER BOTTLE 0.14285714
#> 7 WHITE METAL LANTERN 0.07079646
#> 8 IVORY EMBROIDERED QUILT 0.06837607