library(tidyverse)
library(lubridate)
library(scales)
library(readxl)
library(zoo)


#visualization
library(plotly)
library(glue)
library(RColorBrewer)
library(gridExtra)
library(sparkline)

#data model
library(rfm)
library(GGally)

#Segmentation
library(dbscan) # DBScan
library(factoextra) # K-Means

#recomendation
library(recommenderlab)

Project Background

sumber: Statistik E-Commerce 2019, Badan Pusat Statistik Indonesia

Berdasarkan artikel yang dipublish oleh www.xendit.co , hasil riset Accenture mengatakan bahwa 40% konsumen meninggalkan suatu website bisnis dan melakukan pembelian di website atau toko online lain karena merasa kebingungan akibat terlalu banyaknya pilihan saat akan membuat keputusan pembelian.

Data Description

data yang digunakan adalah data transaksi yang terjadi antara 01/12/2010 sampai 09/12/2011 pada salah satu perusahaan online retail yang berbasis di Inggris. Banyak pelanggan dari perusahaan adalah grosir.

Sumber Data Online Retail Data Set

Data Preparation

Read Data

Total Data awal 541909 data dab 8 variabel.

df_input <- read_excel("data_input/online_retail.xlsx", trim_ws = TRUE)
data.frame("total.data" = dim(df_input)[1],
           "total.variabel" = dim(df_input)[2])

#>   total.data total.variabel
#> 1     541909              8

10 Data teratas

head(df_input,10)

#> # A tibble: 10 x 8
#>    InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
#>    <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
#>  1 536365    85123A    WHITE HANG~        6 2010-12-01 08:26:00      2.55
#>  2 536365    71053     WHITE META~        6 2010-12-01 08:26:00      3.39
#>  3 536365    84406B    CREAM CUPI~        8 2010-12-01 08:26:00      2.75
#>  4 536365    84029G    KNITTED UN~        6 2010-12-01 08:26:00      3.39
#>  5 536365    84029E    RED WOOLLY~        6 2010-12-01 08:26:00      3.39
#>  6 536365    22752     SET 7 BABU~        2 2010-12-01 08:26:00      7.65
#>  7 536365    21730     GLASS STAR~        6 2010-12-01 08:26:00      4.25
#>  8 536366    22633     HAND WARME~        6 2010-12-01 08:28:00      1.85
#>  9 536366    22632     HAND WARME~        6 2010-12-01 08:28:00      1.85
#> 10 536367    84879     ASSORTED C~       32 2010-12-01 08:34:00      1.69
#> # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Saya merasa lebih mudah jika seluruh nama kolom menggunakan lowercase dan setiap kata dipisahkan karakter underscore.

colnames(df_input) <- tolower(colnames(df_input))
df_input <-   df_input %>% 
    rename("invoice_no" = invoiceno,
           "stock_code" = stockcode,
           "invoice_date" = invoicedate,
           "unit_price" = unitprice,
           "customer_id" = customerid)
colnames(df_input)

#> [1] "invoice_no"   "stock_code"   "description"  "quantity"     "invoice_date"
#> [6] "unit_price"   "customer_id"  "country"

Variable Description

Berdasar informasi dari Online Retail Data Set berikut deskripsi datanya:

Variable	Description
invoice_no	Nomor invoice terdiri dari 6 digit angkat yang bersifat unik untuk setiap transaksi. Jika diawali huruf C maka mengindikasikan transaksi berstatus Cancel
stock_code	Kode Produk yang bersifat unik setiap produk, terdiri dari 5 digit angka yang bersifat unik untuk setiap produk.
description	Nama Produk
quantity	Jumlah produk yang dibeli
invoice_date	Tanggal dan waktu transaksi
unit_price	harga produk per unit
customer_id	ID Customer yang terdiri dari 5 digit angka yang bersifat unik untuk setiap customer.
country	Negara customer

Data Preprocessing

Data Structure

Berikut Struktur Datanya:

glimpse(df_input)

#> Observations: 541,909
#> Variables: 8
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <chr> "United Kingdom", "United Kingdom", "United Kingdom", ...

data.frame(
  invoice_unik = df_input$invoice_no %>% unique() %>% length(),
  stock_code_unik = df_input$stock_code %>% unique() %>% length(),
  description_unik = df_input$description %>% unique() %>% length(),
  country_unik = df_input$country %>% unique() %>% length(),
  customer_unik = df_input$customer_id %>% unique() %>% length()
)

#>   invoice_unik stock_code_unik description_unik country_unik customer_unik
#> 1        25900            4070             4212           38          4373

Dari data diatas country hanya berjumlah 38, sehingga mari kita ubah tipe datanya menjadi factor

df_input$country <- as.factor(df_input$country)

Data Cleansing

Data Range Adjustment

Data ini berisikan transaksi antara 01/12/2010 sampai 09/12/2011. Data transaksi Desember 2011 tidak full 1 bulan, sehinnga saya memilih untuk melakukan analisa dari 01/12/2010 sampai 30/11/2011.

df_input <- df_input %>% filter(as.Date(invoice_date) > ymd("2010-11-30"), as.Date(invoice_date) < ymd("2011-12-01"))

Cancelled Transaction

Sesuai deskripsi variabel, jika data invoice_no diawali huruf C maka mengindikasikan transaksi berstatus Cancel. Data transaksi yang dibatalkan tidak sesuai untuk kebutuhan analisis ini, maka perlu kita remove.

df_input %>% filter(grepl("C", df_input$invoice_no)) %>% summarise(total_cancelled_transaction = n())

#> # A tibble: 1 x 1
#>   total_cancelled_transaction
#>                         <int>
#> 1                        8928

Terdapat 9288 transaksi yang dicancel. Data ini tidak kita gunakan maka data dapat kita exclude.

df_input <- df_input %>% filter(!grepl("C", df_input$invoice_no))

Invalid Invoice

Sesuai deskripsi, Invoice No yang valid memiliki 6 digit angka.

df_input %>% filter(nchar(invoice_no)>6)

#> # A tibble: 3 x 8
#>   invoice_no stock_code description quantity invoice_date        unit_price
#>   <chr>      <chr>      <chr>          <dbl> <dttm>                   <dbl>
#> 1 A563185    B          Adjust bad~        1 2011-08-12 14:50:00     11062.
#> 2 A563186    B          Adjust bad~        1 2011-08-12 14:51:00    -11062.
#> 3 A563187    B          Adjust bad~        1 2011-08-12 14:52:00    -11062.
#> # ... with 2 more variables: customer_id <dbl>, country <fct>

Terdapat 3 data transaksi yang tidak valid, maka data ini dapat di exclude.

df_input <- df_input %>% filter(!(nchar(invoice_no)>6))

Invalid Quantity

Apakah ada data yang memiliki quantity<=0 ? Berapa banyak?

df_input %>% filter(quantity<=0) %>% nrow()

#> [1] 1306

Terdapat 1306 data transaksi yang memiliki quantity<=0, sehingga data ini perlu dibuang karena tidak valid jika suatu transaksi tidak memiliki quantity.

df_input <- df_input %>% filter(quantity>0)

Invalid Product

Sesuai deskripsi variabel, stock code selalu diawali dengan 5 digit angka. Untuk hal ini, kita akan substring 5 digit pertama dari stock code kemudian kita convert ke tipe data numerik. Jika stock code tersebut menjadi NA Value, maka stock tersebut tidak valid.

# Create Stock Validation
# substring first 5 digit of stock code then mutate to numeric
# if there are any NA Values of stock code validation, then that product isn't valid
stock_valid <- df_input %>% mutate(
  stock_code_validation = substr(stock_code,start = 1, stop = 5)
) %>% mutate(
  stock_code_validation = as.numeric(stock_code_validation)
) %>% select(stock_code,stock_code_validation,description) %>% distinct()


df_input %>% filter(description %in% (stock_valid %>% filter(is.na(stock_code_validation)) %>% .$description))

#> # A tibble: 2,888 x 8
#>    invoice_no stock_code description quantity invoice_date        unit_price
#>    <chr>      <chr>      <chr>          <dbl> <dttm>                   <dbl>
#>  1 536370     POST       POSTAGE            3 2010-12-01 08:45:00        18 
#>  2 536403     POST       POSTAGE            1 2010-12-01 11:27:00        15 
#>  3 536414     22139      <NA>              56 2010-12-01 11:52:00         0 
#>  4 536527     POST       POSTAGE            1 2010-12-01 13:04:00        18 
#>  5 536540     C2         CARRIAGE           1 2010-12-01 14:05:00        50 
#>  6 536544     DOT        DOTCOM POS~        1 2010-12-01 14:32:00       570.
#>  7 536545     21134      <NA>               1 2010-12-01 14:32:00         0 
#>  8 536546     22145      <NA>               1 2010-12-01 14:33:00         0 
#>  9 536547     37509      <NA>               1 2010-12-01 14:33:00         0 
#> 10 536549     85226A     <NA>               1 2010-12-01 14:34:00         0 
#> # ... with 2,878 more rows, and 2 more variables: customer_id <dbl>,
#> #   country <fct>

Terdapat 2888 data transaksi yang memiliki stock code tidak valid. Jika dilihat dari deskripsi data tersebut merupakan ongkos kirim, bank charges, NA Values dan lainnya. Seluruh data transaksi diatas dapat kita exclude karena tidak sesuai untuk kebutuhan analisis ini.

invalid_stock <- stock_valid %>% filter(is.na(stock_code_validation)) %>% .$description
df_input <- df_input %>% filter(!description %in% invalid_stock)

Mengacu pada artikel yang dipublish oleh Diego Usai pada Mar 14, 2019 terkait Market Basket Analysis menggunakan data online retail ini, ia berhasil menemukan 50 deskripsi yang terkesan diinputkan secara manual yaitu:

# Additional adjustment codes to remove
descr <- c( "check", "check?", "?", "??", "damaged", "found", 
            "adjustment", "Amazon", "AMAZON", "amazon adjust", 
            "Amazon Adjustment", "amazon sales", "Found", "FOUND",
            "found box", "Found by jackie ","Found in w/hse","dotcom", 
            "dotcom adjust", "allocate stock for dotcom orders ta", "FBA", 
            "Dotcomgiftshop Gift Voucher £100.00", "on cargo order",
            "wrongly sold (22719) barcode", "wrongly marked 23343",
            "dotcomstock", "rcvd be air temp fix for dotcom sit", 
            "Manual", "John Lewis", "had been put aside", 
            "for online retail orders", "taig adjust", "amazon", 
            "incorrectly credited C550456 see 47", "returned", 
            "wrongly coded 20713", "came coded as 20713", 
            "add stock to allocate online orders", "Adjust bad debt", 
            "alan hodge cant mamage this section", "website fixed",
            "did  a credit  and did not tick ret", "michel oops",
            "incorrectly credited C550456 see 47", "mailout", "test",
            "Sale error",  "Lighthouse Trading zero invc incorr", "SAMPLES",
            "Marked as 23343", "wrongly coded 23343","Adjustment", 
            "rcvd be air temp fix for dotcom sit", "Had been put aside." )
descr

#>  [1] "check"                               "check?"                             
#>  [3] "?"                                   "??"                                 
#>  [5] "damaged"                             "found"                              
#>  [7] "adjustment"                          "Amazon"                             
#>  [9] "AMAZON"                              "amazon adjust"                      
#> [11] "Amazon Adjustment"                   "amazon sales"                       
#> [13] "Found"                               "FOUND"                              
#> [15] "found box"                           "Found by jackie "                   
#> [17] "Found in w/hse"                      "dotcom"                             
#> [19] "dotcom adjust"                       "allocate stock for dotcom orders ta"
#> [21] "FBA"                                 "Dotcomgiftshop Gift Voucher £100.00"
#> [23] "on cargo order"                      "wrongly sold (22719) barcode"       
#> [25] "wrongly marked 23343"                "dotcomstock"                        
#> [27] "rcvd be air temp fix for dotcom sit" "Manual"                             
#> [29] "John Lewis"                          "had been put aside"                 
#> [31] "for online retail orders"            "taig adjust"                        
#> [33] "amazon"                              "incorrectly credited C550456 see 47"
#> [35] "returned"                            "wrongly coded 20713"                
#> [37] "came coded as 20713"                 "add stock to allocate online orders"
#> [39] "Adjust bad debt"                     "alan hodge cant mamage this section"
#> [41] "website fixed"                       "did  a credit  and did not tick ret"
#> [43] "michel oops"                         "incorrectly credited C550456 see 47"
#> [45] "mailout"                             "test"                               
#> [47] "Sale error"                          "Lighthouse Trading zero invc incorr"
#> [49] "SAMPLES"                             "Marked as 23343"                    
#> [51] "wrongly coded 23343"                 "Adjustment"                         
#> [53] "rcvd be air temp fix for dotcom sit" "Had been put aside."

Saya setuju dengan hasil temuannya, oleh karena itu data transaksi yang memiliki deskripsi diatas perlu di remove.

df_input <- df_input %>% 
  filter(!description %in% descr)

Duplicated StockCode-Description

Proses ini penting karena data stock code dan description harus bersifat unik supaya matriks produk untuk sistem rekomendasi bisa valid.

df_product <- df_input %>% select(stock_code,description) %>% distinct()
df_product <- df_product %>% 
  mutate(stock_code_lowercase = tolower(stock_code),
         description_lowercase = tolower(description))

data_frame(
  stock_code_unik = df_product$stock_code %>% unique() %>% length(),
  stock_code_unik_lowercase = df_product$stock_code_lowercase %>% unique() %>% length(),
  description_unik = df_product$description %>% unique() %>% length(),
  description_unik_lower = df_product$description_lowercase %>% unique() %>% length()
)

#> # A tibble: 1 x 4
#>   stock_code_unik stock_code_unik_lowerca~ description_unik description_unik_lo~
#>             <int>                    <int>            <int>                <int>
#> 1            3898                     3790             3993                 3993

Jumlah stock code unik dan produk unik tidak sama, mari kita cek dari sisi stock code dahulu.

stock_code_check_dupli <- df_product %>% select(stock_code,stock_code_lowercase) %>% 
  distinct() %>% 
  group_by(stock_code_lowercase) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)

Data diatas merupakan data stock code yang duplikat apabila kita ubah menjadi lowecase. Mari kita cek apakah benar duplikat.

df_product %>% filter(stock_code_lowercase %in% stock_code_check_dupli$stock_code_lowercase) %>% 
  arrange(stock_code_lowercase)

#> # A tibble: 227 x 4
#>    stock_code description            stock_code_lowerca~ description_lowercase  
#>    <chr>      <chr>                  <chr>               <chr>                  
#>  1 15056BL    EDWARDIAN PARASOL BLA~ 15056bl             edwardian parasol black
#>  2 15056bl    EDWARDIAN PARASOL BLA~ 15056bl             edwardian parasol black
#>  3 15056N     EDWARDIAN PARASOL NAT~ 15056n              edwardian parasol natu~
#>  4 15056n     EDWARDIAN PARASOL NAT~ 15056n              edwardian parasol natu~
#>  5 15056P     EDWARDIAN PARASOL PINK 15056p              edwardian parasol pink 
#>  6 15056p     EDWARDIAN PARASOL PINK 15056p              edwardian parasol pink 
#>  7 15060B     FAIRY CAKE DESIGN UMB~ 15060b              fairy cake design umbr~
#>  8 15060b     FAIRY CAKE DESIGN UMB~ 15060b              fairy cake design umbr~
#>  9 18098C     PORCELAIN BUTTERFLY O~ 18098c              porcelain butterfly oi~
#> 10 18098c     PORCELAIN BUTTERFLY O~ 18098c              porcelain butterfly oi~
#> # ... with 217 more rows

Ternyata terdapat data stock code yang duplicated karena efek case sensitive. Oleh karena itu, seluruh stock_code akan kita convert menjadi UPPERCASE untuk mengilahkan efek case sensitive.

df_input <- df_input %>% mutate(
  stock_code = toupper(stock_code)
)

data_frame(
  jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
  stock_code_unik_lowercase = df_input$stock_code %>% tolower() %>% unique() %>% length(),
  description_unik = df_input$description %>% unique() %>% length(),
  description_unik_lower = df_input$description %>% tolower() %>% unique() %>% length()
)

#> # A tibble: 1 x 4
#>   jumlah_stock_code_u~ stock_code_unik_low~ description_unik description_unik_l~
#>                  <int>                <int>            <int>               <int>
#> 1                 3790                 3790             3993                3993

Oke, data stock_code sudah clean. Namun, seharusnya jumlah stock_code dan description berjumlah sama karena bersifat unik. Hal ini mengindikasikan duplikat data. Mari kita cek.

description_check <- df_input %>% select(stock_code,description) %>% 
  distinct() %>% 
  group_by(stock_code) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)

description_check

#> # A tibble: 212 x 2
#>    stock_code  freq
#>    <chr>      <int>
#>  1 16156L         2
#>  2 17107D         3
#>  3 20622          2
#>  4 20725          2
#>  5 20914          2
#>  6 21109          2
#>  7 21112          2
#>  8 21175          2
#>  9 21192          2
#> 10 21232          2
#> # ... with 202 more rows

Data diatas adalah data stock code yang memiliki description > 1. Mari kita sampling datanya.

df_input %>% filter(stock_code %in% description_check$stock_code) %>% 
  select(stock_code,description) %>% 
  distinct() %>% 
  arrange(stock_code,description)

#> # A tibble: 443 x 2
#>    stock_code description                        
#>    <chr>      <chr>                              
#>  1 16156L     WRAP CAROUSEL                      
#>  2 16156L     WRAP, CAROUSEL                     
#>  3 17107D     FLOWER FAIRY 5 DRAWER LINERS       
#>  4 17107D     FLOWER FAIRY 5 SUMMER DRAW LINERS  
#>  5 17107D     FLOWER FAIRY,5 SUMMER B'DRAW LINERS
#>  6 20622      VIP PASSPORT COVER                 
#>  7 20622      VIPPASSPORT COVER                  
#>  8 20725      LUNCH BAG RED RETROSPOT            
#>  9 20725      LUNCH BAG RED SPOTTY               
#> 10 20914      SET/5 RED RETROSPOT LID GLASS BOWLS
#> # ... with 433 more rows

Dari hasil pengecekan diatas dapat kita simpulkan bahwa terdapat kesalahan pada deskripsi yang berupa tanda baca, spasi hingga kesalahan penulisan deskripsi produk. Untuk itu kita akan generate deskripsi produk menggunakan data pertamanya.

df_description <- df_input %>% select(stock_code,description) %>% 
  filter(stock_code %in% description_check$stock_code) %>% 
  distinct() %>% 
  group_by(stock_code) %>% 
  slice(1) %>% 
  ungroup()

df_input <- df_input %>% left_join(df_description, by=c("stock_code","stock_code")) %>% 
  mutate(description = ifelse(is.na(description.y),description.x,description.y)) %>% 
  select(-c(description.y,description.x))


data_frame(
  jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
  jumlah_description_unik = df_input$description %>% unique() %>% length()
)

#> # A tibble: 1 x 2
#>   jumlah_stock_code_unik jumlah_description_unik
#>                    <int>                   <int>
#> 1                   3790                    3764

Oke, selisih nya sudah mulai berkurang. Jumlah diatas mengindikasikan terdapat stock_code yang memiliki description sama.

df_description <- df_input %>% select(stock_code,description) %>% 
  distinct() %>% 
  group_by(description) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)

df_description

#> # A tibble: 24 x 2
#>    description                      freq
#>    <chr>                           <int>
#>  1 BATHROOM METAL SIGN                 2
#>  2 COLOURING PENCILS BROWN TUBE        2
#>  3 COLUMBIAN CANDLE RECTANGLE          2
#>  4 COLUMBIAN CANDLE ROUND              3
#>  5 EAU DE NILE JEWELLED PHOTOFRAME     2
#>  6 FRENCH FLORAL CUSHION COVER         2
#>  7 FRENCH LATTICE CUSHION COVER        2
#>  8 FRENCH PAISLEY CUSHION COVER        2
#>  9 FROSTED WHITE BASE                  2
#> 10 HEART T-LIGHT HOLDER                2
#> # ... with 14 more rows

Terdapat 24 stock code yang memiliki deskripsi sama.

df_input %>% select(stock_code,description) %>% 
  distinct() %>% 
  filter(description %in% c(df_description$description)) %>% 
  arrange(description)

#> # A tibble: 50 x 2
#>    stock_code description                    
#>    <chr>      <chr>                          
#>  1 82580      BATHROOM METAL SIGN            
#>  2 21171      BATHROOM METAL SIGN            
#>  3 10133      COLOURING PENCILS BROWN TUBE   
#>  4 10135      COLOURING PENCILS BROWN TUBE   
#>  5 72133      COLUMBIAN CANDLE RECTANGLE     
#>  6 72131      COLUMBIAN CANDLE RECTANGLE     
#>  7 72127      COLUMBIAN CANDLE ROUND         
#>  8 72130      COLUMBIAN CANDLE ROUND         
#>  9 72128      COLUMBIAN CANDLE ROUND         
#> 10 85023B     EAU DE NILE JEWELLED PHOTOFRAME
#> # ... with 40 more rows

Untuk case ini kita bisa berasumsi bahwa produk tersebut sama. Sehinnga, kita akan generate setiap description menggunakan stock code pertama.

df_product_unik <- df_input %>% select(stock_code,description) %>% 
  filter(description %in% c(df_description$description)) %>% 
  distinct() %>% 
  group_by(description) %>% 
  slice(1) %>% 
  ungroup()

df_input <- df_input %>% left_join(df_product_unik, by=c("description","description")) %>% 
  mutate(stock_code = ifelse(is.na(stock_code.y),stock_code.x,stock_code.y)) %>% 
  select(-c("stock_code.x","stock_code.y")) %>% 
  select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price)


data_frame(
  jumlah_stock_code_unik = df_input$stock_code %>% unique() %>% length(),
  jumlah_description_unik = df_input$description %>% unique() %>% length(),
  jumlah_code_description_unik = df_input %>% select(stock_code,description) %>% distinct() %>% nrow()
)

#> # A tibble: 1 x 3
#>   jumlah_stock_code_unik jumlah_description_unik jumlah_code_description_unik
#>                    <int>                   <int>                        <int>
#> 1                   3764                    3764                         3764

Oke, setiap stock sudah bersifat unik. mari kita lanjut cek data country.

Invalid Country

data_frame(
  customer_id_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow(),
  customer_country_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow()
)

#> # A tibble: 1 x 2
#>   customer_id_unik customer_country_unik
#>              <int>                 <int>
#> 1             4311                  4295

Berdasar data diatas terdapat 1 customer yang memiliki 2 negara. Mungkin bisa karena customer tersebut pindah, oleh karena itu kita bisa ambil negara customer berdasarkan negara terakhir ia melakukan transaksi.

df_master_customer <- df_input %>% 
  arrange(desc(invoice_date,customer_id)) %>% 
  select(customer_id, country) %>% 
  group_by(customer_id) %>% 
  slice(1)

df_input <- df_input %>% select(-country) %>% 
  left_join(df_master_customer, by = c("customer_id","customer_id"))


data_frame(
  customer_id_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow(),
  customer_country_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow()
)

#> # A tibble: 1 x 2
#>   customer_id_unik customer_country_unik
#>              <int>                 <int>
#> 1             4295                  4295

Oke, setiap data customer dan negara sudah sesuai. mari kita lanjut cek data duplikat.

Duplicated Data

data.frame(
  jumlah_data = df_input %>% nrow(),
  jumlah_data_unik = df_input %>% distinct() %>% nrow()
)

#>   jumlah_data jumlah_data_unik
#> 1      503105           498077

Dataset ini memiliki data yang duplikat, untuk itu perlu kita remove.

df_input <- df_input %>% distinct()

# total data
df_input %>% nrow()

#> [1] 498077

Missing Values

colSums(is.na(df_input))

#>   invoice_no invoice_date  customer_id   stock_code  description     quantity 
#>            0            0       123903            0            0            0 
#>   unit_price      country 
#>            0            0

Dapat dilihat terdapat 123938 Missing Value pada customer_id. Projek ini ditujukan untuk melakukan segmentasi pelanggan dan membuat personalisasi rekomendasi produk. Segmentasi pelanggan jelas harus mengetahui siapa pelangannya, sehingga data transaksi yang tidak memiliki data customer_id perlu di-exclude. Kemudian untuk membangun sistem rekomendasi, kita bisa mengabaikan customer nya dan fokus pada produk yang dibeli, sehingga dapat menggunakan data invoice_no dan stock_code.

Oleh karena itu dapat kita putuskan untuk membagi data ini menjadi 3 dataset, yaitu:
1. df_general_transaction : dataset data original.
1. df_customer_transaction : dataset ini digungakan untuk segmentasi pelanggan, sehingga harus mengexclude missing values.
2. df_product_recomm : dataset untuk membangun sistem rekomendasi produk yang hanya terdiri dari customer_id, invoice_no, stock_code dan description.

df_general_transaction <- df_input
df_customer_transaction <- drop_na(df_input)
df_product_recomm <- df_input %>% select(customer_id,invoice_no,stock_code,description)

Feature Extraction

Kita bisa mengekstrak data total amount per transaksi berdasarkan quantity * unit_price mempermudah proses analisis selanjutnya.

df_general_transaction <- df_general_transaction %>% mutate(total_amount = quantity * unit_price) %>% 
  select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price,total_amount)

df_customer_transaction <- df_customer_transaction %>% mutate(total_amount = quantity * unit_price) %>% 
  select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price,total_amount)

head(df_general_transaction,5)

#> # A tibble: 5 x 9
#>   invoice_no invoice_date        customer_id country stock_code description
#>   <chr>      <dttm>                    <dbl> <fct>   <chr>      <chr>      
#> 1 536365     2010-12-01 08:26:00       17850 United~ 85123A     WHITE HANG~
#> 2 536365     2010-12-01 08:26:00       17850 United~ 71053      WHITE META~
#> 3 536365     2010-12-01 08:26:00       17850 United~ 84406B     CREAM CUPI~
#> 4 536365     2010-12-01 08:26:00       17850 United~ 84029G     KNITTED UN~
#> 5 536365     2010-12-01 08:26:00       17850 United~ 84029E     RED WOOLLY~
#> # ... with 3 more variables: quantity <dbl>, unit_price <dbl>,
#> #   total_amount <dbl>

Data Summary

df_general_transaction

#summary(df_transaction)
glimpse(df_general_transaction)

#> Observations: 498,077
#> Variables: 9
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...

wd <-  as.character(getwd())
saveRDS(object=df_general_transaction, file=paste(paste(wd,"/data_clean/",sep = ""),"df_general_transaction.rds",sep=""))

df_customer_transaction

#summary(df_customer_segment)
glimpse(df_customer_transaction)

#> Observations: 374,174
#> Variables: 9
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...

Data ini akan digunakan untuk melakukan segmentasi berdasarkan Recency, Frequency dan Monetary. Untuk itu perlu kita cek apakah terdapat data yang memiliki unit_price <= 0 ? Berapa banyak?

df_customer_transaction %>% filter(unit_price<=0) %>% nrow()

#> [1] 33

Terdapat 33 data transaksi yang memiliki unit_price <= 0. Bisa jadi karena efek promo, diskon atau kesalahan input. Namun karena kita tidak mengetahui data aslinya, data ini dapat kita remove karena jumlahnya juga tidak banyak.

df_customer_transaction <- df_customer_transaction %>% filter(unit_price>0)
wd <-  as.character(getwd())
saveRDS(object=df_customer_transaction, file=paste(paste(wd,"/data_clean/",sep = ""),"df_customer_transaction.rds",sep=""))

df_product_recomm

glimpse(df_product_recomm)

#> Observations: 498,077
#> Variables: 4
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
#> $ invoice_no  <chr> "536365", "536365", "536365", "536365", "536365", "5363...
#> $ stock_code  <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...

wd <-  as.character(getwd())
saveRDS(object=df_product_recomm, file=paste(paste(wd,"/data_clean/",sep = ""),"df_product_recomm.rds",sep=""))

kita ekstrak seluruh produk nya untuk menjadi master produk.

df_master_product <- df_product_recomm %>% select(stock_code,description) %>% distinct() %>% arrange(stock_code)
glimpse(df_master_product)

#> Observations: 3,764
#> Variables: 2
#> $ stock_code  <chr> "10002", "10080", "10120", "10123C", "10124A", "10124G"...
#> $ description <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATABLE...

wd <-  as.character(getwd())
saveRDS(object=df_master_product, file=paste(paste(wd,"/data_clean/",sep = ""),"df_master_product.rds",sep=""))

Exploratory Data Analysis

df_customer_transaction <- readRDS("data_clean/df_customer_transaction.rds")

Transaction

Berapa Total Order Amount yang dilakukan?

df_customer_transaction %>% summarise(total_order_amount = sum(total_amount))

#> # A tibble: 1 x 1
#>   total_order_amount
#>                <dbl>
#> 1           8224194.

Bagaimana Total Order Amount tiap bulan?

plot_monthly_transaction <- df_customer_transaction %>% 
  mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(invoice_date)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_order_amount = sum(total_amount)) %>% 
  ungroup() %>% 
  mutate(
     normalisasi = (total_order_amount-min(total_order_amount))/(max(total_order_amount)-min(total_order_amount)),
     popup=glue("Year-Month : {ym}
                 Total Amount : {total_order_amount}")
  ) %>% 
  ggplot(aes(yearmonth,normalisasi))+
   geom_area(fill="green",alpha=0.7)+
  geom_line(size=0.7,color="#181818") +
  labs(
    title = "Total Transaction per Month",
    x = "Month-Year",
    y = NULL
  )+
  geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  theme(
    axis.text.y = element_blank()
  )

ggplotly(plot_monthly_transaction, tooltip = "text")

Frequency Order

Berapa Total Transaksi yang terjadi?

df_customer_transaction %>% 
  select(invoice_date,invoice_no) %>% 
  distinct() %>% nrow()

#> [1] 17656

Bagaimana total transaksi yang terjadi tiap bulan?

plot_monthly_frequency <- df_customer_transaction %>% select(invoice_date,invoice_no) %>% 
  distinct() %>% 
  mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(invoice_date)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_frequency = n()) %>% 
  ungroup() %>% 
  mutate(
     normalisasi = (total_frequency-min(total_frequency))/(max(total_frequency)-min(total_frequency)),
     popup=glue("Year-Month : {ym}
                 Total Transaction : {total_frequency}")
  ) %>% 
  ggplot(aes(yearmonth,normalisasi))+
   geom_area(fill="yellow",alpha=0.7)+
  geom_line(size=0.7,color="#181818") +
  labs(
    title = "Total Transaction per Month",
    x = "Month-Year",
    y = NULL
  )+
  geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  theme(
    axis.text.y = element_blank()
  )

ggplotly(plot_monthly_frequency, tooltip = "text")

Customer

Berapa total customer yang ada?

df_customer_transaction %>% select(customer_id) %>% distinct() %>% nrow()

#> [1] 4293

Bagaimana jumlah customer yang melakukan transaksi tiap bulan?

plot_monthly_customer <- df_customer_transaction %>% select(invoice_date,customer_id) %>% 
  mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(invoice_date)) %>% 
  select(customer_id,ym,yearmonth) %>% 
  distinct() %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_customer = n()) %>% 
  ungroup() %>% 
  mutate(
     normalisasi = (total_customer-min(total_customer))/(max(total_customer)-min(total_customer)),
     popup=glue("Year-Month : {ym}
                 Total Customer : {total_customer}")
  ) %>% 
  ggplot(aes(yearmonth,normalisasi))+
   geom_area(fill="blue",alpha=0.7)+
  geom_line(size=0.7,color="#181818") +
  labs(
    title = "Total Customer per Month",
    x = "Month-Year",
    y = NULL
  )+
  geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  theme(
    axis.text.y = element_blank()
  )

ggplotly(plot_monthly_customer, tooltip = "text")

Pertumbuhan customer baru tiap bulan?

plot_monthly_new_customer <- df_customer_transaction %>%
  group_by(customer_id) %>% 
  summarise(first_order = min(invoice_date)) %>% 
  ungroup() %>% 
  mutate( yearmonth = format(first_order, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(first_order)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_new_customer = n()) %>% 
  ungroup() %>% 
  mutate(
     normalisasi = (total_new_customer-min(total_new_customer))/(max(total_new_customer)-min(total_new_customer)),
     popup=glue("Year-Month : {ym}
                 Total New Customer : {total_new_customer} ({round((total_new_customer/sum(total_new_customer))*100,1)}%)")
  ) %>% 
  ggplot(aes(yearmonth,normalisasi))+
   geom_area(fill="blue",alpha=0.7)+
  geom_line(size=0.7,color="#181818") +
  labs(
    title = "Growth of New Customer",
    x = "Month-Year",
    y = NULL
  )+
  geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  theme(
    axis.text.y = element_blank()
  )


ggplotly(plot_monthly_new_customer, tooltip = "text")

Order Habbit

Hari apa customer sering berbelanja?

plot_order_wday <- df_customer_transaction %>% select(invoice_date,invoice_no) %>% 
  distinct() %>% 
  mutate(wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>% 
  group_by(wday) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup() %>% 
  mutate(popup = glue("Weekday: {wday}
                      Total Transaction: {total_transaction}")) %>% 
  ggplot(aes(wday,total_transaction))+
  geom_bar(stat="identity", aes(fill=total_transaction, text=popup), show.legend = FALSE)+
  labs(
    title = "Day of Most Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) 

ggplotly(plot_order_wday, tooltip="text")

Jam berapa customer sering berbelanja?

 plot_order_time <- data_frame(hour_of_day = c(1:24)) %>% left_join(
  df_customer_transaction %>% select(invoice_date,invoice_no) %>% 
  distinct() %>% 
  mutate(hour_of_day = hour(invoice_date)) %>% 
  group_by(hour_of_day) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup(), by=c("hour_of_day","hour_of_day")) %>% 
  mutate(total_transaction = replace_na(total_transaction,0)) %>% 
  mutate(popup = glue("Hour of Day : {hour_of_day}
                      Total Transaction: {total_transaction}")) %>% 
  ggplot(aes(hour_of_day,total_transaction))+
  geom_bar(stat="identity", aes(fill=total_transaction, text=popup), show.legend = FALSE)+
  labs(
    title = "Time of Most Order Frequency",
    x = "Hour of Day", 
    y = NULL)
  

ggplotly(plot_order_time, tooltip="text")

RFM Segmentation

Metode analisis Recency, Frequency, Monetary Value (RFM) adalah salah satu metode analisis perilaku pelanggan berdasarkan histori transaksinya. Output dari analisis RFM dapat digunakan untuk melakukan segmentasi pelanggan. Metode analisis RFM dapat menjelaskan:

Seberapa baru pelanggan melakukan transaksi?
Seberapa sering pelanggan melaukan transaksi?
Seberapa besar transaksi yang sudah dilakukan pelanggan?

Sesuai artikel yang dipublish oleh www.marketeers.com, GO-JEK adalah salah satu perusahaan yang menggunakan metode analisis RFM dalam menentukan segmentasi pelanggan. GO-JEK membagi segmentasi pelanggan ke dalam empat kelas, yaitu Gold, Silver, Bronze, dan Non-Profit. Segmentasi Gold memiliki kualifikasi konsumen dengan high monetary, high frequency, dan high recency. Silver memiliki kualifikasi konsumen dengan tingkat monetary yang tinggi, frequency yang rendah, dan recency yang tinggi. Kategori Bronze terdiri dari konsumen dengan tingkat monetary rendah, frequency, dan recency yang tinggi. Sementara segmentasi Non-Profit memiliki kualifikasi konsumen dengan kualifikasi monetary, frequency, dan recency yang rendah.

Dalam case ini, dataset berisikan data transaksi dari 1 Desember 2010 sampai 30 November 2011, sehingga berikut perlakukan RFM Value untuk case ini:

Recency : Selisih antara hari terakhir pelanggan melakukan transaksi dan hari melakukan analisis. Dalam case ini, hari analisis menggunakan data hari terakhir transaksi + 1. Frequency : Jumlah transaksi yang dilakukan oleh pelanggan dari 1 Desember 2010 sampai 30 November 2011. Monetary : Jumlah total order amount yang sudah dikeluarkan pelanggan dari 1 Desember 2010 sampai 30 November 2011.

Generate RFM Value

Berikut sample data RFM Value:

analysis_date = date(max(df_customer_transaction$invoice_date))+days(1)


df_customer_rfm <- df_customer_transaction %>% 
  group_by(customer_id,country) %>% 
  summarise(
         first_order = as.Date(min(invoice_date)),
         last_order = as.Date(max(invoice_date)),
         min_amount_order = min(total_amount),
         med_amount_order = median(total_amount),
         max_amount_order = max(total_amount),
         recency =  as.integer(analysis_date-date(last_order)),
         monetary = sum(total_amount))


df_customer_rfm <- df_customer_transaction %>% select(customer_id,invoice_no) %>% 
  distinct() %>% 
  group_by(customer_id) %>% 
  summarise(frequency = n()) %>% 
  ungroup() %>% 
  left_join(df_customer_rfm,by=c("customer_id","customer_id"))


# Baskets size
df_customer_rfm <- df_customer_transaction %>% group_by(customer_id,invoice_no) %>% 
  summarise(freq = n()) %>% ungroup() %>% 
  group_by(customer_id) %>% 
  summarise(
    min_baskets = min(freq),
    max_baskets = max(freq),
    avg_baskets = round(median(freq),0)
  ) %>% ungroup() %>% 
  left_join(df_customer_rfm, by = c("customer_id","customer_id")) %>% 
  select(customer_id, country, first_order,last_order,min_amount_order,med_amount_order,max_amount_order,
         min_baskets,max_baskets,avg_baskets,recency, frequency, monetary)


head(df_customer_rfm,10)

#> # A tibble: 10 x 13
#>    customer_id country first_order last_order min_amount_order med_amount_order
#>          <dbl> <fct>   <date>      <date>                <dbl>            <dbl>
#>  1       12346 United~ 2011-01-18  2011-01-18         77184.            77184. 
#>  2       12347 Iceland 2010-12-07  2011-10-31             5.04             17  
#>  3       12348 Finland 2010-12-16  2011-09-25            13.2              41.8
#>  4       12349 Italy   2011-11-21  2011-11-21             6.64             17.7
#>  5       12350 Norway  2011-02-02  2011-02-02             8.5              18.8
#>  6       12352 Norway  2011-02-16  2011-11-03             9.9              17.4
#>  7       12353 Bahrain 2011-05-19  2011-05-19            11.6              18.8
#>  8       12354 Spain   2011-04-21  2011-04-21             8.5              17.0
#>  9       12355 Bahrain 2011-05-09  2011-05-09            17.7              25.5
#> 10       12356 Portug~ 2011-01-18  2011-11-17             3.75             34.8
#> # ... with 7 more variables: max_amount_order <dbl>, min_baskets <int>,
#> #   max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> #   monetary <dbl>

Recency Segmentation

Data ini merupakan histori transksi selama 1 tahun, untuk itu bisa kita buat aturan sebagai berikut:

Active : Recency <= 30
Warm : 30 < Recency <= 90
Cold : 90 < Recency <= 180
Inactive : Recency > 180

recencny_adjust <- data.frame(
  active = 30,
  warm = 90,
  cold = 180,
  inactive = Inf
)


df_customer_rfm <- df_customer_rfm %>% mutate(
  recency_segment = case_when(recency <= recencny_adjust$active ~ "active",
                              recency > recencny_adjust$active & recency <= recencny_adjust$warm ~ "warm",
                              recency > recencny_adjust$warm & recency <= recencny_adjust$cold ~ "cold",
                              TRUE ~ "inactive"),
  recency_segment = factor(recency_segment, levels=c(names(recencny_adjust)))
)

recency_segmentation <- df_customer_rfm %>% group_by(recency) %>% summarise(freq=n()) %>% ungroup()


plot_recency_segmentation <- ggplot(recency_segmentation,aes(recency,freq))+
  #geom_line()+
 
  geom_vline(xintercept = recencny_adjust$active,linetype="dotted",color = "black", size=1,alpha=0.5)+
  geom_rect(aes(xmin = 0, xmax = recencny_adjust$active,
                ymin = 0, ymax = Inf),fill = "green", alpha = 0.002)+
  
  geom_vline(xintercept = recencny_adjust$warm,linetype="dotted",color = "black", size=1,alpha=0.5)+
  geom_rect(aes(xmin = recencny_adjust$active, xmax = recencny_adjust$warm, 
                ymin = 0, ymax = Inf),fill = "orange", alpha = 0.002)+
  
  geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1, alpha=0.5)+
  geom_rect(aes(xmin = recencny_adjust$warm, xmax = recencny_adjust$cold, 
                ymin = 0, ymax = Inf),fill = "blue", alpha = 0.002)+
  
  #geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1)+
   geom_rect(aes(xmin = recencny_adjust$cold, xmax = Inf, 
                ymin = 0, ymax = Inf),fill = "red", alpha = 0.8)+
  
  geom_area(fill="black",alpha=0.2)+
  geom_point(aes(x=recencny_adjust$active, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$active, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$active," days"))+
  
  geom_point(aes(x=recencny_adjust$warm, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$warm, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$warm," days"))+
  
  geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$cold, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+
  
  geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$cold, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+

  annotate("text", x = (0+recencny_adjust$active)/2, y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Active")+
  annotate("text", x = median(recencny_adjust$active:recencny_adjust$warm), y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Warm")+
  annotate("text", x = median(recencny_adjust$warm:recencny_adjust$cold), y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Cold")+
  annotate("text", x = median(recencny_adjust$cold:max(recency_segmentation$recency)), y = max(recency_segmentation$freq)-5, 
          color = "black", size=4, label="Inactive")+
  labs(
    title="Distribution of Recency Segementation",
    x = "Recency (days since last transaction)",
    y = "Total Customer",
    subtitle = paste0("Data from ", min(df_customer_rfm$first_order)," untill ",  max(df_customer_rfm$last_order))
  )+
  theme_minimal()

ggplotly(plot_recency_segmentation)

plot_cust_recency <- df_customer_rfm %>% group_by(recency_segment) %>% 
  summarise(
    total_customer = n(),
    total_transaction = sum(frequency),
    total_monetary = sum(monetary),
    percent_monetary = round((total_monetary/sum(df_customer_rfm$monetary))*100,1)
  ) %>% 
  ungroup() %>% 
  mutate(
    popup = glue("Recency Segment : {recency_segment}
                 Total Customer : {comma(total_customer)}
                 Total Trasanction: {total_transaction}
                 Total Monetary: BRL {comma(total_monetary)} ({percent_monetary}%)")
  ) %>% 
  ggplot(aes(recency_segment,total_monetary))+
  geom_bar(stat = "identity", aes(fill=recency_segment, text=popup), show.legend = FALSE)+
  geom_text(aes(label=paste0(comma(total_customer)," (",round((total_customer/sum(total_customer))*100,1),"%) Cust"),
                y=total_monetary+450000),size=3,color="black")+
  geom_text(aes(label=paste0(round((total_monetary/sum(total_monetary))*100,1),"% Revenue"),y=total_monetary+150000), size=3)+
  labs(
    title = "Total Revenue based Recency Segmentation",
    x = "Recency Segment",
    y = "Total Revenue",
    subtitle = paste0("Data from ", min(df_customer_rfm$first_order)," untill ",  max(df_customer_rfm$last_order))
  ) +
  theme_minimal()


ggplotly(plot_cust_recency,tooltip="text") %>% 
  layout(showlegend=FALSE)

Chart diatas menunjukan hanya 38.7% dari total customer kita yang masuk segmen Active dengan Revenue yang dihasilkan sebesar 70.7%, maka tentunya perusahaan harus mempertahankan customer pada segmen ini untuk dapat tetap aktif.

Namun, bagaimana jika terdapat High value customer yang tidak masuk ke segmen active customer? Segmen diatas dapat dibagi lagi berdasarkan nilai transaksi customer. Mengelompokan berdasarkan nilai customer sangat penting karena perusahaan dapat menawarkan promosi yang sesuai dengan kebiasaan pengeluaran customer. Mengetahui setiap nilai customer juga membantu bisnis menciptakan pemasaran bertarget untuk pelanggan bernilai tinggi. Misalnya, pada segmen warm, cold dan inactive terdapat customer yang memberikan revenue besar bagi perusahaan, maka customer ini dapat menjadi target utama pemasaran. Untuk itu kita perlu breakdown segmen ini berdasarkan Frekuensi Order dan total pengeluaran customer.

Frequency/Monetary Segmentation

Correlaction Checking

Jika dilihat frequency dan monetary memiliki korelasi yang cukup kuat, dalam artian semakin tinggi korelasi maka semakin besar nilai monetary.

ggcorr(df_customer_rfm %>% select(recency,frequency,monetary), label = TRUE, label_size = 4, hjust=0.2)

Outlier Checking

Mari kita cek dahulu sebaran data dan outliernya.

par(mfrow=c(2,2),cex=0.8)
#hist(cust_rfm$log_recency,main="log of recency")
hist(df_customer_rfm$frequency,main="frequency",breaks=200)
hist(df_customer_rfm$monetary,main="monetary", breaks=10000)
boxplot(df_customer_rfm$frequency,main="frequency")
boxplot(df_customer_rfm$monetary, main="monetary")

Dari visualisasi diatas, nilai frequency dan monetary diatas memiliki outlier yang cukup banyak. Mari kita cek lebih detail lewat visualisasi yang lebih jelas.

ggplotly(
  df_customer_rfm %>% 
  ggplot(aes(frequency,monetary))+
  geom_jitter()+
  geom_smooth(method = lm)+
  labs(
    title = "Frequency vs Monetary of all data"
  )+
  scale_x_continuous(breaks = seq(from = 0, to = 200, by = 10))+
  scale_y_continuous(breaks = seq(from = 0, to = 270000, by = 10000)))

Dapat dilihat lebih jelas terdapat beberapa customer yang sangat jauh dari pusat data dan bisa kita sebut outlier. Jika dizoom, bisa dilihat lebih jelas data cukup berpusat dari pada frequency <= 38 dan monetary <= 15000. Mari kita cek:

frequency_adjust = 30
monetary_adjust = 12000

df_customer_rfm_adjust <-  df_customer_rfm %>% 
    filter(frequency <= frequency_adjust, monetary <=monetary_adjust)

txt_freqmon_range <- paste0("Range Frequency <= ",frequency_adjust, " and Monetary <=", monetary_adjust)
txt_data_total <- paste0("(",round((df_customer_rfm_adjust %>% nrow())/(df_customer_rfm %>% nrow())*100,1),"% Customer)")

ggplotly(
  df_customer_rfm_adjust %>% 
  ggplot(aes(frequency,monetary))+
  geom_jitter()+
  geom_smooth(method = lm)+
  labs(
    title = "Frequency vs Monetary"
  )) %>% 
  layout(title = list(text = paste0('Frequency vs Monetary',
                                    '<br>',
                                    '<sup>',
                                    paste0(txt_freqmon_range,"  ",txt_data_total),
                                    '</sup>')))

79 (98.2%) customer berada pada range frequency <= 30 dan monetary <= 15000. Metode clustering cukup sensitif dengan jarak, sehingga kita perlu mengexclude data yang terlalu jauh kemudian melakukan scalling. Untuk itu kita akan melakukan clustering terhadap customer pada range ini dan data lainnya kita analisa terpisah.

Scalling

norm_minmax <- function(x){
  return ((x - min(x))/(max(x) - min(x)))
}

df_customer_rfm_adjust$mm_frequency <- norm_minmax(df_customer_rfm_adjust$frequency)
df_customer_rfm_adjust$mm_monetary <- norm_minmax(df_customer_rfm_adjust$monetary)


par(mfrow=c(2,2),cex=0.9)
#hist(cust_rfm$mm_recency,main="1 - minmax of log_recency")
hist(df_customer_rfm_adjust$mm_frequency,main="minmax of frequency")
hist(df_customer_rfm_adjust$mm_monetary,main="minmax of monetary")

#boxplot(cust_rfm$mm_recency, main="1 - minmax of log_recency")
boxplot(df_customer_rfm_adjust$mm_frequency,main="minmax of frequency")
boxplot(df_customer_rfm_adjust$mm_monetary, main="minmax of monetary")

Optimal K For Clustering

Proses Clustering akan menggunakan nilai dari min-max normalization.

Elbow Method

fviz_nbclust(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], kmeans, method = "wss", k.max = 15) + 
  labs(subtitle = "Elbow method")

Berdasarkan plot penentuan Optimal-K diatas, Elbow Method menunjukan penuruan mulai stabil pada K=3. Namun, jika dilihat pada K=5, penurunan cenderung statis hingga K=15. Untuk itu mari kita coba menggunakan nilai K=5.

Silhouette Method

fviz_nbclust(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], kmeans, "silhouette", k.max = 15) + 
  labs(subtitle = "Silhouette method")

Silhouette Method menunjukan bahwa optimal K=2, namun jika dilihat nilai K=3 dan K=5 juga cukup optimum. Kita ingin mengclusteringkan customer lebih spesifik, tentunya setiap cluster harus memiliki identitas. Jika menggunakan 2 cluster maka cluster cenderung kurang memiliki identitas, maka mari kita coba melakukan clustering menggunakan K=3 dan K=5.

Clustering

hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')

Clustering using K=3

K-Means

Hasil K-Means dengan 3 clusters terbagi menjadi 2624:396:1024 dengan nilai between_SS / total_SS = 73.4 %

set.seed(2020, sample.kind = "Rounding")

kmeans_mm_k3 <- kmeans(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")],centers = 3, iter.max = 20)
kmeans_mm_k3$size

#> [1] 3141  865  208

round((kmeans_mm_k3$betweenss/kmeans_mm_k3$totss)*100,1)

#> [1] 75.1

DBScan

Hasil DBScan dengan 3 clusters terbagi menjadi 1477:720:364 dengan noise sebanyak 1654. Jumlah noise terlalu banyak.

kNNdistplot(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], k = 3)
abline(h = 0.02, col = "red")

db_clust_k3 <- dbscan(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], eps = 0.02, minPts = 150)
db_clust_k3

#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.02, minPts = 150
#> The clustering contains 3 cluster(s) and 1654 noise points.
#> 
#>    0    1    2    3 
#> 1654 1476  720  364 
#> 
#> Available fields: cluster, eps, minPts

K-Means vs DBScan using K=3

# K-Means
cust_fm_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=kmeans_mm_k3$cluster) %>%
   mutate(
    fm_segment=as.factor(fm_segment))

perfom <- round((kmeans_mm_k3$betweenss/kmeans_mm_k3$totss)*100,1)

cluster_rfm_mm_profile_k3 <- cust_fm_mm_k3 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()



cluster_rfm_mm_profile_k3 %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "K-Means Clustering for 3 Segment",
    subtitle = paste("betweenss/totts:",perfom,"%"),
    x = "Cluster of Frequency/Monetary"
  ) -> plot_kmeans_k3



# DBScan 
dbscan_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=db_clust_k3$cluster) %>%
  mutate(fm_segment=as.factor(fm_segment))


dbscan_mm_k3_profile <- dbscan_mm_k3 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()


dbscan_mm_k3_profile %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "DBScan Clustering for 3 segment, with 569  noise",
    x = "Cluster of Frequency/Monetary"
  ) -> plot_dbscan_k3


grid.arrange(plot_kmeans_k3, plot_dbscan_k3, nrow = 1)

Clustering using K=5

K-Means

Hasil K-Means dengan 5 clusters terbagi menjadi 2304, 151, 1041, 176, 372 dengan nilai between_SS / total_SS = 84.1 %

set.seed(2020, sample.kind = "Rounding")

kmeans_mm_k5 <- kmeans(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")],centers = 5, iter.max = 20)
kmeans_mm_k5$size

#> [1] 2306 1048  538  219  103

round((kmeans_mm_k5$betweenss/kmeans_mm_k5$totss)*100,1)

#> [1] 83.6

DBScan

Hasil DBScan dengan 5 cluster terbagi menjadi 321:1520:773:460:174 dengan noise sebanyak 967. Jumlah noise terlalu banyak.

kNNdistplot(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], k = 5)
abline(h = 0.025, col = "red")

db_clust_k5 <- dbscan(df_customer_rfm_adjust[,c("mm_frequency","mm_monetary")], eps = 0.025, minPts = 50)
db_clust_k5

#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.025, minPts = 50
#> The clustering contains 5 cluster(s) and 967 noise points.
#> 
#>    0    1    2    3    4    5 
#>  967  322 1519  773  460  173 
#> 
#> Available fields: cluster, eps, minPts

K-Means vs DBScan using K=5

# K-Means
cust_fm_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=kmeans_mm_k5$cluster) %>%
   mutate(
    fm_segment=as.factor(fm_segment))

perfom <- round((kmeans_mm_k5$betweenss/kmeans_mm_k5$totss)*100,1)


cluster_rfm_mm_profile_k5 <- cust_fm_mm_k5 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()



cluster_rfm_mm_profile_k5 %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "K-Means Clustering for 5 Segment",
    subtitle = paste("betweenss/totts:",perfom,"%"),
    x = "Cluster of Frequency/Monetary"
  ) -> plot_kmeans_k5



# DBScan 
dbscan_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=db_clust_k5$cluster) %>%
  mutate(fm_segment=as.factor(fm_segment))


dbscan_mm_k5_profile <- dbscan_mm_k5 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()


dbscan_mm_k5_profile %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "DBScan Clustering for 5 segment, exclude 345 noise",
    x = "Cluster of Frequency/Monetary"
  ) -> plot_dbscan_k5


grid.arrange(plot_kmeans_k5, plot_dbscan_k5, nrow = 1)

FM Segmentation Profiling

cluster_fm_summary <- cluster_rfm_mm_profile_k5 %>% mutate(
   popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
               Total Monetary: BRL {comma(total_monetary)} ({comma((total_monetary/sum(total_monetary))*100)}%)
               Frequency Range: {min_frequency} - {max_frequency}
               Monetary Range: BRL {min_monetary} - {max_monetary}
               Avg. Frequency: {avg_frequency} Order
               Avg. Monetary: BRL {avg_monetary}"),
  fm_segment = as.factor(fm_segment)
) %>% 
ggplot(aes(fm_segment,avg_frequency)) +
  geom_bar(aes(fill=fm_segment, text=popup),stat="identity", show.legend = FALSE)+
  geom_text(aes(label=paste0('Avg: ',round(avg_frequency,0)),y=avg_frequency+0.5),size=3, color="black")+  
  #geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
  #              y=total_monetary+220000),size=3, color="black")+
  labs(
    title="Profiling: Average Frequency per Cluster using K=5",
    x="Cluster",
    y="Average Frequency"
  )+theme_minimal()


ggplotly(cluster_fm_summary,tooltip="text") %>% 
  layout(showlegend=FALSE)

cluster_fm_summary <- cluster_rfm_mm_profile_k5 %>% mutate(
    popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
               Total Monetary: BRL {comma(total_monetary)} ({comma((total_monetary/sum(total_monetary))*100)}%)
               Frequency Range: {min_frequency} - {max_frequency}
               Monetary Range: BRL {min_monetary} - {max_monetary}
               Avg. Frequency: {avg_frequency} Order
               Avg. Monetary: BRL {avg_monetary}"),
  fm_segment = as.factor(fm_segment)
) %>% 
ggplot(aes(fm_segment,avg_monetary)) +
  geom_bar(aes(fill=fm_segment, text=popup),stat="identity", show.legend = FALSE)+
  geom_text(aes(label=paste0("BRL ",comma(avg_monetary,1)),y=avg_monetary+150),size=3, color="black")+  
  #geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
  #              y=total_monetary+220000),size=3, color="black")+
  labs(
    title="Profiling: Average Monetary per Cluster using K=5",
    x="Cluster",
    y="Average Monetary"
  )+theme_minimal()


ggplotly(cluster_fm_summary,tooltip="text") %>% 
  layout(showlegend=FALSE)

Berdasarkan visualisasi diatas maka, setiap cluster bisa artikan sebagai berikut:

Cluster 1: Medium Frequency, Medium Monetary. code: VLF-VLM. Kita sebut Very Low Value Customer.
Cluster 2: Very Low Frequency, Very Low Monetary. LF-LM. Kita sebut Low Value Customer
Cluster 3: Very High Frequency, Very High Monetary. code: MF-MM. Kita sebut Medium Value Customer
Cluster 4: Low Frequency, Low Monetary. code: HF-HM. Kita sebut High Value Customer
Cluster 5: High Frequency, High Monetary. code: VHF-VHM. Kita sebut Very High Value Customer

Berikut ini sebaran data hasil clusteringnya:

fm_segment_result <- cust_fm_mm_k5 %>% 
  mutate(
   fm_segment_code = case_when(fm_segment == "1" ~ "VLF-VLM",
                        fm_segment == "2" ~ "LF-LM",
                        fm_segment == "3" ~ "MF-MM",
                        fm_segment == "4" ~ "HF-HM",
                        TRUE ~ "VHF-VHM"),
   fm_segment_code = factor(fm_segment_code, levels=c("VLF-VLM","LF-LM","MF-MM","HF-HM","VHF-VHM")),
   fm_segment_desc = case_when(fm_segment == "1" ~ "Very Low Frequency, Very Low Monetary",
                        fm_segment == "2" ~ "Low Frequency, Low Monetary",
                        fm_segment == "3" ~ "Medium Frequency, Medium Monetary",
                        fm_segment == "4" ~ "High Frequency, High Monetary",
                        TRUE ~ "Very High Frequency, Very High Monetary"),
   fm_segment_desc = factor(fm_segment_desc,levels=c("Very Low Frequency, Very Low Monetary",
                                                      "Low Frequency, Low Monetary",
                                                      "Medium Frequency, Medium Monetary",
                                                      "High Frequency, High Monetary",
                                                      "Very High Frequency, Very High Monetary")),
   fm_segment_name = case_when(fm_segment == "1" ~ "Very Low Value",
                        fm_segment == "2" ~ "Low Value",
                        fm_segment == "3" ~ "Medium Value",
                        fm_segment == "4" ~ "High Value",
                        TRUE ~ "Very High Value"),
   fm_segment_name = factor(fm_segment_name, levels=c("Very Low Value","Low Value","Medium Value",
                                                      "High Value","Very High Value"))) %>% 
  select(customer_id,fm_segment_code,fm_segment_desc,fm_segment_name)

rfm_segment_result <- df_customer_rfm_adjust %>% 
  #filter(frequency<400, monetary<6000) %>% 
  left_join(fm_segment_result, by=c("customer_id","customer_id")) %>% 
  select(-c(mm_frequency,mm_monetary))


head(rfm_segment_result,10)

#> # A tibble: 10 x 17
#>    customer_id country first_order last_order min_amount_order med_amount_order
#>          <dbl> <fct>   <date>      <date>                <dbl>            <dbl>
#>  1       12347 Iceland 2010-12-07  2011-10-31             5.04             17  
#>  2       12348 Finland 2010-12-16  2011-09-25            13.2              41.8
#>  3       12349 Italy   2011-11-21  2011-11-21             6.64             17.7
#>  4       12350 Norway  2011-02-02  2011-02-02             8.5              18.8
#>  5       12352 Norway  2011-02-16  2011-11-03             9.9              17.4
#>  6       12353 Bahrain 2011-05-19  2011-05-19            11.6              18.8
#>  7       12354 Spain   2011-04-21  2011-04-21             8.5              17.0
#>  8       12355 Bahrain 2011-05-09  2011-05-09            17.7              25.5
#>  9       12356 Portug~ 2011-01-18  2011-11-17             3.75             34.8
#> 10       12357 Switze~ 2011-11-06  2011-11-06            10.1              35.4
#> # ... with 11 more variables: max_amount_order <dbl>, min_baskets <int>,
#> #   max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> #   monetary <dbl>, recency_segment <fct>, fm_segment_code <fct>,
#> #   fm_segment_desc <fct>, fm_segment_name <fct>

plot_ly(rfm_segment_result, x = ~recency, y = ~frequency, z = ~monetary, color = ~fm_segment_name,
        #colors=c("#207c06","#799922","#beb448","#ffd178","#f3a252","#e5703b","#d33333"),
        hoverinfo = 'text', text = ~paste("Customer ID: ", customer_id,
                                          "<br>Segment : ",fm_segment_name,
                                          "<br>Segment Description: <b>",fm_segment_desc,"</b>",
                                          "<br>Recency: <b>",recency,"</b>",
                                          "<br>Frequency: <b>", frequency,"</b>",
                                          "<br>Monetary: <b>BRL ",monetary,"</b>")) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Recency'),
                     yaxis = list(title = 'Frequency'),
                     zaxis = list(title = 'Monetary'))) %>% 
  layout(title="Customer Value Segmentation")

RFM Segmentation Summary

plot_fm_total_monetary <- rfm_segment_result %>% group_by(fm_segment_name) %>% 
  summarise(total_monetary = sum(monetary),
            total_customer = n(),
            avg_frequency = median(frequency),
            avg_monetary = median(monetary)) %>% 
  ungroup() %>% 
  mutate(popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,1)})%
               Total Monetary: GBP {comma(total_monetary)}  ({comma((total_monetary/sum(total_monetary))*100)}%)
               Avg. Frequency: {avg_frequency} Order
               Avg. Monetary: GBP {avg_monetary}")) %>% 
  ggplot(aes(fm_segment_name,total_monetary)) +
  geom_bar(aes(fill=fm_segment_name, text=popup),stat="identity", show.legend = FALSE)+
  geom_text(aes(label=paste0("GBP ",comma(total_monetary),'(',round(total_monetary/sum(total_monetary)*100,1),'%)'),
                y=total_monetary+70000),size=3, color="black")+  
  geom_text(aes(label=paste0(round((total_customer/sum(total_customer))*100,1),"% cust"),
                y=total_monetary+140000),size=3, color="black")+
  labs(
    title="Total Monetary per FM Segment",
    x="Cluster",
    y="Total Monetary"
  )+theme_minimal()


ggplotly(plot_fm_total_monetary,tooltip="text") %>% 
  layout(showlegend=FALSE)

plot_rfm_summary <- rfm_segment_result %>% 
  group_by(recency_segment, fm_segment_name) %>% 
  summarise(total_monetary=sum(monetary),
            total_customer=n()) %>% 
  ungroup() %>% 
  mutate(
    popup = glue("Total Customer: {total_customer} ({round((total_customer/sum(total_customer))*100,2)}%)
                 Total Monetary : GBP {total_monetary} ({comma(round((total_monetary/sum(total_monetary))*100,2))}%)")
  ) %>% 
  ggplot(aes(x=fm_segment_name,total_monetary))+
  geom_text(aes(label=paste0("M: ",round((total_monetary/sum(total_monetary))*100,2),"%"),
                y=total_monetary+35000),size=3, color="black")+
  geom_text(aes(label=paste0("C: ",round((total_customer/sum(total_customer))*100,2),"%"),
                y=total_monetary+100000),size=3, color="black")+
  # geom_bar(aes(fill=fm_segment),stat="identity", show.legend = FALSE)+
  geom_rect(aes(xmin = 0, xmax = Inf,
                ymin = 0, ymax = Inf),fill = "red", alpha = 0.02)+
  geom_bar(aes(fill=fm_segment_name, text=popup),stat="identity", show.legend = FALSE)+
  labs(
    title="RFM Segment Summary",
    x="Customer Value Segment",
    y="Total Monetary"
  )+
  facet_wrap(~ recency_segment)+
  theme(
    legend.position = "none", 
    axis.text.y = element_blank(),
    plot.title = element_text(hjust = 0.5),
    panel.spacing = unit(1, "lines"))

ggplotly(plot_rfm_summary, tooltip = "text", height = 640)%>% 
  layout(showlegend=FALSE) %>% 
  #layout(margin = list(l = 50, r = 30, b = 50, t = 90)) %>% 
  layout(title = list(text = paste0('RFM Segmentation Summary',
                                    '<br>',
                                    '<sup>',
                                    '</sup>')))

Sebelumnya terdapat 79 data customer yang kita exclude dari proses clustering karena outlier. Customer ini jelas dapat kita kategorikan sebagai Special Customer karena memiliki frequency dan Monetary yang terlalu jauh dari customer lainnya.

cust_segment_result <- rbind(
  rfm_segment_result,
  df_customer_rfm %>% filter(!customer_id %in% c(df_customer_rfm_adjust$customer_id %>% unique())) %>% 
  mutate(
    fm_segment_code = "SPC",
    fm_segment_desc = "Special Customer",
    fm_segment_name = "Special Customer",
    fm_segment_code = factor(fm_segment_code, levels=c("VLF-VLM","LF-LM","MM-MM","HF-HM","VHF-VHM","SPC")),
     fm_segment_desc = factor(fm_segment_desc,levels=c("Very Low Frequency, Very Low Monetary",
                                                      "Low Frequency, Low Monetary",
                                                      "Medium Frequency, Medium Monetary",
                                                      "High Frequency, High Monetary",
                                                      "Very High Frequency, Very High Monetary",
                                                      "Special Customer")),
    fm_segment_name = factor(fm_segment_name, levels=c("Very Low Value","Low Value","Medium Value",
                                                      "High Value","Very High Value","Special Customer"))))

plot_tile_rfm_segmentation <- cust_segment_result %>% 
  group_by(recency_segment,fm_segment_name,fm_segment_desc) %>% 
  summarise(freq = n(),
            total_monetary=sum(monetary),
            avg_monetary = round(mean(monetary)),
            avg_frequency = median(frequency)) %>% 
  ungroup() %>% 
  mutate(
      popup=glue("Receny Segment : {toupper(recency_segment)}
                 Value Segment : {toupper(fm_segment_name)}
                 Total Customer: {freq}
                 Total Monetary : {comma(total_monetary)}
                 Avg. Monetary  : GBP {comma(avg_monetary)}
                 Most Frequency : {avg_frequency}")
  ) %>% 
  mutate(percent_freq = round((freq/sum(freq))*100,2)) %>% 
  ggplot(aes(recency_segment,fm_segment_name))+
  geom_tile(aes(fill = avg_monetary, text=popup), colour = "white", show.legend = TRUE) +
  geom_text(aes(label=paste0(percent_freq,"%"),text=popup), size=3, color="white", show.legend = FALSE)+
  labs(
      y= NULL,
      x= "Days since last purchase",
      fill = "Avg. Monetary"
  )+
  scale_fill_gradient(low="red",na.value = "#C0C0C0", high="green")+
  theme(legend.title=element_text(size=9),
        legend.position = "bottom")

ggplotly(plot_tile_rfm_segmentation, tooltip="text") %>% 
  #layout(showlegend=FALSE) %>% 
  #layout(margin = list(l = 50, r = 30, b = 50, t = 90)) %>% 
  layout(legend=list(orientation = "h", y=-0.1))%>% 
  layout(title = list(text = paste0('RFM Segmentation Distribution',
                                    '<br>',
                                    '<sup> Customer Distribution based on RFM Segmentation',
                                    '</sup>')))

Segment Analysis

Daily Order Behaviour

 rfm_order_habit_day <-  df_customer_transaction %>% 
  left_join(cust_segment_result, by=c("customer_id","customer_id")) %>% 
  select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>%
  distinct() %>% 
  mutate(wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>% 
  group_by(wday,fm_segment_name) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup() %>% 
  mutate(popup = glue("Weekday: {wday}
                      Total Transaction: {total_transaction}"))



rfm_order_habit_day %>% filter(fm_segment_name=="Special Customer") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Special Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_1



rfm_order_habit_day %>% filter(fm_segment_name=="Very High Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Very High Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_2


rfm_order_habit_day %>% filter(fm_segment_name=="High Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "High Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_3


rfm_order_habit_day %>% filter(fm_segment_name=="Medium Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Medium Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_4


rfm_order_habit_day %>% filter(fm_segment_name=="Low Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Low Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_5


rfm_order_habit_day %>% filter(fm_segment_name=="Very Low Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Very Low Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_6


 grid.arrange(plot_rfm_day_1, plot_rfm_day_2, plot_rfm_day_3,
               plot_rfm_day_4, plot_rfm_day_5, plot_rfm_day_6,
               ncol = 3)

Hourly Order Behaviour

serror <- function(x) sqrt(var(x)/length(x)) 
time_range = data_frame(hour_of_day = c(0:23))


rfm_order_habit_time <- df_customer_transaction %>% 
  left_join(cust_segment_result, by=c("customer_id","customer_id")) %>% 
  select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>% 
  distinct() %>%
  mutate(hour_of_day = hour(invoice_date)) %>% 
  group_by(hour_of_day,fm_segment_name) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup() %>% 
  mutate(popup = glue("Hour of Day : {hour_of_day}
                       Total Transaction: {total_transaction}")) 


  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Special Customer"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Special Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_1
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Very High Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Very High Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_2
  
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="High Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "High Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_3
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Medium Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Medium Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_4
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Low Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Low Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_5
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Very Low Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Very Low Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_6
  
  
  grid.arrange(plot_rfm_time_1,plot_rfm_time_2,plot_rfm_time_3,
               plot_rfm_time_4,plot_rfm_time_5,plot_rfm_time_6,
               ncol = 3)

Recommender System

Sistem rekomendasi dibuat menggunakan data histori transaksi setiap invoice_no.

df_product_recomm <- readRDS("data_clean/df_product_recomm.rds")

sapply(df_product_recomm[ ,c('invoice_no','stock_code')], function(x) length(unique(x)))

#> invoice_no stock_code 
#>      18975       3764

sapply(df_product_recomm[ ,c('customer_id','stock_code')], function(x) length(unique(x)))

#> customer_id  stock_code 
#>        4295        3764

Rating Matrix

Data yang kita miliki merupakan data histori transaksi, sehingga kita bisa menggunakan binaryRatingsMatrix dengan ketentuan Jika produk dibeli maka di set 1 dan jika tidak maka di set 0. Kita akan mencoba menggunakan matriks atas data Customer-Produk dan Invoice-Produk.

Rating Matrix Customer-Product

Kelemahan menggunakan rating Customer-Product pada data ini yaitu cukup banyak data dibuang karena terdapat missing values pada customer_id.

# drop NA Values of customer_id
df_custprod <- df_product_recomm %>% select(customer_id,stock_code)
df_custprod <- drop_na(df_custprod)

# Remove duplicate product per customer
df_custprod <- df_custprod %>% mutate(identifier = paste0(customer_id,"-",stock_code))
df_custprod <- df_custprod[!duplicated(df_custprod$identifier), ] %>% select(-identifier)

rating_matrix_custprod <- df_custprod %>%
  select(customer_id, stock_code) %>% 
  mutate(value = 1) %>%
  spread(stock_code, value, fill = 0) %>%
  select(-customer_id) %>%
  as.matrix() %>%
  as("binaryRatingMatrix")

rating_matrix_custprod

#> 4294 x 3625 rating matrix of class 'binaryRatingMatrix' with 257567 ratings.

Rating Matrix Invoice-Product

# Remove duplicate product per product
df_invprod <- df_product_recomm %>% select(invoice_no,stock_code)
df_invprod <- df_product_recomm %>% mutate(identifier = paste0(invoice_no,"-",stock_code))
df_invprod <- df_invprod[!duplicated(df_invprod$identifier), ] %>% select(-identifier)

rating_matrix_invprod <- df_invprod %>%
  select(invoice_no, stock_code) %>% 
  mutate(value = 1) %>%
  spread(stock_code, value, fill = 0) %>%
  select(-invoice_no) %>%
  as.matrix() %>%
  as("binaryRatingMatrix")

rating_matrix_invprod

#> 18975 x 3764 rating matrix of class 'binaryRatingMatrix' with 492975 ratings.

Dapat dilihat jumlah matriks yang terbentuk pada rating matrix Customer-Product dan rating matrix Invoice-Product cukup besar. Meskipun begitu, mari tetap kita coba dan bandingkan hasil akhirnya.

Model Validation with K-Fold

Train dan test set menggunakan Fold dengan K=5 dan proporsi setiap fold terdiri dari 80% train sets dan 20% test sets.

# rating_matrix_custprod
scheme_custprod <- rating_matrix_custprod %>% 
  evaluationScheme(method = "cross-validation",
                   k      = 5, 
                   train  = 0.8,  
                   given  = -1)


# rating_matrix_invprod
scheme_invprod <- rating_matrix_invprod %>% 
  evaluationScheme(method = "cross-validation",
                   k      = 5, 
                   train  = 0.8,  
                   given  = -1)

Modelling

Algoritma yang tersedia untuk binaryRatingMatrix menggunakan library Recommender Lab

recommender_models <- recommenderRegistry$get_entries(dataType = "binaryRatingMatrix")
summary(recommender_models)

#>                                 Length Class              Mode
#> ALS_implicit_binaryRatingMatrix 6      recommender_method list
#> AR_binaryRatingMatrix           6      recommender_method list
#> IBCF_binaryRatingMatrix         6      recommender_method list
#> POPULAR_binaryRatingMatrix      6      recommender_method list
#> RANDOM_binaryRatingMatrix       6      recommender_method list
#> RERECOMMEND_binaryRatingMatrix  6      recommender_method list
#> UBCF_binaryRatingMatrix         6      recommender_method list

Mari kita coba seluruh algoritma diatas dan bandingkan hasilnya:

algorithms_binary <- list(
  "association_rules" = list(name  = "AR", param = list(support = 0.05, confidence = 0.05)),
  "popular"     = list(name  = "POPULAR", param = NULL),
  "random"      = list(name  = "RANDOM",  param = NULL),
  "ibcf"        = list(name  = "IBCF", param = list(k = 5)), #JACCARD
  #"ibcf_20"     = list(name  = "IBCF", param = list(k = 20)),
  #"ibcf_50"     = list(name  = "IBCF", param = list(k = 50)),
  #"ibcf_cosine" = list(name  = "IBCF", param = list(method = "Cosine", k = 5)),
  "ibcf_pearson_5"= list(name  = "IBCF", param = list(method = "Pearson", k = 5)),
  "ibcf_pearson_50"= list(name  = "IBCF", param = list(method = "Pearson", k = 50)),
  "ibcf_pearson_100"= list(name  = "IBCF", param = list(method = "Pearson", k = 100)),
  #"ubcf_25"     = list(name  = "UBCF", param = list(method = "Cosine", nn = 25)),
  #"ubcf_50"     = list(name  = "UBCF", param = list(method = "Cosine", nn = 50)),
  "ubcf_cosine_100"    = list(name  = "UBCF", param = list(method = "Cosine", nn = 100)),
  #"ubcf_200"    = list(name  = "UBCF", param = list(method = "Cosine", nn = 200))
  "ubcf_pearson_100"    = list(name  = "UBCF", param = list(method = "Pearson", nn = 100))
  )

memory.limit(size=56000)
start <- Sys.time()
results_custprod <- recommenderlab::evaluate(scheme_custprod, 
                                    algorithms_binary, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20))

results_invprod <- recommenderlab::evaluate(scheme_invprod, 
                                    algorithms_binary, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20))

end <- Sys.time()
cat('runtime', end - start)

wd <-  as.character(getwd())
saveRDS(object=results_custprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
                                            "results_custprod.rds",sep=""))
saveRDS(object=results_invprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
                                           "results_invprod.rds",sep=""))

results_custprod
results_invprod

results_custprod <- readRDS("modelling/recommendation_development/results_custprod.rds")
results_invprod <- readRDS("modelling/recommendation_development/results_invprod.rds")
results_custprod

#> List of evaluation results for 9 recommenders:
#> Evaluation results for 5 folds/samples using method 'AR'.
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.

results_invprod

#> List of evaluation results for 8 recommenders:
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.

Model Evaluation

method_dis = function(n,x){
  if(x==1){
    sprintf(paste0(n))
  }else{
     sprintf(x)
  }
}

Eval: Rating Matrix Customer-Product

avg_conf_matr_personal <- function(results_custprod) {
  tmp <- results_custprod %>%
    getConfusionMatrix()  %>%  
    as.list() 
    as.data.frame( Reduce("+",tmp) / length(tmp)) %>% 
    mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}

results_tbl_personal <- results_custprod %>%
  map(avg_conf_matr_personal) %>% 
# Turning into an unnested tibble
  enframe() %>%
# Unnesting to have all variables on same level
  unnest()

#results_tbl_personal

results_tbl_personal %>%
  ggplot(aes(FPR, TPR, colour = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC curves using Matrix Customer-Product",
       colour = "Model") +
  theme_grey(base_size = 14)

results_tbl_personal %>%
  ggplot(aes(recall, precision, 
             colour = fct_reorder2(as.factor(name),  precision, recall))) +
  geom_line() +
  geom_label(aes(label = n))  +
  geom_text(aes(label = method_dis(name,n)),hjust=-1, size=3)  +
  labs(title = "Precision-Recall curves using Matrix Customer-Product",
       colour = "Model") +
  theme_grey(base_size = 14)

Eval: Rating Matrix Invoice-Product

avg_conf_matr_baskets <- function(results_invprod) {
  tmp <- results_invprod %>%
    getConfusionMatrix()  %>%  
    as.list() 
    as.data.frame( Reduce("+",tmp) / length(tmp)) %>% 
    mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}

results_tbl_baskets <- results_invprod %>%
  map(avg_conf_matr_baskets) %>% 
# Turning into an unnested tibble
  enframe() %>%
# Unnesting to have all variables on same level
  unnest()

#results_tbl_baskets

results_tbl_baskets %>%
  ggplot(aes(FPR, TPR, colour = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC curves using Matrix Invoice-Product",
       colour = "Model") +
  theme_grey(base_size = 14)

results_tbl_baskets %>%
  ggplot(aes(recall, precision, 
             colour = fct_reorder2(as.factor(name),  precision, recall))) +
  geom_line() +
  geom_label(aes(label = n))  +
  geom_text(aes(label = method_dis(name,n)),hjust=-1, size=3)  +
  labs(title = "Precision-Recall curves using Matrix Invoice-Product",
       colour = "Model") +
  theme_grey(base_size = 14)

Model Selection

Berdasarkan hasil ROC Curves dan Precision-Recall Curves diatas, Model rekomendasi menggunakan Matrix Invoice-Product lebih baik karena memiliki nilai Precision dan Recall yang lebih tinggi. Kemudian, algoritma IBCF dan UBCF menunjukan performa yang lebih baik dari algortima lainnya. Pada kurva ROC kita menilai dari algoritma yang memiliki nilai FPR dan TPR tertinggi. Jika pada kurva Precision-Recall kita bisa memlihat yang memiliki precision yang tinggi karena kita ingin produk yang direkomendasikan sesuai dengan yang diinginkan customer.

Berdasar kurva Precision-Recall, jika hanya merekomendasikan 1 produk maka yang terbaik adalah algoritma UBCF, namun tidak selisih jauh dengan algoritma IBCF. Namun, model UBCF menunjukan performa yang lebih rendah dari IBCF jika semakin banyak produk yang hendak direkomendasikan. Sehingga untuk case ini, kita akan memilih model IBCF untuk membangun sistem rekomendasi.

model_ibcf_recomm <- Recommender(getData(scheme_invprod, 'train'), method = "IBCF",  param = list(k = 5))

saveRDS(object=model_ibcf_recomm, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
                                           "model_ibcf_recomm.rds",sep=""))

Prediction

model_ibcf_recomm <- readRDS("modelling/recommendation_development/model_ibcf_recomm.rds")

# Data invoice-product
df_invprod_predict <- df_product_recomm %>% select(invoice_no,stock_code)
df_invprod_predict <- df_invprod_predict %>% mutate(identifier = paste0(invoice_no,"-",stock_code))
df_invprod_predict <- df_invprod_predict[!duplicated(df_invprod_predict$identifier), ] %>% select(-identifier)

Recommendation based History Order

Rekomendasi ini dihasilkan berdasarkan histori produk yang dibeli oleh customer. Misalkan kita simulasi untuk customer_id == 17850. Histori produk yang dibeli sebagai berikut:

df_master_product <- readRDS("data_clean/df_master_product.rds")

cust_id <- "17850"
cust_hist_product <- df_customer_transaction %>% filter(customer_id == cust_id) %>% .$stock_code %>% unique()
data.frame(stock_code = cust_hist_product) %>% left_join(df_master_product, by=c("stock_code","stock_code"))

#>    stock_code                         description
#> 1      85123A  WHITE HANGING HEART T-LIGHT HOLDER
#> 2       71053                 WHITE METAL LANTERN
#> 3      84406B      CREAM CUPID HEARTS COAT HANGER
#> 4      84029G KNITTED UNION FLAG HOT WATER BOTTLE
#> 5      84029E      RED WOOLLY HOTTIE WHITE HEART.
#> 6       22752        SET 7 BABUSHKA NESTING BOXES
#> 7       21730   GLASS STAR FROSTED T-LIGHT HOLDER
#> 8       22633              HAND WARMER UNION JACK
#> 9       22632           HAND WARMER RED POLKA DOT
#> 10      20679               EDWARDIAN PARASOL RED
#> 11      37370          RETRO COFFEE MUGS ASSORTED
#> 12      21871                 SAVE THE PLANET MUG
#> 13      21071      VINTAGE BILLBOARD DRINK ME MUG
#> 14      21068     VINTAGE BILLBOARD LOVE/HATE MUG
#> 15      82483  WOOD 2 DRAWER CABINET WHITE FINISH
#> 16      82486   WOOD S/3 CABINET ANT WHITE FINISH
#> 17      82482   WOODEN PICTURE FRAME WHITE FINISH
#> 18     82494L          WOODEN FRAME ANTIQUE WHITE
#> 19    15056BL             EDWARDIAN PARASOL BLACK
#> 20      22803             IVORY EMBROIDERED QUILT
#> 21      22411   JUMBO SHOPPER VINTAGE RED PAISLEY

Kemudian siapkan matriks nya:

rating_matrix_histprod <- df_invprod_predict %>%
  select(stock_code) %>% 
  unique() %>%
  mutate(value = as.numeric(stock_code %in% cust_hist_product)) %>%
  spread(stock_code, value) %>%
  as.matrix() %>%
  as("binaryRatingMatrix")

rating_matrix_histprod

#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 21 ratings.

Berikut Top produk yang bisa direkomendasikan kepada customer_id == 17850 :

# Predict
predict_result_histprod <- predict(model_ibcf_recomm, newdata = rating_matrix_histprod, n= 20, type="topNList")

result_cust_recomm = NULL

if((length(predict_result_histprod@items$`1`))>0){
  pred_prod_cust <- data.frame(
      customer_id = cust_id,
      recommendation_based="History Order",
      stock_code = getList(predict_result_histprod)$`1`,
      recom_rate = predict_result_histprod@ratings$`1`
    )
 pred_prod_cust <- pred_prod_cust %>% 
   left_join(df_master_product, by=c("stock_code","stock_code")) %>% 
   select(customer_id,recommendation_based,stock_code,description,recom_rate)
 
 result_cust_recomm <-  rbind(result_cust_recomm,pred_prod_cust)
}

result_cust_recomm

#>    customer_id recommendation_based stock_code
#> 1        17850        History Order      21931
#> 2        17850        History Order      22866
#> 3        17850        History Order     85099B
#> 4        17850        History Order     15056N
#> 5        17850        History Order      21929
#> 6        17850        History Order      21928
#> 7        17850        History Order      22865
#> 8        17850        History Order      22386
#> 9        17850        History Order     85099C
#> 10       17850        History Order      20712
#> 11       17850        History Order      23439
#> 12       17850        History Order      22867
#> 13       17850        History Order      21877
#> 14       17850        History Order      20711
#> 15       17850        History Order      21733
#> 16       17850        History Order     15056P
#> 17       17850        History Order      21069
#> 18       17850        History Order      21479
#> 19       17850        History Order      21485
#> 20       17850        History Order      22114
#>                          description recom_rate
#> 1             JUMBO STORAGE BAG SUKI  0.2878999
#> 2      HAND WARMER SCOTTY DOG DESIGN  0.2747569
#> 3            JUMBO BAG RED RETROSPOT  0.2684497
#> 4          EDWARDIAN PARASOL NATURAL  0.2583729
#> 5     JUMBO BAG PINK VINTAGE PAISLEY  0.2560500
#> 6     JUMBO BAG SCANDINAVIAN PAISLEY  0.2504039
#> 7             HAND WARMER OWL DESIGN  0.2481908
#> 8            JUMBO BAG PINK POLKADOT  0.2481848
#> 9     JUMBO  BAG BAROQUE BLACK WHITE  0.2464097
#> 10        JUMBO BAG WOODLAND ANIMALS  0.2456964
#> 11        HAND WARMER RED LOVE HEART  0.2340989
#> 12           HAND WARMER BIRD DESIGN  0.2112100
#> 13               HOME SWEET HOME MUG  0.2097130
#> 14                    JUMBO BAG TOYS  0.2000000
#> 15  RED HANGING HEART T-LIGHT HOLDER  0.1939300
#> 16            EDWARDIAN PARASOL PINK  0.1819672
#> 17         VINTAGE BILLBOARD TEA MUG  0.1818182
#> 18      WHITE SKULL HOT WATER BOTTLE  0.1808874
#> 19  RETROSPOT HEART HOT WATER BOTTLE  0.1652174
#> 20 HOT WATER BOTTLE TEA AND SYMPATHY  0.1526391

Recommendation based on Spesific Product

Rekomendasi ini bisa terapkan berdasarkan data keranjang belanja, produk yang sedang dilihat oleh customer atau produk yang dibeli customer dalam periode waktu tertentu. Misalkan kita simulasikan jika didalam keranjang belanja customer terdapat produk dengan stock_code 84029E, 21730

cart_stock_code <- c("84029E", "21730")
df_master_product %>% filter(stock_code %in% cart_stock_code) %>% 
   select(stock_code,description)

#> # A tibble: 2 x 2
#>   stock_code description                      
#>   <chr>      <chr>                            
#> 1 21730      GLASS STAR FROSTED T-LIGHT HOLDER
#> 2 84029E     RED WOOLLY HOTTIE WHITE HEART.

Kemudian siapkan matrix-nya:

rating_matrix_cart <- df_invprod_predict %>%
  select(stock_code) %>% 
  unique() %>%
  mutate(value = as.numeric(stock_code %in% cart_stock_code)) %>%
  spread(stock_code, value) %>%
  as.matrix() %>%
  as("binaryRatingMatrix")

rating_matrix_cart

#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 2 ratings.

Berikut Top produk yang bisa direkomendasikan kepada customer:

# Predict
predict_result_cart <- predict(model_ibcf_recomm, newdata = rating_matrix_cart, n= 20, type="topNList")

result_cust_recomm = NULL

if((length(predict_result_cart@items$`1`))>0){
  pred_prod_cust <- data.frame(
      customer_id = cust_id,
      recommendation_based="Customer Cart",
      stock_code = getList(predict_result_cart)$`1`,
      recom_rate = predict_result_cart@ratings$`1`
    )
 pred_prod_cust <- pred_prod_cust %>% 
   left_join(df_master_product, by=c("stock_code","stock_code")) %>% 
   select(customer_id,recommendation_based,stock_code,description,recom_rate)
 
 result_cust_recomm <-  rbind(result_cust_recomm,pred_prod_cust)
}

result_cust_recomm

#>   customer_id recommendation_based stock_code
#> 1       17850        Customer Cart     84029G
#> 2       17850        Customer Cart      21479
#> 3       17850        Customer Cart      21485
#> 4       17850        Customer Cart      22114
#> 5       17850        Customer Cart      21484
#> 6       17850        Customer Cart      21481
#> 7       17850        Customer Cart      71053
#> 8       17850        Customer Cart      22803
#>                           description recom_rate
#> 1 KNITTED UNION FLAG HOT WATER BOTTLE 0.20248668
#> 2        WHITE SKULL HOT WATER BOTTLE 0.18088737
#> 3    RETROSPOT HEART HOT WATER BOTTLE 0.16521739
#> 4   HOT WATER BOTTLE TEA AND SYMPATHY 0.15263909
#> 5         CHICK GREY HOT WATER BOTTLE 0.14466546
#> 6          FAWN BLUE HOT WATER BOTTLE 0.14285714
#> 7                 WHITE METAL LANTERN 0.07079646
#> 8             IVORY EMBROIDERED QUILT 0.06837607

Customer Market Optimization

Customer Market Optimization

Project Background

Data Description

Data Preparation

Read Data

Variable Description

Data Preprocessing

Data Structure

Data Cleansing

Data Range Adjustment

Cancelled Transaction

Invalid Invoice

Invalid Quantity

Invalid Product

Duplicated StockCode-Description

Invalid Country

Duplicated Data

Missing Values

Feature Extraction

Data Summary

df_general_transaction

df_customer_transaction

df_product_recomm

Exploratory Data Analysis

Transaction

Frequency Order

Customer

Order Habbit

RFM Segmentation

Generate RFM Value

Recency Segmentation

Frequency/Monetary Segmentation

Correlaction Checking

Outlier Checking

Scalling

Optimal K For Clustering

Elbow Method

Silhouette Method

Clustering

Clustering using K=3

K-Means

DBScan

K-Means vs DBScan using K=3

Clustering using K=5

K-Means

DBScan

K-Means vs DBScan using K=5

FM Segmentation Profiling

RFM Segmentation Summary

Segment Analysis

Daily Order Behaviour

Hourly Order Behaviour

Recommender System

Rating Matrix

Rating Matrix Customer-Product

Rating Matrix Invoice-Product

Model Validation with K-Fold

Modelling

Model Evaluation

Eval: Rating Matrix Customer-Product

Eval: Rating Matrix Invoice-Product

Model Selection

Prediction

Recommendation based History Order

Recommendation based on Spesific Product