Project Background

sumber: Statistik E-Commerce 2019, Badan Pusat Statistik Indonesia

Berdasarkan artikel yang dipublish oleh www.xendit.co , hasil riset Accenture mengatakan bahwa 40% konsumen meninggalkan suatu website bisnis dan melakukan pembelian di website atau toko online lain karena merasa kebingungan akibat terlalu banyaknya pilihan saat akan membuat keputusan pembelian.

Data Description

data yang digunakan adalah data transaksi yang terjadi antara 01/12/2010 sampai 09/12/2011 pada salah satu perusahaan online retail yang berbasis di Inggris. Banyak pelanggan dari perusahaan adalah grosir.

Sumber Data Online Retail Data Set

Data Preparation

Read Data

Total Data awal 541909 data dab 8 variabel.

#>   total.data total.variabel
#> 1     541909              8

10 Data teratas

#> # A tibble: 10 x 8
#>    InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
#>    <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
#>  1 536365    85123A    WHITE HANG~        6 2010-12-01 08:26:00      2.55
#>  2 536365    71053     WHITE META~        6 2010-12-01 08:26:00      3.39
#>  3 536365    84406B    CREAM CUPI~        8 2010-12-01 08:26:00      2.75
#>  4 536365    84029G    KNITTED UN~        6 2010-12-01 08:26:00      3.39
#>  5 536365    84029E    RED WOOLLY~        6 2010-12-01 08:26:00      3.39
#>  6 536365    22752     SET 7 BABU~        2 2010-12-01 08:26:00      7.65
#>  7 536365    21730     GLASS STAR~        6 2010-12-01 08:26:00      4.25
#>  8 536366    22633     HAND WARME~        6 2010-12-01 08:28:00      1.85
#>  9 536366    22632     HAND WARME~        6 2010-12-01 08:28:00      1.85
#> 10 536367    84879     ASSORTED C~       32 2010-12-01 08:34:00      1.69
#> # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Saya merasa lebih mudah jika seluruh nama kolom menggunakan lowercase dan setiap kata dipisahkan karakter underscore.

#> [1] "invoice_no"   "stock_code"   "description"  "quantity"     "invoice_date"
#> [6] "unit_price"   "customer_id"  "country"

Variable Description

Berdasar informasi dari Online Retail Data Set berikut deskripsi datanya:

Variable Description
invoice_no Nomor invoice terdiri dari 6 digit angkat yang bersifat unik untuk setiap transaksi. Jika diawali huruf C maka mengindikasikan transaksi berstatus Cancel
stock_code Kode Produk yang bersifat unik setiap produk, terdiri dari 5 digit angka yang bersifat unik untuk setiap produk.
description Nama Produk
quantity Jumlah produk yang dibeli
invoice_date Tanggal dan waktu transaksi
unit_price harga produk per unit
customer_id ID Customer yang terdiri dari 5 digit angka yang bersifat unik untuk setiap customer.
country Negara customer

Data Preprocessing

Data Structure

Berikut Struktur Datanya:

#> Observations: 541,909
#> Variables: 8
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <chr> "United Kingdom", "United Kingdom", "United Kingdom", ...
#>   invoice_unik stock_code_unik description_unik country_unik customer_unik
#> 1        25900            4070             4212           38          4373

Dari data diatas country hanya berjumlah 38, sehingga mari kita ubah tipe datanya menjadi factor

Data Cleansing

Data Range Adjustment

Data ini berisikan transaksi antara 01/12/2010 sampai 09/12/2011. Data transaksi Desember 2011 tidak full 1 bulan, sehinnga saya memilih untuk melakukan analisa dari 01/12/2010 sampai 30/11/2011.

Cancelled Transaction

Sesuai deskripsi variabel, jika data invoice_no diawali huruf C maka mengindikasikan transaksi berstatus Cancel. Data transaksi yang dibatalkan tidak sesuai untuk kebutuhan analisis ini, maka perlu kita remove.

#> # A tibble: 1 x 1
#>   total_cancelled_transaction
#>                         <int>
#> 1                        8928

Terdapat 9288 transaksi yang dicancel. Data ini tidak kita gunakan maka data dapat kita exclude.

Invalid Invoice

Sesuai deskripsi, Invoice No yang valid memiliki 6 digit angka.

#> # A tibble: 3 x 8
#>   invoice_no stock_code description quantity invoice_date        unit_price
#>   <chr>      <chr>      <chr>          <dbl> <dttm>                   <dbl>
#> 1 A563185    B          Adjust bad~        1 2011-08-12 14:50:00     11062.
#> 2 A563186    B          Adjust bad~        1 2011-08-12 14:51:00    -11062.
#> 3 A563187    B          Adjust bad~        1 2011-08-12 14:52:00    -11062.
#> # ... with 2 more variables: customer_id <dbl>, country <fct>

Terdapat 3 data transaksi yang tidak valid, maka data ini dapat di exclude.

Invalid Quantity

Apakah ada data yang memiliki quantity<=0 ? Berapa banyak?

#> [1] 1306

Terdapat 1306 data transaksi yang memiliki quantity<=0, sehingga data ini perlu dibuang karena tidak valid jika suatu transaksi tidak memiliki quantity.

Invalid Product

Sesuai deskripsi variabel, stock code selalu diawali dengan 5 digit angka. Untuk hal ini, kita akan substring 5 digit pertama dari stock code kemudian kita convert ke tipe data numerik. Jika stock code tersebut menjadi NA Value, maka stock tersebut tidak valid.

#> # A tibble: 2,888 x 8
#>    invoice_no stock_code description quantity invoice_date        unit_price
#>    <chr>      <chr>      <chr>          <dbl> <dttm>                   <dbl>
#>  1 536370     POST       POSTAGE            3 2010-12-01 08:45:00        18 
#>  2 536403     POST       POSTAGE            1 2010-12-01 11:27:00        15 
#>  3 536414     22139      <NA>              56 2010-12-01 11:52:00         0 
#>  4 536527     POST       POSTAGE            1 2010-12-01 13:04:00        18 
#>  5 536540     C2         CARRIAGE           1 2010-12-01 14:05:00        50 
#>  6 536544     DOT        DOTCOM POS~        1 2010-12-01 14:32:00       570.
#>  7 536545     21134      <NA>               1 2010-12-01 14:32:00         0 
#>  8 536546     22145      <NA>               1 2010-12-01 14:33:00         0 
#>  9 536547     37509      <NA>               1 2010-12-01 14:33:00         0 
#> 10 536549     85226A     <NA>               1 2010-12-01 14:34:00         0 
#> # ... with 2,878 more rows, and 2 more variables: customer_id <dbl>,
#> #   country <fct>

Terdapat 2888 data transaksi yang memiliki stock code tidak valid. Jika dilihat dari deskripsi data tersebut merupakan ongkos kirim, bank charges, NA Values dan lainnya. Seluruh data transaksi diatas dapat kita exclude karena tidak sesuai untuk kebutuhan analisis ini.

Mengacu pada artikel yang dipublish oleh Diego Usai pada Mar 14, 2019 terkait Market Basket Analysis menggunakan data online retail ini, ia berhasil menemukan 50 deskripsi yang terkesan diinputkan secara manual yaitu:

#>  [1] "check"                               "check?"                             
#>  [3] "?"                                   "??"                                 
#>  [5] "damaged"                             "found"                              
#>  [7] "adjustment"                          "Amazon"                             
#>  [9] "AMAZON"                              "amazon adjust"                      
#> [11] "Amazon Adjustment"                   "amazon sales"                       
#> [13] "Found"                               "FOUND"                              
#> [15] "found box"                           "Found by jackie "                   
#> [17] "Found in w/hse"                      "dotcom"                             
#> [19] "dotcom adjust"                       "allocate stock for dotcom orders ta"
#> [21] "FBA"                                 "Dotcomgiftshop Gift Voucher £100.00"
#> [23] "on cargo order"                      "wrongly sold (22719) barcode"       
#> [25] "wrongly marked 23343"                "dotcomstock"                        
#> [27] "rcvd be air temp fix for dotcom sit" "Manual"                             
#> [29] "John Lewis"                          "had been put aside"                 
#> [31] "for online retail orders"            "taig adjust"                        
#> [33] "amazon"                              "incorrectly credited C550456 see 47"
#> [35] "returned"                            "wrongly coded 20713"                
#> [37] "came coded as 20713"                 "add stock to allocate online orders"
#> [39] "Adjust bad debt"                     "alan hodge cant mamage this section"
#> [41] "website fixed"                       "did  a credit  and did not tick ret"
#> [43] "michel oops"                         "incorrectly credited C550456 see 47"
#> [45] "mailout"                             "test"                               
#> [47] "Sale error"                          "Lighthouse Trading zero invc incorr"
#> [49] "SAMPLES"                             "Marked as 23343"                    
#> [51] "wrongly coded 23343"                 "Adjustment"                         
#> [53] "rcvd be air temp fix for dotcom sit" "Had been put aside."

Saya setuju dengan hasil temuannya, oleh karena itu data transaksi yang memiliki deskripsi diatas perlu di remove.

Duplicated StockCode-Description

Proses ini penting karena data stock code dan description harus bersifat unik supaya matriks produk untuk sistem rekomendasi bisa valid.

#> # A tibble: 1 x 4
#>   stock_code_unik stock_code_unik_lowerca~ description_unik description_unik_lo~
#>             <int>                    <int>            <int>                <int>
#> 1            3898                     3790             3993                 3993

Jumlah stock code unik dan produk unik tidak sama, mari kita cek dari sisi stock code dahulu.

Data diatas merupakan data stock code yang duplikat apabila kita ubah menjadi lowecase. Mari kita cek apakah benar duplikat.

#> # A tibble: 227 x 4
#>    stock_code description            stock_code_lowerca~ description_lowercase  
#>    <chr>      <chr>                  <chr>               <chr>                  
#>  1 15056BL    EDWARDIAN PARASOL BLA~ 15056bl             edwardian parasol black
#>  2 15056bl    EDWARDIAN PARASOL BLA~ 15056bl             edwardian parasol black
#>  3 15056N     EDWARDIAN PARASOL NAT~ 15056n              edwardian parasol natu~
#>  4 15056n     EDWARDIAN PARASOL NAT~ 15056n              edwardian parasol natu~
#>  5 15056P     EDWARDIAN PARASOL PINK 15056p              edwardian parasol pink 
#>  6 15056p     EDWARDIAN PARASOL PINK 15056p              edwardian parasol pink 
#>  7 15060B     FAIRY CAKE DESIGN UMB~ 15060b              fairy cake design umbr~
#>  8 15060b     FAIRY CAKE DESIGN UMB~ 15060b              fairy cake design umbr~
#>  9 18098C     PORCELAIN BUTTERFLY O~ 18098c              porcelain butterfly oi~
#> 10 18098c     PORCELAIN BUTTERFLY O~ 18098c              porcelain butterfly oi~
#> # ... with 217 more rows

Ternyata terdapat data stock code yang duplicated karena efek case sensitive. Oleh karena itu, seluruh stock_code akan kita convert menjadi UPPERCASE untuk mengilahkan efek case sensitive.

#> # A tibble: 1 x 4
#>   jumlah_stock_code_u~ stock_code_unik_low~ description_unik description_unik_l~
#>                  <int>                <int>            <int>               <int>
#> 1                 3790                 3790             3993                3993

Oke, data stock_code sudah clean. Namun, seharusnya jumlah stock_code dan description berjumlah sama karena bersifat unik. Hal ini mengindikasikan duplikat data. Mari kita cek.

#> # A tibble: 212 x 2
#>    stock_code  freq
#>    <chr>      <int>
#>  1 16156L         2
#>  2 17107D         3
#>  3 20622          2
#>  4 20725          2
#>  5 20914          2
#>  6 21109          2
#>  7 21112          2
#>  8 21175          2
#>  9 21192          2
#> 10 21232          2
#> # ... with 202 more rows

Data diatas adalah data stock code yang memiliki description > 1. Mari kita sampling datanya.

#> # A tibble: 443 x 2
#>    stock_code description                        
#>    <chr>      <chr>                              
#>  1 16156L     WRAP CAROUSEL                      
#>  2 16156L     WRAP, CAROUSEL                     
#>  3 17107D     FLOWER FAIRY 5 DRAWER LINERS       
#>  4 17107D     FLOWER FAIRY 5 SUMMER DRAW LINERS  
#>  5 17107D     FLOWER FAIRY,5 SUMMER B'DRAW LINERS
#>  6 20622      VIP PASSPORT COVER                 
#>  7 20622      VIPPASSPORT COVER                  
#>  8 20725      LUNCH BAG RED RETROSPOT            
#>  9 20725      LUNCH BAG RED SPOTTY               
#> 10 20914      SET/5 RED RETROSPOT LID GLASS BOWLS
#> # ... with 433 more rows

Dari hasil pengecekan diatas dapat kita simpulkan bahwa terdapat kesalahan pada deskripsi yang berupa tanda baca, spasi hingga kesalahan penulisan deskripsi produk. Untuk itu kita akan generate deskripsi produk menggunakan data pertamanya.

#> # A tibble: 1 x 2
#>   jumlah_stock_code_unik jumlah_description_unik
#>                    <int>                   <int>
#> 1                   3790                    3764

Oke, selisih nya sudah mulai berkurang. Jumlah diatas mengindikasikan terdapat stock_code yang memiliki description sama.

#> # A tibble: 24 x 2
#>    description                      freq
#>    <chr>                           <int>
#>  1 BATHROOM METAL SIGN                 2
#>  2 COLOURING PENCILS BROWN TUBE        2
#>  3 COLUMBIAN CANDLE RECTANGLE          2
#>  4 COLUMBIAN CANDLE ROUND              3
#>  5 EAU DE NILE JEWELLED PHOTOFRAME     2
#>  6 FRENCH FLORAL CUSHION COVER         2
#>  7 FRENCH LATTICE CUSHION COVER        2
#>  8 FRENCH PAISLEY CUSHION COVER        2
#>  9 FROSTED WHITE BASE                  2
#> 10 HEART T-LIGHT HOLDER                2
#> # ... with 14 more rows

Terdapat 24 stock code yang memiliki deskripsi sama.

#> # A tibble: 50 x 2
#>    stock_code description                    
#>    <chr>      <chr>                          
#>  1 82580      BATHROOM METAL SIGN            
#>  2 21171      BATHROOM METAL SIGN            
#>  3 10133      COLOURING PENCILS BROWN TUBE   
#>  4 10135      COLOURING PENCILS BROWN TUBE   
#>  5 72133      COLUMBIAN CANDLE RECTANGLE     
#>  6 72131      COLUMBIAN CANDLE RECTANGLE     
#>  7 72127      COLUMBIAN CANDLE ROUND         
#>  8 72130      COLUMBIAN CANDLE ROUND         
#>  9 72128      COLUMBIAN CANDLE ROUND         
#> 10 85023B     EAU DE NILE JEWELLED PHOTOFRAME
#> # ... with 40 more rows

Untuk case ini kita bisa berasumsi bahwa produk tersebut sama. Sehinnga, kita akan generate setiap description menggunakan stock code pertama.

#> # A tibble: 1 x 3
#>   jumlah_stock_code_unik jumlah_description_unik jumlah_code_description_unik
#>                    <int>                   <int>                        <int>
#> 1                   3764                    3764                         3764

Oke, setiap stock sudah bersifat unik. mari kita lanjut cek data country.

Invalid Country

#> # A tibble: 1 x 2
#>   customer_id_unik customer_country_unik
#>              <int>                 <int>
#> 1             4311                  4295

Berdasar data diatas terdapat 1 customer yang memiliki 2 negara. Mungkin bisa karena customer tersebut pindah, oleh karena itu kita bisa ambil negara customer berdasarkan negara terakhir ia melakukan transaksi.

#> # A tibble: 1 x 2
#>   customer_id_unik customer_country_unik
#>              <int>                 <int>
#> 1             4295                  4295

Oke, setiap data customer dan negara sudah sesuai. mari kita lanjut cek data duplikat.

Duplicated Data

#>   jumlah_data jumlah_data_unik
#> 1      503105           498077

Dataset ini memiliki data yang duplikat, untuk itu perlu kita remove.

#> [1] 498077

Missing Values

#>   invoice_no invoice_date  customer_id   stock_code  description     quantity 
#>            0            0       123903            0            0            0 
#>   unit_price      country 
#>            0            0

Dapat dilihat terdapat 123938 Missing Value pada customer_id. Projek ini ditujukan untuk melakukan segmentasi pelanggan dan membuat personalisasi rekomendasi produk. Segmentasi pelanggan jelas harus mengetahui siapa pelangannya, sehingga data transaksi yang tidak memiliki data customer_id perlu di-exclude. Kemudian untuk membangun sistem rekomendasi, kita bisa mengabaikan customer nya dan fokus pada produk yang dibeli, sehingga dapat menggunakan data invoice_no dan stock_code.

Oleh karena itu dapat kita putuskan untuk membagi data ini menjadi 3 dataset, yaitu:
1. df_general_transaction : dataset data original.
1. df_customer_transaction : dataset ini digungakan untuk segmentasi pelanggan, sehingga harus mengexclude missing values.
2. df_product_recomm : dataset untuk membangun sistem rekomendasi produk yang hanya terdiri dari customer_id, invoice_no, stock_code dan description.

Feature Extraction

Kita bisa mengekstrak data total amount per transaksi berdasarkan quantity * unit_price mempermudah proses analisis selanjutnya.

#> # A tibble: 5 x 9
#>   invoice_no invoice_date        customer_id country stock_code description
#>   <chr>      <dttm>                    <dbl> <fct>   <chr>      <chr>      
#> 1 536365     2010-12-01 08:26:00       17850 United~ 85123A     WHITE HANG~
#> 2 536365     2010-12-01 08:26:00       17850 United~ 71053      WHITE META~
#> 3 536365     2010-12-01 08:26:00       17850 United~ 84406B     CREAM CUPI~
#> 4 536365     2010-12-01 08:26:00       17850 United~ 84029G     KNITTED UN~
#> 5 536365     2010-12-01 08:26:00       17850 United~ 84029E     RED WOOLLY~
#> # ... with 3 more variables: quantity <dbl>, unit_price <dbl>,
#> #   total_amount <dbl>

Data Summary

df_general_transaction

#> Observations: 498,077
#> Variables: 9
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...

df_customer_transaction

#> Observations: 374,174
#> Variables: 9
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <fct> United Kingdom, United Kingdom, United Kingdom, United...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ total_amount <dbl> 15.30, 20.34, 22.00, 20.34, 20.34, 15.30, 25.50, 11.10...

Data ini akan digunakan untuk melakukan segmentasi berdasarkan Recency, Frequency dan Monetary. Untuk itu perlu kita cek apakah terdapat data yang memiliki unit_price <= 0 ? Berapa banyak?

#> [1] 33

Terdapat 33 data transaksi yang memiliki unit_price <= 0. Bisa jadi karena efek promo, diskon atau kesalahan input. Namun karena kita tidak mengetahui data aslinya, data ini dapat kita remove karena jumlahnya juga tidak banyak.

df_product_recomm

#> Observations: 498,077
#> Variables: 4
#> $ customer_id <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
#> $ invoice_no  <chr> "536365", "536365", "536365", "536365", "536365", "5363...
#> $ stock_code  <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
#> $ description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...

kita ekstrak seluruh produk nya untuk menjadi master produk.

#> Observations: 3,764
#> Variables: 2
#> $ stock_code  <chr> "10002", "10080", "10120", "10123C", "10124A", "10124G"...
#> $ description <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATABLE...

Exploratory Data Analysis

Transaction

Berapa Total Order Amount yang dilakukan?

#> # A tibble: 1 x 1
#>   total_order_amount
#>                <dbl>
#> 1           8224194.

Bagaimana Total Order Amount tiap bulan?

Frequency Order

Berapa Total Transaksi yang terjadi?

#> [1] 17656

Bagaimana total transaksi yang terjadi tiap bulan?

Customer

Berapa total customer yang ada?

#> [1] 4293

Bagaimana jumlah customer yang melakukan transaksi tiap bulan?

Pertumbuhan customer baru tiap bulan?

Order Habbit

Hari apa customer sering berbelanja?

Jam berapa customer sering berbelanja?




RFM Segmentation

Metode analisis Recency, Frequency, Monetary Value (RFM) adalah salah satu metode analisis perilaku pelanggan berdasarkan histori transaksinya. Output dari analisis RFM dapat digunakan untuk melakukan segmentasi pelanggan. Metode analisis RFM dapat menjelaskan:

  1. Seberapa baru pelanggan melakukan transaksi?
  2. Seberapa sering pelanggan melaukan transaksi?
  3. Seberapa besar transaksi yang sudah dilakukan pelanggan?

Sesuai artikel yang dipublish oleh www.marketeers.com, GO-JEK adalah salah satu perusahaan yang menggunakan metode analisis RFM dalam menentukan segmentasi pelanggan. GO-JEK membagi segmentasi pelanggan ke dalam empat kelas, yaitu Gold, Silver, Bronze, dan Non-Profit. Segmentasi Gold memiliki kualifikasi konsumen dengan high monetary, high frequency, dan high recency. Silver memiliki kualifikasi konsumen dengan tingkat monetary yang tinggi, frequency yang rendah, dan recency yang tinggi. Kategori Bronze terdiri dari konsumen dengan tingkat monetary rendah, frequency, dan recency yang tinggi. Sementara segmentasi Non-Profit memiliki kualifikasi konsumen dengan kualifikasi monetary, frequency, dan recency yang rendah.

Dalam case ini, dataset berisikan data transaksi dari 1 Desember 2010 sampai 30 November 2011, sehingga berikut perlakukan RFM Value untuk case ini:

Recency : Selisih antara hari terakhir pelanggan melakukan transaksi dan hari melakukan analisis. Dalam case ini, hari analisis menggunakan data hari terakhir transaksi + 1. Frequency : Jumlah transaksi yang dilakukan oleh pelanggan dari 1 Desember 2010 sampai 30 November 2011. Monetary : Jumlah total order amount yang sudah dikeluarkan pelanggan dari 1 Desember 2010 sampai 30 November 2011.

Generate RFM Value

Berikut sample data RFM Value:

#> # A tibble: 10 x 13
#>    customer_id country first_order last_order min_amount_order med_amount_order
#>          <dbl> <fct>   <date>      <date>                <dbl>            <dbl>
#>  1       12346 United~ 2011-01-18  2011-01-18         77184.            77184. 
#>  2       12347 Iceland 2010-12-07  2011-10-31             5.04             17  
#>  3       12348 Finland 2010-12-16  2011-09-25            13.2              41.8
#>  4       12349 Italy   2011-11-21  2011-11-21             6.64             17.7
#>  5       12350 Norway  2011-02-02  2011-02-02             8.5              18.8
#>  6       12352 Norway  2011-02-16  2011-11-03             9.9              17.4
#>  7       12353 Bahrain 2011-05-19  2011-05-19            11.6              18.8
#>  8       12354 Spain   2011-04-21  2011-04-21             8.5              17.0
#>  9       12355 Bahrain 2011-05-09  2011-05-09            17.7              25.5
#> 10       12356 Portug~ 2011-01-18  2011-11-17             3.75             34.8
#> # ... with 7 more variables: max_amount_order <dbl>, min_baskets <int>,
#> #   max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> #   monetary <dbl>

Recency Segmentation

Data ini merupakan histori transksi selama 1 tahun, untuk itu bisa kita buat aturan sebagai berikut:

  1. Active : Recency <= 30
  2. Warm : 30 < Recency <= 90
  3. Cold : 90 < Recency <= 180
  4. Inactive : Recency > 180
recency_segmentation <- df_customer_rfm %>% group_by(recency) %>% summarise(freq=n()) %>% ungroup()


plot_recency_segmentation <- ggplot(recency_segmentation,aes(recency,freq))+
  #geom_line()+
 
  geom_vline(xintercept = recencny_adjust$active,linetype="dotted",color = "black", size=1,alpha=0.5)+
  geom_rect(aes(xmin = 0, xmax = recencny_adjust$active,
                ymin = 0, ymax = Inf),fill = "green", alpha = 0.002)+
  
  geom_vline(xintercept = recencny_adjust$warm,linetype="dotted",color = "black", size=1,alpha=0.5)+
  geom_rect(aes(xmin = recencny_adjust$active, xmax = recencny_adjust$warm, 
                ymin = 0, ymax = Inf),fill = "orange", alpha = 0.002)+
  
  geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1, alpha=0.5)+
  geom_rect(aes(xmin = recencny_adjust$warm, xmax = recencny_adjust$cold, 
                ymin = 0, ymax = Inf),fill = "blue", alpha = 0.002)+
  
  #geom_vline(xintercept = recencny_adjust$cold,linetype="dotted",color = "black", size=1)+
   geom_rect(aes(xmin = recencny_adjust$cold, xmax = Inf, 
                ymin = 0, ymax = Inf),fill = "red", alpha = 0.8)+
  
  geom_area(fill="black",alpha=0.2)+
  geom_point(aes(x=recencny_adjust$active, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$active, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$active," days"))+
  
  geom_point(aes(x=recencny_adjust$warm, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$warm, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$warm," days"))+
  
  geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$cold, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+
  
  geom_point(aes(x=recencny_adjust$cold, y=10), colour="black", size=3)+
  annotate("text", x = recencny_adjust$cold, y = 10+5, 
        color = "black", size=3, label=paste0(recencny_adjust$cold," days"))+

  annotate("text", x = (0+recencny_adjust$active)/2, y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Active")+
  annotate("text", x = median(recencny_adjust$active:recencny_adjust$warm), y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Warm")+
  annotate("text", x = median(recencny_adjust$warm:recencny_adjust$cold), y = max(recency_segmentation$freq)-5, 
           color = "black", size=4, label="Cold")+
  annotate("text", x = median(recencny_adjust$cold:max(recency_segmentation$recency)), y = max(recency_segmentation$freq)-5, 
          color = "black", size=4, label="Inactive")+
  labs(
    title="Distribution of Recency Segementation",
    x = "Recency (days since last transaction)",
    y = "Total Customer",
    subtitle = paste0("Data from ", min(df_customer_rfm$first_order)," untill ",  max(df_customer_rfm$last_order))
  )+
  theme_minimal()

ggplotly(plot_recency_segmentation)

Chart diatas menunjukan hanya 38.7% dari total customer kita yang masuk segmen Active dengan Revenue yang dihasilkan sebesar 70.7%, maka tentunya perusahaan harus mempertahankan customer pada segmen ini untuk dapat tetap aktif.

Namun, bagaimana jika terdapat High value customer yang tidak masuk ke segmen active customer? Segmen diatas dapat dibagi lagi berdasarkan nilai transaksi customer. Mengelompokan berdasarkan nilai customer sangat penting karena perusahaan dapat menawarkan promosi yang sesuai dengan kebiasaan pengeluaran customer. Mengetahui setiap nilai customer juga membantu bisnis menciptakan pemasaran bertarget untuk pelanggan bernilai tinggi. Misalnya, pada segmen warm, cold dan inactive terdapat customer yang memberikan revenue besar bagi perusahaan, maka customer ini dapat menjadi target utama pemasaran. Untuk itu kita perlu breakdown segmen ini berdasarkan Frekuensi Order dan total pengeluaran customer.

Frequency/Monetary Segmentation

Correlaction Checking

Jika dilihat frequency dan monetary memiliki korelasi yang cukup kuat, dalam artian semakin tinggi korelasi maka semakin besar nilai monetary.

Outlier Checking

Mari kita cek dahulu sebaran data dan outliernya.

Dari visualisasi diatas, nilai frequency dan monetary diatas memiliki outlier yang cukup banyak. Mari kita cek lebih detail lewat visualisasi yang lebih jelas.

Dapat dilihat lebih jelas terdapat beberapa customer yang sangat jauh dari pusat data dan bisa kita sebut outlier. Jika dizoom, bisa dilihat lebih jelas data cukup berpusat dari pada frequency <= 38 dan monetary <= 15000. Mari kita cek:

79 (98.2%) customer berada pada range frequency <= 30 dan monetary <= 15000. Metode clustering cukup sensitif dengan jarak, sehingga kita perlu mengexclude data yang terlalu jauh kemudian melakukan scalling. Untuk itu kita akan melakukan clustering terhadap customer pada range ini dan data lainnya kita analisa terpisah.

Optimal K For Clustering

Proses Clustering akan menggunakan nilai dari min-max normalization.

Elbow Method

Berdasarkan plot penentuan Optimal-K diatas, Elbow Method menunjukan penuruan mulai stabil pada K=3. Namun, jika dilihat pada K=5, penurunan cenderung statis hingga K=15. Untuk itu mari kita coba menggunakan nilai K=5.

Silhouette Method

Silhouette Method menunjukan bahwa optimal K=2, namun jika dilihat nilai K=3 dan K=5 juga cukup optimum. Kita ingin mengclusteringkan customer lebih spesifik, tentunya setiap cluster harus memiliki identitas. Jika menggunakan 2 cluster maka cluster cenderung kurang memiliki identitas, maka mari kita coba melakukan clustering menggunakan K=3 dan K=5.

Clustering

Clustering using K=3

K-Means

Hasil K-Means dengan 3 clusters terbagi menjadi 2624:396:1024 dengan nilai between_SS / total_SS = 73.4 %

#> [1] 3141  865  208
#> [1] 75.1


DBScan

Hasil DBScan dengan 3 clusters terbagi menjadi 1477:720:364 dengan noise sebanyak 1654. Jumlah noise terlalu banyak.

#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.02, minPts = 150
#> The clustering contains 3 cluster(s) and 1654 noise points.
#> 
#>    0    1    2    3 
#> 1654 1476  720  364 
#> 
#> Available fields: cluster, eps, minPts
K-Means vs DBScan using K=3
# K-Means
cust_fm_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=kmeans_mm_k3$cluster) %>%
   mutate(
    fm_segment=as.factor(fm_segment))

perfom <- round((kmeans_mm_k3$betweenss/kmeans_mm_k3$totss)*100,1)

cluster_rfm_mm_profile_k3 <- cust_fm_mm_k3 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()



cluster_rfm_mm_profile_k3 %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "K-Means Clustering for 3 Segment",
    subtitle = paste("betweenss/totts:",perfom,"%"),
    x = "Cluster of Frequency/Monetary"
  ) -> plot_kmeans_k3



# DBScan 
dbscan_mm_k3 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=db_clust_k3$cluster) %>%
  mutate(fm_segment=as.factor(fm_segment))


dbscan_mm_k3_profile <- dbscan_mm_k3 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()


dbscan_mm_k3_profile %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "DBScan Clustering for 3 segment, with 569  noise",
    x = "Cluster of Frequency/Monetary"
  ) -> plot_dbscan_k3


grid.arrange(plot_kmeans_k3, plot_dbscan_k3, nrow = 1)

Clustering using K=5

K-Means

Hasil K-Means dengan 5 clusters terbagi menjadi 2304, 151, 1041, 176, 372 dengan nilai between_SS / total_SS = 84.1 %

#> [1] 2306 1048  538  219  103
#> [1] 83.6
DBScan

Hasil DBScan dengan 5 cluster terbagi menjadi 321:1520:773:460:174 dengan noise sebanyak 967. Jumlah noise terlalu banyak.

#> DBSCAN clustering for 4214 objects.
#> Parameters: eps = 0.025, minPts = 50
#> The clustering contains 5 cluster(s) and 967 noise points.
#> 
#>    0    1    2    3    4    5 
#>  967  322 1519  773  460  173 
#> 
#> Available fields: cluster, eps, minPts
K-Means vs DBScan using K=5
# K-Means
cust_fm_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=kmeans_mm_k5$cluster) %>%
   mutate(
    fm_segment=as.factor(fm_segment))

perfom <- round((kmeans_mm_k5$betweenss/kmeans_mm_k5$totss)*100,1)


cluster_rfm_mm_profile_k5 <- cust_fm_mm_k5 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()



cluster_rfm_mm_profile_k5 %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "K-Means Clustering for 5 Segment",
    subtitle = paste("betweenss/totts:",perfom,"%"),
    x = "Cluster of Frequency/Monetary"
  ) -> plot_kmeans_k5



# DBScan 
dbscan_mm_k5 <- cbind(df_customer_rfm_adjust[,c("customer_id","recency","recency_segment","frequency","monetary")],
                     fm_segment=db_clust_k5$cluster) %>%
  mutate(fm_segment=as.factor(fm_segment))


dbscan_mm_k5_profile <- dbscan_mm_k5 %>% group_by(fm_segment) %>% 
  summarise(
    min_recency = min(recency),
    max_recency = max(recency),
    avg_recency = round(mean(recency),2),
    min_frequency = min(frequency),
    max_frequency = max(frequency),
    avg_frequency = round(mean(frequency),2),
    med_frequency = round(median(frequency),2),
    min_monetary = min(monetary),
    max_monetary = max(monetary),
    avg_monetary = round(mean(monetary),2),
    med_monetary = round(median(monetary),2),
    total_monetary = sum(monetary),
    total_customer = n()
  ) %>% ungroup()


dbscan_mm_k5_profile %>% 
  gather("features","values",max_frequency,min_frequency,avg_frequency,med_frequency,
         min_monetary,max_monetary,avg_monetary,med_monetary) %>% 
  ggplot(aes(x = fm_segment, y = features)) +
  #scale_x_continuous(breaks = seq(min(cluster), max(cluster), by = 1)) +
  geom_tile(aes(fill = values), show.legend = FALSE) +
  geom_text(aes(label=round(values,2)), size=3, color="white")+
  coord_equal() +
  scale_fill_gradientn(colours = hm.palette(27)) +
  labs(
    title = "DBScan Clustering for 5 segment, exclude 345 noise",
    x = "Cluster of Frequency/Monetary"
  ) -> plot_dbscan_k5


grid.arrange(plot_kmeans_k5, plot_dbscan_k5, nrow = 1)

FM Segmentation Profiling


Berdasarkan visualisasi diatas maka, setiap cluster bisa artikan sebagai berikut:

  1. Cluster 1: Medium Frequency, Medium Monetary. code: VLF-VLM. Kita sebut Very Low Value Customer.
  2. Cluster 2: Very Low Frequency, Very Low Monetary. LF-LM. Kita sebut Low Value Customer
  3. Cluster 3: Very High Frequency, Very High Monetary. code: MF-MM. Kita sebut Medium Value Customer
  4. Cluster 4: Low Frequency, Low Monetary. code: HF-HM. Kita sebut High Value Customer
  5. Cluster 5: High Frequency, High Monetary. code: VHF-VHM. Kita sebut Very High Value Customer

Berikut ini sebaran data hasil clusteringnya:

#> # A tibble: 10 x 17
#>    customer_id country first_order last_order min_amount_order med_amount_order
#>          <dbl> <fct>   <date>      <date>                <dbl>            <dbl>
#>  1       12347 Iceland 2010-12-07  2011-10-31             5.04             17  
#>  2       12348 Finland 2010-12-16  2011-09-25            13.2              41.8
#>  3       12349 Italy   2011-11-21  2011-11-21             6.64             17.7
#>  4       12350 Norway  2011-02-02  2011-02-02             8.5              18.8
#>  5       12352 Norway  2011-02-16  2011-11-03             9.9              17.4
#>  6       12353 Bahrain 2011-05-19  2011-05-19            11.6              18.8
#>  7       12354 Spain   2011-04-21  2011-04-21             8.5              17.0
#>  8       12355 Bahrain 2011-05-09  2011-05-09            17.7              25.5
#>  9       12356 Portug~ 2011-01-18  2011-11-17             3.75             34.8
#> 10       12357 Switze~ 2011-11-06  2011-11-06            10.1              35.4
#> # ... with 11 more variables: max_amount_order <dbl>, min_baskets <int>,
#> #   max_baskets <int>, avg_baskets <dbl>, recency <int>, frequency <int>,
#> #   monetary <dbl>, recency_segment <fct>, fm_segment_code <fct>,
#> #   fm_segment_desc <fct>, fm_segment_name <fct>

RFM Segmentation Summary

Sebelumnya terdapat 79 data customer yang kita exclude dari proses clustering karena outlier. Customer ini jelas dapat kita kategorikan sebagai Special Customer karena memiliki frequency dan Monetary yang terlalu jauh dari customer lainnya.

Segment Analysis

Daily Order Behaviour

 rfm_order_habit_day <-  df_customer_transaction %>% 
  left_join(cust_segment_result, by=c("customer_id","customer_id")) %>% 
  select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>%
  distinct() %>% 
  mutate(wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>% 
  group_by(wday,fm_segment_name) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup() %>% 
  mutate(popup = glue("Weekday: {wday}
                      Total Transaction: {total_transaction}"))



rfm_order_habit_day %>% filter(fm_segment_name=="Special Customer") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Special Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_1



rfm_order_habit_day %>% filter(fm_segment_name=="Very High Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Very High Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_2


rfm_order_habit_day %>% filter(fm_segment_name=="High Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "High Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_3


rfm_order_habit_day %>% filter(fm_segment_name=="Medium Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Medium Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_4


rfm_order_habit_day %>% filter(fm_segment_name=="Low Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Low Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_5


rfm_order_habit_day %>% filter(fm_segment_name=="Very Low Value") %>% 
  ggplot(aes(wday,total_transaction,fill=total_transaction, ))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  labs(
    title = "Very Low Value Customer",
    subtitle = "Daily Order Frequency",
    x = "Day", 
    y = NULL)+
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_day_6


 grid.arrange(plot_rfm_day_1, plot_rfm_day_2, plot_rfm_day_3,
               plot_rfm_day_4, plot_rfm_day_5, plot_rfm_day_6,
               ncol = 3)

Hourly Order Behaviour

serror <- function(x) sqrt(var(x)/length(x)) 
time_range = data_frame(hour_of_day = c(0:23))


rfm_order_habit_time <- df_customer_transaction %>% 
  left_join(cust_segment_result, by=c("customer_id","customer_id")) %>% 
  select(invoice_date,invoice_no,recency_segment,fm_segment_name) %>% 
  distinct() %>%
  mutate(hour_of_day = hour(invoice_date)) %>% 
  group_by(hour_of_day,fm_segment_name) %>% 
  summarise(total_transaction = n()) %>% 
  ungroup() %>% 
  mutate(popup = glue("Hour of Day : {hour_of_day}
                       Total Transaction: {total_transaction}")) 


  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Special Customer"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Special Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_1
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Very High Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Very High Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_2
  
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="High Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "High Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_3
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Medium Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Medium Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_4
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Low Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Low Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_5
  
  
  
  data_frame(hour_of_day = c(0:23)) %>% 
    left_join(
      rfm_order_habit_time %>% filter(fm_segment_name=="Very Low Value"),
      by=c("hour_of_day","hour_of_day")) %>%
    mutate(hour_of_day = as.factor(hour_of_day)) %>% 
  ggplot(aes(x=hour_of_day,y=total_transaction, fill=total_transaction))+
  geom_bar(width=1, stat="identity", show.legend = FALSE)+
  geom_errorbar(aes(ymin = total_transaction - serror(total_transaction), 
                    ymax = total_transaction + serror(total_transaction), 
                    color = hour_of_day), 
                    width = .2) + 
  labs(
    title = "Very Low Value Customer",
    subtitle = "Hourly Order Frequency",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5,size=11, face="bold"),
    plot.subtitle = element_text(hjust = 0.5,size=10),
    axis.text.y = element_blank(),
    axis.text.x=element_text(size=9, face="bold"))+
  coord_polar() -> plot_rfm_time_6
  
  
  grid.arrange(plot_rfm_time_1,plot_rfm_time_2,plot_rfm_time_3,
               plot_rfm_time_4,plot_rfm_time_5,plot_rfm_time_6,
               ncol = 3)




Recommender System

Sistem rekomendasi dibuat menggunakan data histori transaksi setiap invoice_no.

#> invoice_no stock_code 
#>      18975       3764
#> customer_id  stock_code 
#>        4295        3764

Rating Matrix

Data yang kita miliki merupakan data histori transaksi, sehingga kita bisa menggunakan binaryRatingsMatrix dengan ketentuan Jika produk dibeli maka di set 1 dan jika tidak maka di set 0. Kita akan mencoba menggunakan matriks atas data Customer-Produk dan Invoice-Produk.

Modelling

Algoritma yang tersedia untuk binaryRatingMatrix menggunakan library Recommender Lab

#>                                 Length Class              Mode
#> ALS_implicit_binaryRatingMatrix 6      recommender_method list
#> AR_binaryRatingMatrix           6      recommender_method list
#> IBCF_binaryRatingMatrix         6      recommender_method list
#> POPULAR_binaryRatingMatrix      6      recommender_method list
#> RANDOM_binaryRatingMatrix       6      recommender_method list
#> RERECOMMEND_binaryRatingMatrix  6      recommender_method list
#> UBCF_binaryRatingMatrix         6      recommender_method list

Mari kita coba seluruh algoritma diatas dan bandingkan hasilnya:

algorithms_binary <- list(
  "association_rules" = list(name  = "AR", param = list(support = 0.05, confidence = 0.05)),
  "popular"     = list(name  = "POPULAR", param = NULL),
  "random"      = list(name  = "RANDOM",  param = NULL),
  "ibcf"        = list(name  = "IBCF", param = list(k = 5)), #JACCARD
  #"ibcf_20"     = list(name  = "IBCF", param = list(k = 20)),
  #"ibcf_50"     = list(name  = "IBCF", param = list(k = 50)),
  #"ibcf_cosine" = list(name  = "IBCF", param = list(method = "Cosine", k = 5)),
  "ibcf_pearson_5"= list(name  = "IBCF", param = list(method = "Pearson", k = 5)),
  "ibcf_pearson_50"= list(name  = "IBCF", param = list(method = "Pearson", k = 50)),
  "ibcf_pearson_100"= list(name  = "IBCF", param = list(method = "Pearson", k = 100)),
  #"ubcf_25"     = list(name  = "UBCF", param = list(method = "Cosine", nn = 25)),
  #"ubcf_50"     = list(name  = "UBCF", param = list(method = "Cosine", nn = 50)),
  "ubcf_cosine_100"    = list(name  = "UBCF", param = list(method = "Cosine", nn = 100)),
  #"ubcf_200"    = list(name  = "UBCF", param = list(method = "Cosine", nn = 200))
  "ubcf_pearson_100"    = list(name  = "UBCF", param = list(method = "Pearson", nn = 100))
  )

memory.limit(size=56000)
start <- Sys.time()
results_custprod <- recommenderlab::evaluate(scheme_custprod, 
                                    algorithms_binary, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20))

results_invprod <- recommenderlab::evaluate(scheme_invprod, 
                                    algorithms_binary, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20))

end <- Sys.time()
cat('runtime', end - start)

wd <-  as.character(getwd())
saveRDS(object=results_custprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
                                            "results_custprod.rds",sep=""))
saveRDS(object=results_invprod, file=paste(paste(wd,"/modelling/recommendation_development/",sep = ""),
                                           "results_invprod.rds",sep=""))

results_custprod
results_invprod
#> List of evaluation results for 9 recommenders:
#> Evaluation results for 5 folds/samples using method 'AR'.
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> List of evaluation results for 8 recommenders:
#> Evaluation results for 5 folds/samples using method 'POPULAR'.
#> Evaluation results for 5 folds/samples using method 'RANDOM'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'IBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.
#> Evaluation results for 5 folds/samples using method 'UBCF'.

Model Evaluation

Model Selection

Berdasarkan hasil ROC Curves dan Precision-Recall Curves diatas, Model rekomendasi menggunakan Matrix Invoice-Product lebih baik karena memiliki nilai Precision dan Recall yang lebih tinggi. Kemudian, algoritma IBCF dan UBCF menunjukan performa yang lebih baik dari algortima lainnya. Pada kurva ROC kita menilai dari algoritma yang memiliki nilai FPR dan TPR tertinggi. Jika pada kurva Precision-Recall kita bisa memlihat yang memiliki precision yang tinggi karena kita ingin produk yang direkomendasikan sesuai dengan yang diinginkan customer.

Berdasar kurva Precision-Recall, jika hanya merekomendasikan 1 produk maka yang terbaik adalah algoritma UBCF, namun tidak selisih jauh dengan algoritma IBCF. Namun, model UBCF menunjukan performa yang lebih rendah dari IBCF jika semakin banyak produk yang hendak direkomendasikan. Sehingga untuk case ini, kita akan memilih model IBCF untuk membangun sistem rekomendasi.

Prediction

Recommendation based History Order

Rekomendasi ini dihasilkan berdasarkan histori produk yang dibeli oleh customer. Misalkan kita simulasi untuk customer_id == 17850. Histori produk yang dibeli sebagai berikut:

#>    stock_code                         description
#> 1      85123A  WHITE HANGING HEART T-LIGHT HOLDER
#> 2       71053                 WHITE METAL LANTERN
#> 3      84406B      CREAM CUPID HEARTS COAT HANGER
#> 4      84029G KNITTED UNION FLAG HOT WATER BOTTLE
#> 5      84029E      RED WOOLLY HOTTIE WHITE HEART.
#> 6       22752        SET 7 BABUSHKA NESTING BOXES
#> 7       21730   GLASS STAR FROSTED T-LIGHT HOLDER
#> 8       22633              HAND WARMER UNION JACK
#> 9       22632           HAND WARMER RED POLKA DOT
#> 10      20679               EDWARDIAN PARASOL RED
#> 11      37370          RETRO COFFEE MUGS ASSORTED
#> 12      21871                 SAVE THE PLANET MUG
#> 13      21071      VINTAGE BILLBOARD DRINK ME MUG
#> 14      21068     VINTAGE BILLBOARD LOVE/HATE MUG
#> 15      82483  WOOD 2 DRAWER CABINET WHITE FINISH
#> 16      82486   WOOD S/3 CABINET ANT WHITE FINISH
#> 17      82482   WOODEN PICTURE FRAME WHITE FINISH
#> 18     82494L          WOODEN FRAME ANTIQUE WHITE
#> 19    15056BL             EDWARDIAN PARASOL BLACK
#> 20      22803             IVORY EMBROIDERED QUILT
#> 21      22411   JUMBO SHOPPER VINTAGE RED PAISLEY

Kemudian siapkan matriks nya:

#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 21 ratings.

Berikut Top produk yang bisa direkomendasikan kepada customer_id == 17850 :

#>    customer_id recommendation_based stock_code
#> 1        17850        History Order      21931
#> 2        17850        History Order      22866
#> 3        17850        History Order     85099B
#> 4        17850        History Order     15056N
#> 5        17850        History Order      21929
#> 6        17850        History Order      21928
#> 7        17850        History Order      22865
#> 8        17850        History Order      22386
#> 9        17850        History Order     85099C
#> 10       17850        History Order      20712
#> 11       17850        History Order      23439
#> 12       17850        History Order      22867
#> 13       17850        History Order      21877
#> 14       17850        History Order      20711
#> 15       17850        History Order      21733
#> 16       17850        History Order     15056P
#> 17       17850        History Order      21069
#> 18       17850        History Order      21479
#> 19       17850        History Order      21485
#> 20       17850        History Order      22114
#>                          description recom_rate
#> 1             JUMBO STORAGE BAG SUKI  0.2878999
#> 2      HAND WARMER SCOTTY DOG DESIGN  0.2747569
#> 3            JUMBO BAG RED RETROSPOT  0.2684497
#> 4          EDWARDIAN PARASOL NATURAL  0.2583729
#> 5     JUMBO BAG PINK VINTAGE PAISLEY  0.2560500
#> 6     JUMBO BAG SCANDINAVIAN PAISLEY  0.2504039
#> 7             HAND WARMER OWL DESIGN  0.2481908
#> 8            JUMBO BAG PINK POLKADOT  0.2481848
#> 9     JUMBO  BAG BAROQUE BLACK WHITE  0.2464097
#> 10        JUMBO BAG WOODLAND ANIMALS  0.2456964
#> 11        HAND WARMER RED LOVE HEART  0.2340989
#> 12           HAND WARMER BIRD DESIGN  0.2112100
#> 13               HOME SWEET HOME MUG  0.2097130
#> 14                    JUMBO BAG TOYS  0.2000000
#> 15  RED HANGING HEART T-LIGHT HOLDER  0.1939300
#> 16            EDWARDIAN PARASOL PINK  0.1819672
#> 17         VINTAGE BILLBOARD TEA MUG  0.1818182
#> 18      WHITE SKULL HOT WATER BOTTLE  0.1808874
#> 19  RETROSPOT HEART HOT WATER BOTTLE  0.1652174
#> 20 HOT WATER BOTTLE TEA AND SYMPATHY  0.1526391

Recommendation based on Spesific Product

Rekomendasi ini bisa terapkan berdasarkan data keranjang belanja, produk yang sedang dilihat oleh customer atau produk yang dibeli customer dalam periode waktu tertentu. Misalkan kita simulasikan jika didalam keranjang belanja customer terdapat produk dengan stock_code 84029E, 21730

#> # A tibble: 2 x 2
#>   stock_code description                      
#>   <chr>      <chr>                            
#> 1 21730      GLASS STAR FROSTED T-LIGHT HOLDER
#> 2 84029E     RED WOOLLY HOTTIE WHITE HEART.

Kemudian siapkan matrix-nya:

#> 1 x 3764 rating matrix of class 'binaryRatingMatrix' with 2 ratings.

Berikut Top produk yang bisa direkomendasikan kepada customer:

#>   customer_id recommendation_based stock_code
#> 1       17850        Customer Cart     84029G
#> 2       17850        Customer Cart      21479
#> 3       17850        Customer Cart      21485
#> 4       17850        Customer Cart      22114
#> 5       17850        Customer Cart      21484
#> 6       17850        Customer Cart      21481
#> 7       17850        Customer Cart      71053
#> 8       17850        Customer Cart      22803
#>                           description recom_rate
#> 1 KNITTED UNION FLAG HOT WATER BOTTLE 0.20248668
#> 2        WHITE SKULL HOT WATER BOTTLE 0.18088737
#> 3    RETROSPOT HEART HOT WATER BOTTLE 0.16521739
#> 4   HOT WATER BOTTLE TEA AND SYMPATHY 0.15263909
#> 5         CHICK GREY HOT WATER BOTTLE 0.14466546
#> 6          FAWN BLUE HOT WATER BOTTLE 0.14285714
#> 7                 WHITE METAL LANTERN 0.07079646
#> 8             IVORY EMBROIDERED QUILT 0.06837607