Email : gabrielerichsonmrp@gmail.com
Linkedin : www.linkedin.com/in/gabrielerichson
Github : www.github.com/gabrielerichsonmrp

"Demand Forecasting

# Wrangling
library(tidyverse)
library(lubridate)
library(scales)
library(readxl)
library(zoo)


#visualization
library(plotly)
library(paletti)
library(treemap)
library(glue)
library(gridExtra)

1 Project Description

Pada Part 1 ini, kita akan melakukan proses data preparation dan exploratory data analysis untuk mencari masalah dan mendapatkan insight-insight general. Project ini terbagi menjadi 3 part dan 1 Dashboard yaitu:

Part 1 : Data Preparation and Exploratory Analysis
Part 2 : Customer Analysis and Segmentation
Part 3 : Product Personalization
Shiny Dashboard : CUSTMARKETS

Saya sangat menyarankan teman-teman membaca setiap part secara berurutan karena setiap part berhubungan.

2 Data Background

Projek ini menggunakan dataset transaksi yang terjadi 01/12/2010 and 30/11/2011 pada perusahaan online retail yang berbasis di UK. Dataset ini diupload oleh Dr. Daqing Chen. Dataset original bisa didownload di sini Online Retail II Dataset

2.1 Read Data

Dataset ini terdiri dari 541,909 observasi dan 8 variabel. Jumlah ini merupakan jumlah data kotor dan perlu dicleansing lagi.

df_input <- read_excel("data_input/online_retail.xlsx", trim_ws = TRUE)
data.frame("total.data" = dim(df_input)[1],
           "total.variabel" = dim(df_input)[2])

Top 10 Data

head(df_input,10)

Saya lebih mudah mengelola datanya jika nama kolom menggunakan huruf kecil dan dipisahkan setiap kata underscore. Berikut cara mengubah dan hasilnya:

colnames(df_input) <- tolower(colnames(df_input))
df_input <-   df_input %>% 
    rename("invoice_no" = invoiceno,
           "stock_code" = stockcode,
           "invoice_date" = invoicedate,
           "unit_price" = unitprice,
           "customer_id" = customerid)
colnames(df_input)

#> [1] "invoice_no"   "stock_code"   "description"  "quantity"     "invoice_date"
#> [6] "unit_price"   "customer_id"  "country"

2.2 Attribute Information

Berdasarkan informasi dari Online Retail Data Set, berikut informasi setiap atributnya:

Variable	Description
invoice_no	Merupakan nomor transaksi. Terdiri dari 6 digit angka yang bersifat unik setiap transaksi. Kemudian, jika nomor transaksi diawali huruf C maka artinya transaksi tersebut dicancel.
stock_code	Merupakan kode produk yang terdiri dari 5 digit angka yang bersifat unik.
description	Nama produk.
quantity	Jumlah per produk yang dibeli setiap transaksi.
invoice_date	Waktu transaksi.
unit_price	Harga setiap produk persetiap transaksi.
customer_id	ID Customer yang terdiri dari 5 digit angka yang bersifat unik setiap customer.
country	Negara asal customer ketika melakukan transaksi.

3 Data Preparation

3.1 Data Structure

Berikut ini struktur datanya:

glimpse(df_input)

#> Observations: 541,909
#> Variables: 8
#> $ invoice_no   <chr> "536365", "536365", "536365", "536365", "536365", "536...
#> $ stock_code   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "2275...
#> $ description  <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LAN...
#> $ quantity     <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3,...
#> $ invoice_date <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 ...
#> $ unit_price   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, ...
#> $ customer_id  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850...
#> $ country      <chr> "United Kingdom", "United Kingdom", "United Kingdom", ...

data.frame(
  invoice_unique = df_input$invoice_no %>% unique() %>% length(),
  stock_code_unique = df_input$stock_code %>% unique() %>% length(),
  description_unique = df_input$description %>% unique() %>% length(),
  country_unique = df_input$country %>% unique() %>% length(),
  customer_unique = df_input$customer_id %>% unique() %>% length()
)

Hanya terdapat 38 negara customer, sehingga bisa di convert menjadi factor.

df_input$country <- as.factor(df_input$country)

3.2 Data Cleansing

3.2.1 Cancelled Transaction

Berdarkan informasi variabel, jika InvoiceNo diawali denan huruf C, maka menandakan transaksi tersebut di-cancel. Mari kita cek:

df_input %>% filter(grepl("C", df_input$invoice_no)) %>% summarise(total_cancelled_transaction = n())

Terdapat 9288 transaksi yang dicancel. Kita tidak membutuhkan transaksi yang dicancel untuk case ini, jadi perlu di remove.

df_input <- df_input %>% filter(!grepl("C", df_input$invoice_no))

3.2.2 Invalid Invoice

Berdasarkan informasi variabel, InvoiceNo yang valid terdiri dari 6 digit angka. Mari kita cek:

df_input %>% filter(nchar(invoice_no)>6)

Terdapat 3 invoice yang tidak valid. mari kita remove.

df_input <- df_input %>% 
  filter(nchar(invoice_no)<=6) %>% 
  mutate(invoice_no_check  = as.integer(invoice_no)) %>% 
  filter(!is.na(invoice_no_check))

3.2.3 Invalid Quantity

Apakah ada transaksi yang memiliki quantity <=0 ?

df_input %>% filter(quantity<=0) %>% nrow()

#> [1] 1336

Kita memiliki 1336 transaksi dengan quantity <= 0. Kita bisa berasumsi transaksi ini tidak valid, karena tidak mungkin transaksi tidak memiliki quantity. Mari kita remove.

df_input <- df_input %>% filter(quantity>0)

3.2.4 Invalid Unit Price

Apakah ada transaksi yang memiliki unit_price <=0 ?

df_input %>% filter(unit_price<=0) %>% nrow()

#> [1] 1179

Data ini memiliki 1178 transaksi dengan unit_price <=0. Dari informasi dataset tidak dijelaskan apakah ini menunjukan diskon, promo, gratis atau lainnya. Sehingga, transaksi ini bisa kita exclude.

df_input <- df_input %>% filter(unit_price>0)

3.2.5 Invalid Products by Stock Code

Berdasarkan informasi variabel, stock code yang valid terdiri dari 5 digit angkat, tapi ketika kita lihat 10 data teratas, terdapat huruf pada stock code. Mari kita cek.

df_input <- df_input %>% mutate(stock_code = toupper(stock_code))
df_input <- df_input %>% mutate(stock_code_nchar = nchar(stock_code))

# nchar stock code
df_input %>%
  group_by(stock_code_nchar) %>% 
  summarise(row_count = n())

Data diatas adalah kumpulan panjang karakter/digit dari stock_code. Informasi diatas menunjukan bahwa stock code tidak hanya terdiri dari 5 digit angka. mari kita bersihkan.

Pertama, cek stock_code yang memiliki 1 sampai 4 karakter.

df_input %>% filter(stock_code_nchar %in% c(1,2,3,4)) %>% 
  select(stock_code,description) %>% 
   distinct() %>% arrange(stock_code)

Produk diatas terlihat tidak valid, maka bisa kita remove.

df_input <- df_input %>% filter(!stock_code_nchar %in% c(1,2,3,4))

Kedua, cek stock_code yang memiliki 5 karakter.

df_input %>% filter(stock_code_nchar %in% c(5)) %>% 
  select(stock_code,description) %>% 
   distinct() %>% arrange(stock_code)

Data di atas merupakan data produk yang valid.

Ketiga, mari kita cek stock_code yang terdiri dari 6 digit

df_input %>% filter(stock_code_nchar %in% c(6)) %>% 
  select(stock_code,description) %>% 
   distinct() %>% arrange(stock_code)

Data di atas merupakan data produk dengan stock_code yan terdiri dari 6 karakter. Adapun terdiri dari 5 angka dan 1 huruf. Berdasarkan data di atas, huruf pada stock code terlihat seperti menunjukan karakter masing-masing produk. Coba dilihat pada stock_code 15056N dan 15056P, itu adalah produk yang sama namun memiliki warna yang berbeda. Untuk case ini kita bisa asumsikan produk tersebut sebenarnya berbeda, sehingga seluruh data diatas adalah produk yan valid.

Keempat, mari kita cek stock_code yang memiliki lebih dari 6 digit/karakter

df_input %>% filter(nchar(stock_code)>6) %>% 
  select(stock_code,description) %>% 
   distinct() %>% arrange(stock_code)

Data di atas merupakan data stock_code yang memiliki lebih dari 6 digit/karakter. Kalau dilihat dari namanya, terdapat beberapa produk yang tidak valid seperti AMAZONFEE, BANK CHARGES, GIFT_0001_10, GIFT_0001_20, GIFT_0001_20, GIFT_0001_30, GIFT_0001_40 and GIFT_0001_50. Sehingga, mari kita remove produk-produk tersebut.

df_input <- df_input %>% 
  filter(!stock_code %in% c("AMAZONFEE", "BANK CHARGES", "GIFT_0001_10","GIFT_0001_20",
                           "GIFT_0001_20", "GIFT_0001_30", "GIFT_0001_40","GIFT_0001_50"))

df_input <- df_input %>% select(-invoice_no_check,-stock_code_nchar)

Sampai sini data stock_code sudah clean. Mari kita lanjutkan.

3.2.6 Invalid Product by Description

Berdasarkan artikel yang di publish oleh Diego Usai pada 14 Maret 2019 terkait Market Based Analysis menggunakan data online retail ini, dia menemukan 50 deskripsi yang terlihat diinput manual. Silahkan lihat pada code untuk mengetahui deskripsi apa yang tidak valid. Mari kita cek

descr <- c( "check", "check?", "?", "??", "damaged", "found", 
            "adjustment", "Amazon", "AMAZON", "amazon adjust", 
            "Amazon Adjustment", "amazon sales", "Found", "FOUND",
            "found box", "Found by jackie ","Found in w/hse","dotcom", 
            "dotcom adjust", "allocate stock for dotcom orders ta", "FBA", 
            "Dotcomgiftshop Gift Voucher £100.00", "on cargo order",
            "wrongly sold (22719) barcode", "wrongly marked 23343",
            "dotcomstock", "rcvd be air temp fix for dotcom sit", 
            "Manual", "John Lewis", "had been put aside", 
            "for online retail orders", "taig adjust", "amazon", 
            "incorrectly credited C550456 see 47", "returned", 
            "wrongly coded 20713", "came coded as 20713", 
            "add stock to allocate online orders", "Adjust bad debt", 
            "alan hodge cant mamage this section", "website fixed",
            "did  a credit  and did not tick ret", "michel oops",
            "incorrectly credited C550456 see 47", "mailout", "test",
            "Sale error",  "Lighthouse Trading zero invc incorr", "SAMPLES",
            "Marked as 23343", "wrongly coded 23343","Adjustment", 
            "rcvd be air temp fix for dotcom sit", "Had been put aside." )

df_input %>% 
  filter(description %in% descr) %>% arrange(stock_code)

Kita hanya memiliki 1 produk yang terkesan tidak valid. Produk diatas merupakan voucher. Mari kita remove.

df_input <- df_input %>% 
  filter(!description %in% descr)

3.2.7 Duplicated StockCode-Description

Setiap produk seharusnya memiliki stock_code dan deksripsi yang unik. Mari kita cek apakah terdapat stock_code yang memiliki beberapa deskripsi atau sebaliknya.

data_frame(
  stock_code_unique = df_input$stock_code %>% unique() %>% length(),
  description_unique = df_input$description %>% unique() %>% length(),
  stock_description_unique = df_input %>% select(stock_code,description) %>% distinct() %>% nrow()
)

Data ditas mengindikasikan duplikasi, karena seharusnya jumlah unik dari stock_code = description = stock_code_description. Mari kita cek:

df_products <- df_input %>% 
  arrange(desc(invoice_date)) %>% #to get the last named of stock code and description
  select(stock_code,description) %>% 
  distinct()

df_products %>% 
  group_by(stock_code) %>% 
  summarise(description_count=n()) %>% 
  ungroup() %>% 
  filter(description_count>1)

Terdapat 212 stock_code yang memiliki lebih dari 1 deskripsi. Untuk case ini, kita perlu menyesuaikan deskripsi dari setiap produk dengan menggunakan deksripsi pada transaksi terakhir dari setiap produk. Mari kita bersihkan.

df_products <- df_products %>% 
  group_by(stock_code) %>% 
  slice(1)

df_input <- df_input %>% left_join(df_products , by=c("stock_code","stock_code")) %>% 
  mutate(description=description.y) %>% 
  select(-description.x,-description.y)

data_frame(
  stock_code_unique = df_input$stock_code %>% unique() %>% length(),
  description_unique = df_input$description %>% unique() %>% length(),
  stock_description_unique = df_input %>% select(stock_code,description) %>% distinct() %>% nrow()
)

Setelah kita bersihkan, data di atas menunjukan terdapat deskripsi yang sama memiliki stock_code yang berbeda. Mari kita adjust menggunakan stock_code pada transaksi terakhir.

df_description <- df_input %>% 
  arrange(desc(invoice_date)) %>% #to get the last named of stock code and description
  select(stock_code,description) %>% 
  distinct()

df_description <- df_description %>% 
  group_by(description) %>% 
  slice(1)

df_input <- df_input %>% left_join(df_description , by=c("description","description")) %>% 
  mutate(stock_code=stock_code.y) %>% 
  select(-stock_code.x,-stock_code.y)

data_frame(
  stock_code_unique = df_input$stock_code %>% unique() %>% length(),
  description_unique = df_input$description %>% unique() %>% length(),
  stock_description_unique = df_input %>% select(stock_code,description) %>% distinct() %>% nrow()
)

Sampai sini, data setiap produk yang kita miliki sudah bersifat unik. Mari kita lanjutkan.

3.2.8 Invalid Customer-Country

Berdasarkan informasi variabel, country adalah nama negara dimana customer berlokasi, jadi 1 customer seharusnya memiliki 1 negara.

data_frame(
  customer_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow(),
  customer_country_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow()
)

Data di atas menunjukan bahwa ada customer yang memiliki 2 negara. Untuk case ini, kita bisa berasumsi bahwa customer tersebut pindah lokasi tempat tinggal, sehingga kita bisa menyesuaikan negara customer berdasarkan negara terakhir ia melakukan transaksi. Mari kita bersihkan.

df_country <- df_input %>% 
  arrange(desc(invoice_date,customer_id)) %>% #country that used at the last transaction.
  select(customer_id, country) %>% 
  group_by(customer_id) %>% 
  slice(1)

df_input <- df_input %>% select(-country) %>% 
  left_join(df_country, by = c("customer_id","customer_id"))


data_frame(
  customer_unik = df_input %>% select(customer_id) %>% distinct() %>% nrow(),
  customer_country_unik = df_input %>% select(customer_id,country) %>% distinct() %>% nrow()
)

Sampai sini, setiap customer sudah memiliki 1 negara.

3.2.9 Data Periodical Cut-Off

Data transaksi yang kita miliki dari sampai dengan 2010-12-01 untill 2011-12-09. Berhubung kita tidak memiliki data transaksi lengkap pada Desembe 2011, maka saya memutuskan untuk take out data Desember 2011 dari analisa ini, sehingga kita memiliki transaksi full 1 tahun dari Desember 2010 sampai november 2011.

df_input <- df_input %>% filter(ymd(as.Date(df_input$invoice_date)) < ymd("2011-12-1"))

3.3 Duplicated Data

Apakah terdapat baris data yang duplikat?

data.frame(
  row_of_data = df_input %>% nrow(),
  row_of_unique.data = df_input %>% distinct() %>% nrow()
)

data di atas menunjukan terdapat baris data yang duplikat, sehingga bisa kita remove karena tidak memberikan informasi yang penting.

df_input <- df_input %>% distinct()

3.4 Missing Values

Apakah terdapat missing value?

colSums(is.na(df_input))

#>   invoice_no     quantity invoice_date   unit_price  customer_id  description 
#>            0            0            0            0       123562            0 
#>   stock_code      country 
#>            0            0

WOW! kita memliki 123562 data customer_id yang hilang. Untuk case ini kita perlu membagi dataset kita menjadi 2 bagian sebagai berikut:

df_customer_segmentation : Kita tidak bisa melakukan segmentasi customer jika kita tidak mengetahui siapa customernya. Sehingga pada dataset ini, seluruh data yang tidak memiliki customer_id perlu kita remove.
df_product_personalized : Dataset ini akan digunakan untuk membuat sistem rekomendasi. Dalam membuat sistem rekomendasi kita bisa menggunakan matriks customer-produk atau invoice-produk. Kedua matriks tersebut memberikan hasil yang berbeda, jika menggunakan matriks customer-produk maka artinya kita akan memberikan rekomendasi berdasarkan kemiripan histori produk yang dibeli customer dan hasil rekomendasi ini bisa digunakan kapanpun. Sedangkan, jika kita menggunakan matriks invoice-produk maka kita akan memberikan rekomendasi produk berdasarkan setiap histori produk yang dibeli pertransaksi tanpa mempedulikan siapa customernya dan hasil dari matriks ini hanya bisa digunakan jika customer sedang berbelanja atau ketika kita mengetahui isi keranjang belanja customer.

4 Feature Extraction

Untuk projek ini, kita membutuhkan informasi total nilai setiap transaksi untuk dianalisa. Sehinga, kita bisa mengekstrak total nilai transaksi berdasarkan quantity * unit_price

df_input <- df_input %>% mutate(total_amount = quantity * unit_price) %>% 
  select(invoice_no,invoice_date,customer_id,country,stock_code,description,quantity,unit_price,total_amount)

5 Clean Datasets

df_customer_transaction <- drop_na(df_input)
df_product_personalized <- df_input %>% select(customer_id,invoice_no,stock_code,description)

wd <-  as.character(getwd())
saveRDS(object=df_customer_transaction, file=paste(paste(wd,"/data_clean/",sep = ""),"df_customer_transaction.rds",sep=""))
saveRDS(object=df_product_personalized, file=paste(paste(wd,"/data_clean/",sep = ""),"df_product_personalized.rds",sep=""))

Berikut ini dataset yang akan kita gunakan.

df_customer_transaction : 374,141 observasi dan 9 attributes

df_customer_transaction

df_product_personalized : 497,703 observasi dengan 4 attribute

df_product_personalized

6 Exploratory Data Analysis

Sebelum kita melakukan segmentasi dan membuat model untuk sistem rekomendasi, mari explore dulu data dari df_customer_transaction untuk mengetahui transaksi atas customer yang ada.

6.1 Monthly Transaction Frequency

df_monthly_transactions <- df_customer_transaction %>%
  select(invoice_date,invoice_no) %>% 
      distinct() %>% 
      mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
              yearmonth = ymd(yearmonth),
              ym = as.yearmon(invoice_date)) %>% 
      group_by(ym,yearmonth) %>% 
      summarise(total_transaction = n()) %>% 
      ungroup() %>% 
      mutate(
        popup=glue("{ym}
                   {comma(total_transaction)}")
      )

df_max <- df_monthly_transactions %>% arrange(total_transaction) %>% tail(1)
df_min <- df_monthly_transactions %>% arrange(total_transaction) %>% head(1)
      
plot_monthly_transactions <- df_monthly_transactions %>% 
  ggplot(aes(x=yearmonth,total_transaction))+
  geom_line(size=1)+
  geom_point(size=2, aes(text=popup))+
  geom_point(data=df_max, aes(x=yearmonth, y=total_transaction,text=popup), colour="#eaec42", size=3)+
  geom_point(data=df_min, aes(x=yearmonth, y=total_transaction,text=popup), colour="#eaec42", size=3)+
  labs(
    title = "Total Transaction per Month",
    x = "Month-Year",
    y = NULL
  )+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  my_plot_theme(10)


ggplotly(plot_monthly_transactions,tooltip="text") %>% 
  config(displayModeBar = F)

Transaksi terendah terjadi pada Januari 2011 dan transaksi tertinggi terjadi pada November 2011. Meski begitu, bisa kita katakan transaksi cenderung meningkat.

6.2 Monthly Transaction Value

df_monthly_value <- df_customer_transaction %>% 
  mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(invoice_date)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_order_amount = sum(total_amount)) %>% 
  ungroup() %>% 
  mutate(
    popup=glue("{ym}
               £ {comma(total_order_amount)}")
  )

df_max <- df_monthly_value %>% arrange(total_order_amount) %>% tail(1)
df_min <- df_monthly_value %>% arrange(total_order_amount) %>% head(1)

plot_monthly_value <- df_monthly_value %>% 
  ggplot(aes(x=yearmonth,total_order_amount))+
  geom_line(size=1)+
  geom_point(size=2,aes(text=popup))+
  geom_point(data=df_max, aes(x=yearmonth, y=total_order_amount, text=popup), colour="#eaec42", size=3)+
  geom_point(data=df_min, aes(x=yearmonth, y=total_order_amount, text=popup), colour="#eaec42", size=3)+
  labs(
    title = "Total Transaction Value per Month",
    x = "Month-Year",
    y = NULL
  )+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  my_plot_theme(10)

ggplotly(plot_monthly_value, tooltip="text") %>% 
  #layout(showlegend=FALSE, margin = list(l = 1, r = 1, b = 1, t = 12)) %>%  
  config(displayModeBar = F, scrollzoom = F)

Total nilai transaksi bulanan terendah terjadi pada Februari 2011 dan tertinggi pada November 2011. Bisa kita katakan nilai transaksi cenderung meningkat.

6.3 Monthly Customer

 df_total_customer <- df_customer_transaction %>% select(invoice_date,customer_id) %>% 
  mutate( yearmonth = format(invoice_date, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(invoice_date)) %>% 
  select(customer_id,ym,yearmonth) %>% 
  distinct() %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_customer = n()) %>% 
  ungroup() %>% 
  mutate(
    popup=glue("{ym}
               {comma(total_customer)}")
  )

df_max <- df_total_customer %>% arrange(total_customer) %>% tail(1)
df_min <- df_total_customer %>% arrange(total_customer) %>% head(1)

plot_customer <- df_total_customer %>% 
  ggplot(aes(x=yearmonth,total_customer))+
  geom_line(size=1)+
  geom_point(size=2, aes(text=popup))+
  geom_point(data=df_max, aes(x=yearmonth, y=total_customer, text=popup), colour="#eaec42", size=3)+
  geom_point(data=df_min, aes(x=yearmonth, y=total_customer, text=popup), colour="#eaec42", size=3)+
  labs(
    title = "Total Customer per Month",
    x = "Month-Year",
    y = NULL
  )+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  my_plot_theme(10)

ggplotly(plot_customer, tooltip="text") %>% 
  config(displayModeBar = F, scrollzoom = F)

Chart diatas menunjukan total customer yang melakukan transaksi setiap bulan. Bisa dilihat terendah pada Januari 2011 dan tertinggi pada November 2011. Kalau kita lihat, chart total transaksi, total nilai transaksi dan total customer yang berbelanja tiap bulan cenderung meningkat dan mirip. Bagaimana dengan pertumbuhan customer baru?

6.4 Growth of New Customer

df_growth_customer <- df_customer_transaction %>% group_by(customer_id) %>% 
  summarise(first_order = min(invoice_date)) %>% 
  ungroup() %>% 
  mutate(yearmonth = format(first_order, format="%Y-%m-1"),
         yearmonth = ymd(yearmonth),
         ym = as.yearmon(first_order)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_new_customer = n()) %>% 
  ungroup() %>% 
  mutate(
  popup=glue("Year-Month : {ym}
              Total Customer : {total_new_customer}"))


df_min <- df_growth_customer %>% arrange(total_new_customer) %>% head(1)
df_max <- df_growth_customer %>% arrange(total_new_customer) %>% tail(1)

plot_growth_customer <- df_growth_customer %>% 
  ggplot(aes(x=yearmonth,total_new_customer))+
  geom_line(size=1)+
  geom_point(size=2, aes(text=popup))+
  geom_point(data=df_max, aes(x=yearmonth, y=total_new_customer, text=popup), colour="#eaec42", size=3)+
  geom_point(data=df_min, aes(x=yearmonth, y=total_new_customer, text=popup), colour="#eaec42", size=3)+
  labs(
    title = "Total Growth of New Customer per Month",
    x = "Month-Year",
    y = NULL
  )+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  my_plot_theme(10)

ggplotly(plot_growth_customer, tooltip="text") %>% 
  config(displayModeBar = F, scrollzoom = F)

Oke, sampai sini kita mendapatkan masalah. Bisa kita lihat pada chart diatas, total pertumbuhan customer baru justru cenderung menurun, padahal dari 3 chart sebelumnya kita ketahui nilai transaksi cenderung meningkat. Hal ini mengindikasi terdapat customer yang melakukan pembelian berulang, atau bisa kita katakan pada dataset ini terdapat customer yang loyal dan non loyal yang bisa kita harus kita ketahui. Dari hal ini, kita bisa melakukan segementasi customer supaya dapat memberikan approach yang berbeda untuk mengoptimalkan customer value dan mengefektifkan marketing cost. Kita akan melakukan segmentasi customer pada part 2. Mari kita cek coba explore lagi data ini.

6.5 Order Habbit

Kapan customer sering berbelanja?

# Monthly
df_monthly_habbit <- df_customer_transaction %>% select(invoice_date,invoice_no) %>% 
  distinct() %>%
  mutate(month = month(invoice_date),
         day = day(invoice_date)) %>% 
  group_by(month,day) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(day) %>% 
  summarise(avg_monthly_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("Date : {day}
                       Total Transaction: {avg_monthly_trans}")) 

df_monthly_habbit %>% 
  ggplot(aes(x=as.factor(day), y=avg_monthly_trans)) +
  geom_bar(stat="identity", aes(fill=avg_monthly_trans),show.legend = FALSE)+
  labs(title = "Monthly Order Habbit",
    x = "Date", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5,size=12, face="bold"),
        plot.subtitle = element_text(hjust = 0.5,size=10),
        axis.text.y = element_blank(),
        axis.text.x=element_text(size=11, face="bold"))+
  scale_fill_gradient(low=my_theme_hex("col2") ,na.value = "#C0C0C0", high=my_theme_hex("col7"))+
  #my_theme_fill()+
  coord_polar() -> polar1


# Daily
df_wday_habbit <- df_customer_transaction %>% select(invoice_no,invoice_date) %>% 
  distinct() %>% 
  mutate(month = month(invoice_date),
        wday = wday(invoice_date,week_start = getOption("lubridate.week.start", 1))) %>% 
  group_by(month,wday) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(wday) %>% 
  summarise(avg_wday_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("wday : {wday}
                     Total Transaction: {avg_wday_trans}")) 


df_wday_habbit %>% 
    ggplot(aes(wday,avg_wday_trans))+
    geom_bar(width=1, stat="identity", show.legend = FALSE, aes(fill=avg_wday_trans))+
    labs(
      title = "Daily Order Habbit",
      x = "Day", 
      y = NULL)+
    scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                       labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
    theme_minimal()+
    theme(axis.title = element_blank(),
          legend.position = "none",
          plot.title = element_text(hjust = 0.5,size=12, face="bold"),
          plot.subtitle = element_text(hjust = 0.5,size=10),
          axis.text.y = element_blank(),
          axis.text.x=element_text(size=11, face="bold"))+
    scale_fill_gradient(low=my_theme_hex("col2") ,na.value = "#C0C0C0", high=my_theme_hex("col7"))+
    coord_polar() -> polar2



#Hourly
 
df_hourly_habbit <- df_customer_transaction %>% select(invoice_date,invoice_no) %>% 
  distinct() %>%
  mutate(day = day(invoice_date),
         hour = hour(invoice_date)) %>% 
  group_by(day,hour) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(hour) %>% 
  summarise(avg_hourly_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("Hour of Day : {hour}
                       Total Transaction: {avg_hourly_trans}")) 

time_range = data_frame(hour = c(0:23))

data_frame(hour = c(0:23)) %>% 
  left_join(
    df_hourly_habbit ,by=c("hour","hour")) %>%
  mutate(hour = as.factor(hour)) %>% 
  ggplot(aes(x=hour,y=avg_hourly_trans))+
  geom_bar(stat="identity",show.legend = FALSE, aes(fill=avg_hourly_trans))+
  labs(title = "Hourly Order Habbit",
    x = "Hour of Day", 
    y = NULL)+
  theme_minimal()+
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5,size=12, face="bold"),
        plot.subtitle = element_text(hjust = 0.5,size=10),
        axis.text.y = element_blank(),
        axis.text.x=element_text(size=11, face="bold"))+
  scale_fill_gradient(low=my_theme_hex("col2") ,na.value = "#C0C0C0", high=my_theme_hex("col7"))+
  coord_polar() -> polar3


grid.arrange(polar1,polar2,polar3, ncol = 3)

Semakin gelap warna chart diatas maka menunjukan semakun banyak transaksi yang dilakukan. Berdasarkan chart ini, secara general kita bisa mengefektifkan campaign strategy berdasarkan habbit waktu customer paling berbelanja.

6.6 Most Popular Product

plot_most_frequency <- df_customer_transaction %>% group_by(stock_code,description) %>% 
  summarise(frequency = n()) %>% 
  ungroup() %>% 
  arrange(desc(frequency)) %>% 
  head(10) %>% 
  mutate(description = as.factor(description),
         description = reorder(description,frequency)) %>% 
  ggplot(aes(x=description, y=frequency))+
  geom_bar(stat="identity", aes(fill=description, text=frequency), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Frequency Order",
        x=NULL)+
  coord_flip()+
  my_plot_theme(9)+
  my_theme_fill()
 
ggplotly(plot_most_frequency, tooltip="text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)

plot_most_quantity <- df_customer_transaction %>% 
  group_by(stock_code,description) %>% 
  summarise(total_quantity = sum(quantity)) %>% 
  ungroup() %>% 
  arrange(desc(total_quantity)) %>% 
  head(10) %>% 
  mutate(description = as.factor(description),
         description = reorder(description,total_quantity)) %>% 
  ggplot(aes(x=description, y=total_quantity))+
  geom_bar(stat="identity", aes(fill=description, text=total_quantity), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Quantity Order",
        x=NULL)+
  coord_flip()+
  my_plot_theme(9)+
  my_theme_fill()


ggplotly(plot_most_quantity, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)

plot_most_customer <- df_customer_transaction %>%
  select(customer_id, stock_code,description) %>% 
  distinct() %>% 
  group_by(stock_code,description) %>% 
  summarise(total_customer = n()) %>% 
  ungroup() %>% 
  arrange(desc(total_customer)) %>% 
  head(10) %>% 
  mutate(description = as.factor(description),
         description = reorder(description,total_customer)) %>% 
  ggplot(aes(x=description, y=total_customer))+
  geom_bar(stat="identity", aes(fill=description, text=total_customer), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Total Customer",
        x=NULL)+
  coord_flip()+
  my_plot_theme(9)+
  my_theme_fill()

ggplotly(plot_most_customer, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
   config(displayModeBar = F, scrollzoom = F)

plot_most_profit <- df_customer_transaction %>% 
  group_by(stock_code,description) %>% 
  summarise(total_amount = sum(total_amount)) %>% 
  ungroup() %>% 
  arrange(desc(total_amount)) %>% 
  head(10) %>% 
  mutate(description = as.factor(description),
         description = reorder(description,total_amount)) %>% 
  ggplot(aes(x=description, y=total_amount))+
  geom_bar(stat="identity", aes(fill=description, text=paste0("GBP ",total_amount)), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Order Value",
        x=NULL)+
  coord_flip()+
  my_plot_theme(9)+
  my_theme_fill()


ggplotly(plot_most_profit, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)

6.7 Unique Item Purhcased per Customer

getmodus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

df_item_different <- df_customer_transaction %>%
  select(customer_id,stock_code,description) %>% 
  distinct() %>% 
  group_by(customer_id) %>% 
  summarise(item = n()) %>% 
  ungroup() %>% 
  mutate( customer_id = customer_id,
          popup=glue("customer id : {customer_id}
                     Unique Item : {item}"))

plot_df_item_different <- df_item_different %>% 
ggplot(aes(x=customer_id,y=item)) +
  geom_point(aes(color=item, size=item, text=popup), show.legend = FALSE)+
  geom_hline(yintercept=getmodus(df_item_different$item),linetype="dashed", color = "black",size=0.5,text=getmodus(df_item_different$item))+
  annotate(geom="text", 
    x=max(df_item_different$customer_id)-2/10*length(df_item_different$customer_id), 
    y=getmodus(df_item_different$item)+100,size=3,
    label=paste0("Modus: ",getmodus(df_item_different$item),
                ", Total: ",round(as.numeric(((tabulate(match(df_item_different$item, unique(df_item_different$item))) %>% 
                                                 sort(decreasing = T) %>% .[1])/length(df_item_different$customer_id))*100),2), "% cust"),
   color="black")+
  labs(title="Unique item that was purchased by each customer",
       x="Customer",
       y="Total Unique Item")+
  my_plot_theme(10)+
  theme(
    axis.text.x = element_blank()
  )+
  scale_color_gradient(low=my_theme_hex("col2") ,na.value = "#C0C0C0", high=my_theme_hex("col7"))


ggplotly(plot_df_item_different, tooltip="text") %>% 
  layout(showlegend=FALSE)

Bisa kita lihat, hanya 2.14% customer yang membeli 1 tipe item. Hal ini baik karena semakin besar kemungkinan kita bisa membuat recommeder system.

6.8 Baskets Size

Berapa rata-rata jumlah item per keranjang belanja dari setiap customer?

getmodus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

df_basket <- df_customer_transaction %>% 
  group_by(customer_id, invoice_no) %>% 
  summarise(baskets = sum(quantity)) %>% 
  ungroup() %>% 
  group_by(customer_id) %>% 
  summarise(
    freq = n(),
    baskets = median(baskets))%>% 
  ungroup() %>% 
  mutate( customer_id = customer_id,
          popup=glue("customer id : {customer_id}
                      Trans. Frequency : {freq}
                      Avg. basket : {baskets}"))

plot_baskets <- df_basket %>% 
  ggplot(aes(x=customer_id,y=baskets)) +
  geom_point(aes(color=baskets, size=baskets, text=popup), show.legend = FALSE)+
  geom_hline(yintercept=getmodus(df_basket$baskets),linetype="dashed", color = "black",size=0.5,text=getmodus(df_basket$baskets))+
  annotate(geom="text", 
   x=max(df_basket$customer_id)-2/10*length(df_basket$customer_id), 
   y=getmodus(df_basket$baskets)+2000,size=3,
   label=paste0("Modus: ",getmodus(df_basket$baskets),
                ", Total: ",round(as.numeric(((tabulate(match(df_basket$baskets, unique(df_basket$baskets))) %>% 
                                                 sort(decreasing = T) %>% .[1])/length(df_basket$customer_id))*100),2), "% cust"),
   color="black")+
  labs(title="Average baskets for each customer",
       x="Customer",
       y="Average Baskets")+
  my_plot_theme(10)+
  theme(
    axis.text.x = element_blank()
  )+
  scale_color_gradient(low=my_theme_hex("col2") ,na.value = "#C0C0C0", high=my_theme_hex("col7"))
   
ggplotly(plot_baskets, tooltip="text") %>% 
  layout(showlegend=FALSE)

6.9 Best Countries

Manakah negara customer yang paling banyak?

treemap(df_customer_transaction,
        index      = c("country"),
        vSize      = "quantity",
        title      = "",
        palette    = my_color,
        border.col = "grey40")

Oke, customer tersebar dibeberapa negara. Namun, customer paling banyak berada di UK kemudian dilanjutkan Netherland, EIRE, Germany, France dan Australia. Dengan mengetahui informasi ini, kita bisa mengoptimalkan chanel pada negara tertentu dan juga bisa digunakan sebagai landasan pertimbangan untuk mengembangkan strategy.

7 Summary

Kita sudah menyelasaikan proses data preparation dan exploratory analysis pada part 1 ini. Berdasarkan dari analisa diataas, kita menemukan bahwa sebanarnya terdapat customer loyal dan non loyal pada perusahan ini. Sehingga, kita perlu mengsegmentasikan customer untuk membantu perusahaan atau tim merketing melakukan approcah strategy yang berebeda terhadap setiap segment. Hasil segmentasi ini tentunya bisa digunakan untuk mengoptimalkan biaya marketing dan mengoptimalkan waktu campaign. Selain itu, kita juga bisa menerapkan sistem rekomendasi untuk mempersonalisasi produk setiap customer. Coba kita bayangkan, jika kita mengetahui produk apa yang disukai oleh customer dan kita tahu customer tersebut ada di segment apa, maka kita bisa membuat personalisasi campaign untuk meningkatkan customer value, misalkan seperti “memberikan nilai diskon atau promosi yang berbeda-beda sesuai nilai customer terhadap produk-produk yang disukai oleh customer tersebut”.

Mengapa ini penting?

Berdasarkan riset yang dilakukan Badan Pusat Statistik (BPS), Total bisnis e-commerce Indonesia yan baru dimulai tahun 2019 sebesar 25.1%. Berdasar hal ini, saya berasumsi bahwa persaingan untuk mendapatkan customer akan semakin sulit. Sehinnga, problem utama yang terjadi adalah bukan tentang bagaimana mendapatkan customer baru, tapi bagaimana mengoptimalkan nilai customer yang sudah ada

Berikut datanya:

df_background_ecom <- data.frame(
  year=c("<2010","2010-2016","2017-2018","2019"),
  total = c(1.53,28.06, 45.30, 25.11)) %>% 
  mutate(year = factor(year, levels = c("<2010","2010-2016","2017-2018","2019")))

plot_ly(df_background_ecom, labels=~year, values=~total,type="pie",
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        hoverinfo = 'text',
        text = ~paste(year,":",total,"%"),
        marker = list(colors = c(my_theme_hex("col4"),my_theme_hex("col5"),my_theme_hex("col6"),my_theme_hex("col7")),
                      line = list(color = '#FFFFFF', width = 1)),
        #The 'pull' attribute can also be used to create space between the sectors
        showlegend = FALSE) %>% 
  layout(xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)) %>% 
  config(displayModeBar = F) %>% 
  layout(font=list(size = 11)) %>% 
  layout(title="E-Commerce Businesses by Year of Starting in Indonesia")

Menurut kalian, bagaimana jika kalian mendapat potongan harga terhadap produk yang kalian suka?

Sampai jumpa di Part 2 : Customer Analysis and Segmentation Cheers!

Part 1: Data Preparation and Exploratory Analysis

Utilizing RFM Analysis and Product Personalization to Optimize Customer Value

by Gabriel Erichson

10 July 2020