E-Commerce Data Exploration

Status: In Progress

1 Introduction

Olist merupakan salah satu platform e-commerce terbesar di Brazil. Berdiri sejak tahun 2015, Olist merupakan marketplace platform terintegrasi yang mengkhususkan layanan pada bidang logistic dan permodalan.

Kali ini saya akan menganalisa transaksi penjualan di marketplace Olist pada periode 2016 hingga 2018. Sumber daya dan penjelasan lengkap terkait data dapat dilihat pada tautan Kaggle: Olist dataset


2 Data Inspection

Skema data yang disediakan seperti gambar berikut.

Dari sekian ragam data, data review orders tidak saya gunakan.

2.1 Import Library and dataset

  • import library
# for wrangling data
library(tidyr)
library(dplyr)
library(readr)
library(lubridate)
library(zoo)
library(glue)
library(scales)

# for visualize data
library(ggplot2)
library(plotly)
library(knitr)

options(scipen = 999)
  • load dataset
#master data
raw_customer <- read.csv("dataset/olist_customers_dataset.csv")
raw_seller <- read.csv("dataset/olist_sellers_dataset.csv")
raw_product <- read.csv("dataset/olist_products_dataset.csv")
raw_product_cat <- read.csv("dataset/product_category_name_translation.csv")
raw_geoloc <- read.csv("dataset/olist_geolocation_dataset.csv")
#trx data
raw_orders <- read.csv("dataset/olist_orders_dataset.csv")
raw_order_items <- read.csv("dataset/olist_order_items_dataset.csv")
raw_order_payments <- read.csv("dataset/olist_order_payments_dataset.csv")

2.2 Customers

Berikut penjelasan terkait data customers yang akan digunakan:

Column Name Description
customer_id key to the orders dataset. Each order has a unique customer_id
customer_unique_id unique identifier of a customer.
customer_city customer city name

Berikut sample data dari customers

raw_customer %>% 
  select(-c(customer_zip_code_prefix, customer_state)) %>% 
  head(.)
#>                        customer_id               customer_unique_id
#> 1 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0
#> 2 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3
#> 3 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e
#> 4 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c
#> 5 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066
#> 6 879864dab9bc3047522c92c82e1212b8 4c93744516667ad3b8f1fb645a3116a4
#>           customer_city
#> 1                franca
#> 2 sao bernardo do campo
#> 3             sao paulo
#> 4       mogi das cruzes
#> 5              campinas
#> 6        jaragua do sul

Struktur data customer:

glimpse(raw_customer)
#> Rows: 99,441
#> Columns: 5
#> $ customer_id              <chr> "06b8999e2fba1a1fbc88172c00ba8bc7", "18955e83~
#> $ customer_unique_id       <chr> "861eff4711a542e4b93843c6dd7febb0", "290c77bc~
#> $ customer_zip_code_prefix <int> 14409, 9790, 1151, 8775, 13056, 89254, 4534, ~
#> $ customer_city            <chr> "franca", "sao bernardo do campo", "sao paulo~
#> $ customer_state           <chr> "SP", "SP", "SP", "SP", "SP", "SC", "SP", "MG~

2.3 Sellers

Berikut penjelasan terkait data sellers yang akan digunakan:

Column Name Description
seller_id seller unique identifier
seller_city seller city name

Berikut sample data dari sellers:

raw_seller %>% 
  select(!c(seller_zip_code_prefix, seller_state)) %>% 
  head(.)
#>                          seller_id       seller_city
#> 1 3442f8959a84dea7ee197c632cb2df15          campinas
#> 2 d1b65fc7debc3361ea86b5f14c68d2e2        mogi guacu
#> 3 ce3ad9de960102d0677a81f5d0bb7b2d    rio de janeiro
#> 4 c0f3eea2e14555b6faeea3dd58c1b1c3         sao paulo
#> 5 51a04a8a6bdcb23deccc82b0b80742cf braganca paulista
#> 6 c240c4061717ac1806ae6ee72be3533b    rio de janeiro

Struktur data sellers

glimpse(raw_seller)
#> Rows: 3,095
#> Columns: 4
#> $ seller_id              <chr> "3442f8959a84dea7ee197c632cb2df15", "d1b65fc7de~
#> $ seller_zip_code_prefix <int> 13023, 13844, 20031, 4195, 12914, 20920, 55325,~
#> $ seller_city            <chr> "campinas", "mogi guacu", "rio de janeiro", "sa~
#> $ seller_state           <chr> "SP", "SP", "RJ", "SP", "SP", "RJ", "PE", "SP",~

2.4 Products

Berikut penjelasan terkait data products yang akan digunakan:

Column Name Description
product_id unique product identifier
product_category_name root category of product (in Portuguese)
product_name_length number of characters extracted from the product name
product_description_length number of characters extracted from the product description.
product_photos_qty number of product published photos
product_weight_g product weight measured in grams
product_length_cm product length measured in centimeters
product_height_cm product height measured in centimeters
product_width_cm product width measured in centimeters

Berikut sample data dari products:

raw_product %>% 
  head(.)
#>                         product_id product_category_name product_name_lenght
#> 1 1e9e8ef04dbcff4541ed26657ea517e5            perfumaria                  40
#> 2 3aa071139cb16b67ca9e5dea641aaa2f                 artes                  44
#> 3 96bd76ec8810374ed1b65e291975717f         esporte_lazer                  46
#> 4 cef67bcfe19066a932b7673e239eb23d                 bebes                  27
#> 5 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas                  37
#> 6 41d3672d4792049fa1779bb35283ed13 instrumentos_musicais                  60
#>   product_description_lenght product_photos_qty product_weight_g
#> 1                        287                  1              225
#> 2                        276                  1             1000
#> 3                        250                  1              154
#> 4                        261                  1              371
#> 5                        402                  4              625
#> 6                        745                  1              200
#>   product_length_cm product_height_cm product_width_cm
#> 1                16                10               14
#> 2                30                18               20
#> 3                18                 9               15
#> 4                26                 4               26
#> 5                20                17               13
#> 6                38                 5               11

Struktur data products :

glimpse(raw_product)
#> Rows: 32,951
#> Columns: 9
#> $ product_id                 <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071~
#> $ product_category_name      <chr> "perfumaria", "artes", "esporte_lazer", "be~
#> $ product_name_lenght        <int> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,~
#> $ product_description_lenght <int> 287, 276, 250, 261, 402, 745, 1272, 184, 16~
#> $ product_photos_qty         <int> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4~
#> $ product_weight_g           <int> 225, 1000, 154, 371, 625, 200, 18350, 900, ~
#> $ product_length_cm          <int> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,~
#> $ product_height_cm          <int> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,~
#> $ product_width_cm           <int> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,~

Sedangkan translasi product category dalam bahasa inggris berada pada product_category.

raw_product_cat %>% head(.)
#>   ï..product_category_name product_category_name_english
#> 1             beleza_saude                 health_beauty
#> 2   informatica_acessorios         computers_accessories
#> 3               automotivo                          auto
#> 4          cama_mesa_banho                bed_bath_table
#> 5         moveis_decoracao               furniture_decor
#> 6            esporte_lazer                sports_leisure

2.5 Order

Data transaksi penjualan tersimpan pada dataframe orders dan order_items, dimana orders menyimpan ringkasan transaksi seperti status pemesanan, kode customer, waktu pemesanan, dan lain-lain.

Berikut penjelasan terkait data orders yang akan digunakan:

Column Name Description
order_id unique identifier of the order.
customer_id key to the customer dataset. Each order has a unique customer_id
order_status Reference to the order status (created, processing, shipped, delivered, approved, invoiced, canceled, and unavailable).
order_purchase_timestamp Shows the purchase timestamp.
order_approved_at Shows the payment approval timestamp.
order_delivered_carrier_date Shows the order posting timestamp. When it was handled to the logistic partner.
order_delivered_customer_date Shows the actual order delivery date to the customer.
order_estimated_delivery_date Shows the estimated delivery date that was informed to customer at the purchase moment.

Berikut sample data dari orders:

raw_orders %>% head(.)
#>                           order_id                      customer_id
#> 1 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d
#> 2 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef
#> 3 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089
#> 4 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82
#> 5 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c
#> 6 a4591c265e18cb1dcee52889e2d8acc3 503740e9ca751ccdda7ba28e9ab8f608
#>   order_status order_purchase_timestamp   order_approved_at
#> 1    delivered      2017-10-02 10:56:33 2017-10-02 11:07:15
#> 2    delivered      2018-07-24 20:41:37 2018-07-26 03:24:27
#> 3    delivered      2018-08-08 08:38:49 2018-08-08 08:55:23
#> 4    delivered      2017-11-18 19:28:06 2017-11-18 19:45:59
#> 5    delivered      2018-02-13 21:18:39 2018-02-13 22:20:29
#> 6    delivered      2017-07-09 21:57:05 2017-07-09 22:10:13
#>   order_delivered_carrier_date order_delivered_customer_date
#> 1          2017-10-04 19:55:00           2017-10-10 21:25:13
#> 2          2018-07-26 14:31:00           2018-08-07 15:27:45
#> 3          2018-08-08 13:50:00           2018-08-17 18:06:29
#> 4          2017-11-22 13:39:59           2017-12-02 00:28:42
#> 5          2018-02-14 19:46:34           2018-02-16 18:17:02
#> 6          2017-07-11 14:58:04           2017-07-26 10:57:55
#>   order_estimated_delivery_date
#> 1           2017-10-18 00:00:00
#> 2           2018-08-13 00:00:00
#> 3           2018-09-04 00:00:00
#> 4           2017-12-15 00:00:00
#> 5           2018-02-26 00:00:00
#> 6           2017-08-01 00:00:00

Struktur data orders

glimpse(raw_orders)
#> Rows: 99,441
#> Columns: 8
#> $ order_id                      <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c~
#> $ customer_id                   <chr> "9ef432eb6251297304e76186b10a928d", "b08~
#> $ order_status                  <chr> "delivered", "delivered", "delivered", "~
#> $ order_purchase_timestamp      <chr> "2017-10-02 10:56:33", "2018-07-24 20:41~
#> $ order_approved_at             <chr> "2017-10-02 11:07:15", "2018-07-26 03:24~
#> $ order_delivered_carrier_date  <chr> "2017-10-04 19:55:00", "2018-07-26 14:31~
#> $ order_delivered_customer_date <chr> "2017-10-10 21:25:13", "2018-08-07 15:27~
#> $ order_estimated_delivery_date <chr> "2017-10-18 00:00:00", "2018-08-13 00:00~

Sedangkan detail transaksi seperti item yang dibeli dan penjual berada pada order_items.

Berikut penjelasan terkait data order_items yang akan digunakan:

Column Name Description
order_id unique identifier of the order.
order_item_id sequential number identifying number of items included in the same order.
product_id products unique identifier
seller_id seller unique identifier
shipping_limit_date Shows the seller shipping limit date for handling the order over to the logistic partner
price item price
freight_value item freight value item (if an order has more than one item the freight value is splitted between items)

Berikut sample data dari order_items:

raw_order_items %>% head(.)
#>                           order_id order_item_id
#> 1 00010242fe8c5a6d1ba2dd792cb16214             1
#> 2 00018f77f2f0320c557190d7a144bdd3             1
#> 3 000229ec398224ef6ca0657da4fc703e             1
#> 4 00024acbcdf0a6daa1e931b038114c75             1
#> 5 00042b26cf59d7ce69dfabb4e55b4fd9             1
#> 6 00048cc3ae777c65dbb7d2a0634bc1ea             1
#>                         product_id                        seller_id
#> 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202
#> 2 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36
#> 3 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d
#> 4 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4
#> 5 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87
#> 6 ef92defde845ab8450f9d70c526ef70f 6426d21aca402a131fc0a5d0960a3c90
#>   shipping_limit_date  price freight_value
#> 1 2017-09-19 09:45:35  58.90         13.29
#> 2 2017-05-03 11:05:13 239.90         19.93
#> 3 2018-01-18 14:48:30 199.00         17.87
#> 4 2018-08-15 10:10:18  12.99         12.79
#> 5 2017-02-13 13:57:51 199.90         18.14
#> 6 2017-05-23 03:55:27  21.90         12.69

Struktur data order_items

glimpse(raw_order_items)
#> Rows: 112,650
#> Columns: 7
#> $ order_id            <chr> "00010242fe8c5a6d1ba2dd792cb16214", "00018f77f2f03~
#> $ order_item_id       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,~
#> $ product_id          <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f2d52b80218~
#> $ seller_id           <chr> "48436dade18ac8b2bce089ec2a041202", "dd7ddc04e1b6c~
#> $ shipping_limit_date <chr> "2017-09-19 09:45:35", "2017-05-03 11:05:13", "201~
#> $ price               <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21.90, 19.90~
#> $ freight_value       <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69, 11.85, 7~

2.6 Order Payments

Data detail terkait nilai transaksi pembayaran, metode pembayaran, dan jenis pembayaran tersimpan di order payments

Berikut penjelasan terkait data order payments yang akan digunakan:

colnames(raw_order_payments)
#> [1] "order_id"             "payment_sequential"   "payment_type"        
#> [4] "payment_installments" "payment_value"
Column Name Description
order_id unique identifier of the order.
payment_sequential a customer may pay an order with more than one payment method. If he does so, a sequence will be created to
payment_type method of payment chosen by the customer.
payment_installments number of installments chosen by the customer.
payment_value transaction value.

Berikut sample data dari order payments:

raw_order_payments %>% head(.)
#>                           order_id payment_sequential payment_type
#> 1 b81ef226f3fe1789b1e8b2acac839d17                  1  credit_card
#> 2 a9810da82917af2d9aefd1278f1dcfa0                  1  credit_card
#> 3 25e8ea4e93396b6fa0d3dd708e76c1bd                  1  credit_card
#> 4 ba78997921bbcdc1373bb41e913ab953                  1  credit_card
#> 5 42fdf880ba16b47b59251dd489d4441a                  1  credit_card
#> 6 298fcdf1f73eb413e4d26d01b25bc1cd                  1  credit_card
#>   payment_installments payment_value
#> 1                    8         99.33
#> 2                    1         24.39
#> 3                    1         65.71
#> 4                    8        107.78
#> 5                    2        128.45
#> 6                    2         96.12

Struktur data order payments

glimpse(raw_order_payments)
#> Rows: 103,886
#> Columns: 5
#> $ order_id             <chr> "b81ef226f3fe1789b1e8b2acac839d17", "a9810da82917~
#> $ payment_sequential   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ payment_type         <chr> "credit_card", "credit_card", "credit_card", "cre~
#> $ payment_installments <int> 8, 1, 1, 8, 2, 2, 1, 3, 6, 1, 8, 1, 1, 5, 4, 10, ~
#> $ payment_value        <dbl> 99.33, 24.39, 65.71, 107.78, 128.45, 96.12, 81.16~

# Test purpose
# raw_product %>% filter(product_id == 'a41e356c76fab66334f36de622ecbd3a')
# colSums(is.na(raw_product))
# 
# raw_product %>% filter(product_id == 'a41e356c76fab66334f36de622ecbd3a')
# raw_product %>% filter(product_category_name == "")

3 Data Wrangling

Pada tahapan ini saya akan melakukan transformasi data yang diperlukan agar mudah untuk diolah seperti merubah tipe data dan rename kolom

# factorize column `customer_city` and remove unnecessary columns
clean_customer <- raw_customer %>% 
  select(-c(customer_zip_code_prefix, customer_state)) %>% 
  mutate(customer_city = as.factor(customer_city))

# factorize column `seller_city` and remove unnecessary columns
clean_seller <- raw_seller %>% 
  select(-c(seller_zip_code_prefix, seller_state)) %>%  
  mutate(seller_city = as.factor(seller_city))

# factorize column 
clean_orderpayments <- raw_order_payments %>% 
  mutate(payment_type = as.factor(payment_type))

# rename column `ï..product_category_name` to `product_category_name`
raw_product_cat <- raw_product_cat %>% 
  rename(product_category_name = ï..product_category_name)

#JOIN `product` & 'product_category' to get fetch product category name in English for `product` dataframe
clean_product <- raw_product %>% 
  select(c(product_id, product_category_name)) %>% 
  left_join(raw_product_cat, by = "product_category_name") %>%
# factorize column `product_category_name` & `product_category_name_english`
  mutate(product_category_name = as.factor(product_category_name),
         product_category_name_english = as.factor(product_category_name_english))
  
# # 74 unique values --> set as factor
# clean_product %>% 
#   select(product_category_name_english) %>% 
#   summarise_all(~n_distinct(.)) %>% 
#   pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values")

# Set columns related to date and time info to DateTime format + set `order_status` as factor
clean_orders <- raw_orders %>% 
  mutate(order_purchase_timestamp = ymd_hms(order_purchase_timestamp,tz="America/Sao_Paulo"),
                       order_approved_at = ymd_hms(order_approved_at, tz="America/Sao_Paulo"),
                       order_delivered_carrier_date = ymd_hms(order_delivered_carrier_date, tz="America/Sao_Paulo"),
                       order_delivered_customer_date = ymd_hms(order_delivered_customer_date, tz="America/Sao_Paulo"),
                       order_estimated_delivery_date = ymd_hms(order_estimated_delivery_date, tz="America/Sao_Paulo"),
         order_status = as.factor(order_status))

# Set columns related to date and time info to DateTime format
clean_order_items <- raw_order_items %>% 
  mutate(shipping_limit_date = ymd_hms(shipping_limit_date, tz="America/Sao_Paulo"))

check if any missing value :

  • customer
colSums(is.na(clean_customer))
#>        customer_id customer_unique_id      customer_city 
#>                  0                  0                  0
  • seller
colSums(is.na(clean_seller))
#>   seller_id seller_city 
#>           0           0
  • product
colSums(is.na(clean_product))
#>                    product_id         product_category_name 
#>                             0                             0 
#> product_category_name_english 
#>                           623
  • Orders
colSums(is.na(clean_orders))
#>                      order_id                   customer_id 
#>                             0                             0 
#>                  order_status      order_purchase_timestamp 
#>                             0                             0 
#>             order_approved_at  order_delivered_carrier_date 
#>                           160                          1783 
#> order_delivered_customer_date order_estimated_delivery_date 
#>                          2965                             0
  • Order Items
colSums(is.na(clean_order_items))
#>            order_id       order_item_id          product_id           seller_id 
#>                   0                   0                   0                   0 
#> shipping_limit_date               price       freight_value 
#>                   0                   0                   0
  • Order Payments
colSums(is.na(clean_orderpayments))
#>             order_id   payment_sequential         payment_type 
#>                    0                    0                    0 
#> payment_installments        payment_value 
#>                    0                    0

Check if any duplicates :

  • Customer
data.frame(
  data = length(clean_customer$customer_id),
  unique_data = length(unique(clean_customer$customer_id))
)
#>    data unique_data
#> 1 99441       99441
  • Seller
data.frame(
  data = length(clean_seller$seller_id),
  unique_data = length(unique(clean_seller$seller_id))
)
#>   data unique_data
#> 1 3095        3095
  • Product
data.frame(
  data = length(clean_product$product_id),
  unique_data = length(unique(clean_product$product_id))
)
#>    data unique_data
#> 1 32951       32951
  • Order
data.frame(
  data = length(clean_orders$order_id),
  unique_data = length(unique(clean_orders$order_id))
)
#>    data unique_data
#> 1 99441       99441

Tidak terdapat missing value pada data customer dan seller. Tetapi pada data product dan orders terdapat beberapa missing value. Kita telaah case missing value satu-persatu, mulai dari data product.

3.1 inspect missing value product

Berikut sample data product yang memiliki missing value pada kolom product_category_name_english.

#xplore missing value english based on prod cat name
clean_product %>% filter(is.na(product_category_name_english)) %>% 
  group_by(product_category_name) %>% count(.)
#> # A tibble: 3 x 2
#> # Groups:   product_category_name [3]
#>   product_category_name                               n
#>   <fct>                                           <int>
#> 1 ""                                                610
#> 2 "pc_gamer"                                          3
#> 3 "portateis_cozinha_e_preparadores_de_alimentos"    10

Terdapat tiga nilai product_category_name yang tidak memiliki translasi bahasa Inggris. Salah satu nilai product_category_name ternyata berupa string kosong.

Selain itu terdapat 623 rows data product.product_category_name yang memiliki missing value. Data kolom product_category_name_english NULL karena key dari kolom product_category_name berisi string kosong atau key tidak ada di product category.

Karena case missing value ini kecil, maka semua nilai product_category_name_english dan product_category_name akan diisi dengan nilai undefined.

# temporarily return factor data type to character
clean_product <- clean_product %>% 
  mutate(product_category_name = as.character(product_category_name),
         product_category_name_english = as.character(product_category_name_english)) %>%

  # replace any NULL value in prior columns to "undefined"
  mutate(product_category_name_english = replace_na(product_category_name_english,"undefined"))


replace_to_undefined <- function(x){
    if(x == "portateis_cozinha_e_preparadores_de_alimentos" | x == "pc_gamer" | x == ""){
      x <- "undefined"
    }  
}

clean_product$product_category_name_english[clean_product$product_category_name == "portateis_cozinha_e_preparadores_de_alimentos" | clean_product$product_category_name == "pc_gamer" | clean_product$product_category_name == ""] <- "undefined"


# return character data type to factor
clean_product <- clean_product %>% 
  mutate(product_category_name = as.factor(product_category_name),
         product_category_name_english = as.factor(product_category_name_english))
  • Cek kembali hasil transformasi:
colSums(is.na(clean_product))
#>                    product_id         product_category_name 
#>                             0                             0 
#> product_category_name_english 
#>                             0

sudah tidak terdapat Missing Value.


3.2 inspect missing value orders

Untuk data orders kita cek case missing value pada kolom order_approved_at berdasarkan nilai order_status

clean_orders %>% 
  filter(is.na(order_approved_at)) %>%
  group_by(order_status) %>% 
  count(.)
#> # A tibble: 3 x 2
#> # Groups:   order_status [3]
#>   order_status     n
#>   <fct>        <int>
#> 1 canceled       141
#> 2 created          5
#> 3 delivered       14

Dari pola data diatas bisa disimpulkan bahwa nilai NA di kolom order_approved_at terjadi saat order_status belum approved.

clean_orders %>% 
  filter(is.na(order_delivered_carrier_date)) %>%
  group_by(order_status) %>% 
  count(.)
#> # A tibble: 7 x 2
#> # Groups:   order_status [7]
#>   order_status     n
#>   <fct>        <int>
#> 1 approved         2
#> 2 canceled       550
#> 3 created          5
#> 4 delivered        2
#> 5 invoiced       314
#> 6 processing     301
#> 7 unavailable    609

Dari pola data diatas bisa disimpulkan bahwa nilai NA di kolom order_delivered_carrier_date terjadi saat order_status belum dikirim oleh penjual ke customer (order_status != “Shipped”)

clean_orders %>% 
  filter(is.na(order_delivered_customer_date)) %>%
  group_by(order_status) %>% 
  count(.)
#> # A tibble: 8 x 2
#> # Groups:   order_status [8]
#>   order_status     n
#>   <fct>        <int>
#> 1 approved         2
#> 2 canceled       619
#> 3 created          5
#> 4 delivered        8
#> 5 invoiced       314
#> 6 processing     301
#> 7 shipped       1107
#> 8 unavailable    609

Case tersebut terjadi di semua nilai order_status, namun jika membaca penjelasan di deskripsi terkait kolom tersebut kemungkinan terjadi karena beberapa transaksi dimana barang sudah diterima customer namun tanggal aktual barang diterima customer tidak terekam/update ke sistem.

  • save clean data

4 Exploratory Data Analysis

<!-- * General Metrics & Orders breakdown: Order Growth, Order Status Summary per Year -->
<!-- * Customers: MAU, Customer per state (Customer per state & Order Volume per Cust State) -->
<!-- * Products: Top Products Category, Top Product -->

4.1 Summary Number of Customers and Orders

sum_total_sales <- clean_orderpayments %>% 
  left_join(clean_orders, by = "order_id") %>% 
  filter(!order_status %in% c("canceled","unavailable")) %>%
  summarise(total_sales = sum(payment_value)) 

sum_cust_order <- clean_order_items %>%
  left_join(clean_orders, by="order_id") %>% 
  left_join(clean_product, by="product_id") %>% 
  left_join(clean_orderpayments, by="order_id", "order_item_id = payment_sequential") %>% 
  filter(!order_status %in% c("canceled","unavailable")) %>%
  summarise(total_unique_customer = length(unique(customer_id)),
            total_unique_order = length(unique(order_id)))

summary_trx <- cbind(sum_total_sales, sum_cust_order)
summary_trx <- summary_trx %>% mutate(sales_per_customer = total_sales / total_unique_customer,
                       sales_per_order = total_sales / total_unique_order)


kable(summary_trx)
total_sales total_unique_customer total_unique_order sales_per_customer sales_per_order
15739137 98199 98199 160.278 160.278

Terdapat 98,199 transaksi jual-beli yang berlangsung di Olist selama kurun periode 2016-2018 dengan volume transaksi diperkirakan mencapai R$15,739,137. Jika nilai transaksi dihitung per customer maka tiap customer rata-rata menghabiskan 160 Dollar Brazil. Nilai yang sama juga diperoleh untuk nilai rata-rata per transaksinya.

4.2 Order Growth

– In Progress


4.3 Highest Sales per Product Category

## Product Category join order join order_item
df_orditem_prod <- clean_order_items %>% 
  left_join(clean_orders, by="order_id") %>% 
  left_join(clean_product, by="product_id") %>% 
  filter(!order_status %in% c("canceled","unavailable"))

#_of_Orders
total_orders <- df_orditem_prod %>%
  filter(!order_status %in% c("canceled","unavailable")) %>% 
  group_by(product_category_name_english) %>% 
  summarise(total_orders = length(unique(order_id))) %>% 
  arrange(desc(total_orders))

#_of_products
total_products <- clean_product %>% 
  left_join(clean_order_items, by="product_id") %>% 
  left_join(clean_orders, by="order_id") %>% 
  filter(!order_status %in% c("canceled","unavailable")) %>% 
  group_by(product_category_name_english) %>% 
  summarise(total_products = length(unique(product_id))) %>% 
  arrange(desc(total_products))

#_of_sellers
total_sellers <- df_orditem_prod %>%
  filter(!order_status %in% c("canceled","unavailable")) %>% 
  group_by(product_category_name_english) %>% 
  summarise(total_sellers = length(unique(seller_id))) %>% 
  arrange(desc(total_sellers))

summary_orderprodcat <- total_orders %>%
  left_join(total_products, by="product_category_name_english") %>%
  left_join(total_sellers, by="product_category_name_english") %>% 
  mutate(
    OrderperNbrProd = round(total_orders/total_products,2),
    popup = glue("Product Category : {product_category_name_english},
    Total Orders : {total_orders}, 
    Total Products : {total_products},
    Total Sellers : {total_sellers},
    Ratio #order per #product : {OrderperNbrProd} "
    )
  )

plot_summary_orderprodcat <- ggplot(summary_orderprodcat,aes(x=total_products, y=total_orders)) +
  geom_point(aes(text=popup, col=product_category_name_english, size=OrderperNbrProd),
              # stat = "identity",
              alpha=0.6, shape="circle",
              show.legend = FALSE) +
  theme(legend.position = "none") +
  labs(
    title = "Sales by Product Category",
    x = "Total Products",
    y = "Total Orders"
  )
ggplotly(plot_summary_orderprodcat, tooltip = "text") %>% 
  layout(title)
# colnames(clean_order_items)
# colnames(clean_product)
# colnames(raw_product)
# colnames(raw_product_cat)

Join Order Items dengan Products

df_orditem_prod <-  clean_order_items %>%
  left_join(clean_orders, by= "order_id") %>% 
  filter(!order_status %in% c("canceled","unavailable")) %>% 
  left_join(clean_product, by = "product_id")
plot_top10_prod_cat <- df_orditem_prod %>%
  mutate(year=year(order_purchase_timestamp)) %>% 
  filter(!order_status %in% c("canceled","unavailable")) %>% 
  group_by(product_category_name_english,year) %>% 
  summarise(count = n()) %>% 
  arrange(year,desc(count)) %>% 
  group_by(year) %>% 
  slice(1:10) %>% 
  ungroup() %>% 
  mutate(
    product_category_name = gsub("_"," ",product_category_name_english),
    popup = glue("Year : {year}
                 Product Category : {product_category_name_english}
                 Ordered : {count}"
    ),
    #product_category_name = str_sub(product_category_name,1,10),
    product_category_name = reorder(product_category_name, count)
    
  ) %>% 
  #filter(year==2017) %>% 
  ggplot(aes(x=product_category_name,y=as.integer(count), 
             group=year, fill=product_category_name)) +
  geom_bar(stat = "identity", show.legend = FALSE, aes(text=popup))+
  labs(
    title = "Top 10 Product Category per Year",
    y="Order Quantity",
    x=""
  )+
  coord_flip()+
  facet_grid(~year, scales = "free")

ggplotly(plot_top10_prod_cat, tooltip="text") %>%
layout(showlegend=FALSE,
       margin = list(l = 50, r = 30, b = 50, t = 90)) %>% 
layout(title = list(text = paste0('Top 10 Product Category per Year',
                                    '<br>',
                                    '<sup>',
                                    'Best Product based on Order Quantity from 2016-2018',
                                    '</sup>')))

Kedua chart diatas menginformasikan 10 jenis produk yang paling sering diorder di Olist dalam kurun periode 2016-2018. Kategori produk yang konsisten masuk kategori top 10 product category dari 2016-2018 adalah Health Beauty, Funiture Decor, Sports Leisure, Computers Accessories dan Housewares. Jika dilihat secara keseluruhan, kategori produk Bed Bath Table merupakan kategori produk yang paling banyak diorder, dimana trend mengalami peningkatan mulai dari tahun 2017 sampai 2018. Product Categories computer accesories meskipun jumlah Product hanya di peringkat ke 7 terbanyak namun total Order nya bisa mengalahkan total Order dari Product yang memiliki jumlah Product yang lebih banyak darinya seperti furniture decor, houseware, auto. Terlihat dari nilai ratio Total Order per Total Product nilai nya mengungguli product category lainnya.

4.4 Highest Customer per city

– in progress


4.5 Payment Type

plot_paymenttype_freq <- clean_orderpayments %>% 
  select(payment_type) %>%
  count(payment_type) %>%
  mutate(percentage = round((n/sum(n))*100,2),
         popup=paste("Payment Type :", payment_type,"<br>",
           "Frequencies :", label=comma(n),
           "Percentage : ", percentage, '%'
          )
         ) %>% 
  ggplot(aes(x=reorder(payment_type,-n), y=n)) +
  geom_col(aes(fill=payment_type, text=popup), show.legend = F)+
  labs(title = "Highest Payment Type Use in Olist", subtitle = "per transaction frequencies from 2016-2018",x = NULL,y = NULL)

ggplotly(plot_paymenttype_freq, tooltip="text") %>% 
  layout(showlegend=FALSE,
         title = list(text = paste0("Highest Payment Type Use in Olist",
                                    "<br>",
                                    "<sup>",
                                    "by transaction frequencies from 2016-2018",
                                    "</sup>")
                      )
         )
plot_paymenttype_value <- clean_orderpayments %>% 
  group_by(payment_type) %>% 
  summarise(values=sum(payment_value)) %>% 
  mutate(percentage = round((values/sum(values))*100,2),
         popup=paste("Payment Type :", payment_type,"<br>",
                     "Values :", "R$",label=comma(values)
            )
         ) %>% 
  ggplot(aes(x=reorder(payment_type,-values), y=values)) +
  geom_col(aes(fill=payment_type, text=popup), show.legend = F)+
  labs(title = "Highest Payment Type in Olist", subtitle = "by transaction Volume from 2016-2018",x = NULL,y = NULL)

ggplotly(plot_paymenttype_value, tooltip="text") %>% 
  layout(showlegend=FALSE,
         title = list(text = paste0("Highest Payment Type in Olist",
                                    "<br>",
                                    "<sup>",
                                    "by transaction Volume from 2016-2018",
                                    "</sup>")
                      )
         )

Berdasarkan metode pembayarannya, terbanyak menggunakan Credit Card sebanyak 73% dari keseluruhan transaksi di Olist.


5 Conclusion & Recommendation

Dari hasil eksplorasi diatas kita mendapat gambaran umum terkait transaksi yang terjadi di Olist pada kurun periode 2016-2018, dimana secara overall trend transaksi mengalami peningkatan dari tahun ke tahun. Untuk meningkatkan convertion rate Olist mungkin bisa mengintensifkan strategi promosi ke wilayah yang memiliki basis customer terbesar seperti di wilayah ___. Strategi promo juga bisa dilakukan dengan vendor credit card untuk meningkatkan convertion rate customer Olist dengan metode pembayaran menggunakan credit card.