E-Commerce Data Exploration
Status: In Progress
1 Introduction
Olist merupakan salah satu platform e-commerce terbesar di Brazil. Berdiri sejak tahun 2015, Olist merupakan marketplace platform terintegrasi yang mengkhususkan layanan pada bidang logistic dan permodalan.
Kali ini saya akan menganalisa transaksi penjualan di marketplace Olist pada periode 2016 hingga 2018. Sumber daya dan penjelasan lengkap terkait data dapat dilihat pada tautan Kaggle: Olist dataset
2 Data Inspection
Skema data yang disediakan seperti gambar berikut.
Dari sekian ragam data, data review orders tidak saya gunakan.
2.1 Import Library and dataset
- import library
# for wrangling data
library(tidyr)
library(dplyr)
library(readr)
library(lubridate)
library(zoo)
library(glue)
library(scales)
# for visualize data
library(ggplot2)
library(plotly)
library(knitr)
options(scipen = 999)- load dataset
#master data
raw_customer <- read.csv("dataset/olist_customers_dataset.csv")
raw_seller <- read.csv("dataset/olist_sellers_dataset.csv")
raw_product <- read.csv("dataset/olist_products_dataset.csv")
raw_product_cat <- read.csv("dataset/product_category_name_translation.csv")
raw_geoloc <- read.csv("dataset/olist_geolocation_dataset.csv")
#trx data
raw_orders <- read.csv("dataset/olist_orders_dataset.csv")
raw_order_items <- read.csv("dataset/olist_order_items_dataset.csv")
raw_order_payments <- read.csv("dataset/olist_order_payments_dataset.csv")2.2 Customers
Berikut penjelasan terkait data customers yang akan digunakan:
| Column Name | Description |
|---|---|
| customer_id | key to the orders dataset. Each order has a unique customer_id |
| customer_unique_id | unique identifier of a customer. |
| customer_city | customer city name |
Berikut sample data dari customers
raw_customer %>%
select(-c(customer_zip_code_prefix, customer_state)) %>%
head(.)#> customer_id customer_unique_id
#> 1 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0
#> 2 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3
#> 3 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e
#> 4 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c
#> 5 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066
#> 6 879864dab9bc3047522c92c82e1212b8 4c93744516667ad3b8f1fb645a3116a4
#> customer_city
#> 1 franca
#> 2 sao bernardo do campo
#> 3 sao paulo
#> 4 mogi das cruzes
#> 5 campinas
#> 6 jaragua do sul
Struktur data customer:
glimpse(raw_customer)#> Rows: 99,441
#> Columns: 5
#> $ customer_id <chr> "06b8999e2fba1a1fbc88172c00ba8bc7", "18955e83~
#> $ customer_unique_id <chr> "861eff4711a542e4b93843c6dd7febb0", "290c77bc~
#> $ customer_zip_code_prefix <int> 14409, 9790, 1151, 8775, 13056, 89254, 4534, ~
#> $ customer_city <chr> "franca", "sao bernardo do campo", "sao paulo~
#> $ customer_state <chr> "SP", "SP", "SP", "SP", "SP", "SC", "SP", "MG~
2.3 Sellers
Berikut penjelasan terkait data sellers yang akan digunakan:
| Column Name | Description |
|---|---|
| seller_id | seller unique identifier |
| seller_city | seller city name |
Berikut sample data dari sellers:
raw_seller %>%
select(!c(seller_zip_code_prefix, seller_state)) %>%
head(.)#> seller_id seller_city
#> 1 3442f8959a84dea7ee197c632cb2df15 campinas
#> 2 d1b65fc7debc3361ea86b5f14c68d2e2 mogi guacu
#> 3 ce3ad9de960102d0677a81f5d0bb7b2d rio de janeiro
#> 4 c0f3eea2e14555b6faeea3dd58c1b1c3 sao paulo
#> 5 51a04a8a6bdcb23deccc82b0b80742cf braganca paulista
#> 6 c240c4061717ac1806ae6ee72be3533b rio de janeiro
Struktur data sellers
glimpse(raw_seller)#> Rows: 3,095
#> Columns: 4
#> $ seller_id <chr> "3442f8959a84dea7ee197c632cb2df15", "d1b65fc7de~
#> $ seller_zip_code_prefix <int> 13023, 13844, 20031, 4195, 12914, 20920, 55325,~
#> $ seller_city <chr> "campinas", "mogi guacu", "rio de janeiro", "sa~
#> $ seller_state <chr> "SP", "SP", "RJ", "SP", "SP", "RJ", "PE", "SP",~
2.4 Products
Berikut penjelasan terkait data products yang akan digunakan:
| Column Name | Description |
|---|---|
| product_id | unique product identifier |
| product_category_name | root category of product (in Portuguese) |
| product_name_length | number of characters extracted from the product name |
| product_description_length | number of characters extracted from the product description. |
| product_photos_qty | number of product published photos |
| product_weight_g | product weight measured in grams |
| product_length_cm | product length measured in centimeters |
| product_height_cm | product height measured in centimeters |
| product_width_cm | product width measured in centimeters |
Berikut sample data dari products:
raw_product %>%
head(.)#> product_id product_category_name product_name_lenght
#> 1 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40
#> 2 3aa071139cb16b67ca9e5dea641aaa2f artes 44
#> 3 96bd76ec8810374ed1b65e291975717f esporte_lazer 46
#> 4 cef67bcfe19066a932b7673e239eb23d bebes 27
#> 5 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37
#> 6 41d3672d4792049fa1779bb35283ed13 instrumentos_musicais 60
#> product_description_lenght product_photos_qty product_weight_g
#> 1 287 1 225
#> 2 276 1 1000
#> 3 250 1 154
#> 4 261 1 371
#> 5 402 4 625
#> 6 745 1 200
#> product_length_cm product_height_cm product_width_cm
#> 1 16 10 14
#> 2 30 18 20
#> 3 18 9 15
#> 4 26 4 26
#> 5 20 17 13
#> 6 38 5 11
Struktur data products :
glimpse(raw_product)#> Rows: 32,951
#> Columns: 9
#> $ product_id <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071~
#> $ product_category_name <chr> "perfumaria", "artes", "esporte_lazer", "be~
#> $ product_name_lenght <int> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,~
#> $ product_description_lenght <int> 287, 276, 250, 261, 402, 745, 1272, 184, 16~
#> $ product_photos_qty <int> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4~
#> $ product_weight_g <int> 225, 1000, 154, 371, 625, 200, 18350, 900, ~
#> $ product_length_cm <int> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,~
#> $ product_height_cm <int> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,~
#> $ product_width_cm <int> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,~
Sedangkan translasi product category dalam bahasa inggris berada pada product_category.
raw_product_cat %>% head(.)#> ï..product_category_name product_category_name_english
#> 1 beleza_saude health_beauty
#> 2 informatica_acessorios computers_accessories
#> 3 automotivo auto
#> 4 cama_mesa_banho bed_bath_table
#> 5 moveis_decoracao furniture_decor
#> 6 esporte_lazer sports_leisure
2.5 Order
Data transaksi penjualan tersimpan pada dataframe orders dan order_items, dimana orders menyimpan ringkasan transaksi seperti status pemesanan, kode customer, waktu pemesanan, dan lain-lain.
Berikut penjelasan terkait data orders yang akan digunakan:
| Column Name | Description |
|---|---|
| order_id | unique identifier of the order. |
| customer_id | key to the customer dataset. Each order has a unique customer_id |
| order_status | Reference to the order status (created, processing, shipped, delivered, approved, invoiced, canceled, and unavailable). |
| order_purchase_timestamp | Shows the purchase timestamp. |
| order_approved_at | Shows the payment approval timestamp. |
| order_delivered_carrier_date | Shows the order posting timestamp. When it was handled to the logistic partner. |
| order_delivered_customer_date | Shows the actual order delivery date to the customer. |
| order_estimated_delivery_date | Shows the estimated delivery date that was informed to customer at the purchase moment. |
Berikut sample data dari orders:
raw_orders %>% head(.)#> order_id customer_id
#> 1 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d
#> 2 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef
#> 3 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089
#> 4 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82
#> 5 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c
#> 6 a4591c265e18cb1dcee52889e2d8acc3 503740e9ca751ccdda7ba28e9ab8f608
#> order_status order_purchase_timestamp order_approved_at
#> 1 delivered 2017-10-02 10:56:33 2017-10-02 11:07:15
#> 2 delivered 2018-07-24 20:41:37 2018-07-26 03:24:27
#> 3 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23
#> 4 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59
#> 5 delivered 2018-02-13 21:18:39 2018-02-13 22:20:29
#> 6 delivered 2017-07-09 21:57:05 2017-07-09 22:10:13
#> order_delivered_carrier_date order_delivered_customer_date
#> 1 2017-10-04 19:55:00 2017-10-10 21:25:13
#> 2 2018-07-26 14:31:00 2018-08-07 15:27:45
#> 3 2018-08-08 13:50:00 2018-08-17 18:06:29
#> 4 2017-11-22 13:39:59 2017-12-02 00:28:42
#> 5 2018-02-14 19:46:34 2018-02-16 18:17:02
#> 6 2017-07-11 14:58:04 2017-07-26 10:57:55
#> order_estimated_delivery_date
#> 1 2017-10-18 00:00:00
#> 2 2018-08-13 00:00:00
#> 3 2018-09-04 00:00:00
#> 4 2017-12-15 00:00:00
#> 5 2018-02-26 00:00:00
#> 6 2017-08-01 00:00:00
Struktur data orders
glimpse(raw_orders)#> Rows: 99,441
#> Columns: 8
#> $ order_id <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c~
#> $ customer_id <chr> "9ef432eb6251297304e76186b10a928d", "b08~
#> $ order_status <chr> "delivered", "delivered", "delivered", "~
#> $ order_purchase_timestamp <chr> "2017-10-02 10:56:33", "2018-07-24 20:41~
#> $ order_approved_at <chr> "2017-10-02 11:07:15", "2018-07-26 03:24~
#> $ order_delivered_carrier_date <chr> "2017-10-04 19:55:00", "2018-07-26 14:31~
#> $ order_delivered_customer_date <chr> "2017-10-10 21:25:13", "2018-08-07 15:27~
#> $ order_estimated_delivery_date <chr> "2017-10-18 00:00:00", "2018-08-13 00:00~
Sedangkan detail transaksi seperti item yang dibeli dan penjual berada pada order_items.
Berikut penjelasan terkait data order_items yang akan digunakan:
| Column Name | Description |
|---|---|
| order_id | unique identifier of the order. |
| order_item_id | sequential number identifying number of items included in the same order. |
| product_id | products unique identifier |
| seller_id | seller unique identifier |
| shipping_limit_date | Shows the seller shipping limit date for handling the order over to the logistic partner |
| price | item price |
| freight_value | item freight value item (if an order has more than one item the freight value is splitted between items) |
Berikut sample data dari order_items:
raw_order_items %>% head(.)#> order_id order_item_id
#> 1 00010242fe8c5a6d1ba2dd792cb16214 1
#> 2 00018f77f2f0320c557190d7a144bdd3 1
#> 3 000229ec398224ef6ca0657da4fc703e 1
#> 4 00024acbcdf0a6daa1e931b038114c75 1
#> 5 00042b26cf59d7ce69dfabb4e55b4fd9 1
#> 6 00048cc3ae777c65dbb7d2a0634bc1ea 1
#> product_id seller_id
#> 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202
#> 2 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36
#> 3 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d
#> 4 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4
#> 5 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87
#> 6 ef92defde845ab8450f9d70c526ef70f 6426d21aca402a131fc0a5d0960a3c90
#> shipping_limit_date price freight_value
#> 1 2017-09-19 09:45:35 58.90 13.29
#> 2 2017-05-03 11:05:13 239.90 19.93
#> 3 2018-01-18 14:48:30 199.00 17.87
#> 4 2018-08-15 10:10:18 12.99 12.79
#> 5 2017-02-13 13:57:51 199.90 18.14
#> 6 2017-05-23 03:55:27 21.90 12.69
Struktur data order_items
glimpse(raw_order_items)#> Rows: 112,650
#> Columns: 7
#> $ order_id <chr> "00010242fe8c5a6d1ba2dd792cb16214", "00018f77f2f03~
#> $ order_item_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,~
#> $ product_id <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f2d52b80218~
#> $ seller_id <chr> "48436dade18ac8b2bce089ec2a041202", "dd7ddc04e1b6c~
#> $ shipping_limit_date <chr> "2017-09-19 09:45:35", "2017-05-03 11:05:13", "201~
#> $ price <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21.90, 19.90~
#> $ freight_value <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69, 11.85, 7~
2.6 Order Payments
Data detail terkait nilai transaksi pembayaran, metode pembayaran, dan jenis pembayaran tersimpan di order payments
Berikut penjelasan terkait data order payments yang akan digunakan:
colnames(raw_order_payments)#> [1] "order_id" "payment_sequential" "payment_type"
#> [4] "payment_installments" "payment_value"
| Column Name | Description |
|---|---|
| order_id | unique identifier of the order. |
| payment_sequential | a customer may pay an order with more than one payment method. If he does so, a sequence will be created to |
| payment_type | method of payment chosen by the customer. |
| payment_installments | number of installments chosen by the customer. |
| payment_value | transaction value. |
Berikut sample data dari order payments:
raw_order_payments %>% head(.)#> order_id payment_sequential payment_type
#> 1 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card
#> 2 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card
#> 3 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card
#> 4 ba78997921bbcdc1373bb41e913ab953 1 credit_card
#> 5 42fdf880ba16b47b59251dd489d4441a 1 credit_card
#> 6 298fcdf1f73eb413e4d26d01b25bc1cd 1 credit_card
#> payment_installments payment_value
#> 1 8 99.33
#> 2 1 24.39
#> 3 1 65.71
#> 4 8 107.78
#> 5 2 128.45
#> 6 2 96.12
Struktur data order payments
glimpse(raw_order_payments)#> Rows: 103,886
#> Columns: 5
#> $ order_id <chr> "b81ef226f3fe1789b1e8b2acac839d17", "a9810da82917~
#> $ payment_sequential <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ payment_type <chr> "credit_card", "credit_card", "credit_card", "cre~
#> $ payment_installments <int> 8, 1, 1, 8, 2, 2, 1, 3, 6, 1, 8, 1, 1, 5, 4, 10, ~
#> $ payment_value <dbl> 99.33, 24.39, 65.71, 107.78, 128.45, 96.12, 81.16~
# Test purpose
# raw_product %>% filter(product_id == 'a41e356c76fab66334f36de622ecbd3a')
# colSums(is.na(raw_product))
#
# raw_product %>% filter(product_id == 'a41e356c76fab66334f36de622ecbd3a')
# raw_product %>% filter(product_category_name == "")3 Data Wrangling
Pada tahapan ini saya akan melakukan transformasi data yang diperlukan agar mudah untuk diolah seperti merubah tipe data dan rename kolom
# factorize column `customer_city` and remove unnecessary columns
clean_customer <- raw_customer %>%
select(-c(customer_zip_code_prefix, customer_state)) %>%
mutate(customer_city = as.factor(customer_city))
# factorize column `seller_city` and remove unnecessary columns
clean_seller <- raw_seller %>%
select(-c(seller_zip_code_prefix, seller_state)) %>%
mutate(seller_city = as.factor(seller_city))
# factorize column
clean_orderpayments <- raw_order_payments %>%
mutate(payment_type = as.factor(payment_type))
# rename column `ï..product_category_name` to `product_category_name`
raw_product_cat <- raw_product_cat %>%
rename(product_category_name = ï..product_category_name)
#JOIN `product` & 'product_category' to get fetch product category name in English for `product` dataframe
clean_product <- raw_product %>%
select(c(product_id, product_category_name)) %>%
left_join(raw_product_cat, by = "product_category_name") %>%
# factorize column `product_category_name` & `product_category_name_english`
mutate(product_category_name = as.factor(product_category_name),
product_category_name_english = as.factor(product_category_name_english))
# # 74 unique values --> set as factor
# clean_product %>%
# select(product_category_name_english) %>%
# summarise_all(~n_distinct(.)) %>%
# pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values")
# Set columns related to date and time info to DateTime format + set `order_status` as factor
clean_orders <- raw_orders %>%
mutate(order_purchase_timestamp = ymd_hms(order_purchase_timestamp,tz="America/Sao_Paulo"),
order_approved_at = ymd_hms(order_approved_at, tz="America/Sao_Paulo"),
order_delivered_carrier_date = ymd_hms(order_delivered_carrier_date, tz="America/Sao_Paulo"),
order_delivered_customer_date = ymd_hms(order_delivered_customer_date, tz="America/Sao_Paulo"),
order_estimated_delivery_date = ymd_hms(order_estimated_delivery_date, tz="America/Sao_Paulo"),
order_status = as.factor(order_status))
# Set columns related to date and time info to DateTime format
clean_order_items <- raw_order_items %>%
mutate(shipping_limit_date = ymd_hms(shipping_limit_date, tz="America/Sao_Paulo"))check if any missing value :
customer
colSums(is.na(clean_customer))#> customer_id customer_unique_id customer_city
#> 0 0 0
seller
colSums(is.na(clean_seller))#> seller_id seller_city
#> 0 0
product
colSums(is.na(clean_product))#> product_id product_category_name
#> 0 0
#> product_category_name_english
#> 623
Orders
colSums(is.na(clean_orders))#> order_id customer_id
#> 0 0
#> order_status order_purchase_timestamp
#> 0 0
#> order_approved_at order_delivered_carrier_date
#> 160 1783
#> order_delivered_customer_date order_estimated_delivery_date
#> 2965 0
Order Items
colSums(is.na(clean_order_items))#> order_id order_item_id product_id seller_id
#> 0 0 0 0
#> shipping_limit_date price freight_value
#> 0 0 0
Order Payments
colSums(is.na(clean_orderpayments))#> order_id payment_sequential payment_type
#> 0 0 0
#> payment_installments payment_value
#> 0 0
Check if any duplicates :
Customer
data.frame(
data = length(clean_customer$customer_id),
unique_data = length(unique(clean_customer$customer_id))
)#> data unique_data
#> 1 99441 99441
Seller
data.frame(
data = length(clean_seller$seller_id),
unique_data = length(unique(clean_seller$seller_id))
)#> data unique_data
#> 1 3095 3095
Product
data.frame(
data = length(clean_product$product_id),
unique_data = length(unique(clean_product$product_id))
)#> data unique_data
#> 1 32951 32951
Order
data.frame(
data = length(clean_orders$order_id),
unique_data = length(unique(clean_orders$order_id))
)#> data unique_data
#> 1 99441 99441
Tidak terdapat missing value pada data customer dan seller. Tetapi pada data product dan orders terdapat beberapa missing value. Kita telaah case missing value satu-persatu, mulai dari data product.
3.1 inspect missing value product
Berikut sample data product yang memiliki missing value pada kolom product_category_name_english.
#xplore missing value english based on prod cat name
clean_product %>% filter(is.na(product_category_name_english)) %>%
group_by(product_category_name) %>% count(.)#> # A tibble: 3 x 2
#> # Groups: product_category_name [3]
#> product_category_name n
#> <fct> <int>
#> 1 "" 610
#> 2 "pc_gamer" 3
#> 3 "portateis_cozinha_e_preparadores_de_alimentos" 10
Terdapat tiga nilai product_category_name yang tidak memiliki translasi bahasa Inggris. Salah satu nilai product_category_name ternyata berupa string kosong.
Selain itu terdapat 623 rows data product.product_category_name
yang memiliki missing value. Data kolom product_category_name_english NULL karena key dari kolom product_category_name berisi string kosong atau key tidak ada di product category.
Karena case missing value ini kecil, maka semua nilai product_category_name_english dan product_category_name akan diisi dengan nilai undefined.
# temporarily return factor data type to character
clean_product <- clean_product %>%
mutate(product_category_name = as.character(product_category_name),
product_category_name_english = as.character(product_category_name_english)) %>%
# replace any NULL value in prior columns to "undefined"
mutate(product_category_name_english = replace_na(product_category_name_english,"undefined"))
replace_to_undefined <- function(x){
if(x == "portateis_cozinha_e_preparadores_de_alimentos" | x == "pc_gamer" | x == ""){
x <- "undefined"
}
}
clean_product$product_category_name_english[clean_product$product_category_name == "portateis_cozinha_e_preparadores_de_alimentos" | clean_product$product_category_name == "pc_gamer" | clean_product$product_category_name == ""] <- "undefined"
# return character data type to factor
clean_product <- clean_product %>%
mutate(product_category_name = as.factor(product_category_name),
product_category_name_english = as.factor(product_category_name_english))- Cek kembali hasil transformasi:
colSums(is.na(clean_product))#> product_id product_category_name
#> 0 0
#> product_category_name_english
#> 0
sudah tidak terdapat Missing Value.
3.2 inspect missing value orders
Untuk data orders kita cek case missing value pada kolom order_approved_at berdasarkan nilai order_status
clean_orders %>%
filter(is.na(order_approved_at)) %>%
group_by(order_status) %>%
count(.)#> # A tibble: 3 x 2
#> # Groups: order_status [3]
#> order_status n
#> <fct> <int>
#> 1 canceled 141
#> 2 created 5
#> 3 delivered 14
Dari pola data diatas bisa disimpulkan bahwa nilai NA di kolom order_approved_at terjadi saat order_status belum approved.
clean_orders %>%
filter(is.na(order_delivered_carrier_date)) %>%
group_by(order_status) %>%
count(.)#> # A tibble: 7 x 2
#> # Groups: order_status [7]
#> order_status n
#> <fct> <int>
#> 1 approved 2
#> 2 canceled 550
#> 3 created 5
#> 4 delivered 2
#> 5 invoiced 314
#> 6 processing 301
#> 7 unavailable 609
Dari pola data diatas bisa disimpulkan bahwa nilai NA di kolom order_delivered_carrier_date terjadi saat order_status belum dikirim oleh penjual ke customer (order_status != “Shipped”)
clean_orders %>%
filter(is.na(order_delivered_customer_date)) %>%
group_by(order_status) %>%
count(.)#> # A tibble: 8 x 2
#> # Groups: order_status [8]
#> order_status n
#> <fct> <int>
#> 1 approved 2
#> 2 canceled 619
#> 3 created 5
#> 4 delivered 8
#> 5 invoiced 314
#> 6 processing 301
#> 7 shipped 1107
#> 8 unavailable 609
Case tersebut terjadi di semua nilai order_status, namun jika membaca penjelasan di deskripsi terkait kolom tersebut kemungkinan terjadi karena beberapa transaksi dimana barang sudah diterima customer namun tanggal aktual barang diterima customer tidak terekam/update ke sistem.
- save clean data
4 Exploratory Data Analysis
<!-- * General Metrics & Orders breakdown: Order Growth, Order Status Summary per Year -->
<!-- * Customers: MAU, Customer per state (Customer per state & Order Volume per Cust State) -->
<!-- * Products: Top Products Category, Top Product -->
4.1 Summary Number of Customers and Orders
sum_total_sales <- clean_orderpayments %>%
left_join(clean_orders, by = "order_id") %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
summarise(total_sales = sum(payment_value))
sum_cust_order <- clean_order_items %>%
left_join(clean_orders, by="order_id") %>%
left_join(clean_product, by="product_id") %>%
left_join(clean_orderpayments, by="order_id", "order_item_id = payment_sequential") %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
summarise(total_unique_customer = length(unique(customer_id)),
total_unique_order = length(unique(order_id)))
summary_trx <- cbind(sum_total_sales, sum_cust_order)
summary_trx <- summary_trx %>% mutate(sales_per_customer = total_sales / total_unique_customer,
sales_per_order = total_sales / total_unique_order)
kable(summary_trx)| total_sales | total_unique_customer | total_unique_order | sales_per_customer | sales_per_order |
|---|---|---|---|---|
| 15739137 | 98199 | 98199 | 160.278 | 160.278 |
Terdapat 98,199 transaksi jual-beli yang berlangsung di Olist selama kurun periode 2016-2018 dengan volume transaksi diperkirakan mencapai R$15,739,137. Jika nilai transaksi dihitung per customer maka tiap customer rata-rata menghabiskan 160 Dollar Brazil. Nilai yang sama juga diperoleh untuk nilai rata-rata per transaksinya.
4.2 Order Growth
– In Progress
4.3 Highest Sales per Product Category
## Product Category join order join order_item
df_orditem_prod <- clean_order_items %>%
left_join(clean_orders, by="order_id") %>%
left_join(clean_product, by="product_id") %>%
filter(!order_status %in% c("canceled","unavailable"))
#_of_Orders
total_orders <- df_orditem_prod %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
group_by(product_category_name_english) %>%
summarise(total_orders = length(unique(order_id))) %>%
arrange(desc(total_orders))
#_of_products
total_products <- clean_product %>%
left_join(clean_order_items, by="product_id") %>%
left_join(clean_orders, by="order_id") %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
group_by(product_category_name_english) %>%
summarise(total_products = length(unique(product_id))) %>%
arrange(desc(total_products))
#_of_sellers
total_sellers <- df_orditem_prod %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
group_by(product_category_name_english) %>%
summarise(total_sellers = length(unique(seller_id))) %>%
arrange(desc(total_sellers))
summary_orderprodcat <- total_orders %>%
left_join(total_products, by="product_category_name_english") %>%
left_join(total_sellers, by="product_category_name_english") %>%
mutate(
OrderperNbrProd = round(total_orders/total_products,2),
popup = glue("Product Category : {product_category_name_english},
Total Orders : {total_orders},
Total Products : {total_products},
Total Sellers : {total_sellers},
Ratio #order per #product : {OrderperNbrProd} "
)
)
plot_summary_orderprodcat <- ggplot(summary_orderprodcat,aes(x=total_products, y=total_orders)) +
geom_point(aes(text=popup, col=product_category_name_english, size=OrderperNbrProd),
# stat = "identity",
alpha=0.6, shape="circle",
show.legend = FALSE) +
theme(legend.position = "none") +
labs(
title = "Sales by Product Category",
x = "Total Products",
y = "Total Orders"
)
ggplotly(plot_summary_orderprodcat, tooltip = "text") %>%
layout(title)# colnames(clean_order_items)
# colnames(clean_product)
# colnames(raw_product)
# colnames(raw_product_cat)Join Order Items dengan Products
df_orditem_prod <- clean_order_items %>%
left_join(clean_orders, by= "order_id") %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
left_join(clean_product, by = "product_id")plot_top10_prod_cat <- df_orditem_prod %>%
mutate(year=year(order_purchase_timestamp)) %>%
filter(!order_status %in% c("canceled","unavailable")) %>%
group_by(product_category_name_english,year) %>%
summarise(count = n()) %>%
arrange(year,desc(count)) %>%
group_by(year) %>%
slice(1:10) %>%
ungroup() %>%
mutate(
product_category_name = gsub("_"," ",product_category_name_english),
popup = glue("Year : {year}
Product Category : {product_category_name_english}
Ordered : {count}"
),
#product_category_name = str_sub(product_category_name,1,10),
product_category_name = reorder(product_category_name, count)
) %>%
#filter(year==2017) %>%
ggplot(aes(x=product_category_name,y=as.integer(count),
group=year, fill=product_category_name)) +
geom_bar(stat = "identity", show.legend = FALSE, aes(text=popup))+
labs(
title = "Top 10 Product Category per Year",
y="Order Quantity",
x=""
)+
coord_flip()+
facet_grid(~year, scales = "free")
ggplotly(plot_top10_prod_cat, tooltip="text") %>%
layout(showlegend=FALSE,
margin = list(l = 50, r = 30, b = 50, t = 90)) %>%
layout(title = list(text = paste0('Top 10 Product Category per Year',
'<br>',
'<sup>',
'Best Product based on Order Quantity from 2016-2018',
'</sup>')))Kedua chart diatas menginformasikan 10 jenis produk yang paling sering diorder di Olist dalam kurun periode 2016-2018. Kategori produk yang konsisten masuk kategori top 10 product category dari 2016-2018 adalah Health Beauty, Funiture Decor, Sports Leisure, Computers Accessories dan Housewares. Jika dilihat secara keseluruhan, kategori produk Bed Bath Table merupakan kategori produk yang paling banyak diorder, dimana trend mengalami peningkatan mulai dari tahun 2017 sampai 2018. Product Categories computer accesories meskipun jumlah Product hanya di peringkat ke 7 terbanyak namun total Order nya bisa mengalahkan total Order dari Product yang memiliki jumlah Product yang lebih banyak darinya seperti furniture decor, houseware, auto. Terlihat dari nilai ratio Total Order per Total Product nilai nya mengungguli product category lainnya.
4.4 Highest Customer per city
– in progress
4.5 Payment Type
plot_paymenttype_freq <- clean_orderpayments %>%
select(payment_type) %>%
count(payment_type) %>%
mutate(percentage = round((n/sum(n))*100,2),
popup=paste("Payment Type :", payment_type,"<br>",
"Frequencies :", label=comma(n),
"Percentage : ", percentage, '%'
)
) %>%
ggplot(aes(x=reorder(payment_type,-n), y=n)) +
geom_col(aes(fill=payment_type, text=popup), show.legend = F)+
labs(title = "Highest Payment Type Use in Olist", subtitle = "per transaction frequencies from 2016-2018",x = NULL,y = NULL)
ggplotly(plot_paymenttype_freq, tooltip="text") %>%
layout(showlegend=FALSE,
title = list(text = paste0("Highest Payment Type Use in Olist",
"<br>",
"<sup>",
"by transaction frequencies from 2016-2018",
"</sup>")
)
)plot_paymenttype_value <- clean_orderpayments %>%
group_by(payment_type) %>%
summarise(values=sum(payment_value)) %>%
mutate(percentage = round((values/sum(values))*100,2),
popup=paste("Payment Type :", payment_type,"<br>",
"Values :", "R$",label=comma(values)
)
) %>%
ggplot(aes(x=reorder(payment_type,-values), y=values)) +
geom_col(aes(fill=payment_type, text=popup), show.legend = F)+
labs(title = "Highest Payment Type in Olist", subtitle = "by transaction Volume from 2016-2018",x = NULL,y = NULL)
ggplotly(plot_paymenttype_value, tooltip="text") %>%
layout(showlegend=FALSE,
title = list(text = paste0("Highest Payment Type in Olist",
"<br>",
"<sup>",
"by transaction Volume from 2016-2018",
"</sup>")
)
)Berdasarkan metode pembayarannya, terbanyak menggunakan Credit Card sebanyak 73% dari keseluruhan transaksi di Olist.
5 Conclusion & Recommendation
Dari hasil eksplorasi diatas kita mendapat gambaran umum terkait transaksi yang terjadi di Olist pada kurun periode 2016-2018, dimana secara overall trend transaksi mengalami peningkatan dari tahun ke tahun. Untuk meningkatkan convertion rate Olist mungkin bisa mengintensifkan strategi promosi ke wilayah yang memiliki basis customer terbesar seperti di wilayah ___. Strategi promo juga bisa dilakukan dengan vendor credit card untuk meningkatkan convertion rate customer Olist dengan metode pembayaran menggunakan credit card.