Email:
RPubs: https://rpubs.com/nikitaindriyani/


library(rfm)
library(treemap)
library(gridExtra)
library(glue)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(scales)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggplot2)
library(tidyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:glue':
## 
##     collapse
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readxl)

1 Latar Belakang

Andikan Anda adalah seorang Manajer dibidang projek Data Scince, ingin melakukan analisis perilaku pelanggan menggunakan data transaksi perusahaan ritel online di Inggris (yang memperoleh dataset dalam kurung waktu antara 01/12/2010 dan 09/12/2011). Diketahui dari data tersebut banyak pelanggan perusahaan adalah grosir (pemasok). Selain itu, ada beberapa hal penting yang perlu diperhatikan mengenai data tersebut adalah sebagai berikut:

Variabel Deskripsi
invoice_no Nomor invoice terdiri dari 6 digit unik untuk setiap transaksi. Jika diawali dengan huruf C, ini menandakan bahwa transaksi tersebut berstatus Batal
stock_code Kode produk yang unik untuk setiap produk, terdiri dari 5 digit angka yang unik untuk setiap produk (Deskripsi Nama Produk).
quantity Jumlah produk yang dibeli
invoice_date Tanggal dan waktu transaksi
unit_price Harga produk per unit
customer_id ID Pelanggan terdiri dari 5 digit angka yang unik untuk setiap pelanggan.
country Negara pelanggan

Anda dapat mengunduh data yang digunakan dalam kasus ini di Google Classrom atau klik Retail.xlsx dan Retail.rds.

2 Tugas 1

Import kedua data tersebut ke Rstudio Anda sesuai dengan jenis file masing-masing (Proses import mana yang lebih baik menurut Anda?).

data_input_1 <- readRDS("Retail.rds")
head(data_input_1)
## # A tibble: 6 x 8
##   InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANG~        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE META~        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPI~        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UN~        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY~        6 2010-12-01 08:26:00      3.39
## 6 536365    22752     SET 7 BABU~        2 2010-12-01 08:26:00      7.65
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>
data_input_2 <- read_excel("Retail.xlsx")
head(data_input_2)
## # A tibble: 6 x 8
##   InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANG~        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE META~        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPI~        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UN~        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY~        6 2010-12-01 08:26:00      3.39
## 6 536365    22752     SET 7 BABU~        2 2010-12-01 08:26:00      7.65
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Argumen Anda: saya merasa lebih baik menggunakan rds karna dapat digunakan secara mudah.

3 Tugas 2

Ubah nama variabel data tersebut agar lebih mudah dipelajari oleh pembaca.

names(data_input_2)[names(data_input_2)==names(data_input_2)] <- c(
                                                    "Invoice_No"   ,
                                                    "Kode_Stock"      ,
                                                    "Deskripsi"       ,
                                                    "Kuantitas"       ,
                                                    "Tanggal_Invoice" ,
                                                    "Harga_per_Barang",
                                                    "ID_Customer"    ,
                                                    "Negara"
                                                   )
data_input_2
## # A tibble: 541,909 x 8
##    Invoice_No Kode_Stock Deskripsi Kuantitas Tanggal_Invoice    
##    <chr>      <chr>      <chr>         <dbl> <dttm>             
##  1 536365     85123A     WHITE HA~         6 2010-12-01 08:26:00
##  2 536365     71053      WHITE ME~         6 2010-12-01 08:26:00
##  3 536365     84406B     CREAM CU~         8 2010-12-01 08:26:00
##  4 536365     84029G     KNITTED ~         6 2010-12-01 08:26:00
##  5 536365     84029E     RED WOOL~         6 2010-12-01 08:26:00
##  6 536365     22752      SET 7 BA~         2 2010-12-01 08:26:00
##  7 536365     21730      GLASS ST~         6 2010-12-01 08:26:00
##  8 536366     22633      HAND WAR~         6 2010-12-01 08:28:00
##  9 536366     22632      HAND WAR~         6 2010-12-01 08:28:00
## 10 536367     84879      ASSORTED~        32 2010-12-01 08:34:00
## # ... with 541,899 more rows, and 3 more variables: Harga_per_Barang <dbl>,
## #   ID_Customer <dbl>, Negara <chr>

Argumen Anda: Saya merasa lebih mudah jika seluruh nama kolom menggunakan bahasa indonesia

4 Tugas 3

Lakukan pemeriksaan struktur data untuk mengubah jenis kumpulan data (jika ada yang perlu diubah).

glimpse(data_input_1)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "5363...
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, ...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 0...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1...
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
## $ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom", "...
data.frame(
  invoice_Nomor = as.integer(data_input_1$InvoiceNo %>% unique() %>% length()),
  stock_code_Barang = data_input_1$Kode_Stock %>% unique() %>% length(),
  description_Barang = data_input_1$Deskripsi %>% unique() %>% length(),
  Nama_Negara = data_input_1$Negara %>% unique() %>% length(),
  customer = data_input_1$ID_Customer %>% unique() %>% length()
)
## Warning: Unknown or uninitialised column: `Kode_Stock`.
## Warning: Unknown or uninitialised column: `Deskripsi`.
## Warning: Unknown or uninitialised column: `Negara`.
## Warning: Unknown or uninitialised column: `ID_Customer`.
##   invoice_Nomor stock_code_Barang description_Barang Nama_Negara customer
## 1         25900                 0                  0           0        0

Argumen Anda: Dengan ini kita dapat mengetahui dan menstrukturkan data secara jelas

5 Tugas 4

Pembersihan data atau disebut juga data scrubbing adalah proses menganalisis kualitas data dengan cara mengubah. Anda sebagai Manajer juga dapat memperbaiki atau menghapus data tersebut. Beberapa hal yang mungkin dilakukan dalam projek ini adalah:

5.1 Transaksi yang dibatalkan

data_input_1 %>% filter(grepl("C", data_input_1$InvoiceNo)) %>% summarise(total_cancelled_transaction = n())
## # A tibble: 1 x 1
##   total_cancelled_transaction
##                         <int>
## 1                        9288
data_input_1 <- data_input_1 %>% filter(!grepl("C", data_input_1$InvoiceNo))

Argumen Anda: Dengan Ini kita tau total transaksi yang di cancel ada 9288

5.2 Faktur tidak valid

data_input_1 %>% filter(nchar(InvoiceNo)>6)
## # A tibble: 3 x 8
##   InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
## 1 A563185   B         Adjust bad~        1 2011-08-12 14:50:00    11062.
## 2 A563186   B         Adjust bad~        1 2011-08-12 14:51:00   -11062.
## 3 A563187   B         Adjust bad~        1 2011-08-12 14:52:00   -11062.
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Argumen Anda: Sesuai deskripsi, Invoice No yang valid memiliki 3 digit angka.

data_input_1 <- data_input_1 %>% filter(!(nchar(InvoiceNo)>6))

5.3 Kuantitas tidak valid

data_input_1 %>% filter(Quantity<=0) %>% nrow()
## [1] 1336

Argumen Anda: Terdapat 1336 data transaksi yang memiliki quantity<=0

data_input_1 <- data_input_1 %>% filter(Quantity>0)

5.4 Harga Satuan Tidak Valid

data_input_1 %>% filter(UnitPrice<=0) %>% nrow()
## [1] 1179

Argumen Anda: Terdapat 1179 data transaksi yang memiliki UnitPrice<=0

data_input_1 <- data_input_1 %>% filter(UnitPrice>0)

5.5 Produk Tidak Valid

stock_valid <- data_input_1 %>% mutate(
  stock_code_validation = substr(StockCode,start = 1, stop = 5)
) %>% mutate(
  stock_code_validation = as.numeric(stock_code_validation)
) %>% select(StockCode,stock_code_validation,Description) %>% distinct()
## Warning: Problem with `mutate()` input `stock_code_validation`.
## i NAs introduced by coercion
## i Input `stock_code_validation` is `as.numeric(stock_code_validation)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
data_input_1 %>% filter(Description %in% (stock_valid %>% filter(is.na(stock_code_validation)) %>% .$Description))
## # A tibble: 2,378 x 8
##    InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
##    <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
##  1 536370    POST      POSTAGE            3 2010-12-01 08:45:00     18   
##  2 536403    POST      POSTAGE            1 2010-12-01 11:27:00     15   
##  3 536527    POST      POSTAGE            1 2010-12-01 13:04:00     18   
##  4 536540    C2        CARRIAGE           1 2010-12-01 14:05:00     50   
##  5 536544    DOT       DOTCOM POS~        1 2010-12-01 14:32:00    570.  
##  6 536569    M         Manual             1 2010-12-01 15:35:00      1.25
##  7 536569    M         Manual             1 2010-12-01 15:35:00     19.0 
##  8 536592    DOT       DOTCOM POS~        1 2010-12-01 17:06:00    607.  
##  9 536779    BANK CHA~ Bank Charg~        1 2010-12-02 15:08:00     15   
## 10 536840    POST      POSTAGE            1 2010-12-02 18:27:00     18   
## # ... with 2,368 more rows, and 2 more variables: CustomerID <dbl>,
## #   Country <chr>

Argumen Anda: Terdapat 2379 data transaksi yang memiliki stock code tidak valid.

invalid_stock <- stock_valid %>% filter(is.na(stock_code_validation)) %>% .$Description
data_input_1 <- data_input_1 %>% filter(!Description %in% invalid_stock)
# Additional adjustment codes to remove
descr <- c( "check", "check?", "?", "??", "damaged", "found", 
            "adjustment", "Amazon", "AMAZON", "amazon adjust", 
            "Amazon Adjustment", "amazon sales", "Found", "FOUND",
            "found box", "Found by jackie ","Found in w/hse","dotcom", 
            "dotcom adjust", "allocate stock for dotcom orders ta", "FBA", 
            "Dotcomgiftshop Gift Voucher 愼㸳100.00", "on cargo order",
            "wrongly sold (22719) barcode", "wrongly marked 23343",
            "dotcomstock", "rcvd be air temp fix for dotcom sit", 
            "Manual", "John Lewis", "had been put aside", 
            "for online retail orders", "taig adjust", "amazon", 
            "incorrectly credited C550456 see 47", "returned", 
            "wrongly coded 20713", "came coded as 20713", 
            "add stock to allocate online orders", "Adjust bad debt", 
            "alan hodge cant mamage this section", "website fixed",
            "did  a credit  and did not tick ret", "michel oops",
            "incorrectly credited C550456 see 47", "mailout", "test",
            "Sale error",  "Lighthouse Trading zero invc incorr", "SAMPLES",
            "Marked as 23343", "wrongly coded 23343","Adjustment", 
            "rcvd be air temp fix for dotcom sit", "Had been put aside." )
descr
##  [1] "check"                                     
##  [2] "check?"                                    
##  [3] "?"                                         
##  [4] "??"                                        
##  [5] "damaged"                                   
##  [6] "found"                                     
##  [7] "adjustment"                                
##  [8] "Amazon"                                    
##  [9] "AMAZON"                                    
## [10] "amazon adjust"                             
## [11] "Amazon Adjustment"                         
## [12] "amazon sales"                              
## [13] "Found"                                     
## [14] "FOUND"                                     
## [15] "found box"                                 
## [16] "Found by jackie "                          
## [17] "Found in w/hse"                            
## [18] "dotcom"                                    
## [19] "dotcom adjust"                             
## [20] "allocate stock for dotcom orders ta"       
## [21] "FBA"                                       
## [22] "Dotcomgiftshop Gift Voucher <U+00A3>100.00"
## [23] "on cargo order"                            
## [24] "wrongly sold (22719) barcode"              
## [25] "wrongly marked 23343"                      
## [26] "dotcomstock"                               
## [27] "rcvd be air temp fix for dotcom sit"       
## [28] "Manual"                                    
## [29] "John Lewis"                                
## [30] "had been put aside"                        
## [31] "for online retail orders"                  
## [32] "taig adjust"                               
## [33] "amazon"                                    
## [34] "incorrectly credited C550456 see 47"       
## [35] "returned"                                  
## [36] "wrongly coded 20713"                       
## [37] "came coded as 20713"                       
## [38] "add stock to allocate online orders"       
## [39] "Adjust bad debt"                           
## [40] "alan hodge cant mamage this section"       
## [41] "website fixed"                             
## [42] "did  a credit  and did not tick ret"       
## [43] "michel oops"                               
## [44] "incorrectly credited C550456 see 47"       
## [45] "mailout"                                   
## [46] "test"                                      
## [47] "Sale error"                                
## [48] "Lighthouse Trading zero invc incorr"       
## [49] "SAMPLES"                                   
## [50] "Marked as 23343"                           
## [51] "wrongly coded 23343"                       
## [52] "Adjustment"                                
## [53] "rcvd be air temp fix for dotcom sit"       
## [54] "Had been put aside."

setuju dengan hasil temuannya, oleh karena itu data transaksi yang memiliki deskripsi diatas perlu di remove.

data_input_1 <- data_input_1 %>% 
  filter(!Description %in% descr)

5.6 Deskripsi Ganda & Stok

df_product <- data_input_1 %>% select(StockCode,Description) %>% distinct()
df_product <- df_product %>% 
  mutate(stock_code_lowercase = tolower(StockCode),
         description_lowercase = tolower(Description))
data_frame(
  stock_code_unik = df_product$StockCode %>% unique() %>% length(),
  stock_code_unik_lowercase = df_product$stock_code_lowercase %>% unique() %>% length(),
  description_unik = df_product$Description %>% unique() %>% length(),
  description_unik_lower = df_product$description_lowercase %>% unique() %>% length()
)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1 x 4
##   stock_code_unik stock_code_unik_lowerca~ description_unik description_unik_lo~
##             <int>                    <int>            <int>                <int>
## 1            3900                     3791             3994                 3994

Jumlah stock code unik dan produk unik tidak sama, mari kita cek dari sisi stock code dahulu.

stock_code_check_dupli <- df_product %>% select(StockCode,stock_code_lowercase) %>% 
  distinct() %>% 
  group_by(stock_code_lowercase) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)
## `summarise()` ungrouping output (override with `.groups` argument)

Data diatas merupakan data stock code yang duplikat apabila kita ubah menjadi lowecase. Mari kita cek apakah benar duplikat.

df_product %>% filter(stock_code_lowercase %in% stock_code_check_dupli$stock_code_lowercase) %>% 
  arrange(stock_code_lowercase)
## # A tibble: 230 x 4
##    StockCode Description             stock_code_lowerc~ description_lowercase   
##    <chr>     <chr>                   <chr>              <chr>                   
##  1 15056BL   EDWARDIAN PARASOL BLACK 15056bl            edwardian parasol black 
##  2 15056bl   EDWARDIAN PARASOL BLACK 15056bl            edwardian parasol black 
##  3 15056N    EDWARDIAN PARASOL NATU~ 15056n             edwardian parasol natur~
##  4 15056n    EDWARDIAN PARASOL NATU~ 15056n             edwardian parasol natur~
##  5 15056P    EDWARDIAN PARASOL PINK  15056p             edwardian parasol pink  
##  6 15056p    EDWARDIAN PARASOL PINK  15056p             edwardian parasol pink  
##  7 15060B    FAIRY CAKE DESIGN UMBR~ 15060b             fairy cake design umbre~
##  8 15060b    FAIRY CAKE DESIGN UMBR~ 15060b             fairy cake design umbre~
##  9 18098C    PORCELAIN BUTTERFLY OI~ 18098c             porcelain butterfly oil~
## 10 18098c    PORCELAIN BUTTERFLY OI~ 18098c             porcelain butterfly oil~
## # ... with 220 more rows

Ternyata terdapat data stock code yang duplicated karena efek case sensitive. Oleh karena itu, seluruh stock_code akan kita convert menjadi UPPERCASE untuk mengilahkan efek case sensitive.

data_input_1 <- data_input_1 %>% mutate(
  StockCode = toupper(StockCode)
)

data_frame(
  jumlah_stock_code_unik = data_input_1$StockCode %>% unique() %>% length(),
  stock_code_unik_lowercase = data_input_1$StockCode %>% tolower() %>% unique() %>% length(),
  description_unik = data_input_1$Description %>% unique() %>% length(),
  description_unik_lower = data_input_1$Description %>% tolower() %>% unique() %>% length()
)
## # A tibble: 1 x 4
##   jumlah_stock_code_u~ stock_code_unik_low~ description_unik description_unik_l~
##                  <int>                <int>            <int>               <int>
## 1                 3791                 3791             3994                3994

Oke, data stock_code sudah clean. Namun, seharusnya jumlah stock_code dan description berjumlah sama karena bersifat unik. Hal ini mengindikasikan duplikat data. Mari kita cek.

description_check <- data_input_1 %>% select(StockCode,Description) %>% 
  distinct() %>% 
  group_by(StockCode) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)
## `summarise()` ungrouping output (override with `.groups` argument)
description_check
## # A tibble: 212 x 2
##    StockCode  freq
##    <chr>     <int>
##  1 16156L        2
##  2 17107D        3
##  3 20622         2
##  4 20725         2
##  5 20914         2
##  6 21109         2
##  7 21112         2
##  8 21175         2
##  9 21232         2
## 10 21243         2
## # ... with 202 more rows

Data diatas adalah data stock code yang memiliki description > 1. Mari kita sampling datanya.

data_input_1 %>% filter(StockCode %in% description_check$StockCode) %>% 
  select(StockCode,Description) %>% 
  distinct() %>% 
  arrange(StockCode,Description)
## # A tibble: 443 x 2
##    StockCode Description                        
##    <chr>     <chr>                              
##  1 16156L    WRAP CAROUSEL                      
##  2 16156L    WRAP, CAROUSEL                     
##  3 17107D    FLOWER FAIRY 5 DRAWER LINERS       
##  4 17107D    FLOWER FAIRY 5 SUMMER DRAW LINERS  
##  5 17107D    FLOWER FAIRY,5 SUMMER B'DRAW LINERS
##  6 20622     VIP PASSPORT COVER                 
##  7 20622     VIPPASSPORT COVER                  
##  8 20725     LUNCH BAG RED RETROSPOT            
##  9 20725     LUNCH BAG RED SPOTTY               
## 10 20914     SET/5 RED RETROSPOT LID GLASS BOWLS
## # ... with 433 more rows

Dari hasil pengecekan diatas dapat kita simpulkan bahwa terdapat kesalahan pada deskripsi yang berupa tanda baca, spasi hingga kesalahan penulisan deskripsi produk. Untuk itu kita akan generate deskripsi produk menggunakan data pertamanya.

df_description <- data_input_1 %>% select(StockCode,Description) %>% 
  filter(StockCode %in% description_check$StockCode) %>% 
  distinct() %>% 
  group_by(StockCode) %>% 
  slice(1) %>% 
  ungroup()

data_input_1 <- data_input_1 %>% left_join(df_description, by=c("StockCode")) %>% 
  mutate(Description = ifelse(is.na(Description.y),Description.x,Description.y)) %>% 
  select(-c(Description.y,Description.x))


data_frame(
  jumlah_stock_code_unik = data_input_1$StockCode %>% unique() %>% length(),
  jumlah_description_unik = data_input_1$Description %>% unique() %>% length()
)
## # A tibble: 1 x 2
##   jumlah_stock_code_unik jumlah_description_unik
##                    <int>                   <int>
## 1                   3791                    3765

Oke, selisih nya sudah mulai berkurang. Jumlah diatas mengindikasikan terdapat stock_code yang memiliki description sama.

df_description <- data_input_1 %>% select(StockCode,Description) %>% 
  distinct() %>% 
  group_by(Description) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  filter(freq>1)
## `summarise()` ungrouping output (override with `.groups` argument)
df_description
## # A tibble: 24 x 2
##    Description                      freq
##    <chr>                           <int>
##  1 BATHROOM METAL SIGN                 2
##  2 COLOURING PENCILS BROWN TUBE        2
##  3 COLUMBIAN CANDLE RECTANGLE          2
##  4 COLUMBIAN CANDLE ROUND              3
##  5 EAU DE NILE JEWELLED PHOTOFRAME     2
##  6 FRENCH FLORAL CUSHION COVER         2
##  7 FRENCH LATTICE CUSHION COVER        2
##  8 FRENCH PAISLEY CUSHION COVER        2
##  9 FROSTED WHITE BASE                  2
## 10 HEART T-LIGHT HOLDER                2
## # ... with 14 more rows
#> # A tibble: 24 x 2

Terdapat 24 stock code yang memiliki deskripsi sama.

data_input_1 %>% select(StockCode,Description) %>% 
  distinct() %>% 
  filter(Description %in% c(df_description$Description)) %>% 
  arrange(Description)
## # A tibble: 50 x 2
##    StockCode Description                    
##    <chr>     <chr>                          
##  1 82580     BATHROOM METAL SIGN            
##  2 21171     BATHROOM METAL SIGN            
##  3 10133     COLOURING PENCILS BROWN TUBE   
##  4 10135     COLOURING PENCILS BROWN TUBE   
##  5 72133     COLUMBIAN CANDLE RECTANGLE     
##  6 72131     COLUMBIAN CANDLE RECTANGLE     
##  7 72127     COLUMBIAN CANDLE ROUND         
##  8 72130     COLUMBIAN CANDLE ROUND         
##  9 72128     COLUMBIAN CANDLE ROUND         
## 10 85023B    EAU DE NILE JEWELLED PHOTOFRAME
## # ... with 40 more rows

Untuk case ini kita bisa berasumsi bahwa produk tersebut sama. Sehinnga, kita akan generate setiap description menggunakan stock code pertama.

df_product_unik <- data_input_1 %>% select(StockCode,Description) %>% 
  filter(Description %in% c(df_description$Description)) %>% 
  distinct() %>% 
  group_by(Description) %>% 
  slice(1) %>% 
  ungroup()

data_input_1 <- data_input_1 %>% left_join(df_product_unik, by=c("Description")) %>% 
  mutate(StockCode = ifelse(is.na(StockCode.y),StockCode.x,StockCode.y)) %>% 
  select(-c("StockCode.x","StockCode.y")) %>% 
  select(InvoiceNo,InvoiceDate,CustomerID,Country,StockCode,Description,Quantity,UnitPrice)


data_frame(
  jumlah_stock_code_unik = data_input_1$StockCode %>% unique() %>% length(),
  jumlah_description_unik = data_input_1$Description %>% unique() %>% length(),
  jumlah_code_description_unik = data_input_1 %>% select(StockCode,Description) %>% distinct() %>% nrow()
)
## # A tibble: 1 x 3
##   jumlah_stock_code_unik jumlah_description_unik jumlah_code_description_unik
##                    <int>                   <int>                        <int>
## 1                   3765                    3765                         3765

Argumen Anda: Oke, setiap stock sudah bersifat unik.

5.7 Country Tidak Valid

data_frame(
  customer_id_unik = data_input_1 %>% select(CustomerID,Country) %>% distinct() %>% nrow(),
  customer_country_unik = data_input_1 %>% select(CustomerID) %>% distinct() %>% nrow()
)
## # A tibble: 1 x 2
##   customer_id_unik customer_country_unik
##              <int>                 <int>
## 1             4351                  4335

Berdasar data diatas terdapat 1 customer yang memiliki 2 negara. Mungkin bisa karena customer tersebut pindah, oleh karena itu kita bisa ambil negara customer berdasarkan negara terakhir ia melakukan transaksi.

df_master_customer <- data_input_1 %>% 
  arrange(desc(InvoiceDate,CustomerID)) %>% 
  select(CustomerID, Country) %>% 
  group_by(CustomerID) %>% 
  slice(1)

data_input_1 <- data_input_1 %>% select(-Country) %>% 
  left_join(df_master_customer, by = c("CustomerID"))


data_frame(
  customer_id_unik = data_input_1 %>% select(CustomerID,Country) %>% distinct() %>% nrow(),
  customer_country_unik = data_input_1 %>% select(CustomerID) %>% distinct() %>% nrow())
## # A tibble: 1 x 2
##   customer_id_unik customer_country_unik
##              <int>                 <int>
## 1             4335                  4335

Argumen Anda: setiap data customer dan negara sudah sesuai.

5.8 Pemutusan Berkala

data_input_1 <- data_input_1 %>% filter(as.Date(InvoiceDate) > ymd("2010-11-30"), as.Date(InvoiceDate) < ymd("2011-12-01"))
tail(data_input_1,10)
## # A tibble: 10 x 8
##    InvoiceNo InvoiceDate         CustomerID StockCode Description Quantity
##    <chr>     <dttm>                   <dbl> <chr>     <chr>          <dbl>
##  1 579885    2011-11-30 17:37:00      15444 22118     JOY WOODEN~        2
##  2 579885    2011-11-30 17:37:00      15444 21287     SCENTED VE~       12
##  3 579885    2011-11-30 17:37:00      15444 23035     DRAWER KNO~        6
##  4 579885    2011-11-30 17:37:00      15444 23240     SET OF 4 K~        1
##  5 579885    2011-11-30 17:37:00      15444 84882     GREEN WIRE~        2
##  6 579885    2011-11-30 17:37:00      15444 85034C    3 ROSE MOR~        4
##  7 579885    2011-11-30 17:37:00      15444 21742     LARGE ROUN~        2
##  8 579885    2011-11-30 17:37:00      15444 23084     RABBIT NIG~        6
##  9 579885    2011-11-30 17:37:00      15444 21257     VICTORIAN ~        1
## 10 579885    2011-11-30 17:37:00      15444 21259     VICTORIAN ~        1
## # ... with 2 more variables: UnitPrice <dbl>, Country <chr>

Argumen Anda: Data ini berisikan transaksi antara 01/12/2010 sampai 09/12/2011. Data transaksi Desember 2011 tidak full 1 bulan, sehinnga saya memilih untuk melakukan analisa dari 01/12/2010 sampai 30/11/2011.

5.9 Data Duplikat

data.frame(
  jumlah_data = data_input_1 %>% nrow(),
  jumlah_data_unik = data_input_1 %>% distinct() %>% nrow()
)
##   jumlah_data jumlah_data_unik
## 1      502695           497672

Dataset ini memiliki data yang duplikat, untuk itu perlu kita remove.

data_input_1 <- data_input_1 %>% distinct()

# total data
data_input_1 %>% nrow()
## [1] 497672

5.10 Fitur ekstraksi

data_input_1 <- data_input_1 %>% mutate(total_amount = Quantity * UnitPrice) %>% 
  select(InvoiceNo,InvoiceDate,CustomerID,Country,StockCode,Description,Quantity,UnitPrice,total_amount)

head(data_input_1)
## # A tibble: 6 x 9
##   InvoiceNo InvoiceDate         CustomerID Country StockCode Description
##   <chr>     <dttm>                   <dbl> <chr>   <chr>     <chr>      
## 1 536365    2010-12-01 08:26:00      17850 United~ 85123A    WHITE HANG~
## 2 536365    2010-12-01 08:26:00      17850 United~ 71053     WHITE META~
## 3 536365    2010-12-01 08:26:00      17850 United~ 84406B    CREAM CUPI~
## 4 536365    2010-12-01 08:26:00      17850 United~ 84029G    KNITTED UN~
## 5 536365    2010-12-01 08:26:00      17850 United~ 84029E    RED WOOLLY~
## 6 536365    2010-12-01 08:26:00      17850 United~ 22752     SET 7 BABU~
## # ... with 3 more variables: Quantity <dbl>, UnitPrice <dbl>,
## #   total_amount <dbl>

Argumen Anda: mengekstrak data total amount per transaksi berdasarkan quantity * unit_price mempermudah proses analisis selanjutnya.

5.11 Nilai yang hilang

colSums(is.na(data_input_1))
##    InvoiceNo  InvoiceDate   CustomerID      Country    StockCode  Description 
##            0            0       123531            0            0            0 
##     Quantity    UnitPrice total_amount 
##            0            0            0

Projek ini ditujukan untuk melakukan segmentasi pelanggan dan membuat personalisasi rekomendasi produk. Segmentasi pelanggan jelas harus mengetahui siapa pelangannya, sehingga data transaksi yang tidak memiliki data customer_id perlu di-exclude. Kemudian untuk membangun sistem rekomendasi, kita bisa mengabaikan customer nya dan fokus pada produk yang dibeli, sehingga dapat menggunakan data invoice_no dan stock_code.

Oleh karena itu dapat kita putuskan untuk membagi data ini menjadi 3 dataset, yaitu: 1. df_general_transaction : dataset data original. 1. df_customer_transaction : dataset ini digungakan untuk segmentasi pelanggan, sehingga harus mengexclude missing values. 2. df_product_recomm : dataset untuk membangun sistem rekomendasi produk yang hanya terdiri dari customer_id, invoice_no, stock_code dan description.

df_general_transaction <- data_input_1
df_customer_transaction <- drop_na(data_input_1)
df_product_recomm <- data_input_1 %>% select(CustomerID,InvoiceNo,StockCode,Description)

6 Tugas 5

Simpan data yang sudah Anda bersihkan ke dalam folder dalam format .json atau .xml atau .rds.

saveRDS(data, "C:/Nikita/Tugas/Data Structutes and Algorithms/UAS/uas.rds")     

7 Tugas 6

Import data yang sudah anda simpan pada Tugas 5, silahkan pilih salah satau jenis file saja. Kemudian, lakukan Analisis Data Eksplorasi dengan menggunakan Visualisasi yang telah Anda pelajari untuk menjawab setiap pertanyaan berikut:

• Gunakan Bar-Chart untuk memperlihatkan berapa banyak pelanggan yang melakukan transaksi setiap bulan?

Customer_per_Bulan   <- data_input_1                               %>%
                        select(InvoiceDate, CustomerID) %>%
                        distinct()                            %>%
                        mutate(yearmonth = format(InvoiceDate, format = "%y-%b-1"),
                               yearmonth = ymd(yearmonth))    %>%
                        group_by(yearmonth)                   %>%
                        summarise(total= n())                 %>%
                        ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
Customer_per_Bulan
## # A tibble: 12 x 2
##    yearmonth  total
##    <date>     <int>
##  1 2010-12-01  1529
##  2 2011-01-01  1081
##  3 2011-02-01  1089
##  4 2011-03-01  1424
##  5 2011-04-01  1231
##  6 2011-05-01  1646
##  7 2011-06-01  1488
##  8 2011-07-01  1423
##  9 2011-08-01  1335
## 10 2011-09-01  1796
## 11 2011-10-01  1987
## 12 2011-11-01  2738
ggplot(Customer_per_Bulan,
       aes(x = yearmonth, y = total))                                            +
       geom_bar(width=24, fill = rainbow(12), color="azure4", stat= "identity" ) +
       theme_minimal()                                                           +
       labs(
            x     = "Bulan"   ,
            y     = "Customer",
            title = "Transaksi Pelanggan Setiap Bulan"
           )                                                         + 
       theme(axis.text.x   = element_text(angle = -45, hjust = -.2)) +
       scale_x_date(breaks = date_breaks('1 month'),
       labels              = date_format("%b %y"))

Argumen Anda: banyak pelanggan yang melakukan transaksi setiap bulan tertinggi ada pada November 11

• Gunakan interaktif Line-Chart untuk memperlihatkan bagaimana pertumbuhan pelanggan baru setiap bulan?

plot_monthly_new_customer <- data_input_1 %>%
  group_by(CustomerID) %>% 
  summarise(first_order = min(InvoiceDate)) %>% 
  ungroup() %>% 
  mutate( yearmonth = format(first_order, format="%Y-%m-1"),
          yearmonth = ymd(yearmonth),
          ym = as.yearmon(first_order)) %>% 
  group_by(ym,yearmonth) %>% 
  summarise(total_new_customer = n()) %>% 
  ungroup() %>% 
  mutate(
     normalisasi = (total_new_customer-min(total_new_customer))/(max(total_new_customer)-min(total_new_customer)),
     popup=glue("Year-Month : {ym}
                 Total New Customer : {total_new_customer} ({round((total_new_customer/sum(total_new_customer))*100,1)}%)")
  ) %>% 
  ggplot(aes(yearmonth,normalisasi))+
   geom_area(fill="blue",alpha=0.7)+
  geom_line(size=0.7,color="#181818") +
  labs(
    title = "Growth of New Customer",
    x = "Month-Year",
    y = NULL
  )+
  geom_point(color="#181818", size = 2, alpha = 0.9, aes(text=popup))+
  scale_x_date(breaks=date_breaks('1 months'),
     labels=date_format('%b %y'))+
  theme(
    axis.text.y = element_blank()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'ym' (override with `.groups` argument)
## Warning: Ignoring unknown aesthetics: text
ggplotly(plot_monthly_new_customer, tooltip = "text")

Argumen Anda: pertumbuhan pelanggan baru setiap bulan tertinggi pada desember 10 dan paling rendah agustus 11

• Gunakan Radar-charts untuk menganalisis Waktu pemesanan yang terbaru dalam (bulanan, harian, dan per-jam)

# Bulan
Bulanan <- data_input_1 %>% select(InvoiceDate,InvoiceNo) %>% 
  distinct() %>%
  mutate(month = month(InvoiceDate),
         day = day(InvoiceDate)) %>% 
  group_by(month,day) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(day) %>% 
  summarise(avg_monthly_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("Date : {day}
                       Total Transaction: {avg_monthly_trans}")) 
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
Bulanan %>% 
  ggplot(aes(x=as.factor(day), y=avg_monthly_trans)) +
  geom_bar(stat="identity", aes(fill=avg_monthly_trans),show.legend = FALSE)+
  labs(title = "Transaksi perbulan",
    x = "Tanggal", 
    y = NULL)+
  theme_minimal()+
  scale_fill_gradient(low = "#FFF8DC", high="#00008B")+
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5,size=12, face="bold"),
        plot.subtitle = element_text(hjust = 0.5,size=10),
        axis.text.y = element_blank(),
        axis.text.x=element_text(size=11, face="bold"))+
  coord_polar() -> Bulan

# Hari
Harian <- data_input_1 %>% select(InvoiceNo,InvoiceDate) %>% 
  distinct() %>% 
  mutate(month = month(InvoiceDate),
        wday = wday(InvoiceDate,week_start = getOption("lubridate.week.start", 1))) %>% 
  group_by(month,wday) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(wday) %>% 
  summarise(avg_wday_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("wday : {wday}
                     Total Transaction: {avg_wday_trans}")) 
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
Harian %>% 
    ggplot(aes(wday,avg_wday_trans))+
    geom_bar(width=1, stat="identity", show.legend = FALSE, aes(fill=avg_wday_trans))+
    labs(
      title = "Transaksi Harian",
      x = "Hari", 
      y = NULL)+
    scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                       labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))+
    theme_minimal()+
    scale_fill_gradient(low = "#adff2f", high="#006400")+
    theme(axis.title = element_blank(),
          legend.position = "none",
          plot.title = element_text(hjust = 0.5,size=12, face="bold"),
          plot.subtitle = element_text(hjust = 0.5,size=10),
          axis.text.y = element_blank(),
          axis.text.x=element_text(size=11, face="bold"))+
    coord_polar() -> Hari



#Jam
 
Jaman <- data_input_1 %>% select(InvoiceDate,InvoiceNo) %>% 
  distinct() %>%
  mutate(day = day(InvoiceDate),
         hour = hour(InvoiceDate)) %>% 
  group_by(day,hour) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  group_by(hour) %>% 
  summarise(avg_hourly_trans = as.integer(median(total))) %>% 
  ungroup() %>% 
  mutate(popup = glue("Hour of Day : {hour}
                       Total Transaction: {avg_hourly_trans}")) 
## `summarise()` regrouping output by 'day' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
time_range = data_frame(hour = c(0:23))

data_frame(hour = c(0:23)) %>% 
  left_join(Jaman ,by="hour") %>%
  mutate(hour = as.factor(hour)) %>% 
  ggplot(aes(x=hour,y=avg_hourly_trans))+
  geom_bar(stat="identity",show.legend = FALSE, aes(fill=avg_hourly_trans))+
  labs(title = "Transaksi Perjam",
    x = "Jam", 
    y = NULL)+
  theme_minimal()+
  scale_fill_gradient(low = "#FFFF00", high="#FF4500")+
  theme(axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5,size=12, face="bold"),
        plot.subtitle = element_text(hjust = 0.5,size=10),
        axis.text.y = element_blank(),
        axis.text.x=element_text(size=11, face="bold"))+
  coord_polar() -> Jam

gridExtra::grid.arrange(Bulan,Hari,Jam, ncol = 3)
## Warning: Removed 9 rows containing missing values (position_stack).

Argumen Anda: Dari hasil dapat dilihat, semakin gelap warnanya, maka semakin banyak transaksi yang terjadi. Untuk Bulanannya, Transaksi terjadi paling banyak di setiap tanggal 30.Untuk harian, transaksi terjadi paling banyak di hari Kamis.Untuk jam nya, transaksi terjadi paling banyak di jam 12 siang setiap harinya.

• Gunakan interaktif Bar-Chart untuk memperlihatkan berapa frekuensi transaksi setiap bulan?

Transaksi_Bulan <- data_input_1 %>%
  select(InvoiceDate,InvoiceNo) %>% 
      distinct() %>% 
      mutate( yearmonth = format(InvoiceDate, format = "%y - %m - 1"),
              yearmonth = ymd(yearmonth)) %>% 
      group_by(yearmonth) %>% 
      summarise(total_transaksi = n()) %>% 
      ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
freq_bulanan <- plot_ly(Transaksi_Bulan, 
                x = ~yearmonth, 
                y = ~total_transaksi, 
                type = 'bar',
                marker = list(color = rainbow(12),
                         line = list(color = "black",
                                     width = 1.5)))
freq_bulanan <- freq_bulanan %>% layout(title = "Frekuensi Transaksi Perbulan",
                xaxis = list(title = "Bulan",
                             type = "date",
                             tickformat = "%b %y"),
                yaxis = list(title = "Total Transaksi"))
freq_bulanan
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Argumen Anda: Frekuensi transaksi terbanyak terjadi pada November 2011 sejumlah 2754, dan Frekuensi transaksi yang paling sedikit terjadi pada Januari 2011 sebanyak 1088 transaksi

• Gunakan interaktif Bar-Chart yang horizontal untuk memperlihatkan 10 teratas dari produk terpopuler berdasarkan frekuensi transaksinya!

plot_most_frequency <- data_input_1 %>% group_by(StockCode,Description) %>% 
  summarise(frequency = n()) %>% 
  ungroup() %>% 
  arrange(desc(frequency)) %>% 
  head(10) %>% 
  mutate(Description = as.factor(Description),
         Description = reorder(Description,frequency)) %>% 
  ggplot(aes(x=Description, y=frequency))+
  geom_bar(stat="identity", aes(fill=Description, text=frequency), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Frequency Order",
        x=NULL)+
  coord_flip()
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
## Warning: Ignoring unknown aesthetics: text
ggplotly(plot_most_frequency, tooltip="text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)
## Warning: 'config' objects don't have these attributes: 'scrollzoom'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'

Argumen Anda: Ternyata transaksi pembelian product terbanyak berdasarkan frekuensi transaksi yaitu white hanging heart T-Light Holder dan paling sedikit pembeli adalah lunch bag black skull and suki design

• Gunakan interaktif Bar-Chart yang horizontal untuk memperlihatkan 10 produk paling populer berdasarkan jumlah pesanan!

plot_most_quantity <- data_input_1 %>% 
  group_by(StockCode,Description) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ungroup() %>% 
  arrange(desc(total_quantity)) %>% 
  head(10) %>% 
  mutate(Description = as.factor(Description),
         Description = reorder(Description,total_quantity)) %>% 
  ggplot(aes(x=Description, y=total_quantity))+
  geom_bar(stat="identity", aes(fill=Description, text=total_quantity), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Quantity Order",
        x=NULL)+
  coord_flip()
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
## Warning: Ignoring unknown aesthetics: text
ggplotly(plot_most_quantity, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)
## Warning: 'config' objects don't have these attributes: 'scrollzoom'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'

Argumen Anda: produk paling populer berdasarkan jumlah pesanan paling banyak adalah medium ceramic top storage jar- dan paling sedikit adalah pack of 60 pink paisley cake cases

• Gunakan interaktif Bar-Chart yang horizontal untuk memperlihatkan 10 produk paling populer menurut total pelanggan!

plot_most_customer <- data_input_1 %>%
  select(CustomerID, StockCode,Description) %>% 
  distinct() %>% 
  group_by(StockCode,Description) %>% 
  summarise(total_customer = n()) %>% 
  ungroup() %>% 
  arrange(desc(total_customer)) %>% 
  head(10) %>% 
  mutate(Description = as.factor(Description),
         Description = reorder(Description,total_customer)) %>% 
  ggplot(aes(x=Description, y=total_customer))+
  geom_bar(stat="identity", aes(fill=Description, text=total_customer), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Total Customer",
        x=NULL)+
  coord_flip()
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
## Warning: Ignoring unknown aesthetics: text
ggplotly(plot_most_customer, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
   config(displayModeBar = F, scrollzoom = F)
## Warning: 'config' objects don't have these attributes: 'scrollzoom'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'

Argumen Anda: produk paling populer menurut total pelanggan terbanyak adalah regency cakestand 3 Tier dan paling sedikit adalah baking set 9 piece rertospot

• Gunakan interaktif Bar-Chart yang horizontal untuk memperlihatkan 10 produk paling populer berdasarkan Nilai Mata uang!

plot_most_profit <- data_input_1 %>% 
  group_by(StockCode,Description) %>% 
  summarise(total_amount = sum(total_amount)) %>% 
  ungroup() %>% 
  arrange(desc(total_amount)) %>% 
  head(10) %>% 
  mutate(Description = as.factor(Description),
         Description = reorder(Description,total_amount)) %>% 
  ggplot(aes(x=Description, y=total_amount))+
  geom_bar(stat="identity", aes(fill=Description, text=paste0("GBP ",total_amount)), show.legend = FALSE)+
  labs(title="Most 10 Popular Product by Order Value",
        x=NULL)+
  coord_flip()
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
## Warning: Ignoring unknown aesthetics: text
ggplotly(plot_most_profit, tooltip = "text") %>% 
  layout(showlegend=FALSE) %>% 
  config(displayModeBar = F, scrollzoom = F)
## Warning: 'config' objects don't have these attributes: 'scrollzoom'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'

Argumen Anda: produk paling populer berdasarkan Nilai Mata uang terbanyak adalah regency cakestand 3 tier dan paling sedikit adalah paper chain kit 50’s christmas

• Lakukan Analisa dengan menggunakan Time Series apakah penjualan berdasarkan Nilai Mata Uang Meningkat-Menurun?

income_bulan <- data_input_1 %>%
                   group_by(Description, total_amount) %>%
                   summarise(awal = min(InvoiceDate)) %>%
                   ungroup() %>%
                  mutate(ym = format(awal, format = "%y-%m-1"),
                         ym = ymd(ym)) %>%
                   group_by(ym) %>%
                   summarise(income = sum(total_amount)) %>%
                   ungroup() %>%
                  mutate(popup = glue("Bulan : {ym}
                                       Income : {income}"))
## `summarise()` regrouping output by 'Description' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
income_bulan_plot <- income_bulan %>%
  ggplot(aes(x = ym,
             y = income,
             group = 1)) +
  geom_line(size = 1, colour="red") +
  labs( title = "Perubahan Income Perbulan",
        x = "Bulan",
        y = "Income")+
  scale_x_date(breaks = date_breaks(width = "1 month"),
               labels = date_format("%b %y")) +
  scale_y_continuous(label = scales::format_format(big.mark = ",",
                     decimal.mark = ".",
                     scientific = F))

income_harian <- df_customer_transaction %>%
                   group_by(Description, total_amount) %>%
                   summarise(awal = min(InvoiceDate)) %>%
                   ungroup() %>%
                  mutate(ym = format(awal, format = "%y-%m-%d"),
                         ym = ymd(ym)) %>%
                   group_by(ym) %>%
                   summarise(income = sum(total_amount)) %>%
                   ungroup() %>%
                  mutate(popup = glue("Bulan : {ym}
                                       Income : {income}"))
## `summarise()` regrouping output by 'Description' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
income_hari_plot <- income_harian %>%
  ggplot(aes(x = ym,
             y = income,
             group = 1)) +
  geom_line(size = 1, colour="green") +
  labs( title = "Perubahan Income Perbulan - Perhari",
        x = "Bulan",
        y = "Income")+
  scale_x_date(breaks = date_breaks(width = "1 month"),
               labels = date_format("%b %y")) +
  scale_y_continuous(label = scales::format_format(big.mark = ",",
                     decimal.mark = ".",
                     scientific = F))

hasil <- gridExtra::grid.arrange(income_bulan_plot, income_hari_plot, ncol = 1)

• Gunakan Tree-Map untuk memvisualisasikan sebagian besar Konsumen menurut Negara?

treemap(data_input_1,
        index      = c("Country"),
        vSize      = "Quantity",
        title      = "",
        border.col = "grey40")

Argumen Anda: Konsumen terbesar jatuh pada united kingdom.

8 Tugas 7

Berikan pandangan dan pendapat terkait kasus yang sudah anda kerjakan diatas (Apa yang akan anda lakukan sebagai Manager mengenai kasus tersebeut untuk mengembangkan bisnis perbelanjaan online tersebut berdasarkan analisa yang anda temukan!).

Argumen Anda:

cara mengembangkan bisnis pembelanjaan online yang efektif: - Meningkatkan kualitas produk - Meningkatkan kualitas pelayanan - Diskon produk - Promosi secara rutin - Inovasi produk paling baru