Project Context


Why EDA?


Data Pre-processing?

library(tidyverse)
## ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ─ Conflicts ────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(knitr)

SalesTable <- read_csv('SalesTable.csv')
## Parsed with column specification:
## cols(
##   Agency = col_integer(),
##   Product_ID = col_integer(),
##   Client_ID = col_integer(),
##   Sales_Amount = col_integer(),
##   Sales = col_integer()
## )
ClientTable <- read_csv('ClientTable.csv')
## Parsed with column specification:
## cols(
##   Client_ID = col_integer(),
##   Client_Name = col_character()
## )
ProductTable <- read_csv('ProductTable.csv')
## Parsed with column specification:
## cols(
##   Product_ID = col_integer(),
##   Product_Name = col_character()
## )
SalesTableNew <- SalesTable %>%
  inner_join(ClientTable, by = 'Client_ID') %>%
  inner_join(ProductTable, by = 'Product_ID')

kable(SalesTableNew[1:10,])
Agency Product_ID Client_ID Sales_Amount Sales Client_Name Product_Name
101 1004 2003 465 78971 CC D
101 1007 2003 182 23053 CC G
101 1008 2003 757 61897 CC H
101 1016 2003 714 69885 CC P
101 1017 2003 321 37015 CC Q
101 1018 2003 56 6141 CC R
101 1002 2003 541 54350 CC B
101 1004 2003 475 56077 CC D
101 1007 2003 222 40973 CC G
101 1008 2003 1868 208044 CC H
SalesTableNew$Agency <- as.factor(SalesTableNew$Agency)
SalesTableNew$Product_ID <- as.factor(SalesTableNew$Product_ID)
SalesTableNew$Client_ID <- as.factor(SalesTableNew$Client_ID)

Propose Hypothesis

Product unit price * Sales

  • Question : What is the relationship between product unit price & sales amount?
  • Chart purpose & type : Correlation, Scatterplot
  • Fact : Sales mostly generated by products which price at ~150 NTD
  • Hypothesis : The average product unit price is lower/similar/highter compared to the industry benchmark
  • Further Data Verification : Product unit price industry benchmark
SalesTablePrice <- SalesTableNew %>%
  mutate( Unit_Price = Sales / Sales_Amount)

ggplot(data = SalesTablePrice,
             aes( x = Unit_Price,
                  y = Sales_Amount))+
  geom_point(color = 'red',
             alpha = 0.5) + theme_bw()
## Warning: Removed 7 rows containing missing values (geom_point).

Sales pattern * Client

  • Question : What’s different in Sales between different clients?
  • Chart Purpose & Type : Distribution , Boxplot
  • Fact : LL’s avg. product sales (total sales by pdct) are obviously higher than other clients
  • Hypothesis : LL’s products sales is much more healthier
  • Further Data Verification : How many pdcts by clients have
ggplot(SalesTableNew)+geom_boxplot( aes( x = factor(Client_Name),
                                         y = Sales,
                                         colour = Client_Name))+
  labs( x = 'Client',
        title = 'Sales Distribution by Client') + theme_bw()