Project Context
- Explore the data to discover potential sales management issues and build relevant hypothesis
- Create data visualizations/dashboards to facilitate issue/hypo discussion with team managers/memebers who aren’t failiar with data analysis
Why EDA?
- Explore data to form initial hypothesis
Data Pre-processing?
- Read data and library
library(tidyverse)## ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ─ Conflicts ────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
SalesTable <- read_csv('SalesTable.csv')## Parsed with column specification:
## cols(
## Agency = col_integer(),
## Product_ID = col_integer(),
## Client_ID = col_integer(),
## Sales_Amount = col_integer(),
## Sales = col_integer()
## )
ClientTable <- read_csv('ClientTable.csv')## Parsed with column specification:
## cols(
## Client_ID = col_integer(),
## Client_Name = col_character()
## )
ProductTable <- read_csv('ProductTable.csv')## Parsed with column specification:
## cols(
## Product_ID = col_integer(),
## Product_Name = col_character()
## )
- Join all the data and look into the data structure
SalesTableNew <- SalesTable %>%
inner_join(ClientTable, by = 'Client_ID') %>%
inner_join(ProductTable, by = 'Product_ID')
kable(SalesTableNew[1:10,])| Agency | Product_ID | Client_ID | Sales_Amount | Sales | Client_Name | Product_Name |
|---|---|---|---|---|---|---|
| 101 | 1004 | 2003 | 465 | 78971 | CC | D |
| 101 | 1007 | 2003 | 182 | 23053 | CC | G |
| 101 | 1008 | 2003 | 757 | 61897 | CC | H |
| 101 | 1016 | 2003 | 714 | 69885 | CC | P |
| 101 | 1017 | 2003 | 321 | 37015 | CC | Q |
| 101 | 1018 | 2003 | 56 | 6141 | CC | R |
| 101 | 1002 | 2003 | 541 | 54350 | CC | B |
| 101 | 1004 | 2003 | 475 | 56077 | CC | D |
| 101 | 1007 | 2003 | 222 | 40973 | CC | G |
| 101 | 1008 | 2003 | 1868 | 208044 | CC | H |
- Change data type for the following analysis : Turn the number code from “interger” into “factor”
SalesTableNew$Agency <- as.factor(SalesTableNew$Agency)
SalesTableNew$Product_ID <- as.factor(SalesTableNew$Product_ID)
SalesTableNew$Client_ID <- as.factor(SalesTableNew$Client_ID)Propose Hypothesis
Product unit price * Sales
- Question : What is the relationship between product unit price & sales amount?
- Chart purpose & type : Correlation, Scatterplot
- Fact : Sales mostly generated by products which price at ~150 NTD
- Hypothesis : The average product unit price is lower/similar/highter compared to the industry benchmark
- Further Data Verification : Product unit price industry benchmark
SalesTablePrice <- SalesTableNew %>%
mutate( Unit_Price = Sales / Sales_Amount)
ggplot(data = SalesTablePrice,
aes( x = Unit_Price,
y = Sales_Amount))+
geom_point(color = 'red',
alpha = 0.5) + theme_bw()## Warning: Removed 7 rows containing missing values (geom_point).
Sales pattern * Client
- Question : What’s different in Sales between different clients?
- Chart Purpose & Type : Distribution , Boxplot
- Fact : LL’s avg. product sales (total sales by pdct) are obviously higher than other clients
- Hypothesis : LL’s products sales is much more healthier
- Further Data Verification : How many pdcts by clients have
ggplot(SalesTableNew)+geom_boxplot( aes( x = factor(Client_Name),
y = Sales,
colour = Client_Name))+
labs( x = 'Client',
title = 'Sales Distribution by Client') + theme_bw()