LBB2 - Retail Dataset Visualization

1. Case Description

Brief Introduction

Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my second RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.

Data Explanation

The data stored a record of a transactions of a retail store specifying several variables. Please take a look at the following glossary for your referrence:
- Order.ID: Id of order.
- Order.Date: Date of order.
- Ship.Date: Date of shipment.
- Ship.Mode: Type of shipment.
- Customer.ID: Id of customer.
- Segment : Customer’s segment.
- Product.ID: Id of product.
- Category : Product category with 3 levels (“Furniture”, “Office Supplies”, “Technology”)
- Sub.Category: more specific product category
- Product.Name: Name of product that was sold.
- Sales: How much earning from each sale.
- Quantity: Quantity of item sold.
- Discount: How much discount was given for each sale.
- Profit: How much can a company earn from each sale.

Objectives

Show stages of data pre-processing that has been done before making visualizations.
Provide suitable plot for analysis with aesthetics consideration (labels of the x/y axis, Title/Subtitle, Legend, etc).
Provide insights from visualization.

2. Setup Environment

Before we start develop the analysis, let’s start with environment setup.

# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

options(scipen=9999)

After we done with environment setup, let’s call several important library that will be used during analysis such as ‘lubridate’, ‘ggplot2’, ‘dplyr’.

library(lubridate)
library(ggplot2)
library(dplyr)

3. Data Preprocessing

3.1 Data Input & Structure

Before we generate visualization of our dataset, we need to understand what’s inside the dataset. Let’s start with reading retail.csv in our folder using read.csv() function, save it into an object named retail_df, and display first 6 rows of the data.

retail_df <- read.csv("retail.csv")
head(retail_df)

we can also check the last 6 rows of the dataset using ‘tail’ function.

tail(retail_df)

Based on the output above, we now that our data has dimension of 9,994 rows and 14 columns. We can also quickly check the dimension using ‘dim’ function.

dim(retail_df)

#> [1] 9994   14

Next we will check the data structure and the data types. We need to make sure our columns have the correct data types so we can optimize the visualization and also the insights.

str(retail_df)

#> 'data.frame':    9994 obs. of  14 variables:
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
#>  $ Ship.Date   : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Retail dataset has 10 ‘chr’, 3 ‘num’, and 1 ‘int’ data types, but we know that some columns have wrong data types. That’s why we need to do adjustment on ‘Order.Date’, ‘Ship.Date’, ‘Ship.Mode’, ‘Segment’, ‘Category’, ‘Sub.Category’ to get the correct data types.

retail_df$Order.Date<- mdy(retail_df$Order.Date)
retail_df$Ship.Date <- mdy(retail_df$Ship.Date)
retail_df$Ship.Mode <- as.factor(retail_df$Ship.Mode)
retail_df$Segment <- as.factor(retail_df$Segment)
retail_df$Category <- as.factor(retail_df$Category)
retail_df$Sub.Category <- as.factor(retail_df$Sub.Category)
str(retail_df)

#> 'data.frame':    9994 obs. of  14 variables:
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Good, now our dataset has the correct data types for each column.

3.2. Missing Data

Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.

anyNA(retail_df)

#> [1] FALSE

colSums(is.na(retail_df))

#>     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID      Segment 
#>            0            0            0            0            0            0 
#>   Product.ID     Category Sub.Category Product.Name        Sales     Quantity 
#>            0            0            0            0            0            0 
#>     Discount       Profit 
#>            0            0

Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.

3.3. Subsetting and Practical Statistics

We will extract ‘Month’,‘Quarter’,and ‘year’ columns based on ‘Order.Date’ to help us get insight on which timeframe the store generate most sales or profit.

retail_df$Order.Month <- month(retail_df$Order.Date,label=TRUE, abbr = FALSE)
retail_df$Order.Quarter <- quarter(retail_df$Order.Date, type='year.quarter', fiscal_start = 1)
retail_df$Order.Year <- year(retail_df$Order.Date)
retail_df$Order.Quarter <- as.factor(retail_df$Order.Quarter)
retail_df$Order.Year <- as.factor(retail_df$Order.Year)
str(retail_df)

#> 'data.frame':    9994 obs. of  17 variables:
#>  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date   : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date    : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode    : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment      : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category     : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...
#>  $ Order.Month  : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#>  $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#>  $ Order.Year   : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...

We also need to add sales/quantity and profit/quantity columns in order to make objective comparison for each transaction at the retail store.

retail_df$SperQ <- retail_df$Sales / retail_df$Quantity
retail_df$PperQ <- retail_df$Profit / retail_df$Quantity
str(retail_df)

#> 'data.frame':    9994 obs. of  19 variables:
#>  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date   : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date    : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode    : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment      : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category     : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...
#>  $ Order.Month  : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#>  $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#>  $ Order.Year   : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...
#>  $ SperQ        : num  130.98 243.98 7.31 191.52 11.18 ...
#>  $ PperQ        : num  20.96 73.19 3.44 -76.61 1.26 ...

The last data transformation that we need to do is add ‘Discount_Cat’ columns to categorize ‘Discount’ columns into: ‘0%’,‘0-20%’,‘>=20%’

pw <- function(x){ 
    
    if(x == 0){
      x <- "0%"
    }else if(x > 0 & x < 0.2){
      x <- "0-20%"
    }else{
      x <- ">=20%"
    }
}
retail_df$Discount_Cat <- sapply(retail_df$Discount, pw)
retail_df$Discount_Cat <- factor(retail_df$Discount_Cat,levels=c('0%','0-20%','>=20%'))
str(retail_df)

#> 'data.frame':    9994 obs. of  20 variables:
#>  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date   : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date    : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode    : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment      : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category     : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...
#>  $ Order.Month  : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#>  $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#>  $ Order.Year   : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...
#>  $ SperQ        : num  130.98 243.98 7.31 191.52 11.18 ...
#>  $ PperQ        : num  20.96 73.19 3.44 -76.61 1.26 ...
#>  $ Discount_Cat : Factor w/ 3 levels "0%","0-20%",">=20%": 1 1 1 3 3 1 1 3 3 1 ...

Now the dataset has grow into 20 columns instead of initial 14 columns. we can conduct quick check of column names inside the dataset.

names(retail_df)

#>  [1] "Order.ID"      "Order.Date"    "Ship.Date"     "Ship.Mode"    
#>  [5] "Customer.ID"   "Segment"       "Product.ID"    "Category"     
#>  [9] "Sub.Category"  "Product.Name"  "Sales"         "Quantity"     
#> [13] "Discount"      "Profit"        "Order.Month"   "Order.Quarter"
#> [17] "Order.Year"    "SperQ"         "PperQ"         "Discount_Cat"

dim(retail_df)

#> [1] 9994   20

Cool!! The dataset is looking incredible, let’s check the statistical summary of our dataset.

summary(retail_df)

#>    Order.ID           Order.Date           Ship.Date         
#>  Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
#>  Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
#>  Mode  :character   Median :2016-06-26   Median :2016-06-29  
#>                     Mean   :2016-04-30   Mean   :2016-05-03  
#>                     3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
#>                     Max.   :2017-12-30   Max.   :2018-01-05  
#>                                                              
#>           Ship.Mode    Customer.ID               Segment      Product.ID       
#>  First Class   :1538   Length:9994        Consumer   :5191   Length:9994       
#>  Same Day      : 543   Class :character   Corporate  :3020   Class :character  
#>  Second Class  :1945   Mode  :character   Home Office:1783   Mode  :character  
#>  Standard Class:5968                                                           
#>                                                                                
#>                                                                                
#>                                                                                
#>             Category         Sub.Category  Product.Name      
#>  Furniture      :2121   Binders    :1523   Length:9994       
#>  Office Supplies:6026   Paper      :1370   Class :character  
#>  Technology     :1847   Furnishings: 957   Mode  :character  
#>                         Phones     : 889                     
#>                         Storage    : 846                     
#>                         Art        : 796                     
#>                         (Other)    :3613                     
#>      Sales              Quantity        Discount          Profit         
#>  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
#>                                                                          
#>     Order.Month   Order.Quarter  Order.Year      SperQ         
#>  November :1471   2017.4 :1219   2014:1993   Min.   :   0.336  
#>  December :1408   2016.4 : 918   2015:2102   1st Qu.:   5.470  
#>  September:1383   2017.3 : 903   2016:2587   Median :  16.270  
#>  October  : 819   2015.4 : 806   2017:3312   Mean   :  60.920  
#>  May      : 735   2014.4 : 755               3rd Qu.:  63.940  
#>  June     : 717   2016.3 : 740               Max.   :3773.080  
#>  (Other)  :3461   (Other):4653                                 
#>      PperQ            Discount_Cat
#>  Min.   :-1319.9956   0%   :4798  
#>  1st Qu.:    0.7228   0-20%: 146  
#>  Median :    2.7670   >=20%:5050  
#>  Mean   :    7.7994               
#>  3rd Qu.:    8.7032               
#>  Max.   : 1679.9952               
#>

Based on the summary above, we may collect some insights from the dataset:
1. First order date was occured in Jan 2014, In 4 year, the company can booked 9994 transactions.
2. The most favorable Ship mode by customers is Standard Class.
3. There are 3 Categories of Product: Furniture, Office Supplies, Technology. Office Supplies has the highest transaction.
4. There are 3 Buyers Segment at the store: Consumer, Corporate, Home Office. Consumer Segment has the highest transaction.
5. The Retail Store tends to not give discount in their successful transaction (~50%).
6. In 4 years of operation, the store makes significant improvement on their number of transaction.
7. The highest store transaction happened at Q4 2017.
8. Sales range is around 0.444 - 22,638.480 and profit range is around -6,599.978 - 8,399.976.
9. Sales per Quantity range is around 0.336 - 3,773.080 and Profit per Quantity range is around -1,319.9956 - 3,773.080.
10. Buyers in average bought 3.79 product in their every shopping session.

4. Study Case

4.1 Box Plot Analysis

ggplot(retail_df, aes(Segment,Profit))+
  geom_boxplot(aes(fill = Segment))+
  scale_y_log10()+
  labs(title = 'Profit by Segment', x='Segment', y='Profit', 
       fill = 'Segment', subtitle = 'red line indicate average Profit')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_hline(yintercept = mean(retail_df$Profit), color = 'red',
             linetype = 5)

ggplot(retail_df, aes(Segment,Sales))+
  geom_boxplot(aes(fill = Segment))+
  scale_y_log10()+
  labs(title = 'Sales by Segment', x='Segment', y='Sales', 
       fill = 'Segment', subtitle = 'red line indicate average sales')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_hline(yintercept = mean(retail_df$Sales), color = 'red',
             linetype = 5)

ggplot(retail_df, aes(Category,Profit))+
  geom_boxplot(aes(fill = Category))+
  scale_y_log10()+
  labs(title = 'Profit by Category', x='Category', y='Profit', 
       fill = 'Category', subtitle = 'red line indicate average profit')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_hline(yintercept = mean(retail_df$Profit), color = 'red',
             linetype = 5)

ggplot(retail_df, aes(Category,Sales))+
  geom_boxplot(aes(fill = Category))+
  scale_y_log10()+
  labs(title = 'Sales by Category', x='Category', y='Sales', 
       fill = 'Category', subtitle = 'red line indicate average sales')+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_hline(yintercept = mean(retail_df$Sales), color = 'red',
             linetype = 5)

ggplot(retail_df,aes(Category,SperQ)) +
   geom_jitter(aes(col= retail_df$Discount_Cat)) +
   geom_boxplot(alpha=0.5) +
   scale_y_log10()+
   labs(title = "Sales per Quantity by Category", x= "Category", y= "Sales per Quantity", col = "Discount Category") +
   theme(plot.title = element_text(hjust = 0.5))

Insights:
1. Each Segment has similar mean profit value and have outlier in upper and lower whisker.
2. Each Segment has similar mean sales value and have outlier in upper whisker.
3. Office Supplies Category has the lowest mean profit and sales value compare to other categories even though we know that Office Supply has the largest transaction at the store.
4. Based on Sales per Quantity by Category plot, looks like Office Supplies suffers low Sales per Quantity due to higher discount applied on the transaction.

4.2 Bar Chart Analysis

#create new group by table for Category in each quarter
retail_dfQ_grpcat <- retail_df  %>% group_by(Order.Quarter,Category) %>% summarise(Total_Quantity = sum(Quantity),
                                                                                  Total_Sales = sum(Sales),
                                                                                  Total_Profit = sum(Profit),
                                                                                  .groups = 'drop')

#Generate Bar Chart to indicate Total Quantity Sold per Quarter by Category
ggplot(retail_dfQ_grpcat, aes(Order.Quarter,Total_Quantity,colour=Category))+
  geom_col(aes(fill = Category))+
  labs(title = 'Historical Quarterly Product Sold by Category', x='Order Date',
       y='Total Quantity', fill = 'Category')+
  theme(plot.title = element_text(vjust=0.5, hjust = 0.5), 
        axis.text.x = element_text(angle=90, hjust=0.5, vjust=0.5))

xtabs(Quantity~ Category+Order.Quarter, retail_df)

#>                  Order.Quarter
#> Category          2014.1 2014.2 2014.3 2014.4 2015.1 2015.2 2015.3 2015.4
#>   Furniture          224    300    429    670    244    345    479    707
#>   Office Supplies    642    948   1333   1646    573    956   1348   1838
#>   Technology         162    275    397    555    173    303    414    599
#>                  Order.Quarter
#> Category          2016.1 2016.2 2016.3 2016.4 2017.1 2017.2 2017.3 2017.4
#>   Furniture          332    455    558    848    280    535    612   1010
#>   Office Supplies    656   1419   1750   2121   1174   1578   2118   2806
#>   Technology         255    366    459    618    391    438    654    880

#create new group by table for Sub Category
retail_dfQ_grpsubcat <- retail_df  %>% group_by(Sub.Category) %>% summarise(Total_Quantity = sum(Quantity),
                                                                            .groups = 'drop')
retail_dfQ_grpsubcat <- retail_dfQ_grpsubcat[order(retail_dfQ_grpsubcat$Total_Quantity, decreasing = TRUE),]

#Generate bar chart to capture sub category ranking based on total quantity
ggplot(data = retail_dfQ_grpsubcat, mapping = aes(x=  Total_Quantity, y= reorder(Sub.Category,Total_Quantity))) +
  geom_col(aes(fill = Total_Quantity))  + # untuk membuat barplot
  labs(
    title = "Sub Category Rank Based on Total Quantity",
    x = "Total Quantity",
    y = "Sub Category"
  ) +
  scale_fill_gradient(low = "#8ecae6", high = "#023047") + # untuk mewarnai plot dengan teknik gradasi
  geom_text(mapping = aes(label = Total_Quantity), # menambahkan informasi label masing-masing kategori
            col = "black", # memberikan warna pada text
            nudge_x = 200) + # menggeser text berdasarkan sumbu x
  theme_classic() +
  theme(plot.title = element_text(vjust=0.5, hjust = 0.5),
        legend.position = "none") # untuk menghilangkan legend.

Insights:
1. Based on Historical Quarterly Product Sold by Category, we can see that the store always has positive total quantity of product sold growth over the time since 2014.
2. Lowest total product sold always happened at Q1 while the highest product sold happened at Q4. Office Supplies always be the number 1 ranking in term of total product sold at the retail store.
3. Based on Sub Category Rank Based on Total Quantity, Binders has the highest Total Quantity and Copiers has the lowest Total Quantity at the retail store.

4.3 Line Chart Analysis

Since we know that the largest total quantity happened at Q4 2017, we want to dig deeper on daily total sales profile for each category.

#create new group by table for Category
retail_df_grpcat <- retail_df  %>% group_by(Order.Date,Category) %>% summarise(Total_Quantity = sum(Quantity),
                                                                  Total_Sales = sum(Sales),
                                                                  Total_Profit = sum(Profit),
                                                                  .groups = 'drop')
#Extract date information and change to correct data type
retail_df_grpcat$Order.Month <- month(retail_df_grpcat$Order.Date,label=TRUE, abbr = FALSE)
retail_df_grpcat$Order.Quarter <- quarter(retail_df_grpcat$Order.Date, type='year.quarter', fiscal_start = 1)
retail_df_grpcat$Order.Year <- year(retail_df_grpcat$Order.Date)
retail_df_grpcat$Order.Quarter <- as.factor(retail_df_grpcat$Order.Quarter)
retail_df_grpcat$Order.Year <- as.factor(retail_df_grpcat$Order.Year)

#Generate Line Plot Historical Total Sales per Category
ggplot(retail_df_grpcat[(retail_df_grpcat$Order.Quarter=='2017.4') ,],
                         aes(Order.Date,Total_Sales,colour=Category))+
  geom_line(aes(fill = Category))+
  geom_point(aes(fill = Category))+
  labs(title = 'Historical Total Sales by Category', x='Order Date',
       y='Total Sales', fill = 'Category')+
  theme(plot.title = element_text(hjust = 0.5))

Insights:
1. Based on daily chart above, Furniture sales is more stable than Office Supplies and Technology.
2. The Highest Total Sales Happened at October 2017 in Technology Category.

4.3 Violin Plot Analysis

ggplot(retail_df,aes(Category,SperQ))+
   scale_y_log10()+
   geom_violin(aes(fill = Segment)) +
   labs (title = "Sales per Quantity Violin Plot in Buyer Segment", x = "Category", y = "Sales per Quantity", fill ="Segment")+
   theme(plot.title = element_text(hjust = 0.5))

Insights:
1. Office Supplies Category has the lowest Sales per Quantity value in the most populated area compare to Furniture and Technology.
2. Furniture and Technology highest population range is around 10-1000, Office SUpplies around 1-10

5. Final Conclusion

In 4 years (2014-2017), the company have generate 9994 transactions and also shows significant growth on their number of unit sold at the store. The store sells 3 main category of products: Furniture, Office Supplies, and Technology. There are also 3 Buyers Segment at the store: Consumer, Corporate, and Home Office. Office Supplies Category and Consumer Segment have the most highest transaction. Meanwhile, The customers like to use Standard Class ship mode to deliver their order.

The store tends to not give discount in their successful transaction (~50%). I think this is the ideal way to do their business because when they gives a lot of discount, it will increase their risk to generate. Based on Box Plot Analysis, high amount of discount have tendency to generate much lower sales and can lead to loss of profit. My suggestion is for the store to conduct exercise on maximum discount that can be applied for each product category to minimize their loss.

A work by Ade Anggi N S

adeanggins@gmail.com