Exploratory Data Analysis

Introduction

This data contains of sales record that occurred in 2014-2017 from a retail company. We will explore the retail data and provide business recommendations to improve its growth and profitability.

Set Up Library

First, we must set up the library that we needed.

library(lubridate)
library(dplyr)

Data Preparation

Input Data

Input our data and put it into ‘retail’ object. We use stringsAsFactors = True, so all columns with character data type will be converted into factors.

retail <- read.csv("retail.csv", stringsAsFactors = T)

Overview our data:

head(retail)

tail(retail)

Data Structure

Check the number of columns and rows.

dim(retail)

## [1] 9994   15

Retail data contains 9,994 rows and 15 columns.

View all columns and the data types.

glimpse(retail)

## Rows: 9,994
## Columns: 15
## $ Row.ID       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
## $ Order.ID     <fct> CA-2016-152156, CA-2016-152156, CA-2016-138688, US-2015-1~
## $ Order.Date   <fct> 11/8/16, 11/8/16, 6/12/16, 10/11/15, 10/11/15, 6/9/14, 6/~
## $ Ship.Date    <fct> 11/11/16, 11/11/16, 6/16/16, 10/18/15, 10/18/15, 6/14/14,~
## $ Ship.Mode    <fct> Second Class, Second Class, Second Class, Standard Class,~
## $ Customer.ID  <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, BH-1171~
## $ Segment      <fct> Consumer, Consumer, Corporate, Consumer, Consumer, Consum~
## $ Product.ID   <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240, FUR-TA~
## $ Category     <fct> Furniture, Furniture, Office Supplies, Furniture, Office ~
## $ Sub.Category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnishings, ~
## $ Product.Name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric U~
## $ Sales        <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, ~
## $ Quantity     <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2, ~
## $ Discount     <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.0~
## $ Profit       <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1.~

Some of data types are not in the correct type. Later we need to convert into correct type.

Pre-processing Data

Check the missing value.

colSums(is.na(retail))

##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

No missing value found!

Next, we must convert data types into correct type. The data type of Order.Date and Ship.Date should be Date, then we will delete Row.ID which is not needed.

retail <- retail %>% 
  select(-c(Row.ID)) %>% 
  mutate(Order.Date = mdy(Order.Date),
         Ship.Date = mdy(Ship.Date))

Let’s take a look again!

glimpse(retail)

## Rows: 9,994
## Columns: 14
## $ Order.ID     <fct> CA-2016-152156, CA-2016-152156, CA-2016-138688, US-2015-1~
## $ Order.Date   <date> 2016-11-08, 2016-11-08, 2016-06-12, 2015-10-11, 2015-10-~
## $ Ship.Date    <date> 2016-11-11, 2016-11-11, 2016-06-16, 2015-10-18, 2015-10-~
## $ Ship.Mode    <fct> Second Class, Second Class, Second Class, Standard Class,~
## $ Customer.ID  <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, BH-1171~
## $ Segment      <fct> Consumer, Consumer, Corporate, Consumer, Consumer, Consum~
## $ Product.ID   <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240, FUR-TA~
## $ Category     <fct> Furniture, Furniture, Office Supplies, Furniture, Office ~
## $ Sub.Category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnishings, ~
## $ Product.Name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric U~
## $ Sales        <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, ~
## $ Quantity     <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2, ~
## $ Discount     <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.0~
## $ Profit       <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1.~

Now the data type of all columns are correct. We can go on to the next step.

Data Explanation

Summary of all columns.

summary(retail)

##            Order.ID      Order.Date           Ship.Date         
##  CA-2017-100111:  14   Min.   :2014-01-03   Min.   :2014-01-07  
##  CA-2017-157987:  12   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  CA-2016-165330:  11   Median :2016-06-26   Median :2016-06-29  
##  US-2016-108504:  11   Mean   :2016-04-30   Mean   :2016-05-03  
##  CA-2015-131338:  10   3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  CA-2016-105732:  10   Max.   :2017-12-30   Max.   :2018-01-05  
##  (Other)       :9926                                            
##           Ship.Mode      Customer.ID          Segment    
##  First Class   :1538   WB-21850:  37   Consumer   :5191  
##  Same Day      : 543   JL-15835:  34   Corporate  :3020  
##  Second Class  :1945   MA-17560:  34   Home Office:1783  
##  Standard Class:5968   PP-18955:  34                     
##                        CK-12205:  32                     
##                        EH-13765:  32                     
##                        (Other) :9791                     
##            Product.ID              Category         Sub.Category 
##  OFF-PA-10001970:  19   Furniture      :2121   Binders    :1523  
##  TEC-AC-10003832:  18   Office Supplies:6026   Paper      :1370  
##  FUR-FU-10004270:  16   Technology     :1847   Furnishings: 957  
##  FUR-CH-10001146:  15                          Phones     : 889  
##  FUR-CH-10002647:  15                          Storage    : 846  
##  TEC-AC-10002049:  15                          Art        : 796  
##  (Other)        :9896                          (Other)    :3613  
##                      Product.Name      Sales              Quantity    
##  Staple envelope           :  48   Min.   :    0.444   Min.   : 1.00  
##  Easy-staple paper         :  46   1st Qu.:   17.280   1st Qu.: 2.00  
##  Staples                   :  46   Median :   54.490   Median : 3.00  
##  Avery Non-Stick Binders   :  20   Mean   :  229.858   Mean   : 3.79  
##  Staples in misc. colors   :  19   3rd Qu.:  209.940   3rd Qu.: 5.00  
##  KI Adjustable-Height Table:  18   Max.   :22638.480   Max.   :14.00  
##  (Other)                   :9797                                      
##     Discount          Profit         
##  Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :0.2000   Median :    8.666  
##  Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :0.8000   Max.   : 8399.976  
##

Summary :

The transaction occured in January 2014 to December 2017.
Standart Class was the most used for shipping mode, followed by Second Class, First Class, and Same Day.
Most buyers coming from Consumer Segment, followed by Corporate Segment and Home Office Segment.
Office Supplies was the most purchased category, followed by Furniture and Technology.
Based on Sub category, Binders was the most purchased.
Sales average is 229.858, maximum Sales is 22,638.480, and minimum Sales is 0.444.
Quantity average is 3.79, maximum Quantity is 14, and minimum Quantity is 1.
Max amount of discount is 80%, meanwhile the average number of discount is 15.62%.
Profit average is 28.657, maximum Profit is 8,399.976, and minimum Profit is -6,599.978.

It can be seen from the summary that Sales and Profit data are quite spread out, so we can check more detail with a boxplot.

boxplot(retail$Sales, retail$Profit, names = c("Sales", "Profit"))

There are quite a lot outliers in Sales at the top, while Profit has outliers in both of the top and bottom of the data.

Business Case

How much sales does the company make?

sum(retail$Sales)

## [1] 2297201

The amount of sales that the company make is 2,297,201.

How much profit does the company gain?

sum(retail$Profit)

## [1] 286397

The amount of profit that the company gains is 286,397.

How much the total quantity sold?

sum(retail$Quantity)

## [1] 37873

The company sold 37,873 items.

When is the transaction with the highest sales?

retail %>% 
  select(Order.Date, Sales) %>% 
  arrange(-Sales) %>% 
  top_n(1)

The highest sales occured on March 18th, 2014 with total of sales is 22,638.48.

When is the transaction with the highest profit?

retail %>% 
  select(Order.Date, Profit) %>% 
  arrange(-Profit) %>% 
  top_n(1)

The highest profit occured on October 2nd, 2016 with total of profit is 8,399.976.

When is the transaction with the highest quantity?

retail %>% 
  select(Order.Date, Quantity) %>% 
  arrange(-Quantity) %>% 
  top_n(1)

There are 29 transactions with the highest quantity (14 items) in 2014-2017.

How are sales, profit, and quantity in each segment?

retail %>% 
  select(Sales, Profit, Quantity, Segment) %>% 
  group_by(Segment) %>% 
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Sales)

The order from the highest to lowest number of sales, profit, and quantity based on Segment are Consumer Segment, followed by Corporate Segment, then Home Office Segments.

How are sales, profit, and quantity in each Ship Mode?

retail %>% 
  select(Sales, Profit, Quantity, Ship.Mode) %>% 
  group_by(Ship.Mode) %>% 
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Sales)

The order from the highest to lowest number of sales, profit, and quantity based on Ship Mode are Standard Class, followed by Second Class, Frist Class, and Same Day. Standard class has 60.2% share of all Ship Mode.

How are sales, profit, and quantity in each Category?

retail %>% 
  select(Sales, Profit, Quantity, Category) %>% 
  group_by(Category) %>% 
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Sales)

Technology has the highest Sales and Profit, but the lowest Quantity compared to others.
Furniture has the second highest Sales, but the lowest Profit compared to others.
Office supplies has the lowest Sales, but the Profit is quite high (second place), and the Quantity is the highest compared to others.

Let’s explore more about Sub Category!

retail %>% 
  select(Sales, Profit, Quantity, Category, Sub.Category) %>% 
  group_by(Category, Sub.Category) %>% 
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>% 
  arrange(-Total_Profit)

Its show that Technology with Sub Category: Copiers; Phones; and Accessories have the highest Profit. Meanwhile Office Supplies - Supplies, Furniture - Bookcases, and Furniture - Tables have minus Profit or gain loss.

How about the amount of discounts? Is that the causes of minus profit in some sub categories?

Let’s see how the discount spread.

boxplot(retail$Discount)

There are 3 outliers at the top. We want to know how much amount of discount in the top 3 biggest discount.

head(sort(unique(retail$Discount),decreasing = T),3)

## [1] 0.8 0.7 0.6

Quite surprising, the discount are 60%, 70%, and 80%. Subset the data and put it into ‘retail_discount’ object, so we can explore more about it.

retail_discount <- retail %>% 
  filter(Discount == 0.8 | Discount == 0.7 | Discount == 0.6)
nrow(retail_discount)

## [1] 856

There are 856 transactions recorded with very big discount.

retail_discount %>% 
  select(Sales, Profit, Category, Sub.Category) %>% 
  group_by(Category, Sub.Category) %>%
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit))  %>% 
  arrange(-Total_Sales)

Group by category dan sub category to all transactions that have big discount. It shows that all categories have minus profit or gain loss.
Based on previous question, we know that Office Supplies - Supplies, Furniture - Bookcases, and Furniture - Tables have minus profit. It turns out that Furniture - Bookcases is one of categories that gives big discount.
The Office Supplies - Supplies and Furniture - Table never give big discount.
Meanwhile for other categories that give big discount, although the profit is minus, it still gains profit after being calculated with all items in the same category.

Business Recommendation

Based on our exploratory above, we still have an opportunity to improve the market share from Corporate segment and Home Office segment.
We must maintain the quality of the Standard Class shipping mode because most buyers using that method and it also has the highest of total sales, profit, and quantity.
We could improve sales from Technology because the biggest profit comes from it even though the amount of quantity is the lowest compare to others.
Office Supplies has the highest sold quantity, therefore we have to make sure the stock items always available.
Create new strategy to sell Office Supplies - Supplies, Furniture – Bookcases, and Furniture - Tables that gain loss profit. Especially for Furniture - Bookcases, the discount given was very big (more than 50%). We must consider again to give a proper discount, so later can gain profit.

Exploratory Data Analysis - Retail Data

Safira Widya Putri

2022-05-18