This data contains of sales record that occurred in 2014-2017 from a retail company. We will explore the retail data and provide business recommendations to improve its growth and profitability.
First, we must set up the library that we needed.
library(lubridate)
library(dplyr)Input our data and put it into ‘retail’ object. We use
stringsAsFactors = True, so all columns with character data
type will be converted into factors.
retail <- read.csv("retail.csv", stringsAsFactors = T)Overview our data:
head(retail)tail(retail)Check the number of columns and rows.
dim(retail)## [1] 9994 15
Retail data contains 9,994 rows and 15 columns.
View all columns and the data types.
glimpse(retail)## Rows: 9,994
## Columns: 15
## $ Row.ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
## $ Order.ID <fct> CA-2016-152156, CA-2016-152156, CA-2016-138688, US-2015-1~
## $ Order.Date <fct> 11/8/16, 11/8/16, 6/12/16, 10/11/15, 10/11/15, 6/9/14, 6/~
## $ Ship.Date <fct> 11/11/16, 11/11/16, 6/16/16, 10/18/15, 10/18/15, 6/14/14,~
## $ Ship.Mode <fct> Second Class, Second Class, Second Class, Standard Class,~
## $ Customer.ID <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, BH-1171~
## $ Segment <fct> Consumer, Consumer, Corporate, Consumer, Consumer, Consum~
## $ Product.ID <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240, FUR-TA~
## $ Category <fct> Furniture, Furniture, Office Supplies, Furniture, Office ~
## $ Sub.Category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnishings, ~
## $ Product.Name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric U~
## $ Sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, ~
## $ Quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2, ~
## $ Discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.0~
## $ Profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1.~
Some of data types are not in the correct type. Later we need to convert into correct type.
Check the missing value.
colSums(is.na(retail))## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
No missing value found!
Next, we must convert data types into correct type. The data type of Order.Date and Ship.Date should be Date, then we will delete Row.ID which is not needed.
retail <- retail %>%
select(-c(Row.ID)) %>%
mutate(Order.Date = mdy(Order.Date),
Ship.Date = mdy(Ship.Date))Let’s take a look again!
glimpse(retail)## Rows: 9,994
## Columns: 14
## $ Order.ID <fct> CA-2016-152156, CA-2016-152156, CA-2016-138688, US-2015-1~
## $ Order.Date <date> 2016-11-08, 2016-11-08, 2016-06-12, 2015-10-11, 2015-10-~
## $ Ship.Date <date> 2016-11-11, 2016-11-11, 2016-06-16, 2015-10-18, 2015-10-~
## $ Ship.Mode <fct> Second Class, Second Class, Second Class, Standard Class,~
## $ Customer.ID <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, BH-1171~
## $ Segment <fct> Consumer, Consumer, Corporate, Consumer, Consumer, Consum~
## $ Product.ID <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240, FUR-TA~
## $ Category <fct> Furniture, Furniture, Office Supplies, Furniture, Office ~
## $ Sub.Category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnishings, ~
## $ Product.Name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric U~
## $ Sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, ~
## $ Quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2, ~
## $ Discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.0~
## $ Profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1.~
Now the data type of all columns are correct. We can go on to the next step.
Summary of all columns.
summary(retail)## Order.ID Order.Date Ship.Date
## CA-2017-100111: 14 Min. :2014-01-03 Min. :2014-01-07
## CA-2017-157987: 12 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## CA-2016-165330: 11 Median :2016-06-26 Median :2016-06-29
## US-2016-108504: 11 Mean :2016-04-30 Mean :2016-05-03
## CA-2015-131338: 10 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## CA-2016-105732: 10 Max. :2017-12-30 Max. :2018-01-05
## (Other) :9926
## Ship.Mode Customer.ID Segment
## First Class :1538 WB-21850: 37 Consumer :5191
## Same Day : 543 JL-15835: 34 Corporate :3020
## Second Class :1945 MA-17560: 34 Home Office:1783
## Standard Class:5968 PP-18955: 34
## CK-12205: 32
## EH-13765: 32
## (Other) :9791
## Product.ID Category Sub.Category
## OFF-PA-10001970: 19 Furniture :2121 Binders :1523
## TEC-AC-10003832: 18 Office Supplies:6026 Paper :1370
## FUR-FU-10004270: 16 Technology :1847 Furnishings: 957
## FUR-CH-10001146: 15 Phones : 889
## FUR-CH-10002647: 15 Storage : 846
## TEC-AC-10002049: 15 Art : 796
## (Other) :9896 (Other) :3613
## Product.Name Sales Quantity
## Staple envelope : 48 Min. : 0.444 Min. : 1.00
## Easy-staple paper : 46 1st Qu.: 17.280 1st Qu.: 2.00
## Staples : 46 Median : 54.490 Median : 3.00
## Avery Non-Stick Binders : 20 Mean : 229.858 Mean : 3.79
## Staples in misc. colors : 19 3rd Qu.: 209.940 3rd Qu.: 5.00
## KI Adjustable-Height Table: 18 Max. :22638.480 Max. :14.00
## (Other) :9797
## Discount Profit
## Min. :0.0000 Min. :-6599.978
## 1st Qu.:0.0000 1st Qu.: 1.729
## Median :0.2000 Median : 8.666
## Mean :0.1562 Mean : 28.657
## 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :0.8000 Max. : 8399.976
##
Summary :
It can be seen from the summary that Sales and Profit data are quite spread out, so we can check more detail with a boxplot.
boxplot(retail$Sales, retail$Profit, names = c("Sales", "Profit"))There are quite a lot outliers in Sales at the top, while Profit has outliers in both of the top and bottom of the data.
sum(retail$Sales)## [1] 2297201
The amount of sales that the company make is 2,297,201.
sum(retail$Profit)## [1] 286397
The amount of profit that the company gains is 286,397.
sum(retail$Quantity)## [1] 37873
The company sold 37,873 items.
retail %>%
select(Order.Date, Sales) %>%
arrange(-Sales) %>%
top_n(1)The highest sales occured on March 18th, 2014 with total of sales is 22,638.48.
retail %>%
select(Order.Date, Profit) %>%
arrange(-Profit) %>%
top_n(1)The highest profit occured on October 2nd, 2016 with total of profit is 8,399.976.
retail %>%
select(Order.Date, Quantity) %>%
arrange(-Quantity) %>%
top_n(1)There are 29 transactions with the highest quantity (14 items) in 2014-2017.
retail %>%
select(Sales, Profit, Quantity, Segment) %>%
group_by(Segment) %>%
summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Sales)The order from the highest to lowest number of sales, profit, and quantity based on Segment are Consumer Segment, followed by Corporate Segment, then Home Office Segments.
retail %>%
select(Sales, Profit, Quantity, Ship.Mode) %>%
group_by(Ship.Mode) %>%
summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Sales)The order from the highest to lowest number of sales, profit, and quantity based on Ship Mode are Standard Class, followed by Second Class, Frist Class, and Same Day. Standard class has 60.2% share of all Ship Mode.
retail %>%
select(Sales, Profit, Quantity, Category) %>%
group_by(Category) %>%
summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Sales)Let’s explore more about Sub Category!
retail %>%
select(Sales, Profit, Quantity, Category, Sub.Category) %>%
group_by(Category, Sub.Category) %>%
summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit), Total_Quantity = sum(Quantity)) %>%
arrange(-Total_Profit)Its show that Technology with Sub Category: Copiers; Phones; and Accessories have the highest Profit. Meanwhile Office Supplies - Supplies, Furniture - Bookcases, and Furniture - Tables have minus Profit or gain loss.
Let’s see how the discount spread.
boxplot(retail$Discount)There are 3 outliers at the top. We want to know how much amount of discount in the top 3 biggest discount.
head(sort(unique(retail$Discount),decreasing = T),3)## [1] 0.8 0.7 0.6
Quite surprising, the discount are 60%, 70%, and 80%. Subset the data and put it into ‘retail_discount’ object, so we can explore more about it.
retail_discount <- retail %>%
filter(Discount == 0.8 | Discount == 0.7 | Discount == 0.6)
nrow(retail_discount)## [1] 856
There are 856 transactions recorded with very big discount.
retail_discount %>%
select(Sales, Profit, Category, Sub.Category) %>%
group_by(Category, Sub.Category) %>%
summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit)) %>%
arrange(-Total_Sales)Group by category dan sub category to all transactions that have big discount. It shows that all categories have minus profit or gain loss.
Based on previous question, we know that Office Supplies - Supplies, Furniture - Bookcases, and Furniture - Tables have minus profit. It turns out that Furniture - Bookcases is one of categories that gives big discount.
The Office Supplies - Supplies and Furniture - Table never give big discount.
Meanwhile for other categories that give big discount, although the profit is minus, it still gains profit after being calculated with all items in the same category.