Retail ilustration. Source: textmaster.com
The data which will be explored in this project is based from a retail company named PT. Lighthouse. PT. Lighthouse is a retail company that sells 3 types of goods, namely Office Supplies, Technology, and Furniture. This dataset contains details of the transactions that occurred from 2014 to 2017, such as order date, segments of buyers, quantity, etc.
There are several questions that will be explored through this dataset:
- How long did it take in average for an order to be shipped since the order was made?
- What was the most used shipping mode in each segment of buyers?
- How much profit did PT. Lighthouse gain for 4 years? Did the amount of profit gained show a particular trend?
- Which transaction resulted in the highest amount of profit and the highest amount of loss respectively?
- Which category was the most profitable?
- Which subcategory among the less profitable category gained the lowest profit? Was there a particular trend of gained profit each year?
- Which category and subcategory gave the highest amount of discount in average? (Top 5)
- How many items were sold?
- Which category was the best seller in terms of quantity?
- Which subcategory was the best seller in terms of quantity? (Top 5) - Which segment of buyers purchased the most items?
- What was the best seller category from each segment of buyer?
- What was the best seller subcategory from each segment of buyer?
Before getting started with the raw data, load necessary library first.
library(lubridate) # for converting data to datetime type
After loading necessary library, import the data.
<- read.csv("retail.csv")
retail head(retail)
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/16 6/16/16 Second Class DV-13045
## 4 4 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/14 6/14/14 Standard Class BH-11710
## Segment Product.ID Category Sub.Category
## 1 Consumer FUR-BO-10001798 Furniture Bookcases
## 2 Consumer FUR-CH-10000454 Furniture Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies Labels
## 4 Consumer FUR-TA-10000577 Furniture Tables
## 5 Consumer OFF-ST-10000760 Office Supplies Storage
## 6 Consumer FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
dim(retail)
## [1] 9994 15
This data consists of 9994 rows and 15 variables. Let’s check if the data contains any missing value.
anyNA(retail)
## [1] FALSE
Thankfully, there are no missing values. Next, check for the data type of each variable.
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Several variables do not have the correct data type. Order.Date
and Ship.Date
are categorized as character when they should be datetime
, while Ship.Mode
, Segment
, Category
, and Sub.Category
should be categorized as factor. Convert the datatype into the correct ones.
c("Order.Date", "Ship.Date")] <- lapply(retail[, c("Order.Date", "Ship.Date")], mdy)
retail[, c("Ship.Mode", "Segment", "Category", "Sub.Category")] <- lapply(retail[, c("Ship.Mode", "Segment", "Category", "Sub.Category")], as.factor) retail[,
To make sure all variables have already converted into the desired data type, check for the data type once more.
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
After all variables have the correct data type, we can check for the summary of each variable.
summary(retail)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Min. :2014-01-03 Min. :2014-01-07
## 1st Qu.:2499 Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## Median :4998 Mode :character Median :2016-06-26 Median :2016-06-29
## Mean :4998 Mean :2016-04-30 Mean :2016-05-03
## 3rd Qu.:7496 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :9994 Max. :2017-12-30 Max. :2018-01-05
##
## Ship.Mode Customer.ID Segment Product.ID
## First Class :1538 Length:9994 Consumer :5191 Length:9994
## Same Day : 543 Class :character Corporate :3020 Class :character
## Second Class :1945 Mode :character Home Office:1783 Mode :character
## Standard Class:5968
##
##
##
## Category Sub.Category Product.Name
## Furniture :2121 Binders :1523 Length:9994
## Office Supplies:6026 Paper :1370 Class :character
## Technology :1847 Furnishings: 957 Mode :character
## Phones : 889
## Storage : 846
## Art : 796
## (Other) :3613
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
##
From the summary, it can be concluded that: * For 4 years, starting from January 3, 2014 until December 30, 2017, there were 9994 orders.
* The majority of the orders were shipped using Standard Class
mode.
* Half of the buyers were from the Consumer
segment.
* Office Supplies
was the most popular category, while Other
was the best seller sub category.
* The highest Quantity
of order was 14.
* The average Discount
given was around 0.15%.
* The maximum Profit
gained was 8399 dollar, while the lowest Profit
was -6599.978 dollar.
The data is now ready to use. Let’s explore the data by answering these business questions.
1. How long did it take in average for an order to be shipped since the order was made?
To answer this question, we need to count the duration between Order.Date
and Ship.Date
$duration <- difftime(retail$Ship.Date, retail$Order.Date, units = "days")
retailmean(retail$duration)
## Time difference of 3.958175 days
Answer: In average, it took 3.96 days for an order to be shipped.
2. What was the most favorable shipping mode in each segment of buyers?
<- table(retail$Ship.Mode, retail$Segment)
cust_ship cust_ship
##
## Consumer Corporate Home Office
## First Class 769 485 284
## Same Day 317 114 112
## Second Class 1020 609 316
## Standard Class 3085 1812 1071
Answer: Standard Class was the most used shipping mode in all of buyer segments.
3. How much profit did PT. Lighthouse gain for 4 years? Did the amount of profit gained show a particular trend?
Since year
variabel isn’t available yet, we need to extract the year from the Order.Date
$Year <- year(retail$Order.Date)
retail<- sum(retail$Profit)
profit profit
## [1] 286397
<- as.data.frame(aggregate(Profit ~ Year, retail, sum))
profit_Year profit_Year
## Year Profit
## 1 2014 49543.97
## 2 2015 61618.60
## 3 2016 81795.17
## 4 2017 93439.27
Answer: For 4 years, PT. Lighthouse gained 286,397 dollars worth of profit. There was a positive trend of profit gained each year.
4. Which transaction resulted in the highest amount of profit and the highest amount of loss respectively?
$Profit == max(retail$Profit) | retail$Profit == min(retail$Profit), c("Order.Date", "Segment", "Category",
retail[retail"Sub.Category", "Product.Name", "Discount", "Profit")]
## Order.Date Segment Category Sub.Category
## 6827 2016-10-02 Corporate Technology Copiers
## 7773 2016-11-25 Consumer Technology Machines
## Product.Name Discount Profit
## 6827 Canon imageCLASS 2200 Advanced Copier 0.0 8399.976
## 7773 Cubify CubeX 3D Printer Double Head Print 0.7 -6599.978
Answer: PT. Lighthouse had both its highest profit and highest amount loss in 2016 from Copiers and Machines sub category, respectively. A high amount of loss could be explained by huge amount of discount given for Cubify CubeX 3D Printer Double Head Print, i.e. 70%.
5. Which category was the most profitable?
# Profit per Category
<- as.data.frame(aggregate(Profit ~ Category, retail, sum))
profit_cat profit_cat
## Category Profit
## 1 Furniture 18451.27
## 2 Office Supplies 122490.80
## 3 Technology 145454.95
Answer: Technology was the category with the most profit. Meanwhile, furniture gained the lowest profit.
6. Which subcategory among Furniture gained the lowest profit? Was there a particular trend of gained profit each year?
# Profit per Sub Category untuk Category Furniture per Year
<- retail[retail$Category == "Furniture", ]
furniture <- as.data.frame(aggregate(Profit ~ Year + Sub.Category, furniture, sum))
profit_furniture_sub_Year profit_furniture_sub_Year
## Year Sub.Category Profit
## 1 2014 Bookcases -346.1707
## 2 2015 Bookcases -2755.2302
## 3 2016 Bookcases 212.4710
## 4 2017 Bookcases -583.6261
## 5 2014 Chairs 6955.0470
## 6 2015 Chairs 6228.4161
## 7 2016 Chairs 5763.1539
## 8 2017 Chairs 7643.5493
## 9 2014 Furnishings 1972.8919
## 10 2015 Furnishings 3051.8189
## 11 2016 Furnishings 3935.2700
## 12 2017 Furnishings 4099.1628
## 13 2014 Tables -3124.0427
## 14 2015 Tables -3509.8019
## 15 2016 Tables -2950.9418
## 16 2017 Tables -8140.6947
Answer: Among 4 subcategories in Furniture, Bookcases and Tables consistently suffered from loss throughout 4 years period.
7. Which category and subcategory gave the highest amount of discount in average? (Top 5)
# The average discount per category
<- as.data.frame(aggregate(Discount ~ Category + Sub.Category, retail, mean))
disc_cat <- disc_cat[order(disc_cat$Discount, decreasing = T), ]
disc_cat head(disc_cat, 5)
## Category Sub.Category Discount
## 4 Office Supplies Binders 0.3722915
## 12 Technology Machines 0.3060870
## 17 Furniture Tables 0.2612853
## 5 Furniture Bookcases 0.2111404
## 6 Furniture Chairs 0.1701783
Answer: Binders from Office Supplies, Machines from Technology category, and Tables, Bookcases, and Chairs from Furnitures gave the highest amount of discount in average.
8. How many items were sold?
sum(retail$Quantity)
## [1] 37873
Answer: For 4 years, 37,873 items were sold in total.
9. Which category was the best seller in terms of quantity?
<- as.data.frame(aggregate(Quantity ~ Category, retail, sum))
cat_qty cat_qty
## Category Quantity
## 1 Furniture 8028
## 2 Office Supplies 22906
## 3 Technology 6939
Answer: Office supplies was the best seller category.
10. Which subcategory was the best seller in terms of quantity?
<- as.data.frame(aggregate(Quantity ~ Sub.Category, retail, sum))
subcat_qty <- subcat_qty[order(subcat_qty$Quantity, decreasing = T), ]
subcat_qty head(subcat_qty,5)
## Sub.Category Quantity
## 4 Binders 5974
## 13 Paper 5178
## 10 Furnishings 3563
## 14 Phones 3289
## 15 Storage 3158
Answer: Binders, Paper, Furnishings, Phones, and Storage were the most popular subcategories.
11. Which segment of buyers purchased the most items?
# Kuantitas barang yang dibeli oleh masing2 segment
<- as.data.frame(aggregate(Quantity ~ Segment, retail, sum))
qty_segment qty_segment
## Segment Quantity
## 1 Consumer 19521
## 2 Corporate 11608
## 3 Home Office 6744
Answer: Buyer segment that bought the highest amount of goods was Consumer.
12. What was the best seller category from each segment of buyer?
<- as.data.frame(aggregate(Quantity ~ Segment + Category, retail, sum))
segment_qty <- segment_qty[order(segment_qty$Quantity, decreasing = T), ]
segment_qty segment_qty
## Segment Category Quantity
## 4 Consumer Office Supplies 11758
## 5 Corporate Office Supplies 7018
## 1 Consumer Furniture 4166
## 6 Home Office Office Supplies 4130
## 7 Consumer Technology 3597
## 2 Corporate Furniture 2495
## 8 Corporate Technology 2095
## 3 Home Office Furniture 1367
## 9 Home Office Technology 1247
Answer: Office Supplies was the best seller category in each segment of buyer.
13. What was the best seller subcategory from each segment of buyer?
<- as.data.frame(aggregate(Quantity ~ Sub.Category + Segment, retail, sum))
segment_sub <- segment_sub[order(segment_sub$Quantity, decreasing = T), ]
segment_sub head(segment_sub, 15)
## Sub.Category Segment Quantity
## 4 Binders Consumer 3015
## 13 Paper Consumer 2602
## 21 Binders Corporate 1848
## 10 Furnishings Consumer 1834
## 14 Phones Consumer 1685
## 3 Art Consumer 1625
## 15 Storage Consumer 1619
## 1 Accessories Consumer 1578
## 30 Paper Corporate 1555
## 6 Chairs Consumer 1234
## 38 Binders Home Office 1111
## 27 Furnishings Corporate 1086
## 47 Paper Home Office 1021
## 31 Phones Corporate 1003
## 32 Storage Corporate 1000
Answer: Binders were the most bought subcategory in each buyers segment.
For a period of 4 years, PT. Lighthouse has sold 37,873 items, with majority of the transactions came from the Office Supplies in terms of quantity. Binders were the most popular subcategory in every segment of buyers. Buyers came from 3 different segments, i.e. Consumer which made up majority of buyers, Corporate, and Home Office. Among all segment of buyers, Standard Mode was the most used mode of shipping. In average, it took around 3.96 days for an order to be shipped.
PT. Lighthouse managed to gain 286,397 dollars worth of profit in a period of 4 years and there was an increasing trend of profit gained each year. Technology was the category with the highest total of profit. But, ironically, it contributed to PT. Lighthouse’s highest ever amount of loss in a transaction that occurred in 2016, because of the huge amount of discount given at that transaction. Furniture was the category with the lowest amount of profit and 2 subcategories, namely Bookcases and Tables from mentioned category consistently suffered from loss almost every year.
In average, percentage of discount given for every transaction was 15%, with Binders from Office Supplies, Machines from Technology category, and Tables, Bookcases, and Chairs from Furnitures gave the highest amount of discount in average.
From the answered business questions, there are several things that could be improved:
1. Furniture was the category with the lowest amount of total profit, even 2 subcategories in it consistently suffered from loss throughout the years. But on the other hand, it gave relatively high amount of discount in average. It is necessary to adjust the amount of the discount given so that losses could be minimized.
2. Office supplies is the most sold category, especially Binders. It’s necessary to maintain its stock.
3. Promotional efforts are needed to attract buyers from other segments.
4. Technology is the category that generates the highest amount of profit, despite having the lowest amount of orders among other categories. To maximize the amount of profit that can be gained, promotional efforts for this category are needed in order to attract more buyers.