1 Introduction

1.1 Data Explanation

Retail ilustration. Source: textmaster.com

The data which will be explored in this project is based from a retail company named PT. Lighthouse. PT. Lighthouse is a retail company that sells 3 types of goods, namely Office Supplies, Technology, and Furniture. This dataset contains details of the transactions that occurred from 2014 to 2017, such as order date, segments of buyers, quantity, etc.

1.2 Business Questions

There are several questions that will be explored through this dataset:
- How long did it take in average for an order to be shipped since the order was made?
- What was the most used shipping mode in each segment of buyers?
- How much profit did PT. Lighthouse gain for 4 years? Did the amount of profit gained show a particular trend?
- Which transaction resulted in the highest amount of profit and the highest amount of loss respectively?
- Which category was the most profitable?
- Which subcategory among the less profitable category gained the lowest profit? Was there a particular trend of gained profit each year?
- Which category and subcategory gave the highest amount of discount in average? (Top 5)
- How many items were sold?
- Which category was the best seller in terms of quantity?
- Which subcategory was the best seller in terms of quantity? (Top 5) - Which segment of buyers purchased the most items?
- What was the best seller category from each segment of buyer?
- What was the best seller subcategory from each segment of buyer?


2 Preparation

Before getting started with the raw data, load necessary library first.

library(lubridate) # for converting data to datetime type

3 Data Wrangling

3.1 Data Importing and Inspection

After loading necessary library, import the data.

retail <- read.csv("retail.csv")
head(retail)
##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694
dim(retail)
## [1] 9994   15

This data consists of 9994 rows and 15 variables. Let’s check if the data contains any missing value.

anyNA(retail)
## [1] FALSE

Thankfully, there are no missing values. Next, check for the data type of each variable.

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Several variables do not have the correct data type. Order.Date and Ship.Date are categorized as character when they should be datetime, while Ship.Mode, Segment, Category, and Sub.Category should be categorized as factor. Convert the datatype into the correct ones.

retail[, c("Order.Date", "Ship.Date")] <- lapply(retail[, c("Order.Date", "Ship.Date")], mdy)
retail[, c("Ship.Mode", "Segment", "Category", "Sub.Category")] <- lapply(retail[, c("Ship.Mode", "Segment", "Category", "Sub.Category")], as.factor)

To make sure all variables have already converted into the desired data type, check for the data type once more.

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

3.2 Data Summary

After all variables have the correct data type, we can check for the summary of each variable.

summary(retail)
##      Row.ID       Order.ID           Order.Date           Ship.Date         
##  Min.   :   1   Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  1st Qu.:2499   Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Median :4998   Mode  :character   Median :2016-06-26   Median :2016-06-29  
##  Mean   :4998                      Mean   :2016-04-30   Mean   :2016-05-03  
##  3rd Qu.:7496                      3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  Max.   :9994                      Max.   :2017-12-30   Max.   :2018-01-05  
##                                                                             
##           Ship.Mode    Customer.ID               Segment      Product.ID       
##  First Class   :1538   Length:9994        Consumer   :5191   Length:9994       
##  Same Day      : 543   Class :character   Corporate  :3020   Class :character  
##  Second Class  :1945   Mode  :character   Home Office:1783   Mode  :character  
##  Standard Class:5968                                                           
##                                                                                
##                                                                                
##                                                                                
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
## 

From the summary, it can be concluded that: * For 4 years, starting from January 3, 2014 until December 30, 2017, there were 9994 orders.
* The majority of the orders were shipped using Standard Class mode.
* Half of the buyers were from the Consumer segment.
* Office Supplies was the most popular category, while Other was the best seller sub category.
* The highest Quantity of order was 14.
* The average Discount given was around 0.15%.
* The maximum Profit gained was 8399 dollar, while the lowest Profit was -6599.978 dollar.


4 Data Manipulation and Exploration

The data is now ready to use. Let’s explore the data by answering these business questions.

1. How long did it take in average for an order to be shipped since the order was made?

To answer this question, we need to count the duration between Order.Date and Ship.Date

retail$duration <- difftime(retail$Ship.Date, retail$Order.Date, units = "days")
mean(retail$duration)
## Time difference of 3.958175 days

Answer: In average, it took 3.96 days for an order to be shipped.

2. What was the most favorable shipping mode in each segment of buyers?

cust_ship <- table(retail$Ship.Mode, retail$Segment)
cust_ship
##                 
##                  Consumer Corporate Home Office
##   First Class         769       485         284
##   Same Day            317       114         112
##   Second Class       1020       609         316
##   Standard Class     3085      1812        1071

Answer: Standard Class was the most used shipping mode in all of buyer segments.

3. How much profit did PT. Lighthouse gain for 4 years? Did the amount of profit gained show a particular trend?

Since year variabel isn’t available yet, we need to extract the year from the Order.Date

retail$Year <- year(retail$Order.Date)
profit <- sum(retail$Profit)
profit
## [1] 286397
profit_Year <- as.data.frame(aggregate(Profit ~ Year, retail, sum))
profit_Year
##   Year   Profit
## 1 2014 49543.97
## 2 2015 61618.60
## 3 2016 81795.17
## 4 2017 93439.27

Answer: For 4 years, PT. Lighthouse gained 286,397 dollars worth of profit. There was a positive trend of profit gained each year.

4. Which transaction resulted in the highest amount of profit and the highest amount of loss respectively?

retail[retail$Profit == max(retail$Profit) | retail$Profit == min(retail$Profit), c("Order.Date", "Segment", "Category",
                                                                                    "Sub.Category", "Product.Name", "Discount", "Profit")]
##      Order.Date   Segment   Category Sub.Category
## 6827 2016-10-02 Corporate Technology      Copiers
## 7773 2016-11-25  Consumer Technology     Machines
##                                   Product.Name Discount    Profit
## 6827     Canon imageCLASS 2200 Advanced Copier      0.0  8399.976
## 7773 Cubify CubeX 3D Printer Double Head Print      0.7 -6599.978

Answer: PT. Lighthouse had both its highest profit and highest amount loss in 2016 from Copiers and Machines sub category, respectively. A high amount of loss could be explained by huge amount of discount given for Cubify CubeX 3D Printer Double Head Print, i.e. 70%.

5. Which category was the most profitable?

# Profit per Category
profit_cat <- as.data.frame(aggregate(Profit ~ Category, retail, sum))
profit_cat
##          Category    Profit
## 1       Furniture  18451.27
## 2 Office Supplies 122490.80
## 3      Technology 145454.95

Answer: Technology was the category with the most profit. Meanwhile, furniture gained the lowest profit.

6. Which subcategory among Furniture gained the lowest profit? Was there a particular trend of gained profit each year?

# Profit per Sub Category untuk Category Furniture per Year
furniture <- retail[retail$Category == "Furniture", ]
profit_furniture_sub_Year <- as.data.frame(aggregate(Profit ~ Year + Sub.Category, furniture, sum))
profit_furniture_sub_Year
##    Year Sub.Category     Profit
## 1  2014    Bookcases  -346.1707
## 2  2015    Bookcases -2755.2302
## 3  2016    Bookcases   212.4710
## 4  2017    Bookcases  -583.6261
## 5  2014       Chairs  6955.0470
## 6  2015       Chairs  6228.4161
## 7  2016       Chairs  5763.1539
## 8  2017       Chairs  7643.5493
## 9  2014  Furnishings  1972.8919
## 10 2015  Furnishings  3051.8189
## 11 2016  Furnishings  3935.2700
## 12 2017  Furnishings  4099.1628
## 13 2014       Tables -3124.0427
## 14 2015       Tables -3509.8019
## 15 2016       Tables -2950.9418
## 16 2017       Tables -8140.6947

Answer: Among 4 subcategories in Furniture, Bookcases and Tables consistently suffered from loss throughout 4 years period.

7. Which category and subcategory gave the highest amount of discount in average? (Top 5)

# The average discount per category
disc_cat <- as.data.frame(aggregate(Discount ~ Category + Sub.Category, retail, mean))
disc_cat <- disc_cat[order(disc_cat$Discount, decreasing = T), ]
head(disc_cat, 5)
##           Category Sub.Category  Discount
## 4  Office Supplies      Binders 0.3722915
## 12      Technology     Machines 0.3060870
## 17       Furniture       Tables 0.2612853
## 5        Furniture    Bookcases 0.2111404
## 6        Furniture       Chairs 0.1701783

Answer: Binders from Office Supplies, Machines from Technology category, and Tables, Bookcases, and Chairs from Furnitures gave the highest amount of discount in average.

8. How many items were sold?

sum(retail$Quantity)
## [1] 37873

Answer: For 4 years, 37,873 items were sold in total.

9. Which category was the best seller in terms of quantity?

cat_qty <- as.data.frame(aggregate(Quantity ~ Category, retail, sum))
cat_qty
##          Category Quantity
## 1       Furniture     8028
## 2 Office Supplies    22906
## 3      Technology     6939

Answer: Office supplies was the best seller category.

10. Which subcategory was the best seller in terms of quantity?

subcat_qty <- as.data.frame(aggregate(Quantity ~ Sub.Category, retail, sum))
subcat_qty <- subcat_qty[order(subcat_qty$Quantity, decreasing = T), ]
head(subcat_qty,5)
##    Sub.Category Quantity
## 4       Binders     5974
## 13        Paper     5178
## 10  Furnishings     3563
## 14       Phones     3289
## 15      Storage     3158

Answer: Binders, Paper, Furnishings, Phones, and Storage were the most popular subcategories.

11. Which segment of buyers purchased the most items?

# Kuantitas barang yang dibeli oleh masing2 segment
qty_segment <- as.data.frame(aggregate(Quantity ~ Segment, retail, sum))
qty_segment
##       Segment Quantity
## 1    Consumer    19521
## 2   Corporate    11608
## 3 Home Office     6744

Answer: Buyer segment that bought the highest amount of goods was Consumer.

12. What was the best seller category from each segment of buyer?

segment_qty <- as.data.frame(aggregate(Quantity ~ Segment + Category, retail, sum))
segment_qty <- segment_qty[order(segment_qty$Quantity, decreasing = T), ]
segment_qty
##       Segment        Category Quantity
## 4    Consumer Office Supplies    11758
## 5   Corporate Office Supplies     7018
## 1    Consumer       Furniture     4166
## 6 Home Office Office Supplies     4130
## 7    Consumer      Technology     3597
## 2   Corporate       Furniture     2495
## 8   Corporate      Technology     2095
## 3 Home Office       Furniture     1367
## 9 Home Office      Technology     1247

Answer: Office Supplies was the best seller category in each segment of buyer.

13. What was the best seller subcategory from each segment of buyer?

segment_sub <- as.data.frame(aggregate(Quantity ~ Sub.Category + Segment, retail, sum))
segment_sub <- segment_sub[order(segment_sub$Quantity, decreasing = T), ]
head(segment_sub, 15)
##    Sub.Category     Segment Quantity
## 4       Binders    Consumer     3015
## 13        Paper    Consumer     2602
## 21      Binders   Corporate     1848
## 10  Furnishings    Consumer     1834
## 14       Phones    Consumer     1685
## 3           Art    Consumer     1625
## 15      Storage    Consumer     1619
## 1   Accessories    Consumer     1578
## 30        Paper   Corporate     1555
## 6        Chairs    Consumer     1234
## 38      Binders Home Office     1111
## 27  Furnishings   Corporate     1086
## 47        Paper Home Office     1021
## 31       Phones   Corporate     1003
## 32      Storage   Corporate     1000

Answer: Binders were the most bought subcategory in each buyers segment.


5 Conclusions and Business Recommendation

For a period of 4 years, PT. Lighthouse has sold 37,873 items, with majority of the transactions came from the Office Supplies in terms of quantity. Binders were the most popular subcategory in every segment of buyers. Buyers came from 3 different segments, i.e. Consumer which made up majority of buyers, Corporate, and Home Office. Among all segment of buyers, Standard Mode was the most used mode of shipping. In average, it took around 3.96 days for an order to be shipped.

PT. Lighthouse managed to gain 286,397 dollars worth of profit in a period of 4 years and there was an increasing trend of profit gained each year. Technology was the category with the highest total of profit. But, ironically, it contributed to PT. Lighthouse’s highest ever amount of loss in a transaction that occurred in 2016, because of the huge amount of discount given at that transaction. Furniture was the category with the lowest amount of profit and 2 subcategories, namely Bookcases and Tables from mentioned category consistently suffered from loss almost every year.

In average, percentage of discount given for every transaction was 15%, with Binders from Office Supplies, Machines from Technology category, and Tables, Bookcases, and Chairs from Furnitures gave the highest amount of discount in average.

From the answered business questions, there are several things that could be improved:
1. Furniture was the category with the lowest amount of total profit, even 2 subcategories in it consistently suffered from loss throughout the years. But on the other hand, it gave relatively high amount of discount in average. It is necessary to adjust the amount of the discount given so that losses could be minimized.
2. Office supplies is the most sold category, especially Binders. It’s necessary to maintain its stock.
3. Promotional efforts are needed to attract buyers from other segments.
4. Technology is the category that generates the highest amount of profit, despite having the lowest amount of orders among other categories. To maximize the amount of profit that can be gained, promotional efforts for this category are needed in order to attract more buyers.