1 Introduction

Nice to meet you all! My name is Jose Octavian Leandro. In this session, I will conduct a retail data analysis using a dataset containing information about product sales.

The dataset used is retail data from a fictional company, which includes details about product sales. This dataset covers sales data from 2014 to 2017 and consists of 9,994 rows and 15 columns.

library(ggplot2)
library(readr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

2 Read Data

Displaying Data Structure

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

From the inspection we have conducted, there are 9,994 rows and 15 columns. These columns contain information about:

Row.ID: ID of the data row

Order.ID: Order ID

Order.Date: Order date

Ship.Date: Shipping date

Ship.Mode: Shipping mode

Customer.ID: Customer ID

Segment: Customer segmentation

Product.ID: Product ID

Category: Product category

Sub.Category: Product subcategory

Product.Name: Product name

Sales: Sales revenue

Quantity: Quantity of products

Discount: Discount applied

Profit: Profit earned

2.1 Data Cleaning

From the data inspection results, we can see that the Order.Date and Ship.Date columns have a character data type. We will change the data type of these columns to Date.

Additionally, we will convert the data type of the Ship.Mode, Segment, and Category columns to factor. Finally, we will modify the levels in the Ship.Mode column to “Standard Class”, “Second Class”, “First Class”, and “Same Day”.

# explicit coercion kolom Order.Date 
retail$Order.Date <- as.Date(retail$Order.Date, format = "%m/%d/%y")
# explicit coercion kolom Ship.Date 
retail$Ship.Date <- as.Date(retail$Ship.Date, format = "%m/%d/%y")


retail$Ship.Mode <- as.factor(retail$Ship.Mode)

retail$Segment <-  as.factor(retail$Segment)

retail$Category <-  as.factor(retail$Category)

levels(retail$Ship.Mode) <- c("Standard Class", "Second Class", "First Class", "Same Day")

We will perform explicit coercion on the Order.Date and Ship.Date columns. We will change their data type to Date.

Additionally, we will also perform explicit coercion on the Ship.Mode, Segment, and Category columns, converting their data type to factor.

Finally, we will modify the levels in the Ship.Mode column to “Standard Class”, “Second Class”, “First Class”, and “Same Day”.

colSums(is.na(retail))
##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

We will check for missing values using colSums on the dataset.

From the results above, we can see that there are no missing values in our dataset.

3 Explonatory Data Analysis

# Data Exploration
summary(retail)
##      Row.ID          Order.ID           Order.Date           Ship.Date         
##  Min.   :   1.00   Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  1st Qu.:2499.25   Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Median :4997.50   Mode  :character   Median :2016-06-26   Median :2016-06-29  
##  Mean   :4997.50                      Mean   :2016-04-30   Mean   :2016-05-03  
##  3rd Qu.:7495.75                      3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  Max.   :9994.00                      Max.   :2017-12-30   Max.   :2018-01-05  
##           Ship.Mode    Customer.ID               Segment      Product.ID       
##  Standard Class:1538   Length:9994        Consumer   :5191   Length:9994       
##  Second Class  : 543   Class :character   Corporate  :3020   Class :character  
##  First Class   :1945   Mode  :character   Home Office:1783   Mode  :character  
##  Same Day      :5968                                                           
##                                                                                
##                                                                                
##             Category    Sub.Category       Product.Name      
##  Furniture      :2121   Length:9994        Length:9994       
##  Office Supplies:6026   Class :character   Class :character  
##  Technology     :1847   Mode  :character   Mode  :character  
##                                                              
##                                                              
##                                                              
##      Sales                   Quantity                 Discount             
##  Min.   :    0.44400000   Min.   : 1.00000000000   Min.   :0.000000000000  
##  1st Qu.:   17.28000000   1st Qu.: 2.00000000000   1st Qu.:0.000000000000  
##  Median :   54.49000000   Median : 3.00000000000   Median :0.200000000000  
##  Mean   :  229.85800083   Mean   : 3.78957374425   Mean   :0.156202721633  
##  3rd Qu.:  209.94000000   3rd Qu.: 5.00000000000   3rd Qu.:0.200000000000  
##  Max.   :22638.48000000   Max.   :14.00000000000   Max.   :0.800000000000  
##      Profit                
##  Min.   :-6599.9780000000  
##  1st Qu.:    1.7287500000  
##  Median :    8.6665000000  
##  Mean   :   28.6568963078  
##  3rd Qu.:   29.3640000000  
##  Max.   : 8399.9760000000

3.1 Summary Data

From the summary above, we can see an overview of our dataset. We can observe information regarding:

Average sales (Sales): 229.858

Average quantity of products (Quantity): 3.789

Average discount (Discount): 0.156

Average profit (Profit): 28.656

Highest sales (Sales): 22,638.48

Highest quantity of products (Quantity): 14

Highest discount (Discount): 0.8

Highest profit (Profit): 8,399.98

First order date: 2014-01-03, last order date: 2017-12-30

Total Sales

sum(retail$Sales)
## [1] 2297200.8603

Total Profit

sum(retail$Profit)
## [1] 286397.0217

Total Quantity

sum(retail$Quantity)
## [1] 37873

3.2 Sales

Viewing the total sales for each product category.

sales_category <- aggregate(
  Sales ~ Category
  , data = retail
  , FUN = sum
)
order_sales_category <- sales_category[order(sales_category$Sales, decreasing = TRUE),]
order_sales_category

We can also analyze the total sales for each product category per year.

sales_category_year <- aggregate(
  Sales ~ Category + format(Order.Date, "%Y")
  , data = retail
  , FUN = sum
)
order_sales_category_year <- sales_category_year[order(sales_category_year$Sales, decreasing = TRUE),]
names(order_sales_category_year) <- c("Category", "Year", "Sales")
order_sales_category_year

Then, we visualize it using ggplot2.

ggplot(order_sales_category_year, aes(x = Year, y = Sales, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Sales for Each Product Category",
    x = "Year",
    y = "Total Sales" ,
    fill = "Category") +
  theme_minimal() +
  scale_fill_manual(values = c("#E07A5F", "#F1E3D3", "#1B4D3E"))

From the visualization, we can observe that:

  1. Total sales in the Office Supplies category increased significantly in 2017.

  2. The Technology category consistently had the highest sales each year.

  3. Sales increased every year, except for Office Supplies in 2015.

  4. The highest sales were recorded in the Technology category in 2017, reaching 271,730.81.

3.3 Profit

First, we will look at the total profit for each product category.

profit_category <- aggregate(
  Profit ~ Category
  , data = retail
  , FUN = sum
)
order_profit_category <- profit_category[order(profit_category$Profit, decreasing = TRUE),]
order_profit_category

Next, we will look at the total profit for each product category per year.

profit_category_year <- aggregate(
  Profit ~ Category + format(Order.Date, "%Y")
  , data = retail
  , FUN = sum
)
order_profit_category_year <- profit_category_year[order(profit_category_year$Profit, decreasing = TRUE),]
names(order_profit_category_year) <- c("Category", "Year", "Profit")
order_profit_category_year

Then, we visualize the data using ggplot2.

ggplot(order_profit_category_year, aes(x = Year, y = Profit, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Profit for Each Product Category ",
    x = "Year",
    y = "Total Profit" ,
    fill = "Category") +
  theme_minimal() +
  scale_fill_manual(values = c("#00879E", "#FFAB5B", "#FFF2DB"))

From the visualization, we can observe that:

  1. The Furniture category has the lowest profit each year.

  2. The lowest profit in the Furniture category was in 2015, reaching only 3,015.20.

  3. The Technology category consistently has the highest profit each year.

  4. The highest profit in the Technology category was in 2017, reaching 50,684.26.

Since we can see that Furniture has the lowest profit, we will now conduct a further analysis.

furniture <- retail[retail$Category == "Furniture",]
loss <- furniture[furniture$Profit < 0,]
order_loss <- loss[order(loss$Profit, decreasing = FALSE),]
order_loss[, c("Category", "Sub.Category", "Sales", "Quantity", "Discount", "Profit")]

From the table above, we can see that the Tables subcategory has the lowest total profit compared to other subcategories.

3.4 Do Profit and Sales Have a Strong Relationship?

From the visualization above, we can observe that profit and sales have a fairly strong positive relationship. This means that as sales increase, profit also tends to rise.

This can be confirmed by calculating the correlation between profit and sales.

cor(retail$Profit, retail$Sales)
## [1] 0.479064349737706

From the results above, we can see that the correlation between profit and sales is 0.48, which indicates a fairly strong positive relationship between profit and sales.

3.5 Segment

Next, we will look at the total sales for each customer segment.

sales_segment <- aggregate(
  Sales ~ Segment
  , data = retail
  , FUN = sum
)

order_sales_segment <- sales_segment[order(sales_segment$Sales, decreasing = TRUE),]
order_sales_segment

The Consumer segment, with total sales of 1,161,401.35, has the highest sales compared to Corporate and Home Office. Now, we will analyze the total profit for each customer segment.

profit_segment <- aggregate(
  Profit ~ Segment
  , data = retail
  , FUN = sum
)
order_profit_segment <- profit_segment[order(profit_segment$Profit, decreasing = TRUE),]
order_profit_segment

The Consumer segment, with a total profit of 134,119.21, has the highest profit compared to Corporate and Home Office. This further confirms that sales and profit have a fairly strong relationship.

4 Conclusion

This retail company has a total sales of 2,297,200.86 and a total profit of 286,397.02. Based on the analysis conducted, it can be concluded that sales in the Office Supplies category saw a significant increase in 2017, while the Technology category consistently had the highest total sales each year, peaking at 271,730.81 in 2017. Overall, sales increased every year.

From a profitability perspective, the Furniture category had the lowest profit each year, hitting its lowest point in 2015 at 3,015.20, whereas the Technology category consistently recorded the highest profit, with the highest profit in 2017 at 50,684.26. Additionally, the Tables subcategory had the lowest total profit compared to other subcategories.

Regarding customer segments, the Consumer segment dominated total sales with 1,161,401.35 in revenue and 134,119.21 in profit, outperforming the Corporate and Home Office segments. Furthermore, there is a positive correlation of 0.48 between profit and sales, indicating that an increase in sales tends to be accompanied by an increase in profit.

Recommendations

  1. The company should focus on selling products from the Technology category, as it consistently has the highest sales and profit each year.

  2. The company should evaluate products in the Furniture category, especially the Tables subcategory, which has the lowest total profit compared to other subcategories.

  3. The company should prioritize the Consumer segment, as it generates the highest total sales and profit compared to Corporate and Home Office.