Data Analysis on Retail Business

In this ‘Learning by Building’ project, I am working with data from the retail.csv file, which contains sales records for a company that sells furniture, office supplies, and technology products. The goal of this analysis is to support the company’s growth and long-term sustainability by identifying areas for improvement. This includes analyzing key aspects such as market segments, target audience, sales performance, profit, and more.

Data Input

Setting working directory to the correct location:

setwd("/Users/jeanice/Desktop/Algoritma/Data Visualization in R/2_practical_statistics-main")

Load the retail file

retail <- read.csv("data_input/retail.csv")

Data Inspection

head(retail)
dim(retail)
## [1] 9994   17
names(retail)
##  [1] "Order.Date"    "Ship.Date"     "Ship.Mode"     "Customer.ID"  
##  [5] "Segment"       "Product.ID"    "Category"      "Sub.Category" 
##  [9] "Product.Name"  "Sales"         "Quantity"      "Discount"     
## [13] "Profit"        "Ship.Duration" "Month"         "Day"          
## [17] "IsWeekend"

From the data inspection, we can find that the retail data has 9994 rows and 17 columns (as shown above).

Data Cleansing & Data Coercion

str(retail)
## 'data.frame':    9994 obs. of  17 variables:
##  $ Order.Date   : chr  "2016-11-08" "2016-11-08" "2016-06-12" "2015-10-11" ...
##  $ Ship.Date    : chr  "2016-11-11" "2016-11-11" "2016-06-16" "2015-10-18" ...
##  $ Ship.Mode    : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category : chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...
##  $ Ship.Duration: int  3 3 4 7 7 5 5 5 5 5 ...
##  $ Month        : chr  "November" "November" "June" "October" ...
##  $ Day          : chr  "Tuesday" "Tuesday" "Sunday" "Sunday" ...
##  $ IsWeekend    : chr  "Weekday" "Weekday" "Weekend" "Weekend" ...

Some of the data has invalid data type, therefore we should change them into the right datatypes first.

library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
retail$Order.Date <- ymd(retail$Order.Date)
retail$Ship.Date <- ymd(retail$Ship.Date)
retail$Ship.Mode <- factor(retail$Ship.Mode, levels = c("Same Day", "First Class", "Second Class", "Standard Class"))
retail$Segment <- factor(retail$Segment, levels = c("Consumer", "Corporate", "Home Office"))
retail$Category <- factor(retail$Category, levels = c("Furniture", "Office Supplies", "Technology"))
retail$Sub.Category <- factor(retail$Sub.Category, levels = c("Bookcases","Chairs","Labels","Tables", "Storage", "Furnishings", "Art", "Phones", "Binders", "Appliances", "Paper", "Accessories", "Envelopes", "Fasteners", "Supplies","Machines", "Copiers"))
str(retail)
## 'data.frame':    9994 obs. of  17 variables:
##  $ Order.Date   : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date    : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode    : Factor w/ 4 levels "Same Day","First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment      : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category : Factor w/ 17 levels "Bookcases","Chairs",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...
##  $ Ship.Duration: int  3 3 4 7 7 5 5 5 5 5 ...
##  $ Month        : chr  "November" "November" "June" "October" ...
##  $ Day          : chr  "Tuesday" "Tuesday" "Sunday" "Sunday" ...
##  $ IsWeekend    : chr  "Weekday" "Weekday" "Weekend" "Weekend" ...

Now each column has changed into the right data type.

#Checking missing value
colSums(is.na(retail))
##    Order.Date     Ship.Date     Ship.Mode   Customer.ID       Segment 
##             0             0             0             0             0 
##    Product.ID      Category  Sub.Category  Product.Name         Sales 
##             0             0             0             0             0 
##      Quantity      Discount        Profit Ship.Duration         Month 
##             0             0             0             0             0 
##           Day     IsWeekend 
##             0             0

There is no missing value.

Data Explanation

summary(retail)
##    Order.Date           Ship.Date                   Ship.Mode   
##  Min.   :2014-01-03   Min.   :2014-01-07   Same Day      : 543  
##  1st Qu.:2015-05-23   1st Qu.:2015-05-27   First Class   :1538  
##  Median :2016-06-26   Median :2016-06-29   Second Class  :1945  
##  Mean   :2016-04-30   Mean   :2016-05-03   Standard Class:5968  
##  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                        
##  Max.   :2017-12-30   Max.   :2018-01-05                        
##                                                                 
##  Customer.ID               Segment      Product.ID       
##  Length:9994        Consumer   :5191   Length:9994       
##  Class :character   Corporate  :3020   Class :character  
##  Mode  :character   Home Office:1783   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
##                                                                          
##  Ship.Duration      Month               Day             IsWeekend        
##  Min.   :0.000   Length:9994        Length:9994        Length:9994       
##  1st Qu.:3.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.958                                                           
##  3rd Qu.:5.000                                                           
##  Max.   :7.000                                                           
## 

The information we got are:

  • The first order was in January 3rd, 2014. Meanwhile, the last order from this file was in December 30th, 2017.
  • Most products shipped using standard class
  • The biggest buyers are direct consumers
  • Office supplies were the most sold products, and the least sold were technology
  • Binders and paper were the top sold products
  • The average sales was 229.858. The highest sales reached 22638.480 and the minimum only at 0.444
  • Minimum order quantity was 1pc, and the maximum was 14pcs
  • Average discount given was 15%, the highest surprisingly was at 80%
  • During the 3 years period, the loss accounted at 6599.978 and max profit gained at 8399.976
retail

Question 1: Who is our biggest customer? The biggest customer is direct consumer

table(retail$Segment)
## 
##    Consumer   Corporate Home Office 
##        5191        3020        1783
# Summarize total profit per segment
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
profit_by_segment <- retail %>%
  group_by(Segment) %>%
  summarise(Total_Profit = sum(Profit)) %>%
  arrange(desc(Total_Profit))  # Arrange in descending order

ggplot(profit_by_segment, aes(x = reorder(Segment, Total_Profit), y = Total_Profit, fill = Segment)) +
  geom_col() +
  coord_flip() +  # Flip for better readability
  labs(title = "Total Profit by Segment",
       x = "Segment",
       y = "Total Profit ($)") +
  theme_minimal() +
  theme(legend.position = "none")  # Hide legend as Segment is already on x-axis

xtabs(formula = Profit ~ Sub.Category+Segment,
      data= retail) 
##              Segment
## Sub.Category    Consumer  Corporate Home Office
##   Bookcases   -4435.6382   638.4502    324.6320
##   Chairs      13235.3319  8344.6565   5010.1779
##   Labels       3075.9884  1760.8273    709.4383
##   Tables      -9728.0378 -4906.4986  -3090.9447
##   Storage      7104.2004  9131.0247   5043.6013
##   Furnishings  7919.4227  3508.2077   1631.5132
##   Art          3454.3011  2004.6477   1068.8382
##   Phones      23837.1147 11766.2196   8912.3963
##   Binders     17995.5972  6377.3201   5848.8460
##   Appliances   6981.9282  7429.8952   3726.1820
##   Paper       15534.6436 10361.5468   8157.3789
##   Accessories 20735.9225 12707.4805   8493.2327
##   Envelopes    3264.4126  2571.2290   1128.5351
##   Fasteners     576.8008   251.9030    120.8144
##   Supplies    -1657.5513   338.9264    129.5254
##   Machines     2141.0618   703.0190    540.6761
##   Copiers     24083.7106 18990.2789  12543.8354

This visualization presents the profit generated by each segment, along with details of the sub-categories sold within each segment. The results can help the company strategically focus on selling specific products to targeted markets to ensure sustained business growth. For example, bookcases may be more profitable when sold exclusively to corporate and home office segments. Additionally, given that table sales have resulted in losses across all segments, the company should consider discontinuing them to minimize financial setbacks.

Question 2: Does the number of sales align with the profit?

ggplot(data= retail, 
       mapping= aes(x = Sales, y = Profit, fill = Category)) +
  geom_boxplot(mapping = aes(colour = Category)) +
  labs(title = "Profit Distribution by Category", x = "Sales", y = "Profit")

Does Sales Align with Profit?

  • Not always. While higher sales generally result in higher profits, there are multiple cases where sales are high, but the profit is negative or low.
  • All categories seems to have inconsistent relationship, with several transactions resulting in a loss. However, technology has the highest profit and biggest loss as well.

Question 3: Which product(s) generate most profit?

profit_contributor <- aggregate(x = Profit ~ Sub.Category,
          data = retail,
          FUN = 'sum')
profit_contributor[order(profit_contributor$Profit, decreasing= T),]
#subsetting data to show products with profit > 0 
profit_contributor <- filter(profit_contributor, Profit > 0)

#bar chart
ggplot(data = profit_contributor,
       mapping = aes(y= reorder(Sub.Category, Profit), x= Profit))+
  geom_col()

Top three profit generated by these products: Copiers, Phones and Accessories.

Question 4: What sub-category contribute to negative profit?

library(dplyr)
# Filter only negative profit values
loss_data <- retail %>% 
  filter(Profit < 0) %>% 
  group_by(Sub.Category) %>% 
  summarise(Total_Loss = sum(Profit)) %>% 
  arrange(Total_Loss)  # Arrange from highest loss to lowest

# Create the bar chart
ggplot(loss_data, aes(x = reorder(Sub.Category, Total_Loss), y = Total_Loss)) +
  geom_col() +
  coord_flip() +  # Flips the chart for better readability
  labs(title = "Total Profit Loss by Sub-Category",
       x = "Sub-Category",
       y = "Total Loss ($)") +
  theme_minimal() +
  theme(legend.position = "none")  # Hide legend since fill is just for color variation

These products contribute to profit losses, and the company may need to reassess whether to continue selling them. While sub-categories like binders, phones, and accessories are among the top profit contributors, certain product IDs within these categories may still be generating losses.

Question 5: How much discount does the company gives? Did it affect the business?

disc_given <-  as.data.frame(table(retail$Discount))
colnames(disc_given) <- c("Discount", "Frequency")
disc_given
#bar chart
ggplot(data=disc_given,
       mapping = aes(x= Discount,
                     y= Frequency))+
  geom_col()

#analyzing discount above 40%

# Prepare data
discounted_sales <- retail %>%
  filter(Discount > 0.40) %>%
  group_by(Sub.Category) %>%
  summarise(
    Count = n(),                     
    Total_Sales = sum(Sales, na.rm = TRUE), 
    Total_Profit = sum(Profit, na.rm = TRUE)
  ) %>%
  arrange(desc(Count))

# Create bar chart
ggplot(discounted_sales, aes(x = reorder(Sub.Category, Count), y = Count, fill = Total_Profit > 0)) +
  geom_col() +
  coord_flip() + # Flip to make it more readable
  labs(title = "Frequency of Discounts Above 40% by Sub-Category",
       x = "Sub-Category",
       y = "Count of Discounts Given",
       fill = "Profitable?") +
  scale_fill_manual(values = c("red", "green"), labels = c("Loss", "Profit")) + 
  theme_minimal()

The bar chart shows which sub-categories receive discount above 40% most frequently. Moreover, high discounts on these products leads to financial loses. Although high discount might indicate slow-moving inventory, the company should analyze if these products should remain in stock or replacing these items with better-performing ones. This analysis also indicates that the company should re-evaluate pricing strategies.

Question 6: Does the company experiencing profit or loss?

sum(retail$Profit)
## [1] 286397

Overall, the company remains profitable despite some losses (profit was at 286397). Moving forward, it may be necessary to reassess pricing strategies and profit margins for the least profitable or loss-generating subcategories. A potential solution could be to discontinue unprofitable products and focus on those that drive the highest sales and profitability.

Explanation and Business Recommendation

Between 2014 and 2017, the company generated a total profit of $286,397 from selling products across three categories: Furniture, Office Supplies, and Technology. Among these, Technology contributed both the highest profit ($145,454.95) and the highest losses (over -$4,000), indicating the need for a closer evaluation of its pricing and cost structures.

While consumers accounted for the highest sales, the data reveals that higher sales do not always translate to higher profits. Across all three categories, there were instances of negative profit, suggesting inefficiencies in pricing, cost management, or discount strategies.

Key Recommendations:

Expand Target Segments – To drive sales growth, the company should consider targeting corporate and office segments, especially for high-demand items such as copiers, paper, phones, and accessories. Reassess Pricing and Discounts – Review selling prices and profit margins on products with negative profits. Additionally, analyze the impact of discounts to ensure they do not further contribute to losses. Optimize Inventory Management – Identify products that should remain in stock and those that should be discontinued to minimize unprofitable inventory and improve overall profitability.