Retail Sales Data Analysis

1. Introduction

Objective

The primary aim of this project is to perform an in-depth analysis of retail sales data. By examining various aspects, we seek to derive actionable insights that could assist in informed decision-making.

2. Input Data

company <- read.csv("retail.csv")

2.1 Data Inspection

head(company)
##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694
tail(company)
##      Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 9989   9989 CA-2017-163629   11/17/17  11/21/17 Standard Class    RA-19885
## 9990   9990 CA-2014-110422    1/21/14   1/23/14   Second Class    TB-21400
## 9991   9991 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9992   9992 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9993   9993 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9994   9994 CA-2017-119914     5/4/17    5/9/17   Second Class    CC-12220
##        Segment      Product.ID        Category Sub.Category
## 9989 Corporate TEC-PH-10004006      Technology       Phones
## 9990  Consumer FUR-FU-10001889       Furniture  Furnishings
## 9991  Consumer FUR-FU-10000747       Furniture  Furnishings
## 9992  Consumer TEC-PH-10003645      Technology       Phones
## 9993  Consumer OFF-PA-10004041 Office Supplies        Paper
## 9994  Consumer OFF-AP-10002684 Office Supplies   Appliances
##                                                                   Product.Name
## 9989                                           Panasonic KX - TS880B Telephone
## 9990                                                    Ultra Door Pull Handle
## 9991                        Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992                                                     Aastra 57i VoIP phone
## 9993                         It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
##        Sales Quantity Discount  Profit
## 9989 206.100        5      0.0 55.6470
## 9990  25.248        3      0.2  4.1028
## 9991  91.960        2      0.0 15.6332
## 9992 258.576        2      0.2 19.3932
## 9993  29.600        4      0.0 13.3200
## 9994 243.160        2      0.0 72.9480
dim(company)
## [1] 9994   15
names(company)
##  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
##  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
## [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

Fields in the Dataset

  • Order ID: The unique identifier for each order
  • Order Date: The date when the order was placed
  • Ship Date: The date when the order was shipped
  • Ship Mode: The mode of shipping (e.g., Second Class, Standard Class)
  • Customer ID: The unique identifier for each customer
  • Segment: The segment to which the customer belongs (e.g., Consumer, Corporate)
  • Product ID: The unique identifier for each product
  • Category: The broad category to which the product belongs (e.g., Furniture, Office Supplies)
  • Sub-Category: The sub-category of the product (e.g., Bookcases, Chairs)
  • Product Name: The name of the product
  • Sales: The sales amount for the transaction
  • Quantity: The quantity of the product sold
  • Discount: The discount applied to the transaction
  • Profit: The profit generated from the transaction

2.2 Data Cleaning and Preprocessing

Check data type for each column

str(company)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...
# Converting columns to appropriate data types

company$Row.ID <- as.character(company$Row.ID)
company$Order.ID <- as.character(company$Order.ID)
company$Order.Date <- as.Date(company$Order.Date, "%m/%d/%y")
company$Ship.Date <- as.Date(company$Ship.Date, "%m/%d/%y")
company$Customer.ID <- as.character(company$Customer.ID)
company$Product.ID <- as.character(company$Product.ID)
company$Product.Name <- as.character(company$Product.Name)

# Inspecting the structure of the dataframe after changes

str(company)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : chr  "1" "2" "3" "4" ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...
colSums(is.na(company))
##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0
anyNA(company)
## [1] FALSE

3. Data Explanation

summary (company)
##     Row.ID            Order.ID           Order.Date        
##  Length:9994        Length:9994        Min.   :2014-01-03  
##  Class :character   Class :character   1st Qu.:2015-05-23  
##  Mode  :character   Mode  :character   Median :2016-06-26  
##                                        Mean   :2016-04-30  
##                                        3rd Qu.:2017-05-14  
##                                        Max.   :2017-12-30  
##    Ship.Date           Ship.Mode         Customer.ID          Segment         
##  Min.   :2014-01-07   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2015-05-27   Class :character   Class :character   Class :character  
##  Median :2016-06-29   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016-05-03                                                           
##  3rd Qu.:2017-05-18                                                           
##  Max.   :2018-01-05                                                           
##   Product.ID          Category         Sub.Category       Product.Name      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976

Quick Stats

We thoroughly inspected the company dataset, which consists of 9,994 observations across various retail metrics. Key columns were converted to appropriate data types, such as character and Date, to align with the nature of the data they represent.

  • The dataset spans from January 3, 2014, to December 30, 2017, based on the Order.Date field.
  • Sales range from a minimum of $0.444 to a maximum of $22,638.480.
  • Discounts range from 0% to 80%.
  • Profit ranges from a minimum loss of $-6,599.978 to a maximum gain of $8,399.976.
  • The most frequent quantity of items sold in an order is 3.
aggregate(Profit~Segment,company,mean)
##       Segment   Profit
## 1    Consumer 25.83687
## 2   Corporate 30.45667
## 3 Home Office 33.81866
aggregate(Profit~Segment,company,var)
##       Segment   Profit
## 1    Consumer 58851.86
## 2   Corporate 53817.63
## 3 Home Office 45105.89
aggregate(Profit~Segment,company,sd)
##       Segment   Profit
## 1    Consumer 242.5940
## 2   Corporate 231.9863
## 3 Home Office 212.3815
boxplot(company$Profit)

Key Findings

Mean Profit

  • Consumer: The mean profit for the Consumer segment is approximately $25.84.
  • Corporate: The Corporate segment has a slightly higher mean profit of around $30.46.
  • Home Office: The highest mean profit is observed in the Home Office segment at about $33.82.

Profit Variability

  • Consumer: The variance and standard deviation for the Consumer segment are 58,851.86 and 242.5940, respectively, indicating the highest variability among the three segments.
  • Corporate: The Corporate segment shows a variance of 53,817.63 and a standard deviation of 231.9863.
  • Home Office: With a variance of 45,105.89 and a standard deviation of 212.3815, the Home Office segment appears to have the least variability in profits.

Conclusions and Implications

  1. The Home Office segment not only has the highest mean profit but also the least variability, making it the most stable and profitable segment.
  2. The Consumer segment, despite having the lowest mean profit, shows the highest variability, indicating a need for further investigation and possibly a different business strategy.
  3. The Corporate segment falls in between the other two in terms of both mean profit and variability, suggesting it could be an area where strategic changes could yield significant improvements.

4. Data Manipulation and Transformation

1. Which Product Category Generates the Most Revenue?

# Aggregate Sales by Product Category
category_sales <- aggregate(Sales ~ Category, company, sum)
# Sort by Sales in Descending Order
category_sales <- category_sales[order(-category_sales$Sales),]

category_sales
##          Category    Sales
## 3      Technology 836154.0
## 1       Furniture 741999.8
## 2 Office Supplies 719047.0

Answer: The Technology category stands out as the most lucrative, suggesting that products in this category are either higher-priced or higher in demand compared to other categories.

2. What Is the Seasonal Trend in Sales?

# Convert Order.Date to month-year format
company$Order.MonthYear <- format(company$Order.Date, "%Y-%m")
# Aggregate Sales by Month-Year
monthly_sales <- aggregate(Sales ~ Order.MonthYear, company, sum)

monthly_sales
##    Order.MonthYear      Sales
## 1          2014-01  14236.895
## 2          2014-02   4519.892
## 3          2014-03  55691.009
## 4          2014-04  28295.345
## 5          2014-05  23648.287
## 6          2014-06  34595.128
## 7          2014-07  33946.393
## 8          2014-08  27909.468
## 9          2014-09  81777.351
## 10         2014-10  31453.393
## 11         2014-11  78628.717
## 12         2014-12  69545.621
## 13         2015-01  18174.076
## 14         2015-02  11951.411
## 15         2015-03  38726.252
## 16         2015-04  34195.209
## 17         2015-05  30131.686
## 18         2015-06  24797.292
## 19         2015-07  28765.325
## 20         2015-08  36898.332
## 21         2015-09  64595.918
## 22         2015-10  31404.924
## 23         2015-11  75972.564
## 24         2015-12  74919.521
## 25         2016-01  18542.491
## 26         2016-02  22978.815
## 27         2016-03  51715.875
## 28         2016-04  38750.039
## 29         2016-05  56987.728
## 30         2016-06  40344.534
## 31         2016-07  39261.963
## 32         2016-08  31115.374
## 33         2016-09  73410.025
## 34         2016-10  59687.745
## 35         2016-11  79411.966
## 36         2016-12  96999.043
## 37         2017-01  43971.374
## 38         2017-02  20301.133
## 39         2017-03  58872.353
## 40         2017-04  36521.536
## 41         2017-05  44261.110
## 42         2017-06  52981.726
## 43         2017-07  45264.416
## 44         2017-08  63120.888
## 45         2017-09  87866.652
## 46         2017-10  77776.923
## 47         2017-11 118447.825
## 48         2017-12  83829.319

3. Which Customer Segment Is Most Profitable?

# Aggregate Profit by Customer Segment
segment_profit <- aggregate(Profit ~ Segment, company, mean)

segment_profit
##       Segment   Profit
## 1    Consumer 25.83687
## 2   Corporate 30.45667
## 3 Home Office 33.81866

Answer: The Home Office segment not only has the highest average profit but also leads by a noticeable margin compared to the Corporate and Consumer segments.

4. Are Discounts Affecting Profits?

correlation <- cor(company$Discount, company$Profit)

correlation
## [1] -0.2194875

Answer: The correlation is negative, indicating that as discounts increase, profits tend to decrease.

5. How Do Sales Vary by Category and Sub-Category?

library(ggplot2)

# Create a cross-tabulation of Sales by Category and Sub-Category
sales_xtabs <- xtabs(Sales ~ Category + Sub.Category, data=company)

# Generate the heatmap
ggplot(as.data.frame(as.table(sales_xtabs)), aes(x=Category, y=Sub.Category)) +
  geom_tile(aes(fill=Freq), color="white") +
  scale_fill_gradient(low="white", high="red") +
  theme_minimal() +
  labs(title="Heatmap of Sales by Category and Sub-Category",
       x="Category",
       y="Sub-Category",
       fill="Sales")

Answer:

6. How Does Product Quantity Vary Across Different Customer Segments and Product Categories?

# Create a cross-tabulation of Quantity by Segment and Category
quantity_xtabs <- xtabs(Quantity ~ Segment + Category, data=company)

# Generate the plot
plot(quantity_xtabs, main="Cross-Tabulation of Quantity by Segment and Category", xlab="Segment", ylab="Category")

7. Is There a Relationship Between Discount and Quantity?

# Scatter plot to explore the relationship between Discount and Quantity
plot(company$Discount, company$Quantity, main="Scatter Plot of Discount vs Quantity", xlab="Discount", ylab="Quantity")

8. What Are the Trends in Profit Over Time?

# Convert Order.Date to a Date object and sort
company$Order.Date <- as.Date(company$Order.Date)
sorted_company <- company[order(company$Order.Date),]

# Plot Profit over time
plot(sorted_company$Order.Date, sorted_company$Profit, type="l", main="Profit Trends Over Time", xlab="Time", ylab="Profit")

9. What Is the Average Profit per Product?

# Aggregate Profit by Product Name
product_profit <- aggregate(Profit ~ Product.Name, company, mean)

# Generate a plot (showing top 10 profitable products for readability)
top_products <- head(product_profit[order(-product_profit$Profit),], 10)
barplot(top_products$Profit, names.arg=top_products$Product.Name, las=2, main="Top 10 Profitable Products", xlab="Product", ylab="Average Profit", cex.names = 0.4)

5. Explanatory Text & Business Recommendations

Overview

We have conducted a comprehensive analysis of the ‘company’ dataset, focusing on key business questions related to revenue generation, customer segmentation, seasonality, and discounts. Our findings offer several actionable insights that can significantly influence business strategies.

Key Findings

Business Recommendations