Superstore Sales Data Analysis

introduction:The Superstore dataset is a real-world retail sales dataset (U.S.A) containing information on customer orders, including products, sales, profit, discounts, shipping methods, and customer segments across different regions.

source:Kaggle

Variables in the Superstore Dataset

Order.Date: The date when a customer placed an order.

Sales: The total amount of money generated from a sale.

Profit: The net earnings made from a sale after deducting costs.

Category: The broad product group to which an item belongs, such as Technology, Furniture, or Office Supplies.

Sub.Category: A more detailed classification of products within each category, e.g., “Phones”, “Chairs”, “Binders”.

Segment: The type of customer group making the purchase, such as Consumer, Corporate, or Home Office.

Ship.Mode: The delivery method used to ship the product to the customer.

Discount: The percentage reduction applied to the product’s original price.

Product.Name: The name of the product sold, such as “Stapler” or “iPhone”, used to track which items are most purchased, most profitable, or losing money.

Aim

The aim of this project is to analyze the Superstore dataset to understand sales performance, profitability, and customer purchasing patterns.

Research Questions

1.How do sales and profit vary and yearly?

2.Are sales increasing or decreasing over time?

3.Which category contributes most to overall sales and profit, and how do they differ in performance?

4.which products do customers buy the most and Which specific Sub-Categories are losing money?

5.Which customer segment places the highest number of orders?

6.Which region contributes the most to total sales, and what percentage of total revenue does it represent?

7.Which states contribute the most to total sales?

8.How do discount rates differ across product categories in the Superstore dataset?

Data Cleaning

Load Packages

library(ggplot2) #For visualization
library(dplyr) #for data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate) #for date
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr) #for cleaning data

Importing the Dataset

superstore<-read.csv("C:/Users/Hello/Downloads/Sample_Superstore.csv")
View(superstore)
# viewing the first six of my data
head(superstore)
##   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156  11/8/2016 11/11/2016   Second Class    CG-12520
## 2      2 CA-2016-152156  11/8/2016 11/11/2016   Second Class    CG-12520
## 3      3 CA-2016-138688  6/12/2016  6/16/2016   Second Class    DV-13045
## 4      4 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
## 5      5 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
## 6      6 CA-2014-115812   6/9/2014  6/14/2014 Standard Class    BH-11710
##     Customer.Name   Segment       Country            City      State
## 1     Claire Gute  Consumer United States       Henderson   Kentucky
## 2     Claire Gute  Consumer United States       Henderson   Kentucky
## 3 Darrin Van Huff Corporate United States     Los Angeles California
## 4  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 5  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 6 Brosina Hoffman  Consumer United States     Los Angeles California
##   Postal.Code Region      Product.ID        Category Sub.Category
## 1       42420  South FUR-BO-10001798       Furniture    Bookcases
## 2       42420  South FUR-CH-10000454       Furniture       Chairs
## 3       90036   West OFF-LA-10000240 Office Supplies       Labels
## 4       33311  South FUR-TA-10000577       Furniture       Tables
## 5       33311  South OFF-ST-10000760 Office Supplies      Storage
## 6       90032   West FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

I Imported my data using read.csv

Exploratory Data Analysis

This process is know as cleaning of the data check for the structure ,summary ,missing value ,coverting data type,handling outliers and so on…….

Checking the structure and the summary of the data

#str(superstore)
glimpse(superstore) # also like structure
## Rows: 9,994
## Columns: 21
## $ Row.ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Order.ID      <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U…
## $ Order.Date    <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10…
## $ Ship.Date     <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "…
## $ Ship.Mode     <chr> "Second Class", "Second Class", "Second Class", "Standar…
## $ Customer.ID   <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033…
## $ Customer.Name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O…
## $ Segment       <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum…
## $ Country       <chr> "United States", "United States", "United States", "Unit…
## $ City          <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal…
## $ State         <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid…
## $ Postal.Code   <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, …
## $ Region        <chr> "South", "South", "West", "South", "South", "West", "Wes…
## $ Product.ID    <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",…
## $ Category      <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"…
## $ Sub.Category  <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F…
## $ Product.Name  <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric …
## $ Sales         <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,…
## $ Quantity      <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,…
## $ Discount      <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.…
## $ Profit        <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1…
summary(superstore)
##      Row.ID       Order.ID          Order.Date         Ship.Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976
#superstore <- distinct(superstore)
dim(superstore) #checking for the dimension
## [1] 9994   21

Checking for missing values

colSums(is.na(superstore))
##        Row.ID      Order.ID    Order.Date     Ship.Date     Ship.Mode 
##             0             0             0             0             0 
##   Customer.ID Customer.Name       Segment       Country          City 
##             0             0             0             0             0 
##         State   Postal.Code        Region    Product.ID      Category 
##             0             0             0             0             0 
##  Sub.Category  Product.Name         Sales      Quantity      Discount 
##             0             0             0             0             0 
##        Profit 
##             0

Converting Data Types

Changing date to date format and converting some categorical variables to factor so that R will not the treat the variables has a character string but rather has a factor ,when a variable is a factor you can control the other at which it appears on your plots (e.g from lowest to highest).It also makes grouping easy.

Date format and Ordering

superstore$Order.Date<-as.Date(superstore$Order.Date,format = "%m/%d/%Y")

superstore$Ship.Date<-as.Date(superstore$Ship.Date,format="%m/%d/%Y")

#Arranging date in ascending order
superstore<-superstore%>%
  arrange(Order.Date)

#viewing order
head(superstore$Order.Date)
## [1] "2014-01-03" "2014-01-04" "2014-01-04" "2014-01-04" "2014-01-05"
## [6] "2014-01-06"

Once I got the dates sorted, I noticed the same date appearing a lot. I double-checked the rows and realized they aren’t actually duplicates. it’s just that some orders have multiple items listed separately, or different people happened to buy things on the same day.

Repeated dates mean multiple sales occurred on the same day, not an error.

Converting some Columns to Factors

#names(superstore)
superstore$Category <- as.factor(superstore$Category)

superstore$Sub.Category <- as.factor(superstore$Sub.Category)

superstore$Region <- as.factor(superstore$Region)

superstore$Segment <- as.factor(superstore$Segment)

superstore$Product.Name<-as.factor(superstore$Product.Name)

ANSWER TO QUESTIONS

Question 1: How do sales and profit vary yearly?

yearly_sales <- superstore %>%
  mutate(Year = year(Order.Date)) %>%
  group_by(Year) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Total_Orders = n()
  ) %>%
  arrange(Year)

head(yearly_sales)
## # A tibble: 4 × 4
##    Year Total_Sales Total_Profit Total_Orders
##   <dbl>       <dbl>        <dbl>        <int>
## 1  2014     484247.       49544.         1993
## 2  2015     470533.       61619.         2102
## 3  2016     609206.       81795.         2587
## 4  2017     733215.       93439.         3312

visualizing Total Sales by Year

ggplot(yearly_sales, aes(x = factor(Year), y = Total_Sales, fill = factor(Year))) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3.5) +
  labs(
    title = "Total Sales by Year",
    x = "Year",
    y = "Total Sales"
  ) +
  theme_minimal()

This showing that that 2017 has the highest sales.

Question 2: are sales increasing or decreasing over time

ggplot(yearly_sales, aes(x = Year, y = Total_Sales)) +
geom_line(color = "red", size = 1) +
geom_point(color="black",size=3)+  
labs(
title = "Yearly Sales Trend Over Time",
x = "Order Date",
y = "Total Sales"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Over time the sales sky rockted when it got to 2015 and went higher

Question 3:Which category contributes most to overall sales and profit, and how do they differ in performance?

# Sales and Profit by Product Category

category_summary <- superstore %>%
  group_by(Category) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Total_Orders = n()
  ) %>%
  arrange(desc(Total_Sales))

head(category_summary)
## # A tibble: 3 × 4
##   Category        Total_Sales Total_Profit Total_Orders
##   <fct>                 <dbl>        <dbl>        <int>
## 1 Technology          836154.      145455.         1847
## 2 Furniture           742000.       18451.         2121
## 3 Office Supplies     719047.      122491.         6026

Visualize Sales by Category

ggplot(category_summary, aes(x = reorder(Category, -Total_Sales), y = Total_Sales, fill = Category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3) +
  labs(
    title = "Total Sales by Product Category",
    x = "Product Category",
    y = "Total Sales"
  ) +
  theme_minimal()

this shows that technology was the most sales product and the total sale of across was in million

Visualize Profit by Category

ggplot(category_summary, aes(x = reorder(Category, -Total_Profit), y = Total_Profit, fill = Category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Profit, 0)), vjust = -0.3, size = 3.5) +
  labs(
    title = "Total Profit by Product Category",
    x = "Product Category",
    y = "Total Profit"
  ) +
  theme_minimal()

This shows that Technology generated the most profit simply because it had the highest sales volume. Since these are the most expensive items in the dataset, they naturally account for the largest portion of the total profit for the business.

Question 4:Which products do customers buy the most?

product_sales <- superstore %>%
group_by(Product.Name) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
  
arrange(desc(Total_Quantity)) %>%
head(10)  # Top 10 products

ggplot(product_sales, aes(x = reorder(Product.Name, Total_Quantity), y = Total_Quantity, fill = Product.Name)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Top 10 Most Purchased Products",
x = "Product Name",
y = "Total Quantity Sold") +
theme_minimal()

The top 10 list is dominated by Office Supplies because these are low-cost, essential items that customers buy frequently. While these products lead in quantity, Technology still generates more total sales revenue because the price per item is much higher.

Question :Which specific Sub-Categories are losing money?

subcat_profit <- superstore %>%
  group_by(Sub.Category) %>%
  summarise(Total_Profit = sum(Profit, na.rm = TRUE)) %>%
  arrange(desc(Total_Profit))

visualizing

ggplot(subcat_profit,
       aes(reorder(Sub.Category, Total_Profit),
           Total_Profit,
           fill = Total_Profit > 0)) +
  geom_col() +
  coord_flip() +
  labs(title = "Profitability by Sub-Category",
       x = "Sub-Category", y = "Total Profit") +
  scale_fill_manual(values = c("FALSE" = "red", "TRUE" = "darkgreen"),
                    guide = "none") +
  theme_minimal()

The losses recorded by Tables, Bookcases, and Supplies can be attributed to heavy discounting and low profit margins, as observed in the dataset. Additionally, the bulky nature of furniture items suggests higher handling and shipping costs, which further reduces profitability.

Question 5:Which customer segment places the highest number of orders?

segment_order_counts <- superstore %>%
  group_by(Segment) %>%
  summarise(Total_Order_Lines = n(), 
            Unique_Orders = n_distinct(Order.ID)) %>%
  arrange(desc(Total_Order_Lines))

segment_order_counts
## # A tibble: 3 × 3
##   Segment     Total_Order_Lines Unique_Orders
##   <fct>                   <int>         <int>
## 1 Consumer                 5191          2586
## 2 Corporate                3020          1514
## 3 Home Office              1783           909
segment_category_counts <- superstore %>%
  group_by(Segment, Category) %>%
  summarise(Items_Ordered = n(), .groups = "drop") %>%
  arrange(Segment, desc(Items_Ordered))

segment_category_counts
## # A tibble: 9 × 3
##   Segment     Category        Items_Ordered
##   <fct>       <fct>                   <int>
## 1 Consumer    Office Supplies          3127
## 2 Consumer    Furniture                1113
## 3 Consumer    Technology                951
## 4 Corporate   Office Supplies          1820
## 5 Corporate   Furniture                 646
## 6 Corporate   Technology                554
## 7 Home Office Office Supplies          1079
## 8 Home Office Furniture                 362
## 9 Home Office Technology                342

Visualizing

ggplot(segment_category_counts, aes(x = Segment, y = Items_Ordered, fill = Category)) +
  # Fixed: Use position_dodge to separate the bars
  geom_col(position = position_dodge(width = 0.9)) + 
  # Fixed: Added position_dodge to geom_text so labels align with the bars
  geom_text(aes(label = Items_Ordered), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3.5) +
  labs(
    title = "Order Volume by Segment and Category",
    subtitle = "Office Supplies is the most frequently ordered category across all segments",
    x = "Customer Segment",
    y = "Number of Items Ordered"
  ) +
  theme_minimal()

Consumers place the most orders due to frequent, smaller personal/home purchases. Office Supplies are the top items across all segments because they are essential and affordable. Furniture and Technology are bought less often due to higher cost

Question 6:Which region contributes the most to total sales, and what percentage of total revenue does it represent?

# Summarize total sales per region and calculate percentage
region_sales <- superstore %>%
  group_by(Region) %>%
  summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
  mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 1))

# View the summarized data
region_sales
## # A tibble: 4 × 3
##   Region  Total_Sales Percentage
##   <fct>         <dbl>      <dbl>
## 1 Central     501240.       21.8
## 2 East        678781.       29.5
## 3 South       391722.       17.1
## 4 West        725458.       31.6

visualizing of region sales

# Pie chart with percentage labels
ggplot(region_sales, aes(x = "", y = Total_Sales, fill = Region)) +
  geom_col(width = 1, color = "white") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), color = "white", size = 5) +
  labs(title = "Percentage of sales  by Region", fill = "Region") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The West region contributes the highest share of total sales because it records a high volume of customer orders, strong demand across all product categories, efficient shipping and logistics, and relatively higher average sales per order compared to other regions.

Question 7: Which states contribute the most to total sale?

# Summarize total sales per state and calculate percentage
state_sales <- superstore %>%
  group_by(State) %>%
  summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
  mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 2)) %>%
  arrange(desc(Total_Sales))

# View top states
head(state_sales, 10)
## # A tibble: 10 × 3
##    State        Total_Sales Percentage
##    <chr>              <dbl>      <dbl>
##  1 California       457688.      19.9 
##  2 New York         310876.      13.5 
##  3 Texas            170188.       7.41
##  4 Washington       138641.       6.04
##  5 Pennsylvania     116512.       5.07
##  6 Florida           89474.       3.89
##  7 Illinois          80166.       3.49
##  8 Ohio              78258.       3.41
##  9 Michigan          76270.       3.32
## 10 Virginia          70637.       3.07
# Bar chart for top 10 states
top_states <- state_sales %>% slice_max(Total_Sales, n = 10)

ggplot(top_states, aes(x = reorder(State, Total_Sales), 
                        y = Total_Sales, fill = State)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 States by Total Sales",
    x = "State",
    y = "Total Sales"
  ) +
  theme_minimal()

California emerges as the top-performing state due to its large customer base, the highest number of orders, strong demand for high-value products such as technology and office supplies, and a high concentration of business and commercial activities.

Question 8:How do discount rates differ across product categories in the Superstore dataset?

ggplot(superstore, aes(x = Category, y = Discount, fill = Category)) +
  geom_boxplot() +
  labs(title = "Distribution of Discounts by Category",
       x = "Category",
       y = "Discount Rate") +
  theme_minimal()

This boxplot shows that discounting strategies vary by product category. Office Supplies are the most unpredictable, with most items sold at full price (median 0%) but occasional outliers up to 80%, indicating unusually high discounts. Furniture generally has a consistent discount around 20%, though some items are outliers at higher discounts. Technology products also have typical discounts near 20% but avoid extreme discounts, reflecting tighter control over high-cost items.

Recommendations

Based on the results of the analysis:

The store should focus more on products and categories that consistently generate profit.

Products that regularly cause losses should be reviewed, improved, or possibly removed.

Discount policies should be carefully managed so that sales growth does not lead to profit loss.

More attention should be given to regions that perform well, while strategies should be developed to improve weaker regions.

Business decisions should focus not only on increasing sales volume but also on increasing profitability.

Conclusion

After analyzing the (U.S.A) store data, I found that while sales are strong, profits are affected by factors like product type, region, and discount levels. These insights can help make better business decisions and improve overall performance.