superstore data analysis

Superstore Sales Data Analysis

introduction:The Superstore dataset is a real-world retail sales dataset containing information on customer orders, including products, sales, profit, discounts, shipping methods, and customer segments across different regions.

source:Kaggle

Aim

The aim of this project is to analyze the Superstore dataset to understand sales performance, profitability, and customer purchasing patterns.

Research Questions

1.How do sales and profit vary and yearly?

2.Are sales increasing or decreasing over time?

3.Which category contributes most to overall sales and profit, and how do they differ in performance?

4.which products do customers buy the most and Which specific Sub-Categories are losing money?

5.Which customer segment places the highest number of orders?

6.Which region contributes the most to total sales, and what percentage of total revenue does it represent?

7.Which states contribute the most to total sales, and what percentage of total revenue do they represent?

8.How do discount rates differ across product categories in the Superstore dataset?

9.Does shipping mode influence sales performance?

Data Cleaning

Load Packages

library(ggplot2) #For visualization
library(dplyr) #for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate) #for date

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyr) #for cleaning data

Importing the Dataset

superstore<-read.csv("C:/Users/Hello/Downloads/Sample_Superstore.csv")
View(superstore)
# viewing the first six of my data
head(superstore)

##   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156  11/8/2016 11/11/2016   Second Class    CG-12520
## 2      2 CA-2016-152156  11/8/2016 11/11/2016   Second Class    CG-12520
## 3      3 CA-2016-138688  6/12/2016  6/16/2016   Second Class    DV-13045
## 4      4 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
## 5      5 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
## 6      6 CA-2014-115812   6/9/2014  6/14/2014 Standard Class    BH-11710
##     Customer.Name   Segment       Country            City      State
## 1     Claire Gute  Consumer United States       Henderson   Kentucky
## 2     Claire Gute  Consumer United States       Henderson   Kentucky
## 3 Darrin Van Huff Corporate United States     Los Angeles California
## 4  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 5  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 6 Brosina Hoffman  Consumer United States     Los Angeles California
##   Postal.Code Region      Product.ID        Category Sub.Category
## 1       42420  South FUR-BO-10001798       Furniture    Bookcases
## 2       42420  South FUR-CH-10000454       Furniture       Chairs
## 3       90036   West OFF-LA-10000240 Office Supplies       Labels
## 4       33311  South FUR-TA-10000577       Furniture       Tables
## 5       33311  South OFF-ST-10000760 Office Supplies      Storage
## 6       90032   West FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

I Imported my data using read.csv

Exploratory Data Analysis

This process is know as cleaning of the data check for the structure ,summary ,missing value ,coverting data type,handling outliers and so on…….

Checking the structure and the summary of the data

#str(superstore)
glimpse(superstore) # also like structure

## Rows: 9,994
## Columns: 21
## $ Row.ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Order.ID      <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U…
## $ Order.Date    <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10…
## $ Ship.Date     <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "…
## $ Ship.Mode     <chr> "Second Class", "Second Class", "Second Class", "Standar…
## $ Customer.ID   <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033…
## $ Customer.Name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O…
## $ Segment       <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum…
## $ Country       <chr> "United States", "United States", "United States", "Unit…
## $ City          <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal…
## $ State         <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid…
## $ Postal.Code   <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, …
## $ Region        <chr> "South", "South", "West", "South", "South", "West", "Wes…
## $ Product.ID    <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",…
## $ Category      <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"…
## $ Sub.Category  <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F…
## $ Product.Name  <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric …
## $ Sales         <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,…
## $ Quantity      <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,…
## $ Discount      <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.…
## $ Profit        <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1…

summary(superstore)

##      Row.ID       Order.ID          Order.Date         Ship.Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

#superstore <- distinct(superstore)
dim(superstore) #checking for the dimension

## [1] 9994   21

Checking for missing values

colSums(is.na(superstore))

##        Row.ID      Order.ID    Order.Date     Ship.Date     Ship.Mode 
##             0             0             0             0             0 
##   Customer.ID Customer.Name       Segment       Country          City 
##             0             0             0             0             0 
##         State   Postal.Code        Region    Product.ID      Category 
##             0             0             0             0             0 
##  Sub.Category  Product.Name         Sales      Quantity      Discount 
##             0             0             0             0             0 
##        Profit 
##             0

Converting Data Types

Changing date to date format and converting some categorical variables to factor so that R will not the treat the variables has a character string but rather has a factor ,when a variable is a factor you can control the other at which it appears on your plots (e.g from lowest to highest).It also makes grouping easy.

Date format and Ordering

superstore$Order.Date<-as.Date(superstore$Order.Date,format = "%m/%d/%Y")

superstore$Ship.Date<-as.Date(superstore$Ship.Date,format="%m/%d/%Y")

#Arranging date in ascending order
superstore<-superstore%>%
  arrange(Order.Date)

#viewing order
head(superstore$Order.Date)

## [1] "2014-01-03" "2014-01-04" "2014-01-04" "2014-01-04" "2014-01-05"
## [6] "2014-01-06"

Once I got the dates sorted, I noticed the same date appearing a lot. I double-checked the rows and realized they aren’t actually duplicates. it’s just that some orders have multiple items listed separately, or different people happened to buy things on the same day.

Repeated dates mean multiple sales occurred on the same day, not an error.

Converting some Columns to Factors

#names(superstore)
superstore$Category <- as.factor(superstore$Category)

superstore$Sub.Category <- as.factor(superstore$Sub.Category)

superstore$Region <- as.factor(superstore$Region)

superstore$Segment <- as.factor(superstore$Segment)

superstore$Product.Name<-as.factor(superstore$Product.Name)

ANSWER TO QUESTIONS

Question 1: How do sales and profit vary yearly?

yearly_sales <- superstore %>%
  mutate(Year = year(Order.Date)) %>%
  group_by(Year) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Total_Orders = n()
  ) %>%
  arrange(Year)

head(yearly_sales)

## # A tibble: 4 × 4
##    Year Total_Sales Total_Profit Total_Orders
##   <dbl>       <dbl>        <dbl>        <int>
## 1  2014     484247.       49544.         1993
## 2  2015     470533.       61619.         2102
## 3  2016     609206.       81795.         2587
## 4  2017     733215.       93439.         3312

visualizing Total Sales by Year

ggplot(yearly_sales, aes(x = factor(Year), y = Total_Sales, fill = factor(Year))) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3.5) +
  labs(
    title = "Total Sales by Year",
    x = "Year",
    y = "Total Sales"
  ) +
  theme_minimal()

This showing that that 2017 has the highest sales.

Question 2: are sales increasing or decreasing over time

ggplot(yearly_sales, aes(x = Year, y = Total_Sales)) +
geom_line(color = "red", size = 1) +
geom_point(color="black",size=3)+  
labs(
title = "Yearly Sales Trend Over Time",
x = "Order Date",
y = "Total Sales"
) +
theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Over time the sales sky rockted when it got to 2015 and went higher

Question 3:Which category contributes most to overall sales and profit, and how do they differ in performance?

# Sales and Profit by Product Category

category_summary <- superstore %>%
  group_by(Category) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Total_Orders = n()
  ) %>%
  arrange(desc(Total_Sales))

head(category_summary)

## # A tibble: 3 × 4
##   Category        Total_Sales Total_Profit Total_Orders
##   <fct>                 <dbl>        <dbl>        <int>
## 1 Technology          836154.      145455.         1847
## 2 Furniture           742000.       18451.         2121
## 3 Office Supplies     719047.      122491.         6026

Visualize Sales by Category

ggplot(category_summary, aes(x = reorder(Category, -Total_Sales), y = Total_Sales, fill = Category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3) +
  labs(
    title = "Total Sales by Product Category",
    x = "Product Category",
    y = "Total Sales"
  ) +
  theme_minimal()

this shows that technology was the most bought product and the total sale of across was in million

Visualize Profit by Category

ggplot(category_summary, aes(x = reorder(Category, -Total_Profit), y = Total_Profit, fill = Category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Profit, 0)), vjust = -0.3, size = 3.5) +
  labs(
    title = "Total Profit by Product Category",
    x = "Product Category",
    y = "Total Profit"
  ) +
  theme_minimal()

This shows that Technology generated the most profit simply because it had the highest sales volume. Since these are the most expensive items in the dataset, they naturally account for the largest portion of the total profit for the business.

Question 4:Which products do customers buy the most?

product_sales <- superstore %>%
group_by(Product.Name) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
  
arrange(desc(Total_Quantity)) %>%
head(10)  # Top 10 products

ggplot(product_sales, aes(x = reorder(Product.Name, Total_Quantity), y = Total_Quantity, fill = Product.Name)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Top 10 Most Purchased Products",
x = "Product Name",
y = "Total Quantity Sold") +
theme_minimal()

The top 10 list is dominated by Office Supplies because these are low-cost, essential items that customers buy frequently. While these products lead in quantity, Technology still generates more total sales revenue because the price per item is much higher.

Question :Which specific Sub-Categories are losing money?

subcat_profit <- superstore %>%
  group_by(Sub.Category) %>%
  summarise(Total_Profit = sum(Profit, na.rm = TRUE)) %>%
  arrange(desc(Total_Profit))

visualizing

ggplot(subcat_profit,
       aes(reorder(Sub.Category, Total_Profit),
           Total_Profit,
           fill = Total_Profit > 0)) +
  geom_col() +
  coord_flip() +
  labs(title = "Profitability by Sub-Category",
       x = "Sub-Category", y = "Total Profit") +
  scale_fill_manual(values = c("FALSE" = "red", "TRUE" = "darkgreen"),
                    guide = "none") +
  theme_minimal()

The losses recorded by Tables, Bookcases, and Supplies can be attributed to heavy discounting and low profit margins, as observed in the dataset. Additionally, the bulky nature of furniture items suggests higher handling and shipping costs, which further reduces profitability.

Question 5:Which customer segment places the highest number of orders?

segment_order_counts <- superstore %>%
  group_by(Segment) %>%
  summarise(Total_Order_Lines = n(), 
            Unique_Orders = n_distinct(Order.ID)) %>%
  arrange(desc(Total_Order_Lines))

segment_category_counts <- superstore %>%
  group_by(Segment, Category) %>%
  summarise(Items_Ordered = n(), .groups = "drop") %>%
  arrange(Segment, desc(Items_Ordered))

Visualizing

ggplot(segment_category_counts, aes(x = Segment, y = Items_Ordered, fill = Category)) +
  # Fixed: Use position_dodge to separate the bars
  geom_col(position = position_dodge(width = 0.9)) + 
  # Fixed: Added position_dodge to geom_text so labels align with the bars
  geom_text(aes(label = Items_Ordered), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            size = 3.5) +
  labs(
    title = "Order Volume by Segment and Category",
    subtitle = "Office Supplies is the most frequently ordered category across all segments",
    x = "Customer Segment",
    y = "Number of Items Ordered"
  ) +
  theme_minimal()

The Consumer segment places the most orders because individuals make frequent, smaller purchases for personal or home use. Office Supplies top across all segments since they are essential, affordable, and regularly needed. Furniture and Technology come later because they are expensive and bought less often.

Question 6:Which region contributes the most to total sales, and what percentage of total revenue does it represent?

# Summarize total sales per region and calculate percentage
region_sales <- superstore %>%
  group_by(Region) %>%
  summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
  mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 1))

# View the summarized data
region_sales

## # A tibble: 4 × 3
##   Region  Total_Sales Percentage
##   <fct>         <dbl>      <dbl>
## 1 Central     501240.       21.8
## 2 East        678781.       29.5
## 3 South       391722.       17.1
## 4 West        725458.       31.6

visualizing of region sales

# Pie chart with percentage labels
ggplot(region_sales, aes(x = "", y = Total_Sales, fill = Region)) +
  geom_col(width = 1, color = "white") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), color = "white", size = 5) +
  labs(title = "Percentage of sales  by Region", fill = "Region") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The West region contributes the highest share of total sales because it records a high volume of customer orders, strong demand across all product categories, efficient shipping and logistics, and relatively higher average sales per order compared to other regions.

Question 7: Which states contribute the most to total sales, and what percentage of total revenue do they represent?

# Summarize total sales per state and calculate percentage
state_sales <- superstore %>%
  group_by(State) %>%
  summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
  mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 2)) %>%
  arrange(desc(Total_Sales))

# View top states
head(state_sales, 10)

## # A tibble: 10 × 3
##    State        Total_Sales Percentage
##    <chr>              <dbl>      <dbl>
##  1 California       457688.      19.9 
##  2 New York         310876.      13.5 
##  3 Texas            170188.       7.41
##  4 Washington       138641.       6.04
##  5 Pennsylvania     116512.       5.07
##  6 Florida           89474.       3.89
##  7 Illinois          80166.       3.49
##  8 Ohio              78258.       3.41
##  9 Michigan          76270.       3.32
## 10 Virginia          70637.       3.07

# Bar chart for top 10 states
top_states <- state_sales %>% slice_max(Total_Sales, n = 10)

ggplot(top_states, aes(x = reorder(State, Total_Sales), 
                        y = Total_Sales, fill = State)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 States by Total Sales",
    x = "State",
    y = "Total Sales"
  ) +
  theme_minimal()

California emerges as the top-performing state due to its large customer base, the highest number of orders, strong demand for high-value products such as technology and office supplies, and a high concentration of business and commercial activities.

Question 8:How do discount rates differ across product categories in the Superstore dataset?

ggplot(superstore, aes(x = Category, y = Discount, fill = Category)) +
  geom_boxplot() +
  labs(title = "Distribution of Discounts by Category",
       x = "Category",
       y = "Discount Rate") +
  theme_minimal()

This boxplot suggests that their discounting strategy might change depending on the product category. Office Supplies appear the most unpredictable; while most customers pay full price (median of 0%), they might occasionally offer very large discounts of up to 80% on items like binders, possibly to clear excess stock. Furniture generally shows a consistent 20% discount, which they might use to encourage sales of heavier, slower-moving inventory. Technology also tends to have around a 20% discount, but they might avoid the extreme 80% cuts seen in office supplies, likely because deep discounts on high-cost electronics could be too risky for the business.

Question 9: Does shipping mode influence sales performance?

ship_category_orders <- superstore %>%
  group_by(Ship.Mode, Category) %>%
  summarise(Total_Orders = n())

## `summarise()` has grouped output by 'Ship.Mode'. You can override using the
## `.groups` argument.

ggplot(ship_category_orders, 
       aes(x = Category, y = Total_Orders, fill = Ship.Mode)) +
  geom_col(position = "dodge") +
  labs(
    title = "Number of Orders by Shipping Mode and Product Category",
    x = "Product Category",
    y = "Number of Orders",
    fill = "Shipping Mode"
  ) +
  theme_minimal()

Shipping mode seems to influence sales performance mainly through customer preference and product type. Standard Class records the highest orders across all categories, which suggests customers might prefer it because it offers a reasonable balance between cost and delivery time. Faster options like First Class and Same Day are used less, likely because of their higher shipping costs

Recommendations

Based on the results of the analysis:

The Superstore should focus more on products and categories that consistently generate profit.

Products that regularly cause losses should be reviewed, improved, or possibly removed.

Discount policies should be carefully managed so that sales growth does not lead to profit loss.

More attention should be given to regions that perform well, while strategies should be developed to improve weaker regions.

Business decisions should focus not only on increasing sales volume but also on increasing profitability.

Conclusion

This project analyzed the Superstore dataset(U.S.A) to gain insights into sales performance, profitability, customer purchasing behavior, and operational efficiency. The study revealed that while the Superstore achieves strong sales, profitability is influenced by factors such as product category, region, discount levels, and shipping modes. By applying data analytics techniques and visualizations, the analysis provided actionable insights that can help management make informed business decisions, improve profitability, and enhance overall performance.