introduction:The Superstore dataset is a real-world retail sales dataset containing information on customer orders, including products, sales, profit, discounts, shipping methods, and customer segments across different regions.
source:Kaggle
The aim of this project is to analyze the Superstore dataset to understand sales performance, profitability, and customer purchasing patterns.
1.How do sales and profit vary and yearly?
2.Are sales increasing or decreasing over time?
3.Which category contributes most to overall sales and profit, and how do they differ in performance?
4.which products do customers buy the most and Which specific Sub-Categories are losing money?
5.Which customer segment places the highest number of orders?
6.Which region contributes the most to total sales, and what percentage of total revenue does it represent?
7.Which states contribute the most to total sales, and what percentage of total revenue do they represent?
8.How do discount rates differ across product categories in the Superstore dataset?
9.Does shipping mode influence sales performance?
library(ggplot2) #For visualization
library(dplyr) #for data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) #for date
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr) #for cleaning data
superstore<-read.csv("C:/Users/Hello/Downloads/Sample_Superstore.csv")
View(superstore)
# viewing the first six of my data
head(superstore)
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/2016 6/16/2016 Second Class DV-13045
## 4 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/2014 6/14/2014 Standard Class BH-11710
## Customer.Name Segment Country City State
## 1 Claire Gute Consumer United States Henderson Kentucky
## 2 Claire Gute Consumer United States Henderson Kentucky
## 3 Darrin Van Huff Corporate United States Los Angeles California
## 4 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 5 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 6 Brosina Hoffman Consumer United States Los Angeles California
## Postal.Code Region Product.ID Category Sub.Category
## 1 42420 South FUR-BO-10001798 Furniture Bookcases
## 2 42420 South FUR-CH-10000454 Furniture Chairs
## 3 90036 West OFF-LA-10000240 Office Supplies Labels
## 4 33311 South FUR-TA-10000577 Furniture Tables
## 5 33311 South OFF-ST-10000760 Office Supplies Storage
## 6 90032 West FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
I Imported my data using read.csv
This process is know as cleaning of the data check for the structure ,summary ,missing value ,coverting data type,handling outliers and so on…….
#str(superstore)
glimpse(superstore) # also like structure
## Rows: 9,994
## Columns: 21
## $ Row.ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Order.ID <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U…
## $ Order.Date <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10…
## $ Ship.Date <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "…
## $ Ship.Mode <chr> "Second Class", "Second Class", "Second Class", "Standar…
## $ Customer.ID <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033…
## $ Customer.Name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O…
## $ Segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum…
## $ Country <chr> "United States", "United States", "United States", "Unit…
## $ City <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal…
## $ State <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid…
## $ Postal.Code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, …
## $ Region <chr> "South", "South", "West", "South", "South", "West", "Wes…
## $ Product.ID <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",…
## $ Category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"…
## $ Sub.Category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F…
## $ Product.Name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric …
## $ Sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,…
## $ Quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,…
## $ Discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.…
## $ Profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1…
summary(superstore)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Length:9994 Length:9994
## 1st Qu.:2499 Class :character Class :character Class :character
## Median :4998 Mode :character Mode :character Mode :character
## Mean :4998
## 3rd Qu.:7496
## Max. :9994
## Ship.Mode Customer.ID Customer.Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Country City State Postal.Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56431
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
#superstore <- distinct(superstore)
dim(superstore) #checking for the dimension
## [1] 9994 21
colSums(is.na(superstore))
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode
## 0 0 0 0 0
## Customer.ID Customer.Name Segment Country City
## 0 0 0 0 0
## State Postal.Code Region Product.ID Category
## 0 0 0 0 0
## Sub.Category Product.Name Sales Quantity Discount
## 0 0 0 0 0
## Profit
## 0
Changing date to date format and converting some categorical variables to factor so that R will not the treat the variables has a character string but rather has a factor ,when a variable is a factor you can control the other at which it appears on your plots (e.g from lowest to highest).It also makes grouping easy.
superstore$Order.Date<-as.Date(superstore$Order.Date,format = "%m/%d/%Y")
superstore$Ship.Date<-as.Date(superstore$Ship.Date,format="%m/%d/%Y")
#Arranging date in ascending order
superstore<-superstore%>%
arrange(Order.Date)
#viewing order
head(superstore$Order.Date)
## [1] "2014-01-03" "2014-01-04" "2014-01-04" "2014-01-04" "2014-01-05"
## [6] "2014-01-06"
Once I got the dates sorted, I noticed the same date appearing a lot. I double-checked the rows and realized they aren’t actually duplicates. it’s just that some orders have multiple items listed separately, or different people happened to buy things on the same day.
Repeated dates mean multiple sales occurred on the same day, not an error.
#names(superstore)
superstore$Category <- as.factor(superstore$Category)
superstore$Sub.Category <- as.factor(superstore$Sub.Category)
superstore$Region <- as.factor(superstore$Region)
superstore$Segment <- as.factor(superstore$Segment)
superstore$Product.Name<-as.factor(superstore$Product.Name)
Question 1: How do sales and profit vary yearly?
yearly_sales <- superstore %>%
mutate(Year = year(Order.Date)) %>%
group_by(Year) %>%
summarise(
Total_Sales = sum(Sales, na.rm = TRUE),
Total_Profit = sum(Profit, na.rm = TRUE),
Total_Orders = n()
) %>%
arrange(Year)
head(yearly_sales)
## # A tibble: 4 × 4
## Year Total_Sales Total_Profit Total_Orders
## <dbl> <dbl> <dbl> <int>
## 1 2014 484247. 49544. 1993
## 2 2015 470533. 61619. 2102
## 3 2016 609206. 81795. 2587
## 4 2017 733215. 93439. 3312
visualizing Total Sales by Year
ggplot(yearly_sales, aes(x = factor(Year), y = Total_Sales, fill = factor(Year))) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3.5) +
labs(
title = "Total Sales by Year",
x = "Year",
y = "Total Sales"
) +
theme_minimal()
This showing that that 2017 has the highest sales.
Question 2: are sales increasing or decreasing over time
ggplot(yearly_sales, aes(x = Year, y = Total_Sales)) +
geom_line(color = "red", size = 1) +
geom_point(color="black",size=3)+
labs(
title = "Yearly Sales Trend Over Time",
x = "Order Date",
y = "Total Sales"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Over time the sales sky rockted when it got to 2015 and went higher
Question 3:Which category contributes most to overall sales and profit, and how do they differ in performance?
# Sales and Profit by Product Category
category_summary <- superstore %>%
group_by(Category) %>%
summarise(
Total_Sales = sum(Sales, na.rm = TRUE),
Total_Profit = sum(Profit, na.rm = TRUE),
Total_Orders = n()
) %>%
arrange(desc(Total_Sales))
head(category_summary)
## # A tibble: 3 × 4
## Category Total_Sales Total_Profit Total_Orders
## <fct> <dbl> <dbl> <int>
## 1 Technology 836154. 145455. 1847
## 2 Furniture 742000. 18451. 2121
## 3 Office Supplies 719047. 122491. 6026
ggplot(category_summary, aes(x = reorder(Category, -Total_Sales), y = Total_Sales, fill = Category)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(Total_Sales, 0)), vjust = -0.3, size = 3) +
labs(
title = "Total Sales by Product Category",
x = "Product Category",
y = "Total Sales"
) +
theme_minimal()
this shows that technology was the most bought product and the total sale of across was in million
ggplot(category_summary, aes(x = reorder(Category, -Total_Profit), y = Total_Profit, fill = Category)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(Total_Profit, 0)), vjust = -0.3, size = 3.5) +
labs(
title = "Total Profit by Product Category",
x = "Product Category",
y = "Total Profit"
) +
theme_minimal()
This shows that Technology generated the most profit simply because it had the highest sales volume. Since these are the most expensive items in the dataset, they naturally account for the largest portion of the total profit for the business.
Question 4:Which products do customers buy the most?
product_sales <- superstore %>%
group_by(Product.Name) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
arrange(desc(Total_Quantity)) %>%
head(10) # Top 10 products
ggplot(product_sales, aes(x = reorder(Product.Name, Total_Quantity), y = Total_Quantity, fill = Product.Name)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Top 10 Most Purchased Products",
x = "Product Name",
y = "Total Quantity Sold") +
theme_minimal()
The top 10 list is dominated by Office Supplies because these are low-cost, essential items that customers buy frequently. While these products lead in quantity, Technology still generates more total sales revenue because the price per item is much higher.
Question :Which specific Sub-Categories are losing money?
subcat_profit <- superstore %>%
group_by(Sub.Category) %>%
summarise(Total_Profit = sum(Profit, na.rm = TRUE)) %>%
arrange(desc(Total_Profit))
ggplot(subcat_profit,
aes(reorder(Sub.Category, Total_Profit),
Total_Profit,
fill = Total_Profit > 0)) +
geom_col() +
coord_flip() +
labs(title = "Profitability by Sub-Category",
x = "Sub-Category", y = "Total Profit") +
scale_fill_manual(values = c("FALSE" = "red", "TRUE" = "darkgreen"),
guide = "none") +
theme_minimal()
The losses recorded by Tables, Bookcases, and Supplies can be attributed to heavy discounting and low profit margins, as observed in the dataset. Additionally, the bulky nature of furniture items suggests higher handling and shipping costs, which further reduces profitability.
Question 5:Which customer segment places the highest number of orders?
segment_order_counts <- superstore %>%
group_by(Segment) %>%
summarise(Total_Order_Lines = n(),
Unique_Orders = n_distinct(Order.ID)) %>%
arrange(desc(Total_Order_Lines))
segment_category_counts <- superstore %>%
group_by(Segment, Category) %>%
summarise(Items_Ordered = n(), .groups = "drop") %>%
arrange(Segment, desc(Items_Ordered))
ggplot(segment_category_counts, aes(x = Segment, y = Items_Ordered, fill = Category)) +
# Fixed: Use position_dodge to separate the bars
geom_col(position = position_dodge(width = 0.9)) +
# Fixed: Added position_dodge to geom_text so labels align with the bars
geom_text(aes(label = Items_Ordered),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3.5) +
labs(
title = "Order Volume by Segment and Category",
subtitle = "Office Supplies is the most frequently ordered category across all segments",
x = "Customer Segment",
y = "Number of Items Ordered"
) +
theme_minimal()
The Consumer segment places the most orders because individuals make frequent, smaller purchases for personal or home use. Office Supplies top across all segments since they are essential, affordable, and regularly needed. Furniture and Technology come later because they are expensive and bought less often.
Question 6:Which region contributes the most to total sales, and what percentage of total revenue does it represent?
# Summarize total sales per region and calculate percentage
region_sales <- superstore %>%
group_by(Region) %>%
summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 1))
# View the summarized data
region_sales
## # A tibble: 4 × 3
## Region Total_Sales Percentage
## <fct> <dbl> <dbl>
## 1 Central 501240. 21.8
## 2 East 678781. 29.5
## 3 South 391722. 17.1
## 4 West 725458. 31.6
visualizing of region sales
# Pie chart with percentage labels
ggplot(region_sales, aes(x = "", y = Total_Sales, fill = Region)) +
geom_col(width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 5) +
labs(title = "Percentage of sales by Region", fill = "Region") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The West region contributes the highest share of total sales because it
records a high volume of customer orders, strong demand across all
product categories, efficient shipping and logistics, and relatively
higher average sales per order compared to other regions.
Question 7: Which states contribute the most to total sales, and what percentage of total revenue do they represent?
# Summarize total sales per state and calculate percentage
state_sales <- superstore %>%
group_by(State) %>%
summarise(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
mutate(Percentage = round((Total_Sales / sum(Total_Sales)) * 100, 2)) %>%
arrange(desc(Total_Sales))
# View top states
head(state_sales, 10)
## # A tibble: 10 × 3
## State Total_Sales Percentage
## <chr> <dbl> <dbl>
## 1 California 457688. 19.9
## 2 New York 310876. 13.5
## 3 Texas 170188. 7.41
## 4 Washington 138641. 6.04
## 5 Pennsylvania 116512. 5.07
## 6 Florida 89474. 3.89
## 7 Illinois 80166. 3.49
## 8 Ohio 78258. 3.41
## 9 Michigan 76270. 3.32
## 10 Virginia 70637. 3.07
# Bar chart for top 10 states
top_states <- state_sales %>% slice_max(Total_Sales, n = 10)
ggplot(top_states, aes(x = reorder(State, Total_Sales),
y = Total_Sales, fill = State)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Top 10 States by Total Sales",
x = "State",
y = "Total Sales"
) +
theme_minimal()
California emerges as the top-performing state due to its large customer base, the highest number of orders, strong demand for high-value products such as technology and office supplies, and a high concentration of business and commercial activities.
Question 8:How do discount rates differ across product categories in the Superstore dataset?
ggplot(superstore, aes(x = Category, y = Discount, fill = Category)) +
geom_boxplot() +
labs(title = "Distribution of Discounts by Category",
x = "Category",
y = "Discount Rate") +
theme_minimal()
This boxplot suggests that their discounting strategy might change depending on the product category. Office Supplies appear the most unpredictable; while most customers pay full price (median of 0%), they might occasionally offer very large discounts of up to 80% on items like binders, possibly to clear excess stock. Furniture generally shows a consistent 20% discount, which they might use to encourage sales of heavier, slower-moving inventory. Technology also tends to have around a 20% discount, but they might avoid the extreme 80% cuts seen in office supplies, likely because deep discounts on high-cost electronics could be too risky for the business.
Question 9: Does shipping mode influence sales performance?
ship_category_orders <- superstore %>%
group_by(Ship.Mode, Category) %>%
summarise(Total_Orders = n())
## `summarise()` has grouped output by 'Ship.Mode'. You can override using the
## `.groups` argument.
ggplot(ship_category_orders,
aes(x = Category, y = Total_Orders, fill = Ship.Mode)) +
geom_col(position = "dodge") +
labs(
title = "Number of Orders by Shipping Mode and Product Category",
x = "Product Category",
y = "Number of Orders",
fill = "Shipping Mode"
) +
theme_minimal()
Shipping mode seems to influence sales performance mainly through customer preference and product type. Standard Class records the highest orders across all categories, which suggests customers might prefer it because it offers a reasonable balance between cost and delivery time. Faster options like First Class and Same Day are used less, likely because of their higher shipping costs
The Superstore should focus more on products and categories that consistently generate profit.
Products that regularly cause losses should be reviewed, improved, or possibly removed.
Discount policies should be carefully managed so that sales growth does not lead to profit loss.
More attention should be given to regions that perform well, while strategies should be developed to improve weaker regions.
Business decisions should focus not only on increasing sales volume but also on increasing profitability.
This project analyzed the Superstore dataset(U.S.A) to gain insights into sales performance, profitability, customer purchasing behavior, and operational efficiency. The study revealed that while the Superstore achieves strong sales, profitability is influenced by factors such as product category, region, discount levels, and shipping modes. By applying data analytics techniques and visualizations, the analysis provided actionable insights that can help management make informed business decisions, improve profitability, and enhance overall performance.