SuperStore Sales Report

Author

Happiness Ndanu

Problem Statement

This analysis aims to uncover trends in sales performance across product categories, regions, and customer segments.

Introduction

The aim of analyzing this dataset is to generate insights into sales patterns at Superstore. By looking into various variables such as sales region, product category, and customer segments, we will be in a position to pin-point areas of high sales and take note of areas of improvement.

Target Audience

This report is designed for Superstore management and sales teams who are seeking data-driven recommendations to make informed decision-making. Insights derived from the analysis will support strategies for increasing sales, optimizing shipping operations, and improving customer satisfaction.

Dataset Summary Table

Table 1.0: A summary of different variables in the Superstore dataset and the respective descriptions.
Variable	Description
Row ID	A unique identifier for each row in the dataset
Order ID	A unique identifier for each order
Order Date	Date the order was placed
Ship Date	Date the order was shipped
Ship Mode	Mode of shipping
Customer ID	Unique identifier for each customer
Customer Name	Full name of the customer
Segment	Customer segment (e.g., Consumer, Corporate, Home Office)
Country	Country where the order was placed
CIty	City where the customer is located
State	State where the customer is located
Postal Code	Postal code of the customer’s location
Region	Region where the order was placed (e.g., East, West)
Product ID	Unique identifier for each product
Category	Product category (e.g., Furniture, Technology)
Sub-Category	Sub-category of the product (e.g., Chairs, Phones)
Product Name	Full name of the product
Sales	Total sales made by each order

Having a clear picture of what the dataset contains, the next steps will be loading necessary libraries that we will use, data cleaning and then conducting Exploratory Data Analysis.

Libraries

To kick off, let; load the necessary libraries needed for this analysis.

Click to show/hide code

library(tidyverse)
library(readr)
library(janitor)
library(scales)  
library(ggpie)
library(ggpubr)
library(ggforce)
library(htmltools)
library(plotly)
library(knitr)
library(gganimate)
library(scales)
library(kableExtra)
 library(DT)
library(lubridate)

Next, import the data set needed for this analysis.

Click to show/hide code

sales <- read_csv("D:/projects/consumer_sales/train.csv") %>% 
  clean_names()

Data Cleaning

Before going deep into the analysis, let’s start by cleaning the data.

Let’s check the data types in the columns

Click to show/hide code

glimpse(sales)

Rows: 9,800
Columns: 18
$ row_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ order_id      <chr> "CA-2017-152156", "CA-2017-152156", "CA-2017-138688", "U…
$ order_date    <chr> "08/11/2017", "08/11/2017", "12/06/2017", "11/10/2016", …
$ ship_date     <chr> "11/11/2017", "11/11/2017", "16/06/2017", "18/10/2016", …
$ ship_mode     <chr> "Second Class", "Second Class", "Second Class", "Standar…
$ customer_id   <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033…
$ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O…
$ segment       <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum…
$ country       <chr> "United States", "United States", "United States", "Unit…
$ city          <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal…
$ state         <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid…
$ postal_code   <dbl> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, …
$ region        <chr> "South", "South", "West", "South", "South", "West", "Wes…
$ product_id    <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",…
$ category      <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"…
$ sub_category  <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F…
$ product_name  <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric …
$ sales         <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,…

Based on the assessment, several columns are not in the right data type. ie. postal code, order date and ship date. To change this, I will convert the postal code to an integer, and the ship date as well as the order date to dates.

Click to show/hide code

sales$postal_code <- as.integer(sales$postal_code)
sales$order_date <- as.Date(sales$order_date, format = "%d/%m/%y")
sales$ship_date <- as.Date(sales$ship_date, format = "%d/%m/%y")

Next, let’s look out for missing values
Click to show/hide code
```
missing_values <- sum(anyNA(sales)) %>% print()
```
```
[1] 1
```
Next step is to eliminate the missing values since there is no significant caused by the one omission
Click to show/hide code
```
sales <- na.omit(sales)
```
Now that that’s done, lets look out for duplicates in the dataset
Click to show/hide code
```
duplicate <- sum(duplicated(sales)) %>% print()
```
```
[1] 0
```
Now that my data is all clean, let the EDA begin!

Exploratory Data Analysis

1.0 How much revenue did each state make?

Click to show/hide code

state_totalsale <- sales %>% 
  select(state, sales) %>% 
  group_by(state) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 48 × 2
   state        total_sale
   <chr>             <dbl>
 1 California      446306.
 2 New York        306361.
 3 Texas           168573.
 4 Washington      135207.
 5 Pennsylvania    116277.
 6 Florida          88437.
 7 Illinois         79237.
 8 Michigan         76136.
 9 Ohio             75130.
10 Virginia         70637.
# ℹ 38 more rows

Click to show/hide code

# Visualize the total sales by state
ggplot(state_totalsale, aes(x = reorder(state, -total_sale), y = total_sale)) +  
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Total Sales by State", 
       x = "State", 
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Click to show/hide code

#visualize top ten states with highest revenue
# Get the top 10 states by total sales
top_10_states <- state_totalsale %>%
  slice_max(total_sale, n = 10)

Click to show/hide code

# Visualize the top 10 states
ggplot(top_10_states, aes(x = reorder(state, -total_sale), y = total_sale)) +
  geom_bar(stat = "identity", fill=c("#B45D58","#B8647D","#A875A0","#848BB8","#529EBD","#27AEAD","#41B88F","#77BE6C","#AFBD50","#E6B64C")) +
  geom_text(aes(label = dollar(total_sale)), hjust = 0.5, color = "black", size = 3)+
  labs(title = "Top 10 States by Revenue Made", 
       x = "State", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) +  
  coord_flip() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Observation: California had the highest sales whereas North Dakota came in last with the least number sales made.

2.0 How much revenue did each city make?

Click to show/hide code

city_totalsale <- sales %>% 
  select(city, sales) %>% 
  group_by(city) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 529 × 2
   city          total_sale
   <chr>              <dbl>
 1 New York City    252463.
 2 Los Angeles      173420.
 3 Seattle          116106.
 4 San Francisco    109041.
 5 Philadelphia     108842.
 6 Houston           63956.
 7 Chicago           47820.
 8 San Diego         47521.
 9 Jacksonville      44713.
10 Detroit           42447.
# ℹ 519 more rows

Click to show/hide code

#visualize top ten cities with highest revenue
# Get the top 10 cities by total sales
top_10_cities <- city_totalsale %>%
  slice_max(total_sale, n = 10)

# Visualize the top 10 cities
ggplot(top_10_cities, aes(x = reorder(city, total_sale), y = total_sale)) +
  geom_bar(stat = "identity", fill = c("#724352", "#78566F", "#746C8A", "#67839F", "#559BAA", "#4DB1AA", "#5FC5A0", "#86D68F", "#B8E47E", "#F0ED73")

) +
  geom_text(aes(label = dollar(total_sale)), hjust = 0.6, color = "black", size = 5)+
  labs(title = "Top 10 Cities by Revenue Made", 
       x = "Cities", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) + 
  coord_flip() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 2))
# Get the bottom 10 cities by total sales
bottom_10_cities <- city_totalsale %>%
  slice_min(total_sale, n = 10)

# Visualize the bottom 10 cities
ggplot(bottom_10_cities, aes(x = reorder(city, total_sale), y = total_sale)) +
  geom_bar(stat = "identity", fill = c("#724352", "#78566F", "#746C8A", "#67839F", "#559BAA", "#4DB1AA", "#5FC5A0", "#86D68F", "#B8E47E", "#F0ED73")) +
  geom_text(aes(label = scales::dollar(total_sale)), hjust = 0.6, color = "black", size = 5) +
  labs(title = "Bottom 10 Cities by Revenue Made", 
       x = "Cities", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) + 
  coord_flip() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 2))

Note

Observation: NewYork City made the highest sales whereas Abilene city had the least sales making a record of as low as $1.39. The margin between the two cities is quite high hence more investigations need to be done to figure out the reason behind this.

3.0 How much revenue did each region make?

Click to show/hide code

region_totalsale <- sales %>% 
  select(region, sales) %>% 
  group_by(region) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 4 × 2
  region  total_sale
  <chr>        <dbl>
1 West       710220.
2 East       660589.
3 Central    492647.
4 South      389151.

Click to show/hide code

ggplot(region_totalsale, aes(x = reorder(region, total_sale), y = total_sale)) +
  geom_bar(stat = "identity", fill = c("#2E6167", "#3F8569", "#7FA25A", "#D5B657")
) +
  geom_text(aes(label = dollar(total_sale)), hjust = 0.5, color = "black", size = 3)+
  labs(title = "Total revenue made by different regions", 
       x = "Regions", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) +  
  coord_flip() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Observation: The West and East regions of the United States recorded highest sales. The difference in the sales margin between the two regions is small compared to the other two regions(Central and South regions)

4.0 Do different shipping modes contribute to the sales made in the various regions?

Click to show/hide code

regionshipping_sales <- sales %>% 
  select(region,ship_mode, sales) %>% 
  group_by(region, ship_mode) %>% 
  summarise(total_sale=sum(sales))%>% print()

# A tibble: 16 × 3
# Groups:   region [4]
   region  ship_mode      total_sale
   <chr>   <chr>               <dbl>
 1 Central First Class        58057.
 2 Central Same Day           20106.
 3 Central Second Class       97606.
 4 Central Standard Class    316878.
 5 East    First Class       110437.
 6 East    Same Day           43316.
 7 East    Second Class      114393.
 8 East    Standard Class    392443.
 9 South   First Class        49046.
10 South   Same Day           21017.
11 South   Second Class       93435.
12 South   Standard Class    225654.
13 West    First Class       128032.
14 West    Same Day           40779.
15 West    Second Class      143766.
16 West    Standard Class    397642.

Click to show/hide code

# Reorder ship_mode by total_sale within each region
regionshipping_sales <- regionshipping_sales %>%
  group_by(region) %>%
  mutate(ship_mode = fct_reorder(ship_mode, total_sale)) %>%
  ungroup()

# Next, visualize the plot
ggplot(regionshipping_sales, aes(x = region, y = total_sale, fill = ship_mode)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = scales::dollar(total_sale)), 
            position = position_dodge(width = 1.0), 
            vjust = -0.5,  
            color = "black", 
            size = 2) +   
  labs(title = "Revenue Made by Shipping Modes in Various Regions", 
       x = "Region", 
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_minimal()

Note

Observation: The Standard Class mode of shipping recorded the highest number of sales across all regions whereas the Same Day shipping mode was least preferred across all regions

5.0 How much revenue did each shipping mode make?

Click to show/hide code

shippingmode_totalsale <- sales %>% 
  select(ship_mode, sales) %>% 
  group_by(ship_mode) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 4 × 2
  ship_mode      total_sale
  <chr>               <dbl>
1 Standard Class   1332617.
2 Second Class      449199.
3 First Class       345572.
4 Same Day          125219.

6.0 How much revenue did each postal code make?

Click to show/hide code

postal_totalsale <- sales %>%
  select(postal_code, sales) %>%
  group_by(postal_code) %>% 
  summarise(total_sale=sum(sales)) %>% 
  arrange(desc(total_sale)) 

 datatable(postal_totalsale, 
           options = list(pageLength = 10, 
                          lengthMenu = c(10, 25, 50)), 
           colnames = c("Postal Code", "Total Sales in $"))

7.0 How much revenue did each category make?

Click to show/hide code

category_totalsale <- sales %>% 
  select(category, sales) %>% 
  group_by(category) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 3 × 2
  category        total_sale
  <chr>                <dbl>
1 Technology         825856.
2 Furniture          723538.
3 Office Supplies    703213.

Click to show/hide code

# Create a pie chart directly with plotly
interactive_pie <- plot_ly(category_totalsale, labels = ~category, values = ~total_sale, type = 'pie',
                           textinfo = 'label+percent', hoverinfo = 'label+value',
                           textposition = 'inside',
                           marker = list(colors = c("#00AFBB", "#E7B800", "#FC4E07"), 
                                         line = list(color = 'white', width = 2))) %>%
  layout(title = "Total Sales Revenue by Category",
         showlegend = TRUE)
interactive_pie

Note

Observation: The Technology product category made the highest sales followed closely by Office Supplies and Furniture categories, respectively.

8.0 How much revenue did each segment make?

Click to show/hide code

segment_totalsale <- sales %>% 
  select(segment, sales) %>% 
  group_by(segment) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) 
#print table
kable(segment_totalsale, format = "markdown", col.names = c("Segment", "Total Sales in $"))

Segment	Total Sales in $
Consumer	1146708.2
Corporate	682211.8
Home Office	423687.4

Note

Observation: The consumer segment recorded the highest sales whereas the Home office segment had the least sales

9.0 How much revenue did each segment make in various product categories?

Click to show/hide code

segcate <- sales %>% 
  select(segment, category, sales) %>% 
  group_by(segment, category) %>%
  summarise(total_sales = sum(sales)) %>%
  ungroup() %>%
  print()

# A tibble: 9 × 3
  segment     category        total_sales
  <chr>       <chr>                 <dbl>
1 Consumer    Furniture           386981.
2 Consumer    Office Supplies     358715.
3 Consumer    Technology          401012.
4 Corporate   Furniture           215917.
5 Corporate   Office Supplies     222558.
6 Corporate   Technology          243737.
7 Home Office Furniture           120641.
8 Home Office Office Supplies     121939.
9 Home Office Technology          181108.

Click to show/hide code

# Lets reorder product categories by total_sales within each segment
categorysegment_sales <- segcate %>%
  group_by(category) %>%
  mutate(segment = fct_reorder(segment, total_sales)) %>%
  ungroup()

# Visualization
ggplot(categorysegment_sales, aes(x = category, y = total_sales, fill = segment)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = scales::dollar(total_sales)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5,  
            color = "black", 
            size = 3) +   
  labs(title = "Revenue Made by Product Categories under various Segments", 
       x = "Product Category", 
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_minimal()

Note

Observation: According to the findings, it is safe to say that individual buyers (consumers) contribute to the highest sales in all product categories whereas Home Office (small businesses or independent professionals operating out of a home-based office), contribute to the least sales

10.0 How much revenue did each product sub-categories make?

Click to show/hide code

subcategory_totalsale <- sales %>% 
  select(sub_category, sales) %>% 
  group_by(sub_category) %>% 
  summarise(total_sale=sum(sales)) %>%  arrange(desc(total_sale)) %>% print()

# A tibble: 17 × 2
   sub_category total_sale
   <chr>             <dbl>
 1 Phones          326488.
 2 Chairs          322108.
 3 Storage         217779.
 4 Tables          202811.
 5 Binders         200029.
 6 Machines        189239.
 7 Accessories     163882.
 8 Copiers         146248.
 9 Bookcases       109408.
10 Appliances      104075.
11 Furnishings      89212.
12 Paper            76736.
13 Supplies         46420.
14 Art              26697.
15 Envelopes        16126.
16 Labels           12348.
17 Fasteners         3002.

Click to show/hide code

ggplot(subcategory_totalsale, aes(x = reorder(sub_category, total_sale), y = total_sale)) +
  geom_bar(stat = "identity", fill = c("#2E7D32", "#388E3C", "#4CAF50", "#81C784", "#C8E6C9", 
                                       "#FFB74D", "#FF9800", "#F57C00", "#FF5722", "#E57373", 
                                       "#EF5350", "#D32F2F", "#BA68C8", "#9C27B0", "#7B1FA2", 
                                       "#9575CD", "#BBDEFB")
  )+
  geom_text(aes(label = dollar(total_sale)), hjust = 0.5, color = "black", size = 3)+
  labs(title = "Total revenue made by different product sub-categories", 
       x = "Product Sub-Categories", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) + 
  coord_flip() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Observation: Phones and chairs had the highest sales whereas fasteners made the least sales

11.0 How much revenue did each segment make via various product sub-categories? (Top 5 subcategories)

Click to show/hide code

# Summarize sales by segment and sub_category
segsubcate <- sales %>% 
  select(segment, sub_category, sales) %>% 
  group_by(segment, sub_category) %>%
  summarise(total_sales = sum(sales), .groups = 'drop') 

# Get the top 5 subcategories by total sales for each segment
top_5_subcate_by_segment <- segsubcate %>%
  group_by(segment) %>%  # Group by segment
  slice_max(total_sales, n = 5, with_ties = FALSE) %>%  #This will help us get the 5 subcategories for each segment
  ungroup() %>% 
  print()

# A tibble: 15 × 3
   segment     sub_category total_sales
   <chr>       <chr>              <dbl>
 1 Consumer    Chairs           170459.
 2 Consumer    Phones           169231.
 3 Consumer    Binders          117092.
 4 Consumer    Tables            99848.
 5 Consumer    Storage           98783.
 6 Corporate   Chairs            95203.
 7 Corporate   Phones            90343.
 8 Corporate   Storage           76411.
 9 Corporate   Tables            67133.
10 Corporate   Machines          60277.
11 Home Office Phones            66914.
12 Home Office Chairs            56445.
13 Home Office Machines          49419.
14 Home Office Storage           42585.
15 Home Office Tables            35829.

Click to show/hide code

# Next, let's the top 5 subcategories by total sales for each segment
ggplot(top_5_subcate_by_segment, aes(x = reorder(segment, total_sales), y = total_sales, fill = sub_category)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = scales::dollar(total_sales)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5,  
            color = "black", 
            size = 3) +   
  labs(title = "Top 5 Product Sub-Categories by Total Sales in Each Segment", 
       x = "Product Segment", 
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Click to show/hide code

# Lets look into the bottom subcategories by total sales for each segment
bottom_subcate_by_segment <- segsubcate %>%
  group_by(segment) %>%  
  slice_min(total_sales, n = 5, with_ties = FALSE) %>%  
  ungroup() %>%  
  print()

# A tibble: 15 × 3
   segment     sub_category total_sales
   <chr>       <chr>              <dbl>
 1 Consumer    Fasteners          1675.
 2 Consumer    Labels             6611.
 3 Consumer    Envelopes          7630.
 4 Consumer    Art               14062.
 5 Consumer    Supplies          25531.
 6 Corporate   Fasteners           783.
 7 Corporate   Labels             4062.
 8 Corporate   Envelopes          5798.
 9 Corporate   Art                8518.
10 Corporate   Supplies          19400.
11 Home Office Fasteners           544.
12 Home Office Supplies           1489.
13 Home Office Labels             1675.
14 Home Office Envelopes          2699.
15 Home Office Art                4117.

Click to show/hide code

# Lets plot the bottom subcategories by total sales for each segment
ggplot(bottom_subcate_by_segment, aes(x = reorder(segment, total_sales), y = total_sales, fill = sub_category)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = scales::dollar(total_sales)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5,  
            color = "black", 
            size = 3) +   
  labs(title = "Product Sub-Categories with the least Total Sales in Each Segment", 
       x = "Product Segment", 
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Observation: According to this:-

Phones made the highest sales whereas Supplies had the least sales in the Home Office segment
Chairs made the highest sales whereas Fasteners had the least sales in both the Corporate and Consumer segments

12.0 Which phones contributed to the highest sales?

Click to show/hide code

# Let's calculate total sales for each phone and get the top 10 phones with the highest sales
phone_totalsale <- sales %>% 
  select(sub_category, product_name, sales) %>% 
  filter(sub_category == "Phones") %>%
  group_by(product_name) %>% 
  summarise(total_sale = sum(sales)) %>%  
  slice_max(total_sale, n = 10) %>% 
  arrange(desc(total_sale)) 
#print table
kable(phone_totalsale, format = "markdown", col.names = c("Type of Phone", "Total Sales in $"))

Type of Phone	Total Sales in $
Samsung Galaxy Mega 6.3	13943.668
Apple iPhone 5	12996.600
Wilson Electronics DB Pro Signal Booster	8878.400
Mitel MiVoice 5330e IP Phone	7699.720
Samsung Galaxy S III - 16GB - pebble blue (T-Mobile)	7139.796
Cisco Unified IP Phone 7945G VoIP phone	6751.602
ClearOne CHATAttach 160 - speaker phone	6695.892
Polycom VVX 310 VoIP phone	6227.654
Samsung Galaxy S4	6134.702
Samsung Galaxy S4 Mini	6109.870

Note

Observation: Samsung phones seem to appear more often on the top ten best selling phones hence contributing to the highest sales

13.0 Is there anything in common with the chairs that made the highest sales?

Let’s calculate look into the chairs sub category since it is a big contributor to the high sales

Click to show/hide code

chair_totalsale <- sales %>% 
  select(sub_category, product_name, sales) %>% 
  filter(sub_category == "Chairs") %>%
  group_by(product_name) %>% 
  summarise(total_sale = sum(sales)) %>%  
  slice_max(total_sale, n = 10) %>% 
  arrange(desc(total_sale)) 
#print table
kable(chair_totalsale, format = "markdown", col.names = c("Type of Chair", "Total Sales in $"))

Type of Chair	Total Sales in $
HON 5400 Series Task Chairs for Big and Tall	21870.576
Global Troy Executive Leather Low-Back Tilter	12975.382
SAFCO Arco Folding Chair	11572.780
Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back	10637.528
GuestStacker Chair with Chrome Finish Legs	9070.944
Office Star - Professional Matrix Back Chair with 2-to-1 Synchro Tilt and Mesh Fabric Seat	8774.500
Global Deluxe High-Back Manager’s Chair	8665.194
Hon 4070 Series Pagoda Armless Upholstered Stacking Chairs	8430.997
Global Commerce Series High-Back Swivel/Tilt Chairs	7979.440
Hon 4700 Series Mobuis Mid-Back Task Chairs with Adjustable Arms	7582.374

Note

Observation: The chairs listed with the highest sales do share common themes of ergonomic design, comfort, versatility, aesthetic appeal, durability, functionality, and adaptability to meet the needs of various consumers and work environments. These characteristics make them suitable choices for modern office settings where both style and practicality are valued.

14.0 months vs sales

Click to show/hide code

monthly_sales <- sales %>%
  mutate(month_name = month(ymd(order_date), label = TRUE, abbr = TRUE),#this will help us get the month the order was placed
         year = year(ymd(order_date))) %>%                              
  group_by(year, month_name) %>%
  summarise(total_sales = sum(sales)) %>%
  arrange(month_name)

print(monthly_sales)

# A tibble: 12 × 3
# Groups:   year [1]
    year month_name total_sales
   <dbl> <ord>            <dbl>
 1  2020 Jan             91982.
 2  2020 Feb             59371.
 3  2020 Mar            197574.
 4  2020 Apr            134988.
 5  2020 May            154087.
 6  2020 Jun            145838.
 7  2020 Jul            145536.
 8  2020 Aug            157316.
 9  2020 Sep            300103.
10  2020 Oct            199496.
11  2020 Nov            345042.
12  2020 Dec            321275.

Click to show/hide code

monthly_sales_aggregated <- monthly_sales %>%
  group_by(month_name) %>%
  summarise(total_sales = sum(total_sales))

# Display the result
print(monthly_sales_aggregated)

# A tibble: 12 × 2
   month_name total_sales
   <ord>            <dbl>
 1 Jan             91982.
 2 Feb             59371.
 3 Mar            197574.
 4 Apr            134988.
 5 May            154087.
 6 Jun            145838.
 7 Jul            145536.
 8 Aug            157316.
 9 Sep            300103.
10 Oct            199496.
11 Nov            345042.
12 Dec            321275.

Click to show/hide code

monthly_sales_aggregated$month_name <- factor(monthly_sales_aggregated$month_name, 
levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),ordered = TRUE)

#next, plot it out!
ggplot(monthly_sales_aggregated, aes(x = month_name, y = total_sales, fill = month_name)) +
  geom_bar(stat = "identity") +
  labs(title = "Monthly Sales", 
       x = "Month", 
       y = "Total Sales") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
 scale_fill_manual(values = c("#FFC1A1", "#FFDDC1", "#FFD700", "#E1C699", "#A4C2A4", "#76B4BD", "#77AADD", "#A8C0E0", "#B1A8DF", "#C9B8E8", "#D9B3C2", "#F4A582"))

Note

Observation: According to the visualization, it seems like the second half of the year recorded the highest sales (From September to December)

Report Findings

Sales Trends:
- The monthly sales data shows significant variations across different months, with certain months (such as November, September & December) displaying notably high sales.
- There is a potential seasonal trend that could be leveraged for marketing campaigns and inventory planning.
Category Performance:
- Certain product categories (like Technology) consistently outperform others (like Office Supplies) in terms of sales.
- This suggests that customers may have a higher demand for technology products, possibly indicating a shift in consumer preferences.
Segment Insights:
- Different customer segments have varying preferences for product categories, which can inform targeted marketing strategies.
- The Consumer segment shows strong performance, suggesting a focus on household and everyday items.
Shipping Modes:
- Analysis of sales by shipping mode reveals that certain shipping options contribute significantly to total sales.
- Understanding customer preferences for shipping modes can help optimize logistics and enhance customer satisfaction.
Product Subcategories:
- The performance of subcategories reveals opportunities for cross-selling products that are frequently bought together.
- Subcategories with lower sales could be targeted for promotions or enhanced marketing efforts.

Recommendations

Targeted Marketing Campaigns:
- Leverage seasonal trends by launching marketing campaigns during peak sales months to maximize visibility and sales.
- Use insights from customer segments to create tailored marketing messages that resonate with specific groups.
Inventory Management:
- Stock high-demand categories and products more heavily to ensure availability during peak sales periods, particularly for popular months.
- Consider reducing stock for underperforming categories or adjusting inventory based on seasonal sales patterns.
Enhance Shipping Options:
- Optimize shipping strategies by analyzing the impact of shipping modes on sales.
- Consider offering promotions for faster shipping options or bundling shipping costs to increase customer satisfaction and sales.
Focus on Low-Performing Products:
- Investigate the reasons behind lower sales in specific subcategories and develop strategies to enhance their appeal (e.g., discounts, promotions).
- Explore coupling underperforming products with more popular items to encourage cross-selling.
Regular Monitoring:
- Continuously monitor sales trends, customer preferences, and inventory levels to adapt quickly to changes in the market.
- Implement a dashboard for real-time sales analysis, which can help in making data-driven decisions.
Customer Feedback Loop:
- Engage with customers through surveys or feedback forms to understand their preferences better and improve product offerings.
- Use customer feedback to refine marketing strategies and product development.

Hypothesis Testing

In this section, I will apply hypothesis testing to assess whether there is a significant difference in sales across different columns in my data-set

Do different product segments perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different segments

Alternative hypothesis: There is a significant difference in average sales between different segments

Click to show/hide code

segment_anova <- aov(sales~segment, data = sales)
summary(segment_anova)

              Df    Sum Sq Mean Sq F value Pr(>F)
segment        2 4.180e+05  209015   0.535  0.586
Residuals   9786 3.827e+09  391040

Based on the ANOVA results, the p-value is 0.586, which is greater than our significance level threshold of 0.05. As a result, I will fail to reject the null hypothesis. This suggests that there is no statistically significant difference in sales between the different segments.

Do different shipping modes perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different shipping modes

Alternative hypothesis: There is a significant difference in average sales between different shipping modes

Click to show/hide code

shipping_anova <- aov(sales~ship_mode, data = sales)
summary(shipping_anova)

              Df    Sum Sq Mean Sq F value Pr(>F)
ship_mode      3 1.067e+05   35581   0.091  0.965
Residuals   9785 3.827e+09  391112

Based on the ANOVA results, the p-value is 0.965, which is greater than our significance level threshold of 0.05. As a result, I will fail to reject the null hypothesis. This suggests that there is no statistically significant difference in sales between the shipping modes.

Do different cities perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different cities

Alternative hypothesis: There is a significant difference in average sales between different cities

Click to show/hide code

cities_anova <- aov(sales~city, data = sales)
summary(cities_anova)

              Df    Sum Sq Mean Sq F value Pr(>F)
city         528 1.362e+08  257950   0.647      1
Residuals   9260 3.691e+09  398589

Based on the ANOVA results, the p-value is 1, which is greater than our significance level threshold of 0.05. As a result, I will fail to reject the null hypothesis. This suggests that there is no statistically significant difference in sales between the cities.

Do different product categories perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different product categories

Alternative hypothesis: There is a significant difference in average sales between different product categories

Click to show/hide code

categories_anova <- aov(sales~category, data = sales)
summary(categories_anova)

              Df    Sum Sq  Mean Sq F value Pr(>F)    
category       2 1.944e+08 97199561   261.8 <2e-16 ***
Residuals   9786 3.633e+09   371218                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results, the p-value is <2e-16(0.000000000000002) , which is lesser than our significance level threshold of 0.05. As a result, I will reject the null hypothesis. This suggests that there is a strong statistically significant difference in sales between the different product categories.

Do different product sub-categories perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different product sub-categories

Alternative hypothesis: There is a significant difference in average sales between different product sub-categories

Click to show/hide code

subcategories_anova <- aov(sales~sub_category, data = sales)
summary(subcategories_anova)

               Df    Sum Sq  Mean Sq F value Pr(>F)    
sub_category   16 7.646e+08 47787152   152.5 <2e-16 ***
Residuals    9772 3.063e+09   313399                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do different product states perform significantly different in making sales?

Null hypothesis: There is no significant difference in average sales between different states

Alternative hypothesis: There is a significant difference in average sales between different states

Click to show/hide code

state_anova <- aov(sales~state, data = sales)
summary(state_anova)

              Df    Sum Sq Mean Sq F value  Pr(>F)   
state         47 3.111e+07  661830   1.698 0.00205 **
Residuals   9741 3.796e+09  389696                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results, the p-value is 0.00205 , which is lesser than our significance level threshold of 0.05. As a result, I will reject the null hypothesis. This suggests that there is a strong statistically significant difference in sales between the different states.