Product Sales By Month

This dataset contains the sales for different products by month and region. The raw dataset is untidy due to its wide format, providing different columns by month. If new data were to be entered we would need to alter its structure to add more months. Additionally in its current format we would need to give further consideration when adding data after a full year has passed and we begin to repeat months.

Loading the data

First we pull in the data in from its CSV file.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

sales_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/product_sales_by_month.csv", 
                      check.names = FALSE)

head(sales_raw)
##   Product Name Region Jan Sales Feb Sales Mar Sales Apr Sales May Sales
## 1    Product A  North       100       110       120       130       140
## 2    Product A  South       200       210       220       230       240
## 3    Product A   East       300       310       320       330       340
## 4    Product B  North       150       160       170       180       190
## 5    Product B  South       250       260       270       280       290
## 6    Product B   East       350       360       370       380       390
##   Jun Sales
## 1       150
## 2       250
## 3       350
## 4       200
## 5       300
## 6       400

Re-arranging the data

We can pivot this longer to move the months into rows.

sales_tidy <- pivot_longer(data = sales_raw,
                           cols = 3:last_col(),
                           names_to = "Month",
                           names_pattern = "(\\w+) Sales",
                           values_to = "Sales") |>
  rename(Product_Name = `Product Name`)

head(sales_tidy)
## # A tibble: 6 × 4
##   Product_Name Region Month Sales
##   <chr>        <chr>  <chr> <int>
## 1 Product A    North  Jan     100
## 2 Product A    North  Feb     110
## 3 Product A    North  Mar     120
## 4 Product A    North  Apr     130
## 5 Product A    North  May     140
## 6 Product A    North  Jun     150

We now have a more manageable dataframe that we would be able to use when adding in new information without needing to constantly change its structure.

Data Analysis

Some analysis that we may want to explore:

  • Which product has the highest and lowest sales?
  • What trend do we see in sales totals across the months?
  • Which region is seeing the highest and lowest sales?

Lets first look into the overall sales performance of the products.

sales_tidy |>
  group_by(Product_Name) |>
  summarise(Total_Sales = sum(Sales, na.rm = TRUE),
            Average_Sales = mean(Sales),
            Max_Sales = max(Sales),
            Min_Sales = min(Sales),
            .groups = "drop") 
## # A tibble: 3 × 5
##   Product_Name Total_Sales Average_Sales Max_Sales Min_Sales
##   <chr>              <int>         <dbl>     <int>     <int>
## 1 Product A           4050          225        350       100
## 2 Product B           4950          275        400       150
## 3 Product C           2025          112.       175        50

Product Sales Distribution

sales_tidy %>%
  group_by(Product_Name) %>%
  summarise(Total_Sales = sum(Sales)) %>%
  ggplot(aes(x = Product_Name, 
                          y = Total_Sales, 
                          fill = Product_Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Product Sales Distribution", x = "Product", y = "Total Sales") +
  theme_minimal()

Product B had roughly 5000 total sales, Product A has a little over 4000 sales, and Product C has 2000 total sales.

Takeaway: - Product B has the highest overall sales - Product C has the lowest overall sales

Now lets put region into the mix. First we can look at the performance of each region overall, and then products by region.

# Sales performance by region
sales_tidy |>
  group_by(Region) |>
  summarise(Total_Sales = sum(Sales, na.rm = TRUE),
            Average_Sales = mean(Sales),
            Max_Sales = max(Sales),
            Min_Sales = min(Sales),
            .groups = "drop") 
## # A tibble: 3 × 5
##   Region Total_Sales Average_Sales Max_Sales Min_Sales
##   <chr>        <int>         <dbl>     <int>     <int>
## 1 East          5175          288.       400       150
## 2 North         2175          121.       200        50
## 3 South         3675          204.       300       100
sales_tidy %>%
  group_by(Region) %>%
  summarise(Total_Sales = sum(Sales)) %>%
  ggplot(aes(x = Region, 
                          y = Total_Sales, 
                          fill = Region)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Regional Sales Performance", x = "Region", y = "Total Sales") +
  theme_minimal()

Takeaway: - The East region has the highest performing sales - The North region has the lowest performing sales

# Sales performance by product and region
sales_tidy |>
  group_by(Region, Product_Name) |>
  summarise(Total_Sales = sum(Sales, na.rm = TRUE),
            Average_Sales = mean(Sales),
            Max_Sales = max(Sales),
            Min_Sales = min(Sales),
            .groups = "drop") 
## # A tibble: 9 × 6
##   Region Product_Name Total_Sales Average_Sales Max_Sales Min_Sales
##   <chr>  <chr>              <int>         <dbl>     <int>     <int>
## 1 East   Product A           1950         325         350       300
## 2 East   Product B           2250         375         400       350
## 3 East   Product C            975         162.        175       150
## 4 North  Product A            750         125         150       100
## 5 North  Product B           1050         175         200       150
## 6 North  Product C            375          62.5        75        50
## 7 South  Product A           1350         225         250       200
## 8 South  Product B           1650         275         300       250
## 9 South  Product C            675         112.        125       100

Takeaway: - Across all regions Product B has the best performance in each region and Product C has the worst performance in each region

Lets start taking a look at how each product is performing month to month

sales_tidy |>
  group_by(Product_Name, Month) |>
  summarise(Total_Sales = sum(Sales, na.rm = TRUE), 
            .groups = "drop") |>
  ggplot(aes(x = factor(Month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")), 
             y = Total_Sales,
             color = Product_Name,
             group = Product_Name)) +
  geom_line() +
  geom_point() +
  labs(title = "Monthly Sales by Product",
       x = "Month",
       y = "Total Sales",
       color = "Product")

Takeaway: - We see that there are overall upward trends in sales for each of the three products across the months.

ggplot(sales_tidy, aes(x=Region, y= Sales, color = Product_Name)) + geom_point(postion = "jitter")
## Warning in geom_point(postion = "jitter"): Ignoring unknown parameters:
## `postion`

Here we see that product A and B are relately close in sales for all 3 regions but for product C, the difference between the other two is greater. We can see the largest difference in East region.

sales_tidy |>
  group_by(Product_Name, Month, Region) |>
  summarise(Total_Sales = sum(Sales, na.rm = TRUE), 
            .groups = "drop") |>
  ggplot(aes(x = factor(Month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")), 
             y = Total_Sales,
             color = Product_Name,
             group = Product_Name)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ Region) +
  labs(title = "Monthly Sales by Product",
       x = "Month",
       y = "Total Sales",
       color = "Product")

Takeaway: - Regardless of whether the Product us the highest performing or lowest performing, each of the products have an upward trend - Each region also has an upward trend in sales - As we expected we see that all three of the products perform best in the East region and have lower performances in the North