This dataset contains the sales for different products by month and region. The raw dataset is untidy due to its wide format, providing different columns by month. If new data were to be entered we would need to alter its structure to add more months. Additionally in its current format we would need to give further consideration when adding data after a full year has passed and we begin to repeat months.
First we pull in the data in from its CSV file.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
sales_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/product_sales_by_month.csv",
check.names = FALSE)
head(sales_raw)
## Product Name Region Jan Sales Feb Sales Mar Sales Apr Sales May Sales
## 1 Product A North 100 110 120 130 140
## 2 Product A South 200 210 220 230 240
## 3 Product A East 300 310 320 330 340
## 4 Product B North 150 160 170 180 190
## 5 Product B South 250 260 270 280 290
## 6 Product B East 350 360 370 380 390
## Jun Sales
## 1 150
## 2 250
## 3 350
## 4 200
## 5 300
## 6 400
We can pivot this longer to move the months into rows.
sales_tidy <- pivot_longer(data = sales_raw,
cols = 3:last_col(),
names_to = "Month",
names_pattern = "(\\w+) Sales",
values_to = "Sales") |>
rename(Product_Name = `Product Name`)
head(sales_tidy)
## # A tibble: 6 × 4
## Product_Name Region Month Sales
## <chr> <chr> <chr> <int>
## 1 Product A North Jan 100
## 2 Product A North Feb 110
## 3 Product A North Mar 120
## 4 Product A North Apr 130
## 5 Product A North May 140
## 6 Product A North Jun 150
We now have a more manageable dataframe that we would be able to use when adding in new information without needing to constantly change its structure.
Some analysis that we may want to explore:
Lets first look into the overall sales performance of the products.
sales_tidy |>
group_by(Product_Name) |>
summarise(Total_Sales = sum(Sales, na.rm = TRUE),
Average_Sales = mean(Sales),
Max_Sales = max(Sales),
Min_Sales = min(Sales),
.groups = "drop")
## # A tibble: 3 × 5
## Product_Name Total_Sales Average_Sales Max_Sales Min_Sales
## <chr> <int> <dbl> <int> <int>
## 1 Product A 4050 225 350 100
## 2 Product B 4950 275 400 150
## 3 Product C 2025 112. 175 50
Product Sales Distribution
sales_tidy %>%
group_by(Product_Name) %>%
summarise(Total_Sales = sum(Sales)) %>%
ggplot(aes(x = Product_Name,
y = Total_Sales,
fill = Product_Name)) +
geom_bar(stat = "identity") +
labs(title = "Product Sales Distribution", x = "Product", y = "Total Sales") +
theme_minimal()
Product B had roughly 5000 total sales, Product A has a little over 4000
sales, and Product C has 2000 total sales.
Takeaway: - Product B has the highest overall sales - Product C has the lowest overall sales
Now lets put region into the mix. First we can look at the performance of each region overall, and then products by region.
# Sales performance by region
sales_tidy |>
group_by(Region) |>
summarise(Total_Sales = sum(Sales, na.rm = TRUE),
Average_Sales = mean(Sales),
Max_Sales = max(Sales),
Min_Sales = min(Sales),
.groups = "drop")
## # A tibble: 3 × 5
## Region Total_Sales Average_Sales Max_Sales Min_Sales
## <chr> <int> <dbl> <int> <int>
## 1 East 5175 288. 400 150
## 2 North 2175 121. 200 50
## 3 South 3675 204. 300 100
sales_tidy %>%
group_by(Region) %>%
summarise(Total_Sales = sum(Sales)) %>%
ggplot(aes(x = Region,
y = Total_Sales,
fill = Region)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Regional Sales Performance", x = "Region", y = "Total Sales") +
theme_minimal()
Takeaway: - The East region has the highest performing sales - The North region has the lowest performing sales
# Sales performance by product and region
sales_tidy |>
group_by(Region, Product_Name) |>
summarise(Total_Sales = sum(Sales, na.rm = TRUE),
Average_Sales = mean(Sales),
Max_Sales = max(Sales),
Min_Sales = min(Sales),
.groups = "drop")
## # A tibble: 9 × 6
## Region Product_Name Total_Sales Average_Sales Max_Sales Min_Sales
## <chr> <chr> <int> <dbl> <int> <int>
## 1 East Product A 1950 325 350 300
## 2 East Product B 2250 375 400 350
## 3 East Product C 975 162. 175 150
## 4 North Product A 750 125 150 100
## 5 North Product B 1050 175 200 150
## 6 North Product C 375 62.5 75 50
## 7 South Product A 1350 225 250 200
## 8 South Product B 1650 275 300 250
## 9 South Product C 675 112. 125 100
Takeaway: - Across all regions Product B has the best performance in each region and Product C has the worst performance in each region
Lets start taking a look at how each product is performing month to month
sales_tidy |>
group_by(Product_Name, Month) |>
summarise(Total_Sales = sum(Sales, na.rm = TRUE),
.groups = "drop") |>
ggplot(aes(x = factor(Month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")),
y = Total_Sales,
color = Product_Name,
group = Product_Name)) +
geom_line() +
geom_point() +
labs(title = "Monthly Sales by Product",
x = "Month",
y = "Total Sales",
color = "Product")
Takeaway: - We see that there are overall upward trends in sales for
each of the three products across the months.
ggplot(sales_tidy, aes(x=Region, y= Sales, color = Product_Name)) + geom_point(postion = "jitter")
## Warning in geom_point(postion = "jitter"): Ignoring unknown parameters:
## `postion`
Here we see that product A and B are relately close in sales for all 3
regions but for product C, the difference between the other two is
greater. We can see the largest difference in East region.
sales_tidy |>
group_by(Product_Name, Month, Region) |>
summarise(Total_Sales = sum(Sales, na.rm = TRUE),
.groups = "drop") |>
ggplot(aes(x = factor(Month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")),
y = Total_Sales,
color = Product_Name,
group = Product_Name)) +
geom_line() +
geom_point() +
facet_wrap(~ Region) +
labs(title = "Monthly Sales by Product",
x = "Month",
y = "Total Sales",
color = "Product")
Takeaway: - Regardless of whether the Product us the highest performing or lowest performing, each of the products have an upward trend - Each region also has an upward trend in sales - As we expected we see that all three of the products perform best in the East region and have lower performances in the North