The goal of this project analysis for Regork is to identify the most profitable strategy for growth in the high-income market segment. My hypothesis is that this economic class contributes higher purchasing volumes than every other segment. Therefore, targeting growth in this segment will lead to higher profits overall.
Firstly, I will Identify the packages and data sets used throughout my project analysis.
library(ggplot2)
library(tidyverse)
library(dplyr)
library(completejourney)
library(ggthemes)
library(gridExtra)
transactions <- get_transactions()
promotions <- get_promotions()
I have joined gridExtra as it will allow me to display multiple plots in a way that works best with my visualizations.
Now I will join the necessary variables to the transactions data set. For this project I am going to be looking at how sales transactions are correlated with purchased products, brands, and certain promotions.
transactions <- transactions %>%
left_join(products %>% select(product_id, brand, product_type), by = "product_id")
transactions <- transactions %>%
left_join(demographics %>% select(household_id, income), by = "household_id")
head(transactions)
## # A tibble: 6 × 14
## househ…¹ store…² baske…³ produ…⁴ quant…⁵ sales…⁶ retai…⁷ coupo…⁸ coupo…⁹ week
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 900 330 311985… 1095275 1 0.5 0 0 0 1
## 2 900 330 311985… 9878513 1 0.99 0.1 0 0 1
## 3 1228 406 311986… 1041453 1 1.43 0.15 0 0 1
## 4 906 319 311987… 1020156 1 1.5 0.29 0 0 1
## 5 906 319 311987… 1053875 2 2.78 0.8 0 0 1
## 6 906 319 311987… 1060312 1 5.49 0.5 0 0 1
## # … with 4 more variables: transaction_timestamp <dttm>, brand <fct>,
## # product_type <chr>, income <ord>, and abbreviated variable names
## # ¹household_id, ²store_id, ³basket_id, ⁴product_id, ⁵quantity, ⁶sales_value,
## # ⁷retail_disc, ⁸coupon_disc, ⁹coupon_match_disc
Now that I have completed adding the necessary variables to my data frame, I am going to confirm that my hypothesis is true.
To display the correlation that high economic spending is correlated with a greater income, I derived average spending per household and visualized it within each income segment.
income_sales <- transactions %>%
filter(!is.na(income)) %>%
group_by(income) %>%
summarize(
total_sales = sum(sales_value, na.rm = TRUE),
household_count = n_distinct(household_id)
) %>%
mutate(avg_sales_per_household = total_sales / household_count) %>%
arrange(factor(income, levels = c("Under 15K", "15-24K", "25-34K", "35-49K",
"50-74K", "75-99K", "100-124K", "125-149K",
"150-174K", "175-199K", "200-249K", "250K+")))
ggplot(income_sales, aes(x = reorder(income, avg_sales_per_household), y = avg_sales_per_household, fill = income)) +
geom_bar(stat = "identity", color = "black", show.legend = FALSE) +
coord_flip() +
labs(title = "Average Sales Value per Household by Income Category",
x = "Income Category",
y = "Avg Sales per Household ($)") +
theme_minimal()
income_sales <- transactions %>%
filter(!is.na(income)) %>%
group_by(income) %>%
summarize(
total_sales = sum(sales_value, na.rm = TRUE),
household_count = n_distinct(household_id)
) %>%
mutate(avg_sales_per_household = total_sales / household_count) %>%
arrange(factor(income, levels = c("Under 15K", "15-24K", "25-34K", "35-49K",
"50-74K", "75-99K", "100-124K", "125-149K",
"150-174K", "175-199K", "200-249K", "250K+")))
ggplot(income_sales, aes(x = income, y = avg_sales_per_household)) +
geom_point(size = 4, color = "blue", alpha = 0.7) +
geom_line(group = 1, color = "darkblue", linetype = "dashed") +
labs(title = "Average Sales per Household by Income Category",
x = "Income Category",
y = "Avg Sales per Household ($)") +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")
)
My hypothesis that higher income is correlated with higher spending is confirmed with $250K+ being a much higher spending than each other of the income segment.
Furthermore, I notice a dip in spending within the segment of 200-249K income. It is interesting that those with an average household spending of under 15k spend more than that segment. This is an important takeaway from my data that will help me with my analysis going forward.
Since there is a significant dip in spending among the category of 200-249k, it does not make since to target this segment in my analysis. Since we are looking at the highest spending segment, I am going to continue my analysis by filtering out data to focus on the 250K+ income sector.
transactions_250K <- transactions %>%
filter(income == "250K+")
transactions_X_250K <- transactions %>%
filter(income != "250K+" | is.na(income))
Now that I have 2 data frames filtered for and without 250K income, I can begin looking for growth opportunities in my data. The first question I ask is how Regork is performing in each quarter for this economic segment.
To complete this segment I first filter the weeks of my data into 4 separate variables (Q1, Q2, Q3, Q4) as weeks (1-13,14-26,27-39,40-52).I assign a value of 1 in these variables so it can be easily mutated with sales value and other statistics to better represent purchasing power for each segment.
transactions_X_250K <- transactions_X_250K %>%
mutate(
Q1 = ifelse(week >= 1 & week <= 13, 1, 0),
Q2 = ifelse(week >= 14 & week <= 26, 1, 0),
Q3 = ifelse(week >= 27 & week <= 39, 1, 0),
Q4 = ifelse(week >= 40 & week <= 52, 1, 0)
)
transactions_250K <- transactions_250K %>%
mutate(
Q1 = ifelse(week >= 1 & week <= 13, 1, 0),
Q2 = ifelse(week >= 14 & week <= 26, 1, 0),
Q3 = ifelse(week >= 27 & week <= 39, 1, 0),
Q4 = ifelse(week >= 40 & week <= 52, 1, 0)
)
ggplot(transactions_X_250K, aes(x = case_when(
Q1 == 1 ~ "Q1",
Q2 == 1 ~ "Q2",
Q3 == 1 ~ "Q3",
Q4 == 1 ~ "Q4"
), weight = sales_value)) +
geom_bar(fill = "red", color = "black", width = 0.7) + # Blue with black outline
labs(title = "Quarterly Sales Value Distribution (Below 250K)",
x = "Quarter",
y = "Total Sales Value ($)") +
theme_minimal(base_size = 16) + # Increase base text size
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 18), # Center title
axis.text.x = element_text(size = 14, face = "bold"), # Bigger, bolder x-axis labels
axis.text.y = element_text(size = 12),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.grid.major = element_blank(), # Remove major grid lines
panel.grid.minor = element_blank() # Remove minor grid lines
)
## Quarter Performance for salary above 250K
ggplot(transactions_250K, aes(x = case_when(
Q1 == 1 ~ "Q1",
Q2 == 1 ~ "Q2",
Q3 == 1 ~ "Q3",
Q4 == 1 ~ "Q4"
), weight = sales_value)) +
geom_bar(fill = "blue", color = "black", width = 0.7) + # Blue with black outline
labs(title = "Quarterly Sales Value Distribution (250K+)",
x = "Quarter",
y = "Total Sales Value ($)") +
theme_minimal(base_size = 16) + # Increase base text size
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 18), # Center title
axis.text.x = element_text(size = 14, face = "bold"), # Bigger, bolder x-axis labels
axis.text.y = element_text(size = 12),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
panel.grid.major = element_blank(), # Remove major grid lines
panel.grid.minor = element_blank() # Remove minor grid lines
)
After displaying the performance for each group one quarter that stood out to me was Q3. For the segment below 250K, it seems that Q1 is the lowest performing with quarters 2-4 on par with each other.
However, for the segment below 250K, there is a significant dip in performance in Q3. Because of this dip in quarter 3, I am going to focus on certain products or promotions that may be beneficial in tracking performance for this sector To begin looking at this section, I will compare data specifically for quarter 3. The factor I am going to consider is the products during quarter 3. My idea is to identify a top performing product from quarter 3 in the high income segment that we can increase marketing to give higher sales and revenue.
transactions_250K_Q3 <- transactions_250K %>%
filter(Q3 == 1)
transactions_X_250K_Q3 <- transactions_X_250K %>%
filter(Q3 == 1)
get_top_5_products <- function(df) {
df %>%
group_by(product_type) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 5) # Keep only top 5 products
}
# Filter top 5 products for each dataset
top_products_X_250K_Q3 <- get_top_5_products(transactions_X_250K_Q3)
top_products_250K_Q3 <- get_top_5_products(transactions_250K_Q3)
plot1 <- ggplot(top_products_X_250K_Q3, aes(x = reorder(product_type, count), y = count, fill = product_type)) +
geom_bar(stat = "identity", width = 0.6) +
labs(title = "Top 5 Products (Below 250K) - Q3",
x = "Product Type",
y = NULL) +
coord_flip() +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 12, margin = margin(t = 5, b = 5)),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
plot.margin = margin(t = 10, r = 10, b = 10, l = 10),
legend.position = "none"
)
plot2 <- ggplot(top_products_250K_Q3, aes(x = reorder(product_type, count), y = count, fill = product_type)) +
geom_bar(stat = "identity", width = 0.6) +
labs(title = "Top 5 Products (250K+) - Q3",
x = "Product Type",
y = NULL) +
coord_flip() +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 12, margin = margin(t = 5, b = 5)),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
plot.margin = margin(t = 10, r = 10, b = 10, l = 10),
legend.position = "none"
)
grid.arrange(plot1, plot2, ncol = 2)
With this Correlation one area that interests me is the RTS SOUP: CHUNKY/HOMESTYLE. From the visualization we can see that this is a product that is number 2 in demand during quarter 3 for the 250K economic sector.I strongly suggest that Regork pushes advertising and marketing for this product in Q2/Q3 as it could lead to higher returns for Regork. A price increase would be something Regork could heavily consider additionally since those with higher salaries are more likely to spend on these products. Another factor I looked into this segment was if high income earners are more likely to purchase based upon the brand of which there product comes from. This data is shown in a pie chart below.
brand_spending_250K <- transactions_250K %>%
group_by(brand) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE))
brand_spending_X_250K <- transactions_X_250K %>%
group_by(brand) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE))
pie_250K <- ggplot(brand_spending_250K, aes(x = "", y = total_sales, fill = brand)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Brand Spending Distribution (250K+)", fill = "Brand") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
pie_X_250K <- ggplot(brand_spending_X_250K, aes(x = "", y = total_sales, fill = brand)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Brand Spending Distribution (Below 250K)", fill = "Brand") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
grid.arrange(pie_250K, pie_X_250K, ncol = 2)
Since we can see that the brand distribution among those earning 250K+ is significantly lower for private spending, an area for opportunity for Q3 can be an increase in marketing for nationally branded products. This could drive steady revenue for this economic sector throughout the quarter.
Overall my goal for Regork was to find the best way to increase revenue. I believe my data shows for findings that will be crucial to meeting this goa.
From my multiple charts I was able to deduct that the higher economic sector are the highest spenders per household. Additionally there is an opportunity for growth during the third quarter as this sector is greatly under performing against the rest of market.
Since soup greatly over performs in this market, this can be an area the Regork continues to grow. I would recommend a pricing increase in these products since the data illustrates that this sector are the highest spenders and likely to pay more for these goods.
Lastly, It is shown that consumers in the higher economic class are more inclined to pay for nationally branded products. Increasing the supply and advertisement of these products will lead to an increase in revenue.