Goal: How can we learn from top performing stores to adjust operations to be more profitable?
We were tasked with the responsibility of bettering Regork by identifying a potential area of growth. We approached this by asking ourselves the question: how can we learn from our top performing stores to adjust operations to be more profitable.
In order to tackle this question we used the complete journey data set to examine the differences in the customer profile and transactions profile of the 25 Top performing stores versus all other stores. First, we identified the top 25 five performing sales based on total sales. Next, in the customer profile section we focused specifically on using the demographics data merged with transactions data to visualize how demographic factors influenced sales at these top-performing stores compared to all stores. We looked into sales distribution by income level, age group, household size, marital status, and household composition, home ownership, and number of kids. Then, in the the transactions profile section we used the products data in the transactions and products data sets to visualize how the customer’s purchasing habits at the top-twenty five stores differed from that of all stores. We looked into sales distribution by hour, department, product category, manufacturer, and package size. From these findings we were able to develop our strategic, data driven recommendation.
We recommend that our client implement targeted initiatives to engage demographics where the top 25 stores outperform all others in sales, while also introducing strategies to bridge the sales gap observed in the differing purchasing behaviors of customers between top-performing stores and all other stores. This analysis will greatly benefit Regork because it will allow the company to optimize targeted markets, enhance product strategy, and all store performances as a whole.
We used multiple packages to assist in our analysis and report:
library(completejourney)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
library(DT)
library(scales)
We used multiple data sets to assist in our analysis and report:
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
## household_id store_id basket_id product_id quantity sales_value retail_disc
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 900 330 31198570044 1095275 1 0.5 0
## 2 900 330 31198570047 9878513 1 0.99 0.1
## 3 1228 406 31198655051 1041453 1 1.43 0.15
## 4 906 319 31198705046 1020156 1 1.5 0.29
## 5 906 319 31198705046 1053875 2 2.78 0.8
## 6 906 319 31198705046 1060312 1 5.49 0.5
## 7 906 319 31198705046 1075313 1 1.5 0.29
## 8 1058 381 31198676055 985893 1 1.88 0.21
## 9 1058 381 31198676055 988791 1 1.5 1.29
## 10 1058 381 31198676055 9297106 1 2.69 0
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## # transaction_timestamp <dttm>
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital_status household_size
## <chr> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2
## 2 1001 45-54 50-74K Homeowner Unmarried 1
## 3 1003 35-44 25-34K <NA> Unmarried 1
## 4 1004 25-34 15-24K <NA> Unmarried 1
## 5 101 45-54 Under 15K Homeowner Married 4
## 6 1012 35-44 35-49K <NA> Married 5+
## 7 1014 45-54 15-24K <NA> Married 4
## 8 1015 45-54 50-74K Homeowner Unmarried 1
## 9 1018 45-54 35-49K Homeowner Married 5+
## 10 1020 45-54 25-34K Homeowner Married 2
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
# Create data table
store_sales <- transactions %>%
group_by(store_id) %>%
summarize(
total_sales = sum(sales_value),
total_qty = sum(quantity)
) %>%
arrange(desc(total_sales))
# Limit to the top 25 stores
top_stores <- head(store_sales, 25)
# Store as a list for future filter values
store_list <- top_stores$store_id
Based on the total sales of each store, our team was able to identify the Top 25 stores out of the total 457 stores in operations. This designation allowed us to compare Top Stores to Other Stores and identify variables that separated the two. These variables help indicate key differentiators that make Top Stores more profitable and successful.
# Transform the data table to be viewable on Rmd
datatable((top_stores), colnames = c("Store ID", "Total Sales", "Total Quantity"),
options = list(
order = list(list(2, 'desc')), # Sort by the third column in descending order
pageLength = 10 # Display 10 entries per page
)) %>%
formatCurrency(columns = c('total_sales'),
currency = "$",
digits = 2,
before = TRUE) %>%
formatCurrency(columns = 'total_qty',
currency = "",
interval = 3,
mark = ",",
digits = 0)
# Total Sales and Transactions for Top Stores
top_stores %>%
summarise(sum(total_sales),
sum(total_qty))
# Total Sales and Transactions for All Stores
store_sales %>%
summarise(sum(total_sales),
sum(total_qty))
# Summary Percentages
total_stores_num <- n_distinct(store_sales$store_id)
other_stores_num <- n_distinct(store_sales$store_id) - 25
top_stores_num <- n_distinct(top_stores$store_id)
prct_sales <- 1809251/4596040
prct_qty <- 98723109/152923108
prct_stores <- top_stores_num/total_stores_num
print(percent(prct_stores, accuracy = 0.01))
print(percent(prct_sales, accuracy = 0.01))
print(percent(prct_qty, accuracy = 0.01))
# Create percentages data frame
prct_data <- data.frame(
class = c("Top Stores", "Other Stores", "All Stores"),
stores = c(25, 457, 25+457),
sales = c(1809251, 4596040-1809251, 4596040),
transactions = c(98723109, 152923108-98723109, 152923108)
)
# Reshape the data to long format
df_long <- prct_data %>%
pivot_longer(cols = -class, names_to = "Column", values_to = "Value") %>%
group_by(Column) %>%
mutate(Total = Value[class == "All Stores"]) %>%
filter(class != "All Stores") %>%
mutate(Percentage = Value / Total * 100)
# Create the plot
ggplot(df_long, aes(x = "", y = Percentage, fill = class)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
facet_wrap(~factor(Column, levels = c("stores", "sales", "transactions")),
ncol = 3,
labeller = as_labeller(c(stores = "Stores",
sales = "Sales",
transactions = "Transactions"))) +
geom_text(aes(label = paste0(round(Percentage, 2), "%")),
position = position_stack(vjust = 0.5),
color = "white") +
theme_void() +
ggtitle('Top 25 vs Other Stores Percentage of Store, Sales, and Transactions') +
theme(
legend.position = "bottom",
plot.title = element_text(size = 15, face = 'bold', hjust = 0.5, margin = margin(b = 20)),
strip.text = element_text(size = 13)
) +
scale_fill_manual(name = "Store Classification",
values = c("Top Stores" = "darkorchid3",
"Other Stores" = "cadetblue3"))
The ideal customer for a high performing Regork Store is a household with two individuals who are married without kids, in the age group 45-54, making more than $75K a year, and who are homeowners.
Regork can improve their overall sales by expanding into markets that have more people sharing the demographics of their “ideal customer”. Analyzing the demographics of potential markets and comparing them to the data we’ve found can allow Regork to open new high-performing stores in key markets across the country.
# Create a data frame that combines transactions and demographics, and designates if the store from the transaction is classified as a Top Store or not.
combined_df <- transactions %>%
inner_join(demographics) %>%
mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores'))
Here is a sample version of the dataset used showing relevant transaction and demographic information for the top 50 sales values. The store class column was created by our team and allows us to identify stores that are designated as a Top 25 Store.
# Created filtered data frame as a sample of larger data used
df_filtered <- combined_df %>%
select(store_id, store_class, sales_value, household_id, age, income,
home_ownership, marital_status, household_size, household_comp, kids_count) %>%
arrange(desc(sales_value)) %>%
slice_head(n = 50)
# Transform the data table to be viewable on Rmd
datatable(df_filtered %>%
select(store_id, store_class, sales_value, household_id, income, age,
household_size, marital_status, household_comp, home_ownership, kids_count),
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = c(0, 1), # store_id and store_class
className = 'dt-center'
)),
order = list(list(2, 'desc')), # Sort by the third column (sales_value) in descending order
pageLength = 10, # Display 10 entries per page
lengthMenu = list(c(10, 25, 50), c('10', '25', '50')), # Allow user to choose 10, 25, or 50 entries
dom = 'lftip' # Show length changing, filtering, table, information, and pagination
),
colnames = c('Store ID', 'Store Class', 'Sales Value', 'Hshld ID', 'Income', 'Age',
'Hshld Size', 'Marital Status', 'Household Composition',
'Home Ownership', '# of Kids'),
rownames = FALSE) %>%
formatCurrency('sales_value', currency = "$", digits = 2) %>%
formatStyle(
'store_class',
target = 'row',
backgroundColor = styleEqual(
c("Top Stores", "Other Stores"),
c('#e5cdf1', '#dcf1f3')
)
) %>%
formatStyle('store_class', fontWeight = 'bold')
Note that transactions without any demographic data have been excluded from this data set and total calculations.
# Calculate overall total sales for top stores and for other stores
combined_df %>%
group_by(store_class) %>%
summarise(total_sales = sum(sales_value))
Variables that do seem to impact the performance of stores are:
Top stores generate a larger percentage of their sales from upper-middle income brackets, particularly the $75K–99K income group, compared to other stores. They also outperform other stores in all higher-income brackets greater than $75k, indicating a stronger appeal to affluent customers. In contrast, other stores have a more balanced distribution across lower-income brackets (e.g., under $15K, $15K–24K and $25-34K), suggesting they cater more to lower-income demographics. This highlights that top stores dominate the upper-middle and high-income segments, while other stores serve a broader range of lower and middle-income customers.
# Group data frame by income and classification, then add a column for the percentage of total sales
income_dist <- combined_df %>%
group_by(income, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by income of top stores versus other stores
income_dist %>%
ggplot(aes(x = income, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Income Level',
subtitle = 'The percentage of total sales for each income level of top stores compared to other stores') +
xlab('Income Level') +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.3, by = .05),
labels = scales::percent)
Top performing stores cater to markets with higher income levels.
# Create a list of high incomes
incomes <- unique(combined_df$income)
high_income_list <- sort(incomes, decreasing = TRUE)[1:7]
# Calculate total percent of overall sales for top stores
top_store_top_income <- income_dist %>%
filter(income %in% high_income_list & store_class == 'Top Stores')
sum(top_store_top_income$sales_percent)
# Calculate total percent of overall sales for other stores
all_store_top_income <- income_dist %>%
filter(income %in% high_income_list & store_class != 'Top Stores')
sum(all_store_top_income$sales_percent)
Variables that do not seem to impact the performance of stores are:
Most of the age groups have similar sales distributions when you compare top stores to other stores. Top stores do have slightly higher percentages for the age groups 24-35 and 45-54.
# Group data frame by age and classification, then add a column for the percentage of total sales
age_dist <- combined_df %>%
group_by(age, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by age groups of top stores versus other stores
age_dist %>%
ggplot(aes(x = age, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Age Group',
subtitle = 'The percentage of total sales for each age group of top stores compared to other stores') +
xlab('Age Group') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.5, by = .05),
labels = scales::percent)
Household size distribution of Top Stores is similar to Other Stores. There is a slightly higher percentage of households that are 1 person or 4 people.
# Group data frame by household size and classification, then add a column for the percentage of total sales
h_s_dist <- combined_df %>%
group_by(household_size, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by household size of top stores versus other stores
h_s_dist %>%
ggplot(aes(x = household_size, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Household Size',
subtitle = 'The percentage of total sales for each household size of top stores compared to other stores') +
xlab('Household Size') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.5, by = .05),
labels = scales::percent)
Marital Status has no impact on if a store will preform highly. The percentage of sales from Married households is almost equal for Top Stores versus Other Stores, and Unmarried household percentages are within 2% of each other.
# Group data frame by marital status and classification, then add a column for the percentage of total sales
m_s_dist <- combined_df %>%
group_by(marital_status, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by marital status of top stores versus other stores
m_s_dist %>%
filter(!is.na(marital_status)) %>%
ggplot(aes(x = marital_status, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Marital Status',
subtitle = 'The percentage of total sales based on marital status for top stores compared to other stores') +
xlab('Marital Status') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.5, by = .05),
labels = scales::percent)
Household composition is similar for both Top Stores and Other Stores. There is about 5% higher level of sales in Top Stores from households that have only one adult. This suggests, single people without kids are more common in markets of Top Stores.
# Group data frame by household composition and classification, then add a column for the percentage of total sales
h_c_dist <- combined_df %>%
group_by(household_comp, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by household composition of top stores versus other stores
h_c_dist %>%
filter(!is.na(household_comp)) %>%
ggplot(aes(x = household_comp, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Household Composition',
subtitle = 'The percentage of total sales based on household composition for top stores\ncompared to other stores') +
xlab('Household Composition') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.5, by = .05),
labels = scales::percent)
Homeownership levels are about the same for both Top Stores and Other Stores. There is about 5% higher percentage of sales in Top Stores from households that are homeowners. This means there are more homeowners in markets of Top Stores than markets of Other Stores.
# Group data frame by home ownership status and classification, then add a column for the percentage of total sales
h_o_dist <- combined_df %>%
group_by(home_ownership, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by home ownership status of top stores versus other stores
h_o_dist %>%
filter(!is.na(home_ownership)) %>%
ggplot(aes(x = home_ownership, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Home Ownership',
subtitle = 'The percentage of total sales based on home ownership status for top stores\ncompared to other stores') +
xlab('Home Ownership') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.7, by = .1),
labels = scales::percent)
Kid count has little impact on sales distributions when you compare top stores to other stores. Top stores do have slightly higher percentages for families with 0 or 2 kids.
# Group data frame by number of kids in family and classification, then add a column for the percentage of total sales
kids_dist <- combined_df %>%
group_by(kids_count, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))
# Plot graph comparing sales distribution by number of kids in family of top stores versus other stores
kids_dist %>%
filter(!is.na(kids_count)) %>%
ggplot(aes(x = kids_count, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) + geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Number of Kids',
subtitle = 'The percentage of total sales based on number of kids in family for top stores\ncompared to other stores') +
xlab('Number of Kids') +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.65, by = .1),
labels = scales::percent)
There are a few transaction details specific to top performing stores. Transaction time and product category had varied sales distributions when comparing Top Stores to Other Stores. Transaction details like Department, Manufacturer, and Package Size did not expose any differentiators of profitability.
Regork can optimize their sales strategies for Other Stores by highlighting successful tactics from Top Stores. This will enable them to increase customer engagement, boost transaction values, and ultimately improve profitability across their entire store network. A few ways Regork can do this is to:
# Create a data frame that combines transactions and products, and designates if the store from the transaction is classified as a Top Store or not.
all_store_sales <- transactions %>%
left_join(products, by = "product_id") %>%
mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores'))
Here is a sample version of the dataset used showing relevant transaction and product information for the top 50 sales values. The store class column was created by our team and allows us to identify stores that are designated as a Top 25 Store. We also created a transaction hour data point to capture what hour a transaction was made.
# Created filtered data frame as a sample of larger data used
df_filtered2 <- all_store_sales %>%
mutate(hour = hour(transaction_timestamp)) %>%
select(store_id, store_class, sales_value, product_id, hour, product_category, department, manufacturer_id, package_size) %>%
arrange(desc(sales_value)) %>%
slice_head(n = 50)
# Transform the data table to be viewable on Rmd
datatable(df_filtered2,
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = c(0, 1), # Store_id and store_class
className = 'dt-center'
)),
order = list(list(2, 'desc')), # Sort by the third column (sales_value) in descending order
pageLength = 10, # Display 10 entries per page
lengthMenu = list(c(10, 25, 50), c('10', '25', '50')), # Allow user to choose 10, 25, or 50 entries
dom = 'lftip' # Show length changing, filtering, table, information, and pagination
),
colnames = c('Store ID', 'Store Class', 'Sales Value', 'Product ID', 'Purchase Hour',
'Product Category', 'Department', 'Manufacturer ID', 'Package Size'),
rownames = FALSE) %>%
formatCurrency('sales_value', currency = "$", digits = 2) %>%
formatStyle(
'store_class',
target = 'row',
backgroundColor = styleEqual(
c("Top Stores", "Other Stores"),
c('#e5cdf1', '#dcf1f3')
)
) %>%
formatStyle('store_class', fontWeight = 'bold')
Variables that do seem to impact the performance of stores are:
A notable spike in transaction amounts occurs at 5 a.m. for Top Stores, where the average transaction value exceeds $6, significantly higher than at any other time of day. In contrast, Other Stores show a much steadier trend, with average transaction amounts hovering around $3 to $4 throughout the day. After 7 a.m., the transaction values for Top Stores gradually decline but remain consistently higher than those of Other Stores across all hours. This suggests that Top Stores experience a unique pattern of high-value transactions during early morning hours, while Other Stores maintain a more uniform transaction value throughout the day.
# Create a column for the hour of the transaction
sales_hr <- transactions %>%
mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores')) %>%
mutate(hour = hour(transaction_timestamp)) %>%
group_by(hour, store_class) %>%
summarize(avg_sales = mean(sales_value))
# Plot to show average sales by hour for top stores v. other stores
ggplot(sales_hr, aes(x = hour, y = avg_sales, color = store_class, group = store_class)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(title = "Transaction Amount Average by Hour",
subtitle = "Average value of each transaction by hour for Other Stores vs. Top Stores",
x = "Hour of Day",
y = "Average Transaction Amount",
color = "Store Class") +
scale_x_continuous(breaks = seq(0, 23, by = 1)) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(name = "Store Classification", values = c("Other Stores" = "cadetblue3", "Top Stores" = "darkorchid3")) +
theme_minimal() +
theme(axis.text.x = element_text(hjust = .5, size = 10),
axis.title.x = element_text(margin = margin(t = 10)))
Top Stores allocate a significantly higher percentage of their sales to Coupon/Misc Items which stands out as their dominant category, accounting for over 10% of total sales. In contrast, Other Stores show a more balanced distribution across categories, with no single category dominating to the same extent. Categories such as Beef and Soft Drinks contribute a higher percentage of sales in Other Stores compared to Top Stores.
Top Stores’ significantly higher percentage of sales for Coupon/Misc Items likely reflects their strategic use of coupons and discounts to drive customer engagement and boost sales. Coupons create a psychological incentive for shoppers, making them feel they are receiving exclusive deals, which can lead to impromptu purchases or larger transactions. Additionally, Top Stores may use coupons to promote specific products, clear inventory, or introduce new items, aligning with their broader business goals of increasing foot traffic and customer loyalty. This focus on coupon-driven sales could also indicate a reliance on targeted promotions to differentiate themselves from competitors and maximize revenue.
# Group data frame by product category and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.5% of sales
p_c_dist <- all_store_sales %>%
group_by(product_category, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
filter(sales_percent > 0.015)
# Plot graph comparing sales distribution by category of top stores versus other stores
p_c_dist %>%
ggplot(aes(x = product_category, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Product Category',
subtitle = 'The percentage of total sales for each of the most common product category of\ntop storescompared to other stores') +
xlab('Product Categories') +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 0.3, by = .05),
labels = scales::percent)
Variables that do not seem to impact the performance of stores are:
The shape of the curves for both store classes is similar, indicating that time is not a significant differentiator of top stores for hourly sales. Both lines show a gradual increase in sales starting early in the morning, peaking around midday to early afternoon, and then tapering off into the evening. This similarity suggests that customer shopping behavior, in terms of timing, is consistent across both Top Stores and Other Stores.
# Create a column for the hour of the transaction
sales_hr <- transactions %>%
mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores')) %>%
mutate(hour = hour(transaction_timestamp)) %>%
group_by(hour, store_class) %>%
summarize(total_sales = sum(sales_value))
# Plot to show total sales by hour for top stores v. other stores
ggplot(sales_hr, aes(x = hour, y = total_sales, color = store_class, group = store_class)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(title = "Sales Total by Hour",
subtitle = "Total sales by hour for Other Stores vs. Top Stores",
x = "Hour of Day",
y = "Total Sales",
color = "Store Class") +
scale_x_continuous(breaks = seq(0, 23, by = 1)) +
scale_y_continuous(labels = scales::label_dollar(),
breaks = seq(0, 300000, by = 50000)) +
scale_color_manual(name = "Store Classification", values = c("Other Stores" = "cadetblue3", "Top Stores" = "darkorchid3")) +
theme_minimal() +
theme(axis.text.x = element_text(hjust = .5, size = 10),
axis.title.x = element_text(margin = margin(t = 10)))
Both store types generate the majority of their sales from the Grocery department, but Other Stores have a higher percentage of sales in this category compared to Top Stores. Meanwhile, Top Stores have a slightly higher share of sales in the Fuel department. Top Stores have about 12% of sales attributed to a gas station connected with their store.
# Group data frame by department, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.0% of sales.
d_dist <- all_store_sales %>%
group_by(department, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
filter(sales_percent > 0.01)
# Plot of graph comparing the sales distribution by department of top store versus all other stores.
d_dist %>%
ggplot(aes(x = department, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Department',
subtitle = 'The percentage of total sales for each of the most common departments of\ntop stores compared to other stores') +
xlab('Departments') +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 1, by = .1),
labels = scales::percent)
Manufacturer 69 overwhelmingly leads in sales, contributing over 30% in Top Stores and slightly less in Other Stores, highlighting its market dominance. Manufacturer 2 has a moderate share, with a slightly higher percentage in Other Stores than in Top Stores. The remaining manufacturers (103, 1208, 1251, 317, 544, 673, and 764) have minimal contributions to total sales, with no significant differences between the two store classifications. This suggests that sales are highly concentrated among a few key manufacturers, particularly Manufacturer 69, and manufacturer choice is not a significant differentiator of top stores.
# Group data frame by manufacturer and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.0% of sales.
m_dist <- all_store_sales %>%
group_by(manufacturer_id, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
arrange(desc(total_sales)) %>%
filter(sales_percent > 0.01)
# Plot graph comparing sales distribution by manufacturer of top stores versus other stores
m_dist %>%
ggplot(aes(x = manufacturer_id, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Manufacturer',
subtitle = 'The percentage of total sales for each of the most common manufacturers of top stores\ncompared to other stores') +
xlab('Manufacturer') +
theme(axis.text.x = element_text(angle = 0, vjust = .5, hjust = .5)) +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 1, by = .05),
labels = scales::percent)
The 12 OZ package size is the most significant contributor to sales, with a slightly higher share in Other Stores compared to Top Stores. Other package sizes, have minimal and comparable contributions across both store types. Overall package size is not a significant differentiator of top stores.
# Group data frame by package size and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.5% of sales.
p_s_dist <- all_store_sales %>%
filter(!is.na(package_size)) %>%
group_by(package_size, store_class) %>%
summarize(total_sales = sum(sales_value)) %>%
mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
arrange(desc(total_sales)) %>%
filter(sales_percent > 0.015)
# Plot graph comparing sales distribution by package size of top stores versus other stores
p_s_dist %>%
ggplot(aes(x = package_size, y = sales_percent, fill = store_class)) +
theme_minimal() +
scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
geom_col(position = 'dodge') +
ggtitle('Sales Distribution by Package Size',
subtitle = 'The percentage of total sales for each of the most common package size purchased of\ntop stores compared to other stores') +
xlab('Manufacturer') +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = .75)) +
scale_y_continuous(name = 'Percent of Sales',
breaks = seq(0, 1, by = .05),
labels = scales::percent)
In order to answer this question comprehensively, we took a broad approach. We utilized the complete journey dataset to analyze the similarities and differences in customer profiles and transaction profiles between the top 25 performing stores and all other stores.
After identifying our top 25 stores through filtration based on the percentage of total sales, we analyzed the customer profile. In this section, we focused on demographic data merged with transaction data to understand how demographic factors influenced sales at these top-performing stores compared to all stores. First, we examined income levels by analyzing sales distribution across different income brackets to determine which income levels contributed most to sales. Second, we segmented customers by age to identify the most prevalent age groups among top-performing stores. Third, we assessed the number of individuals in a household to determine whether larger or smaller households were more frequent shoppers. We then compared sales data among single, married, and other marital statuses. Additionally, we used household composition data to understand the makeup of households, such as single-parent families and couples without children. Next, we differentiated between homeowners and renters to assess whether homeownership status impacted purchasing behavior. Finally, we analyzed how the presence of children in a household influenced shopping patterns.
After completing the customer profile, we formed a transaction profile. We merged the product and transaction datasets to visualize how purchasing habits at the top 25 stores differed from those at all stores. First, we identified peak shopping hours and compared them between top-performing and other stores by examining sales distribution by hour. Second, we analyzed which departments (e.g., groceries, electronics) had higher sales in top-performing stores. Next, we broke down sales by product category to determine which types of products were more popular. We also examined sales distribution by manufacturer to identify any brand preferences. Finally, we analyzed whether customers at top-performing stores preferred larger or smaller package sizes.
From our analysis, there are three main ways that Regork can adjust operations in lower performing stores to be more profitable. The three key insights are as follows:
Assuming that Regrok implements the above three insights, customers will see changes in their shopping experiences. Firstly, with the company targeting households with higher incomes, prices may increase and the shopping experience may have more of a high-end feel. Secondly, store hours may be narrowed down based on lower sales values during certain times of day. In other words, Regork being open for 24 hours a day may not be the most profitable- although it would require further analysis to determine this. With that, customers may receive additional coupons that are only redeemable at times of day where Regork is trying to boost sales, such as 6 am. Lastly, additional coupons and other changes to the customer shopping experience would be dependent on which stores the customer shops at. From the analysis in this report, it is in Regork’s best interest to target many of these changes at the lower 432 stores when it comes to sales performance because the top 25 stores are relatively successful.
Our group made every effort to ensure our analysis was as thorough and reliable as possible. However, like any analysis, there were certain limitations that could be improved upon or expanded in future research.
One key limitation was our sample size. We believe that access to a larger dataset would have allowed us to identify more trends and draw stronger conclusions. Additionally, some of the sales data from low-performing stores may have skewed our findings. A few of the lowest-performing stores recorded total sales of less than $10, which likely impacted our results. Moving forward, we would refine our approach by removing outliers on both the low and high ends to enhance data accuracy.
Another significant limitation was incomplete data. Several variables, such as package size and marital status, contained missing values labeled as “NA.” This lack of complete information restricted our ability to conduct a more comprehensive analysis. Addressing these data gaps in future studies would provide a more accurate and meaningful interpretation of consumer behavior.