Goal: How can we learn from top performing stores to adjust operations to be more profitable?

Introduction

Business Problem

We were tasked with the responsibility of bettering Regork by identifying a potential area of growth. We approached this by asking ourselves the question: how can we learn from our top performing stores to adjust operations to be more profitable.

Approach and Methodology

In order to tackle this question we used the complete journey data set to examine the differences in the customer profile and transactions profile of the 25 Top performing stores versus all other stores. First, we identified the top 25 five performing sales based on total sales. Next, in the customer profile section we focused specifically on using the demographics data merged with transactions data to visualize how demographic factors influenced sales at these top-performing stores compared to all stores. We looked into sales distribution by income level, age group, household size, marital status, and household composition, home ownership, and number of kids. Then, in the the transactions profile section we used the products data in the transactions and products data sets to visualize how the customer’s purchasing habits at the top-twenty five stores differed from that of all stores. We looked into sales distribution by hour, department, product category, manufacturer, and package size. From these findings we were able to develop our strategic, data driven recommendation.

Proposed Solution and Business Impact

We recommend that our client implement targeted initiatives to engage demographics where the top 25 stores outperform all others in sales, while also introducing strategies to bridge the sales gap observed in the differing purchasing behaviors of customers between top-performing stores and all other stores. This analysis will greatly benefit Regork because it will allow the company to optimize targeted markets, enhance product strategy, and all store performances as a whole.

Set Up and Data

Packages Used

We used multiple packages to assist in our analysis and report:

  • completejourney - data sets with various information about transactions
  • tidyverse - group of packages for data manipulation, analysis and visualization
  • tidyr - set of tools to reshape and tidy data for efficient analysis
  • ggplot2 - provides graphing and data visualization capabilities
  • dplyr - a tool for working with data frame like objects
  • DT - allows data tables to be easily viewed on Rmd
  • scales - formats numbers and customizes data visualization scales and labels
library(completejourney)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
library(DT)
library(scales)

Datasets Used

We used multiple data sets to assist in our analysis and report:

  • transactions - retail purchase data for purchases over one year Transactions
    • household_id
    • store_id
    • basket_id
    • product_id
    • quantity
    • sales_value
    • retail_disc
    • coupon_disc
    • coupon_match_disc
    • week
    • transaction_timestamp
  • demographics - household information including age, income, family size, and more
    • household_id
    • age
    • income
    • home_ownership
    • marital_status
    • household_size
    • household_comp
    • kids_count
  • products - product specifications including category, brand, manufacturer, and more
    • product_id
    • manufacturer_id
    • department
    • brand
    • product_category
    • product_type
    • package_size
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>

Top Stores

# Create data table
store_sales <- transactions %>%
  group_by(store_id) %>%
  summarize(
    total_sales = sum(sales_value),
    total_qty = sum(quantity)
  ) %>%
  arrange(desc(total_sales))

# Limit to the top 25 stores
top_stores <- head(store_sales, 25)

# Store as a list for future filter values
store_list <- top_stores$store_id

Top 25 Stores

Based on the total sales of each store, our team was able to identify the Top 25 stores out of the total 457 stores in operations. This designation allowed us to compare Top Stores to Other Stores and identify variables that separated the two. These variables help indicate key differentiators that make Top Stores more profitable and successful.

# Transform the data table to be viewable on Rmd
datatable((top_stores), colnames = c("Store ID", "Total Sales", "Total Quantity"), 
          options = list(
          order = list(list(2, 'desc')),  # Sort by the third column in descending order
          pageLength = 10  # Display 10 entries per page
          )) %>%
  formatCurrency(columns = c('total_sales'), 
                 currency = "$", 
                 digits = 2, 
                 before = TRUE) %>%
  formatCurrency(columns = 'total_qty', 
                 currency = "", 
                 interval = 3, 
                 mark = ",", 
                 digits = 0)

Comparing Top Stores to Other Stores

  • There are 457 stores in total and the top 25 stores are only 5.47% of the stores in operations.
  • The top 25 stores account for $1,809,251 in sales or 39.37% of all sales.
  • The top 25 stores account for 98,723,109 transactions or 64.56% of all transactions.
# Total Sales and Transactions for Top Stores
top_stores %>%
  summarise(sum(total_sales),
            sum(total_qty))

# Total Sales and Transactions for All Stores
store_sales %>%
  summarise(sum(total_sales),
            sum(total_qty))

# Summary Percentages
total_stores_num <- n_distinct(store_sales$store_id)
other_stores_num <- n_distinct(store_sales$store_id) - 25
top_stores_num <- n_distinct(top_stores$store_id)

prct_sales <- 1809251/4596040
prct_qty <- 98723109/152923108
prct_stores <- top_stores_num/total_stores_num

print(percent(prct_stores, accuracy = 0.01))
print(percent(prct_sales, accuracy = 0.01))
print(percent(prct_qty, accuracy = 0.01))
# Create percentages data frame
prct_data <- data.frame(
  class = c("Top Stores", "Other Stores", "All Stores"),
  stores = c(25, 457, 25+457),
  sales = c(1809251, 4596040-1809251, 4596040),
  transactions = c(98723109, 152923108-98723109, 152923108)
)

# Reshape the data to long format
df_long <- prct_data %>%
  pivot_longer(cols = -class, names_to = "Column", values_to = "Value") %>%
  group_by(Column) %>%
  mutate(Total = Value[class == "All Stores"]) %>%
  filter(class != "All Stores") %>%
  mutate(Percentage = Value / Total * 100)

# Create the plot
ggplot(df_long, aes(x = "", y = Percentage, fill = class)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  facet_wrap(~factor(Column, levels = c("stores", "sales", "transactions")), 
             ncol = 3,
             labeller = as_labeller(c(stores = "Stores", 
                                      sales = "Sales", 
                                      transactions = "Transactions"))) +
  geom_text(aes(label = paste0(round(Percentage, 2), "%")), 
            position = position_stack(vjust = 0.5),
            color = "white") +
  theme_void() +
  ggtitle('Top 25 vs Other Stores Percentage of Store, Sales, and Transactions') +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 15, face = 'bold', hjust = 0.5, margin = margin(b = 20)),
    strip.text = element_text(size = 13)
  ) +
  scale_fill_manual(name = "Store Classification", 
                    values = c("Top Stores" = "darkorchid3", 
                               "Other Stores" = "cadetblue3"))

Customer Profile

Our Findings

How can the demographics of top performing stores tell us who an ideal customer is?

The ideal customer for a high performing Regork Store is a household with two individuals who are married without kids, in the age group 45-54, making more than $75K a year, and who are homeowners.

  • The most significant demographic variable that differentiates Top Stores is the income bracket of customers. Top Stores have have a customer base of upper-middle class individuals.
  • Other demographics have similar sales distributions at Top Stores, compared to Other Stores, meaning they are not significant indicators of a high performing store.

Regork can improve their overall sales by expanding into markets that have more people sharing the demographics of their “ideal customer”. Analyzing the demographics of potential markets and comparing them to the data we’ve found can allow Regork to open new high-performing stores in key markets across the country.

  • Ideal markets for expansion will have an average income of at least $75K a year, even better would be markets with an average income of at least $100K.
  • While other demographics are not a differentiator of Top Stores, they still show who Regork’s most common customer’s are. Ideal markets would have a high concentration of people ages 45-54, households of 1 or 2 people, and people who are homeowners.

Dataset Used

# Create a data frame that combines transactions and demographics, and designates if the store from the transaction is classified as a Top Store or not.
combined_df <- transactions %>%
  inner_join(demographics) %>%
  mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores'))

Here is a sample version of the dataset used showing relevant transaction and demographic information for the top 50 sales values. The store class column was created by our team and allows us to identify stores that are designated as a Top 25 Store.

# Created filtered data frame as a sample of larger data used
df_filtered <- combined_df %>%
  select(store_id, store_class, sales_value, household_id, age, income, 
         home_ownership, marital_status, household_size, household_comp, kids_count) %>%
  arrange(desc(sales_value)) %>%
  slice_head(n = 50)

# Transform the data table to be viewable on Rmd
datatable(df_filtered %>% 
            select(store_id, store_class, sales_value, household_id, income, age,
                   household_size, marital_status, household_comp, home_ownership, kids_count),
          options = list(
            scrollX = TRUE,
            columnDefs = list(list(
              targets = c(0, 1),  # store_id and store_class
              className = 'dt-center'
            )),
            order = list(list(2, 'desc')),  # Sort by the third column (sales_value) in descending order
            pageLength = 10,  # Display 10 entries per page
            lengthMenu = list(c(10, 25, 50), c('10', '25', '50')),  # Allow user to choose 10, 25, or 50 entries
            dom = 'lftip'  # Show length changing, filtering, table, information, and pagination
          ),
          colnames = c('Store ID', 'Store Class', 'Sales Value', 'Hshld ID', 'Income', 'Age',
                       'Hshld Size', 'Marital Status', 'Household Composition', 
                       'Home Ownership', '# of Kids'),
          rownames = FALSE) %>%
  formatCurrency('sales_value', currency = "$", digits = 2) %>%
  formatStyle(
    'store_class',
    target = 'row',
    backgroundColor = styleEqual(
      c("Top Stores", "Other Stores"), 
      c('#e5cdf1', '#dcf1f3')
    )
  ) %>%
  formatStyle('store_class', fontWeight = 'bold')

Relevent Totals

  • Total sales from Top Stores: $1,082,050
  • Total sales from Other Stores: $1,542,929

Note that transactions without any demographic data have been excluded from this data set and total calculations.

# Calculate overall total sales for top stores and for other stores
combined_df %>%
  group_by(store_class) %>%
  summarise(total_sales = sum(sales_value))

Significant Demographics

Variables that do seem to impact the performance of stores are:

  • Income Level

Income Level Impact on Top Stores

Top stores generate a larger percentage of their sales from upper-middle income brackets, particularly the $75K–99K income group, compared to other stores. They also outperform other stores in all higher-income brackets greater than $75k, indicating a stronger appeal to affluent customers. In contrast, other stores have a more balanced distribution across lower-income brackets (e.g., under $15K, $15K–24K and $25-34K), suggesting they cater more to lower-income demographics. This highlights that top stores dominate the upper-middle and high-income segments, while other stores serve a broader range of lower and middle-income customers.

# Group data frame by income and classification, then add a column for the percentage of total sales
income_dist <- combined_df %>%
  group_by(income, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by income of top stores versus other stores 
income_dist %>%
  ggplot(aes(x = income, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  
  geom_col(position = 'dodge') +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Income Level',
            subtitle = 'The percentage of total sales for each income level of top stores compared to other stores') +
  xlab('Income Level') +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + 
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.3, by = .05),
                       labels = scales::percent)

Overall Sales from High Incomes

Top performing stores cater to markets with higher income levels.

  • In Top Stores, customers making more than $75k make up 40.45% of overall sales
  • In Other Stores, customers making more than $75k make up only 30.46% of overall sales
# Create a list of high incomes
incomes <- unique(combined_df$income)
high_income_list <- sort(incomes, decreasing = TRUE)[1:7]

# Calculate total percent of overall sales for top stores
top_store_top_income <- income_dist %>%
  filter(income %in% high_income_list & store_class == 'Top Stores')
sum(top_store_top_income$sales_percent)

# Calculate total percent of overall sales for other stores
all_store_top_income <- income_dist %>%
  filter(income %in% high_income_list & store_class != 'Top Stores')
sum(all_store_top_income$sales_percent)

Non-Significant Demographics

Variables that do not seem to impact the performance of stores are:

  • Age
  • Household Size
  • Marital Status
  • Household Composition
  • Home Ownership
  • Number of Kids

Age Impact on Top Stores

Most of the age groups have similar sales distributions when you compare top stores to other stores. Top stores do have slightly higher percentages for the age groups 24-35 and 45-54.

# Group data frame by age and classification, then add a column for the percentage of total sales
age_dist <- combined_df %>%
  group_by(age, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by age groups of top stores versus other stores 
age_dist %>%
  ggplot(aes(x = age, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Age Group',
            subtitle = 'The percentage of total sales for each age group of top stores compared to other stores') +
  xlab('Age Group') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.5, by = .05),
                       labels = scales::percent)

Household Size Impact on Top Stores

Household size distribution of Top Stores is similar to Other Stores. There is a slightly higher percentage of households that are 1 person or 4 people.

# Group data frame by household size and classification, then add a column for the percentage of total sales
h_s_dist <- combined_df %>%
  group_by(household_size, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by household size of top stores versus other stores 
h_s_dist %>%
  ggplot(aes(x = household_size, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Household Size',
            subtitle = 'The percentage of total sales for each household size of top stores compared to other stores') +
  xlab('Household Size') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.5, by = .05),
                       labels = scales::percent)

Marital Status Impact on Top Stores

Marital Status has no impact on if a store will preform highly. The percentage of sales from Married households is almost equal for Top Stores versus Other Stores, and Unmarried household percentages are within 2% of each other.

# Group data frame by marital status and classification, then add a column for the percentage of total sales
m_s_dist <- combined_df %>%
  group_by(marital_status, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by marital status of top stores versus other stores 
m_s_dist %>%
  filter(!is.na(marital_status)) %>%
  ggplot(aes(x = marital_status, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Marital Status',
            subtitle = 'The percentage of total sales based on marital status for top stores compared to other stores') +
  xlab('Marital Status') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.5, by = .05),
                       labels = scales::percent)

Household Composition Impact on Top Stores

Household composition is similar for both Top Stores and Other Stores. There is about 5% higher level of sales in Top Stores from households that have only one adult. This suggests, single people without kids are more common in markets of Top Stores.

# Group data frame by household composition and classification, then add a column for the percentage of total sales
h_c_dist <- combined_df %>%
  group_by(household_comp, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by household composition of top stores versus other stores 
h_c_dist %>%
  filter(!is.na(household_comp)) %>%
  ggplot(aes(x = household_comp, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Household Composition',
            subtitle = 'The percentage of total sales based on household composition for top stores\ncompared to other stores') +
  xlab('Household Composition') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.5, by = .05),
                       labels = scales::percent)

Home Ownership Impact on Top Stores

Homeownership levels are about the same for both Top Stores and Other Stores. There is about 5% higher percentage of sales in Top Stores from households that are homeowners. This means there are more homeowners in markets of Top Stores than markets of Other Stores.

# Group data frame by home ownership status and classification, then add a column for the percentage of total sales
h_o_dist <- combined_df %>%
  group_by(home_ownership, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by home ownership status of top stores versus other stores 
h_o_dist %>%
  filter(!is.na(home_ownership)) %>%
  ggplot(aes(x = home_ownership, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +  
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Home Ownership',
            subtitle = 'The percentage of total sales based on home ownership status for top stores\ncompared to other stores') +
  xlab('Home Ownership') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.7, by = .1),
                       labels = scales::percent)

Number of Kids Impact on Sales Distribution

Kid count has little impact on sales distributions when you compare top stores to other stores. Top stores do have slightly higher percentages for families with 0 or 2 kids.

# Group data frame by number of kids in family and classification, then add a column for the percentage of total sales
kids_dist <- combined_df %>%
  group_by(kids_count, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1082050, 1542929))

# Plot graph comparing sales distribution by number of kids in family of top stores versus other stores 
kids_dist %>%
  filter(!is.na(kids_count)) %>%
  ggplot(aes(x = kids_count, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Number of Kids',
            subtitle = 'The percentage of total sales based on number of kids in family for top stores\ncompared to other stores') +
  xlab('Number of Kids') +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.65, by = .1),
                       labels = scales::percent)

Transaction Profile

Our Findings

How can the transaction history and product offerings of top performing stores tell us what store changes will increase revenue?

There are a few transaction details specific to top performing stores. Transaction time and product category had varied sales distributions when comparing Top Stores to Other Stores. Transaction details like Department, Manufacturer, and Package Size did not expose any differentiators of profitability.

  • Top Stores experience a notable spike in average transaction values exceeding $6 at *5 a.m.**, followed by a gradual decline, while Other Stores maintain a more consistent average of $3 to $4 throughout the day.
  • Top Stores heavily emphasize Coupon/Misc Items, which account for over 10% of their sales, while Other Stores display a more balanced distribution across categories, with higher percentages in areas like Beef and Soft Drinks.

Regork can optimize their sales strategies for Other Stores by highlighting successful tactics from Top Stores. This will enable them to increase customer engagement, boost transaction values, and ultimately improve profitability across their entire store network. A few ways Regork can do this is to:

  • Implement a strategic coupon marketing program, focusing on high-value early morning promotions and loyalty rewards, to drive customer engagement and increase average transaction values across Other Stores
  • Develop a targeted discount strategy that includes timed offers, first-time customer incentives, and bundled deals to boost sales volume and attract new customers while encouraging larger purchases in Other Stores

Dataset

# Create a data frame that combines transactions and products, and designates if the store from the transaction is classified as a Top Store or not.
all_store_sales <- transactions %>%
  left_join(products, by = "product_id") %>%
  mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores'))

Here is a sample version of the dataset used showing relevant transaction and product information for the top 50 sales values. The store class column was created by our team and allows us to identify stores that are designated as a Top 25 Store. We also created a transaction hour data point to capture what hour a transaction was made.

# Created filtered data frame as a sample of larger data used
df_filtered2 <- all_store_sales %>%
  mutate(hour = hour(transaction_timestamp)) %>%
  select(store_id, store_class, sales_value, product_id, hour, product_category, department, manufacturer_id, package_size) %>%
  arrange(desc(sales_value)) %>%
  slice_head(n = 50)

# Transform the data table to be viewable on Rmd
datatable(df_filtered2,
          options = list(
            scrollX = TRUE,
            columnDefs = list(list(
              targets = c(0, 1),  # Store_id and store_class
              className = 'dt-center'
            )),
            order = list(list(2, 'desc')),  # Sort by the third column (sales_value) in descending order
            pageLength = 10,  # Display 10 entries per page
            lengthMenu = list(c(10, 25, 50), c('10', '25', '50')),  # Allow user to choose 10, 25, or 50 entries
            dom = 'lftip'  # Show length changing, filtering, table, information, and pagination
          ),
          colnames = c('Store ID', 'Store Class', 'Sales Value', 'Product ID', 'Purchase Hour', 
                       'Product Category', 'Department', 'Manufacturer ID', 'Package Size'),
          rownames = FALSE) %>%
  formatCurrency('sales_value', currency = "$", digits = 2) %>%
  formatStyle(
    'store_class',
    target = 'row',
    backgroundColor = styleEqual(
      c("Top Stores", "Other Stores"), 
      c('#e5cdf1', '#dcf1f3')
    )
  ) %>%
  formatStyle('store_class', fontWeight = 'bold')

Significant Transaction Details

Variables that do seem to impact the performance of stores are:

  • Time (for Transaction Value)
  • Product Category

Time Impact on Transaction Value at Top Stores

A notable spike in transaction amounts occurs at 5 a.m. for Top Stores, where the average transaction value exceeds $6, significantly higher than at any other time of day. In contrast, Other Stores show a much steadier trend, with average transaction amounts hovering around $3 to $4 throughout the day. After 7 a.m., the transaction values for Top Stores gradually decline but remain consistently higher than those of Other Stores across all hours. This suggests that Top Stores experience a unique pattern of high-value transactions during early morning hours, while Other Stores maintain a more uniform transaction value throughout the day.

# Create a column for the hour of the transaction 
sales_hr <- transactions %>%
  mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores')) %>%
  mutate(hour = hour(transaction_timestamp)) %>%
  group_by(hour, store_class) %>%
  summarize(avg_sales = mean(sales_value))

# Plot to show average sales by hour for top stores v. other stores 
ggplot(sales_hr, aes(x = hour, y = avg_sales, color = store_class, group = store_class)) +
  geom_line(size = 1) + 
  geom_point(size = 3) + 
  labs(title = "Transaction Amount Average by Hour",
  subtitle = "Average value of each transaction by hour for Other Stores vs. Top Stores",
       x = "Hour of Day",
       y = "Average Transaction Amount",
       color = "Store Class") +
  scale_x_continuous(breaks = seq(0, 23, by = 1)) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(name = "Store Classification", values = c("Other Stores" = "cadetblue3", "Top Stores" = "darkorchid3")) +
  theme_minimal() +
    theme(axis.text.x = element_text(hjust = .5, size = 10),
        axis.title.x = element_text(margin = margin(t = 10)))

Product Category Impact on Top Stores

Top Stores allocate a significantly higher percentage of their sales to Coupon/Misc Items which stands out as their dominant category, accounting for over 10% of total sales. In contrast, Other Stores show a more balanced distribution across categories, with no single category dominating to the same extent. Categories such as Beef and Soft Drinks contribute a higher percentage of sales in Other Stores compared to Top Stores.

Top Stores’ significantly higher percentage of sales for Coupon/Misc Items likely reflects their strategic use of coupons and discounts to drive customer engagement and boost sales. Coupons create a psychological incentive for shoppers, making them feel they are receiving exclusive deals, which can lead to impromptu purchases or larger transactions. Additionally, Top Stores may use coupons to promote specific products, clear inventory, or introduce new items, aligning with their broader business goals of increasing foot traffic and customer loyalty. This focus on coupon-driven sales could also indicate a reliance on targeted promotions to differentiate themselves from competitors and maximize revenue.

# Group data frame by product category and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.5% of sales
p_c_dist <- all_store_sales %>%
  group_by(product_category, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
  filter(sales_percent > 0.015)

# Plot graph comparing sales distribution by category of top stores versus other stores 
p_c_dist %>%
  ggplot(aes(x = product_category, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Product Category',
            subtitle = 'The percentage of total sales for each of the most common product category of\ntop storescompared to other stores') +
  xlab('Product Categories') +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + 
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 0.3, by = .05),
                       labels = scales::percent)

Non-Significant Transaction Details

Variables that do not seem to impact the performance of stores are:

  • Time (for Total Sales)
  • Department
  • Manufacturer
  • Package Size

Time Impact on Sales Total at Top Stores

The shape of the curves for both store classes is similar, indicating that time is not a significant differentiator of top stores for hourly sales. Both lines show a gradual increase in sales starting early in the morning, peaking around midday to early afternoon, and then tapering off into the evening. This similarity suggests that customer shopping behavior, in terms of timing, is consistent across both Top Stores and Other Stores.

# Create a column for the hour of the transaction 
sales_hr <- transactions %>%
  mutate(store_class = dplyr::case_when(store_id %in% store_list ~ 'Top Stores', TRUE ~ 'Other Stores')) %>%
  mutate(hour = hour(transaction_timestamp)) %>%
  group_by(hour, store_class) %>%
  summarize(total_sales = sum(sales_value))

# Plot to show total sales by hour for top stores v. other stores 
ggplot(sales_hr, aes(x = hour, y = total_sales, color = store_class, group = store_class)) +
  geom_line(size = 1) + 
  geom_point(size = 3) + 
  labs(title = "Sales Total by Hour",
  subtitle = "Total sales by hour for Other Stores vs. Top Stores",
       x = "Hour of Day",
       y = "Total Sales",
       color = "Store Class") +
  scale_x_continuous(breaks = seq(0, 23, by = 1)) +
  scale_y_continuous(labels = scales::label_dollar(), 
                   breaks = seq(0, 300000, by = 50000)) +
  scale_color_manual(name = "Store Classification", values = c("Other Stores" = "cadetblue3", "Top Stores" = "darkorchid3")) +
  theme_minimal() +
    theme(axis.text.x = element_text(hjust = .5, size = 10),
        axis.title.x = element_text(margin = margin(t = 10)))

Department Impact on Top Stores

Both store types generate the majority of their sales from the Grocery department, but Other Stores have a higher percentage of sales in this category compared to Top Stores. Meanwhile, Top Stores have a slightly higher share of sales in the Fuel department. Top Stores have about 12% of sales attributed to a gas station connected with their store.

# Group data frame by department, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.0% of sales.
d_dist <- all_store_sales %>%
  group_by(department, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
  filter(sales_percent > 0.01)

# Plot of graph comparing the sales distribution by department of top store versus all other stores.
d_dist %>%
  ggplot(aes(x = department, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Department',
            subtitle = 'The percentage of total sales for each of the most common departments of\ntop stores compared to other stores') +
  xlab('Departments') +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 1, by = .1),
                       labels = scales::percent)

Manufacturer Impact on Top Stores

Manufacturer 69 overwhelmingly leads in sales, contributing over 30% in Top Stores and slightly less in Other Stores, highlighting its market dominance. Manufacturer 2 has a moderate share, with a slightly higher percentage in Other Stores than in Top Stores. The remaining manufacturers (103, 1208, 1251, 317, 544, 673, and 764) have minimal contributions to total sales, with no significant differences between the two store classifications. This suggests that sales are highly concentrated among a few key manufacturers, particularly Manufacturer 69, and manufacturer choice is not a significant differentiator of top stores.

# Group data frame by manufacturer and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.0% of sales.
m_dist <- all_store_sales %>%
  group_by(manufacturer_id, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
  arrange(desc(total_sales)) %>%
  filter(sales_percent > 0.01)

# Plot graph comparing sales distribution by manufacturer of top stores versus other stores
m_dist %>%
  ggplot(aes(x = manufacturer_id, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Manufacturer',
            subtitle = 'The percentage of total sales for each of the most common manufacturers of top stores\ncompared to other stores') +
  xlab('Manufacturer') +
        theme(axis.text.x = element_text(angle = 0, vjust = .5, hjust = .5)) + 
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 1, by = .05),
                       labels = scales::percent)

Package Size Impact on Top Stores

The 12 OZ package size is the most significant contributor to sales, with a slightly higher share in Other Stores compared to Top Stores. Other package sizes, have minimal and comparable contributions across both store types. Overall package size is not a significant differentiator of top stores.

# Group data frame by package size and classification, then add a column for the percentage of total sales, and filtering it for categories with greater than 1.5% of sales.
p_s_dist <- all_store_sales %>%
  filter(!is.na(package_size)) %>%
  group_by(package_size, store_class) %>%
  summarize(total_sales = sum(sales_value)) %>%  
  mutate(sales_percent = total_sales/if_else(store_class == 'Top Stores', 1809251, 2786788)) %>%
  arrange(desc(total_sales)) %>%
  filter(sales_percent > 0.015)

# Plot graph comparing sales distribution by package size of top stores versus other stores
p_s_dist %>%
  ggplot(aes(x = package_size, y = sales_percent, fill = store_class)) +
  theme_minimal() +
  scale_fill_manual(name = "Store Classification", values = c("Top Stores" = "darkorchid3", "Other Stores" = "cadetblue3")) +
  geom_col(position = 'dodge') +
  ggtitle('Sales Distribution by Package Size',
            subtitle = 'The percentage of total sales for each of the most common package size purchased of\ntop stores compared to other stores') +
  xlab('Manufacturer') +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = .75)) + 
  scale_y_continuous(name = 'Percent of Sales', 
                       breaks = seq(0, 1, by = .05),
                       labels = scales::percent)

Conclusions

Methodology Review

In order to answer this question comprehensively, we took a broad approach. We utilized the complete journey dataset to analyze the similarities and differences in customer profiles and transaction profiles between the top 25 performing stores and all other stores.

After identifying our top 25 stores through filtration based on the percentage of total sales, we analyzed the customer profile. In this section, we focused on demographic data merged with transaction data to understand how demographic factors influenced sales at these top-performing stores compared to all stores. First, we examined income levels by analyzing sales distribution across different income brackets to determine which income levels contributed most to sales. Second, we segmented customers by age to identify the most prevalent age groups among top-performing stores. Third, we assessed the number of individuals in a household to determine whether larger or smaller households were more frequent shoppers. We then compared sales data among single, married, and other marital statuses. Additionally, we used household composition data to understand the makeup of households, such as single-parent families and couples without children. Next, we differentiated between homeowners and renters to assess whether homeownership status impacted purchasing behavior. Finally, we analyzed how the presence of children in a household influenced shopping patterns.

After completing the customer profile, we formed a transaction profile. We merged the product and transaction datasets to visualize how purchasing habits at the top 25 stores differed from those at all stores. First, we identified peak shopping hours and compared them between top-performing and other stores by examining sales distribution by hour. Second, we analyzed which departments (e.g., groceries, electronics) had higher sales in top-performing stores. Next, we broke down sales by product category to determine which types of products were more popular. We also examined sales distribution by manufacturer to identify any brand preferences. Finally, we analyzed whether customers at top-performing stores preferred larger or smaller package sizes.

Key Insights

From our analysis, there are three main ways that Regork can adjust operations in lower performing stores to be more profitable. The three key insights are as follows:

  1. Regork should capitalize on the opportunity to target higher household incomes (above 74k/year).
  2. Regork should offer incentives around 6 and 7 am to ensure that average transaction amounts do not plateau after the 5am peak. One way to do this would be through coupons that are only redeemable during certain hours.
  3. Regork should focus couponing efforts on the bottom 432 of stores because the coupons have been successful at the top 25 of stores.

Customer Implications

Assuming that Regrok implements the above three insights, customers will see changes in their shopping experiences. Firstly, with the company targeting households with higher incomes, prices may increase and the shopping experience may have more of a high-end feel. Secondly, store hours may be narrowed down based on lower sales values during certain times of day. In other words, Regork being open for 24 hours a day may not be the most profitable- although it would require further analysis to determine this. With that, customers may receive additional coupons that are only redeemable at times of day where Regork is trying to boost sales, such as 6 am. Lastly, additional coupons and other changes to the customer shopping experience would be dependent on which stores the customer shops at. From the analysis in this report, it is in Regork’s best interest to target many of these changes at the lower 432 stores when it comes to sales performance because the top 25 stores are relatively successful.

Analysis Limitations

Our group made every effort to ensure our analysis was as thorough and reliable as possible. However, like any analysis, there were certain limitations that could be improved upon or expanded in future research.

One key limitation was our sample size. We believe that access to a larger dataset would have allowed us to identify more trends and draw stronger conclusions. Additionally, some of the sales data from low-performing stores may have skewed our findings. A few of the lowest-performing stores recorded total sales of less than $10, which likely impacted our results. Moving forward, we would refine our approach by removing outliers on both the low and high ends to enhance data accuracy.

Another significant limitation was incomplete data. Several variables, such as package size and marital status, contained missing values labeled as “NA.” This lack of complete information restricted our ability to conduct a more comprehensive analysis. Addressing these data gaps in future studies would provide a more accurate and meaningful interpretation of consumer behavior.