ABS Buying Patterns Project

Author

A Warsaw

Purpose

To determine a better methodology for observing the buying patterns between different locations, I will be observing the inventory of all locations for the following:

  • Trends in inventory performance between all stores

  • Determining if the current methodology which Alcohol and Beverage Services of Montgomery County (or ABS for short) uses for observing buying patterns is the best method

  • Using models to optimize the inventory management process either through ABS’ current methodology, or a newly introduced model

To do so, I would like to observe the data set provided to me by ABS to locate any potential trends, determine best practices, and also engineer new variables if needed for modeling purposes.

Stores to be Observed

For my observations, I will be researching information for the following locations:

  • Aspen Hill (store 28)
  • Burtonsville (store 8)
  • Cabin John (store 12)
  • Clarksburg Village (store 3)
  • Cloverly (store 10)
  • Darnestown (store 20)
  • Downtown Rockville (store 27)
  • Fallsgrove (store 21)
  • Flower (store 7)
  • Gaithersburg Square (store 30)
  • Goshen Crossing (store 17)
  • Hampden Lane (store 23)
  • Kensington (store 14)
  • King Farm (store 26)
  • Kingsview (store 6)
  • Leisure World (store 13)
  • Montrose (store 24)
  • Muddy Branch (store 9)
  • Olney (store 18)
  • Poolesville (store 29)
  • Potomac (store 15)
  • Seneca Meadows (store 16)
  • Silver Spring (store 2)
  • Walnut Hill (store 11)
  • Westbard (store 4)
  • Wheaton (store 5)
  • White Oak Town Center (store 22)

It is also important to note that there are 3 size designations for all 27 stores (Size A, B, and C) which is used to categorize the stores by their corresponding size:

  • Size A is for stores with greater than 6000 sq ft
  • Size B is for stores between 4000 and 6000 sq ft
  • Size C is for stores below 4000 sq ft

These designations are also categorized by their sales performance (based on a scale created in 2022) from tier 1 to 3 where:

  • Tier 1 is for stores performing greater than $8 million in the SY
  • Tier 2 is for stores performing between $5 and $8 million in the SY
  • Tier 3 is for stores who perform less than $5 million in the SY

So all 27 stores are identified by both a Size and Tier. I have listed below every store with their corresponding Tiers and Rank in descending order:

Size ‘A’:

- Seneca Meadows (Store #16) A1

- Muddy Branch (Store #09) A1

- Damestown (Store #20) A1

- Leisure World (Store #13) A2

- Kingsview (Store #06) A2

- Goshen Crossing (Store #17) A2

- Downtown Rockville (Store #27) A2

- Clarksburg Village (Store #03) A2

- Aspen Hill (Store #28) A2

- Gaithersburg (Store #30) A3

- Cloverly (Store #10) A3

Size ‘B’:

- Hampden Lane (Store #23) B1

- Kensington (Store #14) B1

- Montrose (Store #24) B1

- Westbard (Store #04) B1

- Silver Spring (Store #02) B2

- Walnut Hill (Store #11) B2

- Burtonsvile (Store #08) B3

- Cabin John (Store #12) B3

- White Oak Town Center (Store #22) B3

Size ‘C’:

- Potomac (Store #15) C1

- Olney (Store #18) C2

- Wheaton (Store #05) C2

- Fallsgrove (Store #21) C3

- Flower (Store #07) C3

- King Farm (Store #26) C3

- Poolesville (Store #29) C3

About the Data Set

There are a total of 18 variables and roughly 12 million observations provided in the transactions data set provided by ABS, which I will provide a breakdown of below. However, there will also be additional variables that I will also include for this data set and will be using throughout this project which will be mentioned here.

Variables originally included in the data set:

  • transdate: This is the physical date that the transaction took place. ABS has provided transactions starting from July 1, 2023 to June 30, 2025 (roughly 2 years worth of data)
  • transactionid: This is a unique identifier for each individual transaction (likely to not be used for this project)
  • store: This is the store number where the transaction took place
  • storename: This is the formatted store name where the transaction took place
  • itemid: This is the unique identifier for products. Item dimension attributes do not change. If an item’s pack unit or bottles per case changes a new code is created.
  • description: This is the product description
  • netamount: This is the total cost of the line item / product purchased (discounts included) without tax
  • lineqty: This is quantity sold of each individual line item in a transaction. So if 3 bottles of wine were sold, this would be 3. If 2 6-packs were sold, this would be 2
  • packunit: This is the selling unit for the line item. This is the unit as how the product was sold. This would either be Btl (single bottle) or 4pk, 6pk, etc. Some lines may only show 04 or 06 and this relates to 4pk or 6pk. This is more to show the calculation of the totalqty column, which represents how many actual bottles were sold to a customer
  • totalqty: This is the total quantity of items sold for that line item. (Line quantity * pack unit)
  • itemtag: This is the warehouse designation inventory tag. Some items may be NULL as tags are updated on a daily basis and at the time this data set was taken offline, it may have converted some TAGS to NULL. (note that I will only be observing itemtags that identify as “ST”, which is Standard Stock)
  • tagdesc: This is the formatted description of the warehouse inventory tags
  • classificationdepartment: This is the department the product belongs too (Level 1). The only departments are BEER, WINE, LIQ, MISC.
  • classifictiontype – This is the group type the product belongs too (Level 2). This references the type of product within the department. i.e.: LIQ > WHISKEY
  • classificationcategory – This is the group category the product belongs too (Level 3). This further references the category within the type of items. i.e.: LIQ > WHISKEY > IRISH WHISKEY.
  • bottlespercase – This is the total number of bottles in a case. This is important for our warehouse partners when calculating overall case volume of an item. totalqty / bottlespercase = Case(s) sold
  • size: This is the size of the product. There are standard sizes of many products and often one size of a product will outperform another for a variety of reasons. i.e. 750 mL vs 1.75 L
  • custaccount: This designates if the transaction was for a retail/public customer or a licensee (This will only be used to filter out all licensee transactions as I will only be observing public customers)

Variables added into the data set:

  • transdate2: This is used as a separately formatted date variable for visualization purposes upon cleaning the data set
  • year: Another created variable used for observational purposes to organize the data set specifically by year
  • storesizecat: This is added for observing the data by the store size categorization created by ABS, only “A”, “B”, or “C” will populate the rows corresponding to its respective store in a given transaction
  • storetiercat: This is added for observing the data by the stores financial performance based on ABS’ parameters, only showing “1”, “2”, or “3” in the rows corresponding to its respective store in a given transaction
  • cost: raw cost of a given transaction (not including tax) based on netamount / lineqty

Data Preparation

Before beginning my analysis, I will be preparing my libraries, loading in my data set, and also doing all necessary cleaning below.

library(tidyverse)
library(DBI)
library(RSQLite)
library(nnet)
library(caret)
abs_observed <- read_csv("TransactionData_UTF8.txt")
#Setup for the storesizecat and storetiercat variables
stores_A <- c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30)
stores_B <- c(02, 04, 08, 11, 12, 14, 22, 23, 24)
stores_C <- c(05, 07, 15, 18, 21, 26, 29)
tier1 <- c(04, 09, 14, 15, 16, 20, 23, 24)
tier2 <- c(02, 03, 05, 06, 11, 13, 17, 18, 27, 28)
tier3 <- c(07, 08, 10, 12, 21, 22, 26, 29, 30)

names(abs_observed) <- tolower(names(abs_observed)) #making all variables lowercase
#cleaning the store variable so that R recognizes it as intended
abs_observed$store <- as.character(as.integer(abs_observed$store))
abs_observed$store <- as.numeric(abs_observed$store)

abs_observed <- abs_observed |>
  filter(custaccount != "Licensee") |> #removing licensee transactions
  filter(itemtag == "ST") |> #removing all non-Standard stock transactions
  mutate(storesizecat = case_when(
  store %in% stores_A ~ "A",
  store %in% stores_B ~ "B",
  store %in% stores_C ~ "C",
  TRUE ~ NA_character_)) |>
  mutate(storetiercat = case_when(
  store %in% tier1 ~ "1",
  store %in% tier2 ~ "2",
  store %in% tier3 ~ "3")) |>
  filter(lineqty > 0 ) |> # filtering out transactions that were either voided or refunded
  filter(netamount > 0) #filtering out transactions that were either voided or refunded

# Creating cost, transdate2, and year variables
abs_observed$cost <- abs_observed$netamount / abs_observed$lineqty
abs_observed$transdate2 <- as.Date(abs_observed$transdate)
abs_observed$year <- format(abs_observed$transdate2, "%Y")
options(scipen = 999) #converting all visualizations from scientific notation to standard (as it likely will show up in scientific notation)

Observing the Data Set by the Three Levels of Classifications

There are three variables mentioned earlier (classificationdepartment, classifcationtype, and classificationcategory) that I would like to start with observing, as I would like to see the distribution of transactions based on those levels for all 27 locations. This should help give me a general idea of how different inventory are performing between the locations to determine if there are any notable differences.

to do so, I will create a new tibble named abs_classdeptcount where I will include a new variable count which counts the frequency of the corresponding classification department transactions in the main data set

abs_classdeptcount <- abs_observed |>
  group_by(store, classificationdepartment) |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)
ggplot(abs_classdeptcount, aes(x = store, y = count,  fill = classificationdepartment)) +
  geom_col() +
  labs(title = "Inventory Performance of All Stores by Classification Dept (2023-2025)",
       x = "Store",
       y = "Frequency",
       fill = "Classification Department",
       caption = "Source: Montgomery County ABS") +
  theme_bw() +
  coord_flip() +
  scale_x_continuous(breaks = seq(min(abs_classdeptcount$store), max(abs_classdeptcount$store), by = 1)) #To show ticks for every store

From this it is clear that the liquor classification department dominates among all stores, with wine coming in second, beer in third, and miscellaneous last. Though this is not a major surprise, it does give a good basis on how generally every store (for the most part) follows the same popularity distribution of liquor > Wine > Beer > Misc. There are a few outliers, like store 29 (Poolesville), which appears to only hold liquor sales. This may speak to why it is on the lowest scale of ABS’ method of observing buying patterns (that being that this store is catagorized as C3). Or it could be that due to the store’s previous low performance with inventory of other departments ABS has only made liquor available in said location. It is possible, though, that with further observation of the stores by classification will give more context.

Before doing any deeper research, I would like to see the performance of the top two performing classification departments: Liquor and Wine, both individually for all 27 locations and collectively to see if the top two performing departments share the same distribution patterns on the individual and collective level. This is to determine if there are any other anomalies among the stores that need to be noted. This will be done by creating another two tibbles, abs_liquor and abs_wine, where each will also include the count variable to accurately count the frequency for all transactions.

abs_liquor <- abs_observed |>
  group_by(store, classificationdepartment, classificationtype) |>
  filter(classificationdepartment == "LIQ") |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)

abs_wine <- abs_observed |>
  group_by(store, classificationdepartment, classificationtype) |>
  filter(classificationdepartment == "WINE") |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)

ABS Liquor Distribution Results

abs_liquor |>
  filter(store == 2) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Silver Spring Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 3) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Clarksburg Village Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 4) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Westbard Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 5) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Wheaton Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 6) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kingsview Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 7) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Flower Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 8) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Burtonsville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 9) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Muddy Branch Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 10) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cloverly Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 11) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Walnut Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 12) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cabin John Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 13) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Leisure World Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 14) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kensington Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 15) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Potomac Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 16) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Seneca Meadows Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 17) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Goshen Crossing Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 18) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Olney Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 20) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Darnestown Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 21) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Fallsgrove Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 22) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "White Oak Town Center Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 23) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Hampden Lane Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 24) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Montrose Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 26) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "King Farm Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 27) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Downtown Rockville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 28) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Aspen Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 29) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Poolesville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 30) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Gaithersburg Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

ggplot(abs_liquor, aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "ABS Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

Generally, when observing liquor by classification type, the distributions all seem to be about the same, with slight differentiation. Thus it is safe to say that there is very little variation between all stores for the top performing classification department.

ABS Wine Distribution Results

abs_wine |>
  filter(store == 2) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Silver Spring Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 3) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Clarksburg Village WIne Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 4) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Westbard Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 5) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Wheaton Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 6) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kingsview Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 7) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Flower Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 8) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Burtonsville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 9) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Muddy Branch Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 10) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cloverly Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 11) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Walnut Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 12) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cabin John Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 13) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Leisure World Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 14) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kensington Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 15) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Potomac Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 16) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Seneca Meadows Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 17) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Goshen Crossing Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 18) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Olney Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 20) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Darnestown Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 21) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Fallsgrove Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 22) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "White Oak Town Center Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 23) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Hampden Lane Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 24) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Montrose Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 26) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "King Farm Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 27) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Downtown Rockville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 28) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Aspen Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 29) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Poolesville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 7.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 30) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Gaithersburg Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

ggplot(abs_wine |>
         filter(count > 5000), aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "ABS Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 6.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

The distribution of the wine department by classification type definitely shows a variation based on location. When creating the visualizations I quickly realized that the general distribution of all stores collectively showed a large variety of options by classification type, resulting in a very messy visualization (as you can see from observing the visualizations of each individual store). To fix this, I decided to only show the top 13 performing options by filtering the count to show only options that appear more than 5000 times in the abs_observed data set. Outside of that, the biggest outlier is still poolesville, as they only offer sake, which performs very poorly (only being purchased roughly 25 times between 2023-2025).

Due to the wide variety of performance for the wine department, it is clear that not all departments are equal, and thus more observations are needed to determine how ABS’ inventory is performing in all locations.

Using SQL Queries to observe the stores by size and classification

Now I will be using SQL queries to help identify specifics regarding performance in all stores by all three classification levels. To make it easier to observe the results, I will be categorizing the results using the store size method that ABS uses for observing their stores. This will be done by organizing three different tibbles named TierA_Stores, TierB_Stores, and TierC_Stores, which pulls all information from the main data set into the respective tibble based on the corresponding store’s size category (by using the storesizecat variable). I will then be creating a database for all relevant queries named abstier.db.

TierA_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30))

TierB_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(02, 04, 08, 11, 12, 14, 22, 23, 24))

TierC_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(05, 07, 15, 18, 21, 26, 29))
abstier.db <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(abstier.db, TierA_Stores)
copy_to(abstier.db, TierB_Stores)
copy_to(abstier.db, TierC_Stores)
copy_to(abstier.db, abs_observed) # Added the main data set just in case

Tier A Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationdepartment quantity
2023 LIQ 711252
2023 WINE 357449
2023 BEER 51992
2023 MISC 17172
2024 LIQ 1393916
2024 WINE 627742
2024 BEER 158378
2024 MISC 38941
2025 LIQ 645536
2025 WINE 271803

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationtype quantity
2023 VODKA 180113
2023 TEQUILA 84485
2023 WHISKEY 67596
2023 BOURBON 59527
2023 CORDIALS 53882
2023 CABERNET SAUVIGNON 52833
2023 RUM 51154
2023 RED BLEND 42182
2023 CHARDONNAY 41930
2023 SCOTCH 38718

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationcategory quantity
2023 DOMESTIC VODKA 117579
2023 AMERICAN RED 105677
2023 TEQUILA 81571
2023 AMERICAN WHITE 76593
2023 STRAIGHT BOURBON WHISKEY 56354
2023 IMPORTED RUM 48381
2023 IMPORTED VODKA 42311
2023 IMPORTED CORDIALS 35515
2023 DOMESTIC VODKA FLAVORS 35248
2023 BLENDED WHISKEY 32742

Tier B Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationdepartment quantity
2023 LIQ 548087
2023 WINE 332712
2023 BEER 47278
2023 MISC 15939
2024 LIQ 1051785
2024 WINE 545938
2024 BEER 135093
2024 MISC 38252
2025 LIQ 482284
2025 WINE 258699

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationtype quantity
2023 VODKA 143294
2023 TEQUILA 63304
2023 BOURBON 49720
2023 WHISKEY 47040
2023 CORDIALS 44563
2023 CABERNET SAUVIGNON 42766
2023 CHARDONNAY 41574
2023 SAUVIGNON BLANC 36426
2023 RUM 36368
2023 RED BLEND 34551

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationcategory quantity
2023 DOMESTIC VODKA 97461
2023 AMERICAN RED 89257
2023 AMERICAN WHITE 72842
2023 TEQUILA 60804
2023 STRAIGHT BOURBON WHISKEY 47967
2023 IMPORTED RUM 34516
2023 IMPORTED VODKA 33468
2023 IMPORTED CORDIALS 32014
2023 SPARKLING 27497
2023 BLENDED WHISKEY 22969

Tier C Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationdepartment quantity
2023 LIQ 329124
2023 WINE 155622
2023 BEER 18487
2023 MISC 7638
2024 LIQ 649271
2024 WINE 270456
2024 BEER 57553
2024 MISC 18528
2025 LIQ 235017
2025 WINE 103997

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationtype quantity
2023 VODKA 90682
2023 TEQUILA 35247
2023 WHISKEY 33792
2023 BOURBON 30237
2023 CORDIALS 24567
2023 CABERNET SAUVIGNON 21665
2023 RUM 21559
2023 CHARDONNAY 20113
2023 SCOTCH 17745
2023 RED BLEND 16432

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC
Displaying records 1 - 10
year classificationcategory quantity
2023 DOMESTIC VODKA 61392
2023 AMERICAN RED 42911
2023 AMERICAN WHITE 34710
2023 TEQUILA 33912
2023 STRAIGHT BOURBON WHISKEY 28977
2023 IMPORTED VODKA 21135
2023 IMPORTED RUM 20500
2023 BLENDED WHISKEY 17630
2023 IMPORTED CORDIALS 16807
2023 DOMESTIC VODKA FLAVORS 14104

Results Overview

Generally, all tiers appear to follow the same trends throughout all years and all levels of classifications. This further proves that inventory performance likely has little to do with store size as tier A, B, and C stores all share generally the same results from 2023-2025. To give a quick rundown of the results I will report the top 5 and lowest 5 performing items on all three levels of classification for all three tiers of store size:

Classification Department

  • Tier A Stores
    • Top to lowest performing (2023-2025)
      • Liquor
      • Wine
      • Beer
      • Miscellaneous
      • Donations
      • DLC (only in 2024 & 2025)
      • Non Inventory
      • Gift Card (only in 2023)
  • Tier B Stores
    • Top to lowest performing (2023-2025)
      • Liquor
      • Wine
      • Beer
      • Misc
      • Donations
      • Non Inventory
      • Gift Card (only in 2023)
      • DLC (only in 2023 & 2025)
  • Tier C Stores
    • Top to lowest performing (2023-2025)
      • Liquor
      • Wine
      • Beer
      • Misc
      • Donations
      • Non Inventory
      • Gift Card (only in 2023)
      • DLC (only in 2023 & 2025)

Classification Type

  • Tier A Stores
    • Top 5 Performing:
      • 2023
        • Vodka
        • NA(Values with no data)
        • Tequila
        • Whiskey
        • Bourbon
      • 2024
        • Vodka
        • Tequila
        • Whiskey
        • Bourbon
        • Cordials
      • 2025 (thus far)
        • Vodka
        • Tequila
        • Whiskey
        • Bourbon
        • Cordials
    • Lowest 5 Performing:
      • 2023
        • Soave
        • Pinotage
        • St. Emillion
        • Petite Syrah
        • IPA
      • 2024
        • Petite Syrah
        • Grenache
        • Vermentino
        • Bordeaux
        • Agave
      • 2025
        • Bordeaux
        • Torrontes
        • Other South African Red
        • Flavor Gin
        • Pinotage
        • (directly above is petite syrah, grenache, vermantino, and Agave)
  • Tier B Stores
    • Top 5 Performing:
      • 2023
        • Vodka
        • NA(Values with no data)
        • Tequila
        • Bourbon
        • Cordials
      • 2024
        • Vodka
        • Tequila
        • Bourbon
        • Whiskey
        • Cordials
      • 2025
        • Vodka
        • Tequila
        • Bourbon
        • Whiskey
        • Cordials
    • Lowest 5 Performing:
      • 2023
        • Liebfraumilch
        • Bianco
        • St. Emillion
        • Grenache
        • Torrontes
      • 2024
        • Vermentino
        • Other South African Red
        • Torrontes
        • Madeira
        • Bardolino
      • 2025
        • Bardolino
        • Other South African Red
        • Flavor Gin
        • Grenache
        • Primitivo
  • Tier C Stores
    • Top 5 Performing:
      • 2023
        • Vodka
        • NA(Values with no data)
        • Tequila
        • Whiskey
        • Bourbon
      • 2024
        • Vodka
        • Tequila
        • Whiskey
        • Bourbon
        • Cordials
      • 2025
        • Vodka
        • Tequila
        • Bourbon
        • Whiskey
        • Cordials
    • Lowest 5 Performing:
      • 2023
        • Soave
        • Gamay
        • Pinotage
        • Miscast
        • Bianco
      • 2024
        • Pinotage
        • Torrontes
        • Petite Syrah
        • Bardolino
        • Liebfraumilch
      • 2025
        • Vermentino
        • Graves
        • Bardolino
        • Petite Syrah
        • Grenache

Classification Category

  • Tier A Stores
    • Top 5 performing:
      • 2023
        • Domestic Vodka
        • American Red
        • Tequila
        • American White
        • Straight Bourbon Whiskey
      • 2024
        • Domestic Vodka
        • American Red
        • Tequila
        • American White
        • Straight Bourbon Whiskey
      • 2025
        • Domestic Vodka
        • Tequila
        • American Red
        • American White
        • Straight Bourbon Whiskey
    • Lowest 5 Performing:
      • 2023
        • French Red Wine
        • 1% Mixers
        • Malt Based RTD
        • Indian Whiskey
        • Greek
      • 2024
        • Indian Whiskey
        • Sake
        • Alsatian Wine
        • Misc Agave Spirit
        • Greek
      • 2025
        • Georgian Wine
        • Greek
        • Indian Whisky
        • Grappa
        • Armagnac
  • Tier B Stores
    • Top 5 Performing:
      • 2023
        • Domestic Vodka
        • American Red
        • American White
        • Tequila
        • Straight Bourbon Whiskey
      • 2024
        • Domestic Vodka
        • American Red
        • Tequila
        • American White
        • Straight Bourbon Whiskey
      • 2025
        • Domestic Vodka
        • American Red
        • American White
        • Tequila
        • Straight Bourbon Whiskey
    • Lowest 5 Performing:
      • 2023
        • Store Supplies
        • Grappa
        • Indian Whisky
        • Greek
        • 1% Mixers
      • 2024
        • Wine Based Seltzers
        • Grappa
        • Greek
        • Indian Whisky
        • 1% Mixers
        • 2025
        • Greek
        • Domestic Dessert Wine
        • 1% Mixers
        • Georgian Wine
        • Armagnac
  • Tier C Stores
    • Top 5 Performance:
      • 2023
        • Domestic Vodka
        • American Red
        • Straight Bourbon Whiskey
        • Tequila
        • American White
      • 2024
        • Domestic Vodka
        • American Red
        • Tequila
        • Straight Bourbon Whiskey
        • American White
      • 2025
        • Domestic Vodka
        • American Red
        • Straight Bourbon Whiskey
        • Tequila
        • American White
    • Lowest 5 Performance:
      • 2023
        • 1% Mixers
        • Indian Whisky
        • Domestic Dessert Wine
        • Store Supplies
        • Grappa
      • 2024
        • Malt Based RTD
        • Domestic Dessert Wine
        • 1% Mixers
        • Sake
        • Grappa
      • 2025
        • Store Supplies
        • Armagnac
        • Misc Agave Spirit
        • Georgian Wine
        • 1% Mixers

Observing the Data Set by Item ID

Now that I have determined that the inventory generally performs the same among all stores, regardless of store size, it further proves that ABS’ buying patterns methodology can be upgraded. My idea is for ABS to observe the data by the itemid, specifically for standard stock to prevent the variable becoming unusable as when switching from one itemtag to another, ABS must change the itemid even if it is the same item, so as to have a separate item id for its new tag designation. This is also why it was necessary for me to filter out all results in the main data set to show only standard stock, as this is the variable I will be mainly focusing on for this project. To give an idea of how necessary filtering the data is, before filtering out all transactions that were not standard stock, there were over 12 million observations, now there are roughly 9.7 million. That means that approximately 3 million transactions occur with items under a different item tag, which could likely be the same item under the standard stock, but with a different item id.

To start off, I will be creating a new category for use to assist in my analysis named priority which I will further explain below:

  • priority: This is an engineered variable for observing the itemid variable by frequency of transactions to determine if it is a “High”, “Medium”, or “Low” priority item. High priority merchandise are frequently purchased items in the upper 25 percentile of all transactions globally. Medium priority are items which populate the middle 50 percentile of all transactions. Low priority items populate the lower 25 percentile of all transactions. The percentile may be subject to changing according to ABS’ desired parameters, or other unexpected reasoning. (This variable may be subject to be moved to a tibble dedicated to observing itemid’s by priority)

For now I will be preparing other tibbles for creating a statistical model based on the variables lineqty and netamount to create the following variables:

  • total_netamount: Created by taking the sum of netamount of one individual itemid collectively grouped by year and store
  • total_lineqty: Created by taking the sum of lineqty of one individual itemid collectively grouped by year and store
  • avg_cost: Created by taking the mean of netamount / lineqty of one individual itemid collectively grouped by year and store
  • n_transactions: Created to count the frequency a given itemid appears in transactions per store and year
  • priority: Refer to notes on variable above, created from total_netamount

These variables will be created in two separate tibbles. The first tibble, abs_var will be for creating majority of the variables, with the exception of priority. And the second tibble item_summary will be used for creating the priority variable for use from here on out.

#Updated to show priority for itemid globally for each year for predictive model
abs_var <- abs_observed |>
  group_by(year, itemid, description) |>
  summarise(total_netamount = sum(netamount, na.rm = TRUE),
            total_lineqty   = sum(lineqty, na.rm = TRUE),
            avg_cost = mean(netamount / lineqty, na.rm = TRUE),
            n_transactions  = n(),
            .groups = "drop")
item_summary <- abs_var |>
  group_by(year) |>
  mutate(q25 = quantile(n_transactions, 0.25, na.rm = TRUE),
         q75 = quantile(n_transactions, 0.75, na.rm = TRUE),
         priority = case_when(n_transactions >= q75 ~ "High",
                              n_transactions <= q25 ~ "Low",
                              TRUE ~ "Medium")) |>
  ungroup() |>
  mutate(priority = factor(priority, levels = c("High", "Medium", "Low")))

item_summary |> count(year, priority)
# A tibble: 9 × 3
  year  priority     n
  <chr> <fct>    <int>
1 2023  High       964
2 2023  Medium    1926
3 2023  Low        965
4 2024  High       730
5 2024  Medium    1452
6 2024  Low        731
7 2025  High       708
8 2025  Medium    1413
9 2025  Low        708

Additional Modeling Method Testing

Before moving forward, I want to address the concern with my original multinomial model priority_model. Currently there are some concerns regarding it’s statistics, including a high AIC which indicates that there is too much noise in the data set itself despite its moderately high accuracy. As that is not something I would like to move forward with, I will be using a different method to observe my model_data, by first sectioning the data so that 80% of it is used for training the model and the remaining 20% for testing the model. I will also be creating a custom tibble used for observing the accuracy and F1 statistics of the models I will be creating. The goal is to determine the best modeling method to use for my final step. For this section I will be using the following modeling methods:

  • Multinomial (using a different method)
  • Linear Discriminate Analysis
  • Decision Tree
  • Random Forest
  • K nearest Neighbors
  • Artificial Neural Network
# Modifying variables to ensure storesizecat and storetiercat are seen as categorical and year is seen as numerical
model_data <- model_data |>
  mutate(next_priority = factor(next_priority, levels = c("High", "Medium", "Low")),
         storesizecat = factor(storesizecat),
         storetiercat = factor(storetiercat),
         year = as.numeric(year))

set.seed(111)
train_index <- caret::createDataPartition(model_data$next_priority, p = 0.80, list = FALSE) #Splitting the data so that 80% of the data goes to training and 20% goes to testing and to ensure priority stays balanced between both sets

train_data <- model_data[train_index, ] #training data
test_data <- model_data[-train_index, ] #testing data

ctrl <- caret::trainControl(method = "cv", #fold cross validation aka it is repeatedly split internally during training
                            number = 5, # number of folds
                            classProbs = TRUE, #enables probability outputs
                            summaryFunction = caret::multiClassSummary, #Gives statistics such as accuracy
                            savePredictions = "final") #stores the best model's predictions for diagnostics

priority_formula <- next_priority ~ total_netamount + total_lineqty + avg_cost + n_transactions + storetiercat + storesizecat + year #variables used for predicting the priority 
#custom F1 metric function for predictions
multiclass_f1 <- function(cm_table){
  
  classes <- colnames(cm_table)
  
  per_class <- map_dfr(classes, function(cls){
    tp <- cm_table[cls, cls] # correct predictions
    fp <- sum(cm_table[, cls]) - tp #predicted the class but it was something else
    fn <- sum(cm_table[cls, ]) - tp #actual class but it predicted something else
    
    precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp)) #to determine how consistent the model is right 
    recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn)) #to determine out of all cases of a given priority, how often was caught by the model
    f1 <- ifelse(is.na(precision + recall) | (precision + recall == 0),
             NA, 2 * precision * recall / (precision + recall))
    tibble(class = cls, precision = precision, recall = recall, f1 = f1, support = sum(cm_table[cls, ]))})
  
macro_f1 <- mean(per_class$f1, na.rm = TRUE) #mean of F1 across classes
weighted_f1 <- weighted.mean(per_class$f1, w = per_class$support, na.rm = TRUE) # Weighted by number of items in a priority 

list(per_class = per_class, macro_f1 = macro_f1, weighted_f1 = weighted_f1)}

Linear Discriminate Model

set.seed(111)
lda_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "lda",
                          trControl = ctrl,
                          metric = "Accuracy")

lda_model
Linear Discriminant Analysis 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results:

  logLoss    AUC       prAUC      Accuracy   Kappa      Mean_F1  
  0.7604346  0.793622  0.6295548  0.6629443  0.2330129  0.4051539
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.4252191         0.7300844         0.8368055            0.880875           
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.8368055       0.4252191    0.2209814            0.5776517             
lda_preds <- predict(lda_model, newdata = test_data)

lda_cm <- caret::confusionMatrix(lda_preds, test_data$next_priority)

lda_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    2226      0     0
    Medium  5974  16334  3405
    Low        0      4    39

Overall Statistics
                                               
               Accuracy : 0.6647               
                 95% CI : (0.6591, 0.6702)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.2378               
                                               
 Mcnemar's Test P-Value : NA                   

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity              0.27146        0.9998   0.011324
Specificity              1.00000        0.1945   0.999837
Pos Pred Value           1.00000        0.6352   0.906977
Neg Pred Value           0.76805        0.9982   0.878127
Prevalence               0.29305        0.5839   0.123079
Detection Rate           0.07955        0.5837   0.001394
Detection Prevalence     0.07955        0.9189   0.001537
Balanced Accuracy        0.63573        0.5971   0.505581
lda_f1 <- multiclass_f1(lda_cm$table)
lda_f1$per_class
# A tibble: 3 × 5
  class  precision recall     f1 support
  <chr>      <dbl>  <dbl>  <dbl>   <int>
1 High      0.271   1     0.427     2226
2 Medium    1.000   0.635 0.777    25713
3 Low       0.0113  0.907 0.0224      43
lda_f1$macro_f1
[1] 0.4087481
lda_f1$weighted_f1
[1] 0.7478752

Decision Tree Model

set.seed(111)
tree_model <- caret::train(priority_formula,
                           data = train_data,
                           method = "rpart",
                           trControl = ctrl,
                           tuneLength = 10,
                           metric = "Accuracy")
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
tree_model
CART 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  cp           logLoss    AUC        prAUC       Accuracy   Kappa     
  0.001481322  0.4354266  0.9147499  0.73971388  0.8467936  0.71523805
  0.001803349  0.4378278  0.9142193  0.64646083  0.8444976  0.71072783
  0.002747960  0.4459648  0.9110985  0.50114303  0.8411028  0.70459271
  0.002919708  0.4579212  0.9078631  0.48411659  0.8388693  0.70118405
  0.004765994  0.4693555  0.9048111  0.46759224  0.8364482  0.69746297
  0.009102619  0.4776502  0.9021292  0.46846395  0.8313828  0.68922625
  0.013310434  0.4839012  0.9003058  0.45370560  0.8256116  0.67706074
  0.025611851  0.5712307  0.8317907  0.35802463  0.7985152  0.60592765
  0.031129240  0.5849573  0.8223117  0.34124267  0.7910287  0.58511089
  0.405646200  0.8853990  0.5425396  0.01557607  0.6175079  0.09622994
  Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
  0.7900016  0.7668787         0.8968017         0.8232757          
  0.7879582  0.7644492         0.8950686         0.8218208          
  0.7850300  0.7618582         0.8931263         0.8185093          
  0.7832358  0.7611753         0.8925148         0.8151634          
  0.7814110  0.7604512         0.8917266         0.8119275          
  0.7757190  0.7590782         0.8897687         0.7986896          
  0.7605247  0.7421887         0.8855146         0.7863430          
  0.6992291  0.6502300         0.8569148         0.7600769          
  0.6768049  0.6190085         0.8482932         0.7448422          
        NaN  0.3740659         0.6950683               NaN          
  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
  0.9122454            0.8232757       0.7668787    0.2822645          
  0.9109123            0.8218208       0.7644492    0.2814992          
  0.9085086            0.8185093       0.7618582    0.2803676          
  0.9065078            0.8151634       0.7611753    0.2796231          
  0.9043786            0.8119275       0.7604512    0.2788161          
  0.9012861            0.7986896       0.7590782    0.2771276          
  0.8987560            0.7863430       0.7421887    0.2752039          
  0.8922560            0.7600769       0.6502300    0.2661717          
  0.8915336            0.7448422       0.6190085    0.2636762          
  0.8812209                  NaN       0.3740659    0.2058360          
  Mean_Balanced_Accuracy
  0.8318402             
  0.8297589             
  0.8274922             
  0.8268450             
  0.8260889             
  0.8244234             
  0.8138517             
  0.7535724             
  0.7336509             
  0.5345671             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.001481322.
tree_preds <- predict(tree_model, newdata = test_data)

tree_cm <- caret::confusionMatrix(tree_preds, test_data$next_priority)

tree_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6804    698    45
    Medium  1359  14983  1446
    Low       37    657  1953

Overall Statistics
                                               
               Accuracy : 0.8484               
                 95% CI : (0.8441, 0.8526)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.7183               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.8298        0.9171    0.56707
Specificity               0.9624        0.7591    0.97172
Pos Pred Value            0.9016        0.8423    0.73782
Neg Pred Value            0.9317        0.8671    0.94115
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2432        0.5355    0.06979
Detection Prevalence      0.2697        0.6357    0.09460
Balanced Accuracy         0.8961        0.8381    0.76940
tree_f1 <- multiclass_f1(tree_cm$table)
tree_f1$per_class
# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.830  0.902 0.864    7547
2 Medium     0.917  0.842 0.878   17788
3 Low        0.567  0.738 0.641    2647
tree_f1$macro_f1
[1] 0.7945125
tree_f1$weighted_f1
[1] 0.8519378

Random Forest Model

set.seed(111)
rf_model <- caret::train(priority_formula,
                         data = train_data,
                         method = "rf",
                         trControl = ctrl,
                         tuneLength = 5,
                         metric = "Accuracy",
                         importance = TRUE)

rf_model
Random Forest 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  mtry  logLoss      AUC        prAUC       Accuracy   Kappa      Mean_F1  
  2     0.192744639  0.9898836  0.90367954  0.9043722  0.8206985  0.8690667
  3     0.038045728  0.9998340  0.72523216  0.9978648  0.9961729  0.9969848
  5     0.007977311  0.9998431  0.20016075  0.9982043  0.9967822  0.9975785
  7     0.007635474  0.9998474  0.06603375  0.9981775  0.9967341  0.9975610
  9     0.007561591  0.9998613  0.06039125  0.9981686  0.9967181  0.9975514
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.8371475         0.9292035         0.9146889            0.9541546          
  0.9964763         0.9987484         0.9974983            0.9988661          
  0.9974036         0.9990141         0.9977556            0.9990055          
  0.9973322         0.9989843         0.9977925            0.9989873          
  0.9973079         0.9989713         0.9977974            0.9989839          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.9146889       0.8371475    0.3014574            0.8831755             
  0.9974983       0.9964763    0.3326216            0.9976124             
  0.9977556       0.9974036    0.3327348            0.9982088             
  0.9977925       0.9973322    0.3327258            0.9981582             
  0.9977974       0.9973079    0.3327229            0.9981396             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
rf_preds <- predict(rf_model, newdata = test_data)

rf_cm <- caret::confusionMatrix(rf_preds, test_data$next_priority)
rf_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    8185     11    10
    Medium     9  16327     3
    Low        6      0  3431

Overall Statistics
                                             
               Accuracy : 0.9986             
                 95% CI : (0.9981, 0.999)    
    No Information Rate : 0.5839             
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.9975             
                                             
 Mcnemar's Test P-Value : 0.2407             

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.9982        0.9993     0.9962
Specificity               0.9989        0.9990     0.9998
Pos Pred Value            0.9974        0.9993     0.9983
Neg Pred Value            0.9992        0.9991     0.9995
Prevalence                0.2930        0.5839     0.1231
Detection Rate            0.2925        0.5835     0.1226
Detection Prevalence      0.2933        0.5839     0.1228
Balanced Accuracy         0.9986        0.9991     0.9980
rf_f1 <- multiclass_f1(rf_cm$table)
rf_f1$per_class
# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.998  0.997 0.998    8206
2 Medium     0.999  0.999 0.999   16339
3 Low        0.996  0.998 0.997    3437
rf_f1$macro_f1
[1] 0.9981135
rf_f1$weighted_f1
[1] 0.9986063

K Nearest Neighbors Model

set.seed(111)
knn_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "knn",
                          trControl = ctrl,
                          preProcess = c("center","scale"),
                          tuneLength = 10,
                          metric = "Accuracy")

knn_model
k-Nearest Neighbors 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  k   logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
   5  0.7340195  0.9507491  0.3189370  0.8413619  0.7112888  0.7934107
   7  0.6273057  0.9496576  0.3826737  0.8439348  0.7142589  0.7947029
   9  0.5654628  0.9479928  0.4297978  0.8460700  0.7165959  0.7957863
  11  0.5405543  0.9463873  0.4655751  0.8444083  0.7126375  0.7922833
  13  0.5189061  0.9450991  0.4930885  0.8417639  0.7073103  0.7882549
  15  0.5029809  0.9438835  0.5153264  0.8394679  0.7026519  0.7853978
  17  0.4925583  0.9426782  0.5330345  0.8385120  0.7003138  0.7836291
  19  0.4808587  0.9419760  0.5494355  0.8378241  0.6985476  0.7821388
  21  0.4740448  0.9409573  0.5622290  0.8372523  0.6970106  0.7806741
  23  0.4643513  0.9402343  0.5728135  0.8358050  0.6940929  0.7780040
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.7831788         0.8991537         0.8052842            0.9051869          
  0.7807881         0.8990267         0.8116760            0.9076913          
  0.7785105         0.8986980         0.8178480            0.9099250          
  0.7733203         0.8968131         0.8171109            0.9094447          
  0.7686327         0.8947709         0.8142771            0.9080061          
  0.7651757         0.8929018         0.8125122            0.9067140          
  0.7625163         0.8917729         0.8123252            0.9064704          
  0.7602521         0.8908968         0.8122101            0.9063856          
  0.7579942         0.8901283         0.8123072            0.9063424          
  0.7550832         0.8890861         0.8102218            0.9055729          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.8052842       0.7831788    0.2804540            0.8411663             
  0.8116760       0.7807881    0.2813116            0.8399074             
  0.8178480       0.7785105    0.2820233            0.8386043             
  0.8171109       0.7733203    0.2814694            0.8350667             
  0.8142771       0.7686327    0.2805880            0.8317018             
  0.8125122       0.7651757    0.2798226            0.8290387             
  0.8123252       0.7625163    0.2795040            0.8271446             
  0.8122101       0.7602521    0.2792747            0.8255744             
  0.8123072       0.7579942    0.2790841            0.8240613             
  0.8102218       0.7550832    0.2786017            0.8220846             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
knn_preds <- predict(knn_model, newdata = test_data)

knn_cm <- caret::confusionMatrix(knn_preds, test_data$next_priority)

knn_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6929    783    49
    Medium  1229  14758  1250
    Low       42    797  2145

Overall Statistics
                                               
               Accuracy : 0.8517               
                 95% CI : (0.8475, 0.8558)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.7283               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.8450        0.9033    0.62282
Specificity               0.9579        0.7871    0.96581
Pos Pred Value            0.8928        0.8562    0.71883
Neg Pred Value            0.9371        0.8530    0.94804
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2476        0.5274    0.07666
Detection Prevalence      0.2774        0.6160    0.10664
Balanced Accuracy         0.9015        0.8452    0.79432
knn_f1 <- multiclass_f1(knn_cm$table)
knn_f1$per_class
# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.845  0.893 0.868    7761
2 Medium     0.903  0.856 0.879   17237
3 Low        0.623  0.719 0.667    2984
knn_f1$macro_f1
[1] 0.8049135
knn_f1$weighted_f1
[1] 0.8535158

Multinomial Model (updated)

set.seed(111)
multinom_model <- nnet::multinom(priority_formula,
                                 data = train_data,
                                 MaxNWts = 20000,
                                 trace = FALSE)

multinom_preds <- predict(multinom_model, newdata = test_data)

multinom_cm <- caret::confusionMatrix(data = multinom_preds,
                                      reference = test_data$next_priority)

multinom_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6508    588    26
    Medium  1668  15358  2094
    Low       24    392  1324

Overall Statistics
                                               
               Accuracy : 0.8287               
                 95% CI : (0.8243, 0.8331)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.6699               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.7937        0.9400    0.38444
Specificity               0.9690        0.6769    0.98305
Pos Pred Value            0.9138        0.8032    0.76092
Neg Pred Value            0.9189        0.8894    0.91921
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2326        0.5489    0.04732
Detection Prevalence      0.2545        0.6833    0.06218
Balanced Accuracy         0.8813        0.8085    0.68374
multinom_f1 <- multiclass_f1(multinom_cm$table)

multinom_f1$per_class
# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.794  0.914 0.849    7122
2 Medium     0.940  0.803 0.866   19120
3 Low        0.384  0.761 0.511    1740
multinom_f1$macro_f1
[1] 0.7421881
multinom_f1$weighted_f1
[1] 0.8398932

Artificial Neural Networks Model

set.seed(111)
ann_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "avNNet",
                          trControl = ctrl,
                          preProcess = c("center", "scale"),
                          tuneLength = 5,
                          metric = "Accuracy",
                          trace = FALSE)
Warning: executing %dopar% sequentially: no parallel backend registered
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
ann_model
Model Averaged Neural Network 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  size  decay   logLoss    AUC        prAUC      Accuracy   Kappa    
  1     0.0000  0.6176581  0.9049980  0.7546039  0.7458059  0.4474598
  1     0.0001  0.5613458  0.9058845  0.7469962  0.7919488  0.5729839
  1     0.0010  0.5179014  0.9056283  0.7533280  0.7926278  0.5761921
  1     0.0100  0.5072165  0.9035008  0.7496246  0.7930119  0.5773725
  1     0.1000  0.5016587  0.9018635  0.7536386  0.7927797  0.5770050
  3     0.0000  0.7779042  0.9227665  0.8014519  0.7016731  0.3239037
  3     0.0001  0.4715130  0.9257318  0.8211841  0.8036163  0.6055512
  3     0.0010  0.4885967  0.9232982  0.8111446  0.8055907  0.6091597
  3     0.0100  0.4562472  0.9246343  0.8190856  0.8121751  0.6289616
  3     0.1000  0.4399294  0.9268412  0.8210443  0.8240034  0.6635185
  5     0.0000  0.5715859  0.9260465  0.8235875  0.7165818  0.3644351
  5     0.0001  0.5326066  0.9238333  0.8155376  0.7923240  0.5731236
  5     0.0010  0.4790989  0.9261988  0.8219890  0.8016866  0.5985397
  5     0.0100  0.4879355  0.9155957  0.8094544  0.8037506  0.6057537
  5     0.1000  0.4386123  0.9268583  0.8222660  0.8248075  0.6631698
  7     0.0000  3.1272382  0.8315843  0.5388359  0.7161083  0.3620200
  7     0.0001  0.5058773  0.9254289  0.8214351  0.7961834  0.5849631
  7     0.0010  0.4741548  0.9259822  0.8228672  0.8089590  0.6181997
  7     0.0100  0.4845290  0.9239439  0.8194451  0.8052781  0.6085089
  7     0.1000  0.4377404  0.9269499  0.8242323  0.8213230  0.6545071
  9     0.0000  0.5902360  0.9279246  0.8266170  0.7093113  0.3456493
  9     0.0001  0.4723693  0.9268482  0.8257063  0.8055371  0.6097869
  9     0.0010  0.5274488  0.9250983  0.8211098  0.8030892  0.6026031
  9     0.0100  0.5541231  0.9246532  0.8189026  0.7914396  0.5699787
  9     0.1000  0.4375539  0.9264588  0.8237556  0.8257186  0.6643101
  Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
        NaN  0.5253518         0.8010929               NaN          
        NaN  0.5801674         0.8394973               NaN          
        NaN  0.5827158         0.8410891               NaN          
        NaN  0.5834969         0.8415969               NaN          
        NaN  0.5834047         0.8415109               NaN          
  0.5668880  0.4702489         0.7626250         0.8994180          
  0.7272789  0.6265631         0.8523603         0.8104085          
  0.7478253  0.6317033         0.8532679         0.8163673          
  0.6988604  0.6520406         0.8610782         0.8146565          
  0.7354400  0.7096176         0.8762062         0.8057536          
  0.6356535  0.5070706         0.7776899         0.8415329          
  0.6385979  0.5912921         0.8389422         0.8627749          
  0.6854471  0.6134490         0.8490263         0.8235319          
  0.6640821  0.6214594         0.8521732         0.8095659          
  0.7338529  0.7025485         0.8749126         0.8137255          
  0.6859209  0.5077951         0.7763796         0.8269554          
  0.6463849  0.6005616         0.8442247         0.8200272          
  0.6598055  0.6398534         0.8560602         0.8322109          
  0.6872903  0.6263372         0.8525618         0.8203071          
  0.7142122  0.6935852         0.8720959         0.8166047          
  0.6545936  0.4972616         0.7710541         0.8058017          
  0.6580863  0.6270549         0.8531208         0.8356631          
  0.7285627  0.6225377         0.8506885         0.8240830          
  0.5823591  0.5802767         0.8378046         0.8617526          
  0.7414287  0.7069993         0.8745436         0.8204610          
  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
  0.9033792                  NaN       0.5253518    0.2486020          
  0.9053240                  NaN       0.5801674    0.2639829          
  0.9029975                  NaN       0.5827158    0.2642093          
  0.9027354                  NaN       0.5834969    0.2643373          
  0.9023622                  NaN       0.5834047    0.2642599          
  0.9093985            0.8994180       0.4702489    0.2338910          
  0.9055395            0.8104085       0.6265631    0.2678721          
  0.9074200            0.8163673       0.6317033    0.2685302          
  0.9046267            0.8146565       0.6520406    0.2707250          
  0.9044323            0.8057536       0.7096176    0.2746678          
  0.9074934            0.8415329       0.5070706    0.2388606          
  0.9077125            0.8627749       0.5912921    0.2641080          
  0.9065712            0.8235319       0.6134490    0.2672289          
  0.9046238            0.8095659       0.6214594    0.2679169          
  0.9060151            0.8137255       0.7025485    0.2749358          
  0.9112310            0.8269554       0.5077951    0.2387028          
  0.9053056            0.8200272       0.6005616    0.2653945          
  0.9077764            0.8322109       0.6398534    0.2696530          
  0.9067044            0.8203071       0.6263372    0.2684260          
  0.9048774            0.8166047       0.6935852    0.2737743          
  0.9056787            0.8058017       0.4972616    0.2364371          
  0.9056277            0.8356631       0.6270549    0.2685124          
  0.9062835            0.8240830       0.6225377    0.2676964          
  0.9083646            0.8617526       0.5802767    0.2638132          
  0.9073756            0.8204610       0.7069993    0.2752395          
  Mean_Balanced_Accuracy
  0.6632224             
  0.7098324             
  0.7119024             
  0.7125469             
  0.7124578             
  0.6164369             
  0.7394617             
  0.7424856             
  0.7565594             
  0.7929119             
  0.6423803             
  0.7151172             
  0.7312376             
  0.7368163             
  0.7887305             
  0.6420873             
  0.7223932             
  0.7479568             
  0.7394495             
  0.7828405             
  0.6341578             
  0.7400878             
  0.7366131             
  0.7090407             
  0.7907715             

Tuning parameter 'bag' was held constant at a value of FALSE
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 9, decay = 0.1 and bag = FALSE.
# Predictions on test set
ann_preds <- predict(ann_model, newdata = test_data)

# Confusion matrix
ann_cm <- caret::confusionMatrix(ann_preds, test_data$next_priority)

ann_cm
Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6424    474    27
    Medium  1776  15864  3413
    Low        0      0     4

Overall Statistics
                                               
               Accuracy : 0.7967               
                 95% CI : (0.7919, 0.8014)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.5835               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.7834        0.9710  0.0011614
Specificity               0.9747        0.5544  1.0000000
Pos Pred Value            0.9277        0.7535  1.0000000
Neg Pred Value            0.9157        0.9316  0.8770463
Prevalence                0.2930        0.5839  0.1230791
Detection Rate            0.2296        0.5669  0.0001429
Detection Prevalence      0.2475        0.7524  0.0001429
Balanced Accuracy         0.8790        0.7627  0.5005807
# F1 scores
ann_f1 <- multiclass_f1(ann_cm$table)

ann_f1$per_class
# A tibble: 3 × 5
  class  precision recall      f1 support
  <chr>      <dbl>  <dbl>   <dbl>   <int>
1 High     0.783    0.928 0.849      6925
2 Medium   0.971    0.754 0.849     21053
3 Low      0.00116  1     0.00232       4
ann_f1$macro_f1
[1] 0.5667737
ann_f1$weighted_f1
[1] 0.8486502

Model Results

#Creating a tibble to organize the resulting statistics from each model to compare
model_results <- tibble(model = c("Multinomial", "LDA", "Decision Tree", "Random Forest", "KNN", "ANN"),
                        accuracy = c(multinom_cm$overall["Accuracy"],
                                     lda_cm$overall["Accuracy"],
                                     tree_cm$overall["Accuracy"],
                                     rf_cm$overall["Accuracy"],
                                     knn_cm$overall["Accuracy"],
                                     ann_cm$overall["Accuracy"]),
                        macro_f1 = c(multinom_f1$macro_f1,
                                     lda_f1$macro_f1,
                                     tree_f1$macro_f1,
                                     rf_f1$macro_f1,
                                     knn_f1$macro_f1,
                                     ann_f1$macro_f1),
                        weighted_f1 = c(multinom_f1$weighted_f1,
                                        lda_f1$weighted_f1,
                                        tree_f1$weighted_f1,
                                        rf_f1$weighted_f1,
                                        knn_f1$weighted_f1,
                                        ann_f1$weighted_f1))

model_results
# A tibble: 6 × 4
  model         accuracy macro_f1 weighted_f1
  <chr>            <dbl>    <dbl>       <dbl>
1 Multinomial      0.829    0.742       0.840
2 LDA              0.665    0.409       0.748
3 Decision Tree    0.848    0.795       0.852
4 Random Forest    0.999    0.998       0.999
5 KNN              0.852    0.805       0.854
6 ANN              0.797    0.567       0.849

According to the tibble I created named model_results , the best model for predicting priority (when using the 7 mentioned variables to observe it) would be the Random Forest Model with an extremely high accuracy of 99.86% and a weighted F1 of 0.998 (very close to 1 which indicates high accuracy as well). So going forward I will only be utilizing the random forest model for the final step.

Inventory Management using the Random Forest Model

As it is becoming clear that inventory velocity appears to be unaffected by other categories, there are still other factors to consider such as:

  • How to prioritize what merchandise goes where when each individual storefront has its’ own parameters surrounding number of merchandise available to customers in store at any given moment?
  • How to handle already existing inventory that projects to move from one priority ranking to another?
  • How to handle new inventory?

For resolving this, I will start by listing the average number of SKUs each individual store holds in their storefront at one time based on present day.

  • Silver Spring (Store #2): 3546 SKUs
  • Clarksburg Village (Store #3): 4146 SKUs
  • Westbard (Store #4): 3773 SKUs
  • Wheaton (Store #5): 2878 SKUs
  • Kingsview (Store #6): 3454 SKUs
  • Flower (Store #7): 3386 SKUs
  • Burtonsville (Store #8): 3877 SKUs
  • Muddy Branch (Store #9): 4299 SKUs
  • Cloverly (Store #10): 3533 SKUs
  • Walnut Hill (Store #11): 3437 SKUs
  • Cabin John (Store #12): 3903 SKUs
  • Leisure World (Store #13): 3115 SKUs
  • Kensington (Store #14): 3630 SKUs
  • Potomac (Store #15): 3371 SKUs
  • Seneca Meadows (Store #16): 4060 SKUs
  • Goshen Crossing (Store #17): 3692 SKUs
  • Olney (Store #18): 2918 SKUs
  • Darnestown (Store #20): 3238 SKUs
  • Fallsgrove (Store #21): 2871 SKUs
  • White Oak Town Center (Store #22): 3118 SKUs
  • Hampden Lane (Store #23): 3404 SKUs
  • Montrose (Store #24): 3674 SKUs
  • King Farm (Store #26): 3341 SKUs
  • Downtown Rockville (Store #27): 4154 SKUs
  • Aspen Hill (Store #28): 4066 SKUs
  • Poolesville (Store #29): 1872 SKUs
  • Gaithersburg (Store #30): 3950 SKUs

As the size of the store no longer matters due to now looking specifically at the current average number of SKUs available in an individual storefront, it is best to only consider the ABS provided store classification group of financial performance (which is under the variable storetiercat). I have determined the following parameters for each tier (1 to 3) of store:

  • Tier 1 stores should be at minimum 55% high priority items, 30% medium priority items, and 15% low priority items
  • Tier 2 stores should be at minimum 75% high priority items, 20% medium priority items, and 5% low priority items
  • Tier 3 stores should be at minimum 85% high priority items, 15% medium priority items, and 0% low priority items

This is a general basis to start from that can always be modified based on ABS’ needs, however for the sake of making a baseline I will be referring to this outline. The reasoning for the outline is based on the idea that:

  • Tier 3 stores should not hold any merchandise that could be considered risky, prioritizing only merchandise that regularly perform either very well or generally well
  • Tier 2 stores should focus mainly on holding mostly high and medium priority merchandise, only considering low priority in these categories if the store has a high quantity of SKUs available
  • Tier 1 stores have more flexibility, giving more option to experiment with new inventory (which would theoretically start at low priority) along with keeping available select top performing low priority items as long as additional storage is available at a given store

First step is to build a prediction frame for the random forest model to use for predicting the inventory performance of the next year. (For this use case I will be using 2025 as the main data set does not have a complete record of that year to produce predictions for 2026). I will name this prediction_frame

# Take latest year per item/store as basis
prediction_frame <- model_data |>
  group_by(store, itemid) |>
  slice_tail(n = 1) |>  # use most recent stats
  ungroup() |>
  mutate(year = 2025)  # prediction year

Now that the information has been gathered, I will use my random forest model to predict the priority of items for 2025

prediction_frame$pred_priority <- predict(rf_model, newdata = prediction_frame)

Now, given that every store has a storefront SKU capacity, I need to create a new tibble which includes said information for each given store. I will name this sku_capacity

sku_capacity <- tribble(~store, ~sku_limit,
                        2, 3546,
                        3, 4146,
                        4, 3773,
                        5, 2878,
                        6, 3454,
                        7, 3386,
                        8, 3877,
                        9, 4299,
                        10, 3533,
                        11, 3437,
                        12, 3903,
                        13, 3115,
                        14, 3630,
                        15, 3371,
                        16, 4060,
                        17, 3692,
                        18, 2918,
                        20, 3238,
                        21, 2871,
                        22, 3118,
                        23, 3404,
                        24, 3674,
                        26, 3341,
                        27, 4154,
                        28, 4066,
                        29, 1872,
                        30, 3950)

Next, I will also be creating a tibble which includes all the rules mentioned previously regarding how I recommend organizing the stocking of items by priority for each store based on store tier. I will name this priority_mix

priority_mix <- tribble(~storetiercat, ~high_pct, ~med_pct, ~low_pct,
                        "1", 0.55, 0.30, 0.15,
                        "2", 0.75, 0.20, 0.05,
                        "3", 0.85, 0.15, 0.00)

Now to create an allocation function which will help with allocating the inventory based on priority, SKU count, and the rules created in priority_mix to help with producing a final table which provides all recommended inventory to stock for each store based on the model’s predictions.

allocate_items <- function(store_id) {
  
  store_df <- prediction_frame |> filter(store == store_id)
  tcat <- unique(store_df$storetiercat)
  scount <- sku_capacity |> filter(store == store_id) |> pull(sku_limit)
  mix <- priority_mix |> filter(storetiercat == tcat)

  high_needed   <- ceiling(scount * mix$high_pct)
  medium_needed <- ceiling(scount * mix$med_pct)
  low_needed    <- ceiling(scount * mix$low_pct)

  store_df <- store_df |>
    arrange(desc(n_transactions))  # best items first

  high_items <- store_df |> filter(pred_priority == "High") |> slice_head(n = high_needed)
  med_items  <- store_df |> filter(pred_priority == "Medium") |> slice_head(n = medium_needed)
  low_items  <- store_df |> filter(pred_priority == "Low") |> slice_head(n = low_needed)

  tibble(store = store_id,
         storetiercat = tcat,
         sku_limit = scount,
         high_needed = high_needed,
         medium_needed = medium_needed,
         low_needed = low_needed,
         recommended_high = list(high_items$itemid),
         recommended_medium = list(med_items$itemid),
         recommended_low = list(low_items$itemid),
         year = 2025)}

Finally, I will create the table named final_recommendtions which provides what I would recommend to be stocked in each store in 2025 based on the model’s predictions.

final_recommendations <- map_dfr(unique(sku_capacity$store), allocate_items)

Now, I would like to produce a csv file from final_recommendations which extracts all information provided in the table in a cleanly formatted spreadsheet. To do so, I will first need to create a function to reformat the data provided from final_recommendations to create one final tibble named final_recommendations_long which has all recommended priorities for all stores prepared to be written into a csv file.

clean_list <- function(vec){vec |> stringr::str_trim()}
final_recommendations_parsed <- final_recommendations |>
  mutate(recommended_high   = map(recommended_high,   clean_list),
         recommended_medium = map(recommended_medium, clean_list),
         recommended_low    = map(recommended_low,    clean_list))
final_recommendations_long <- bind_rows(final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_high) |>
    unnest_longer(recommended_high, values_to = "itemid") |>
    mutate(priority = "High"),
  
  final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_medium) |>
    unnest_longer(recommended_medium, values_to = "itemid") |>
    mutate(priority = "Medium"),
  
  final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_low) |>
    unnest_longer(recommended_low, values_to = "itemid") |>
    mutate(priority = "Low")) |>
  filter(itemid != "") |>
  arrange(store, match(priority, c("High","Medium","Low")), itemid)

Now that I have cleaned the data provided by the model in preparation for the final produced spreadsheet, I would like to also include a validation summary as well to help summarize the following:

  • Priority counts per store
  • Fill-rate vs. SKU limit
  • Compliance with Tier rules

This will help with reporting and also may be useful to the ABS team. I will be naming the tibble validation_table

priority_summary <- final_recommendations_long |>
  count(store, storetiercat, priority, name = "priority_count") |>
  pivot_wider(
    names_from = priority,
    values_from = priority_count,
    values_fill = 0)

priority_summary <- priority_summary |>
  left_join(
    final_recommendations |>
      dplyr::select(store, sku_limit, year),
    by = "store") |>
  mutate(total_recommended = High + Medium + Low,
         fill_rate = total_recommended / sku_limit)

quota_rules <- tibble(storetiercat = factor(c("1", "2", "3"),
                                            levels = levels(priority_summary$storetiercat)),
                      min_high   = c(0.55, 0.75, 0.85),
                      min_medium = c(0.30, 0.20, 0.15),
                      min_low    = c(0.15, 0.05, 0.00))


priority_summary <- priority_summary |>
  left_join(quota_rules, by = "storetiercat")

priority_summary <- priority_summary |>
  mutate(pct_high   = High / total_recommended,
         pct_medium = Medium / total_recommended,
         pct_low    = Low / total_recommended,
         meets_high   = pct_high   >= min_high,
         meets_medium = pct_medium >= min_medium,
         meets_low    = pct_low    >= min_low,
         quota_pass = meets_high & meets_medium & meets_low)

validation_table <- priority_summary |>
  dplyr::select(store, storetiercat, sku_limit, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high, meets_medium, meets_low, quota_pass) |>
  arrange(store)

Now that I have produced a prepared table, final_recommendations_long, along with a table to observe the quotas, validation_table, I would like to join the two and then produce a csv file as my final product for ABS.

First I need to join the two tables:

recommendations_with_validation <- final_recommendations_long |>
  left_join(validation_table |>
      dplyr::select(store, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high,  meets_medium, meets_low, quota_pass),
      by = "store")

And finally to produce the csv file:

readr::write_csv(recommendations_with_validation, "ABS_Item_Recommendations_2025.csv")

Final Remarks

As of current, this is the final produced product I can offer, however there are a number of concerns that would need to be addressed, and I would have addressed given more time:

  • The further I got into creating this product, the more I realized that my current ranking system for the priority of items is flawed. This becomes more apparent when creating a quota that the high priority items will never be able to fulfill for any store due to the limitations of the quantity. To resolve this I would:

    • Change the definition of high priority items by changing it from the top 25% performing items organization wide to either top 33%, top 40%, or even consider letting a random forest model help with predicting the priority of items based on performance (which personally I’d recommend the most).
    • Re calibrate the quotas I have created for all store tiers so that the amount of high priority items offer a more realistic percentage for each, making it more possible to fulfill.
  • Include the quantity of an item I would recommend to each individual store, using the random forest model to assist with predicting the best amount based on SKU limitations

  • Adjust the method that the model observes the data from a yearly perspective to a monthly perspective to get a more accurate perspective on inventory fulfillment.

    • Possibly even make it so that the random forest model would produce a new recommended list before every inventory fulfillment cycle so that each individual store could determine what it would need in their inventory based on inventory performance.

Overall, I am certain that by using a random forest model to help with determining how Alcohol and Beverage Services of Montgomery County fulfill their inventory between all 27 locations, they will ultimately see more profit as there will be less inventory going to waste. My work is solely meant to lay the groundwork in creating the inventory ranking system, while also acknowledging how it can be improved in the future.