ABS Buying Patterns Project

Author

A Warsaw

Purpose

To determine a better methodology for observing the buying patterns between different locations, I will be observing the inventory of all locations for the following:

Trends in inventory performance between all stores
Determining if the current methodology which Alcohol and Beverage Services of Montgomery County (or ABS for short) uses for observing buying patterns is the best method
Using models to optimize the inventory management process either through ABS’ current methodology, or a newly introduced model

To do so, I would like to observe the data set provided to me by ABS to locate any potential trends, determine best practices, and also engineer new variables if needed for modeling purposes.

Stores to be Observed

For my observations, I will be researching information for the following locations:

Aspen Hill (store 28)
Burtonsville (store 8)
Cabin John (store 12)
Clarksburg Village (store 3)
Cloverly (store 10)
Darnestown (store 20)
Downtown Rockville (store 27)
Fallsgrove (store 21)
Flower (store 7)
Gaithersburg Square (store 30)
Goshen Crossing (store 17)
Hampden Lane (store 23)
Kensington (store 14)
King Farm (store 26)
Kingsview (store 6)
Leisure World (store 13)
Montrose (store 24)
Muddy Branch (store 9)
Olney (store 18)
Poolesville (store 29)
Potomac (store 15)
Seneca Meadows (store 16)
Silver Spring (store 2)
Walnut Hill (store 11)
Westbard (store 4)
Wheaton (store 5)
White Oak Town Center (store 22)

It is also important to note that there are 3 size designations for all 27 stores (Size A, B, and C) which is used to categorize the stores by their corresponding size:

Size A is for stores with greater than 6000 sq ft
Size B is for stores between 4000 and 6000 sq ft
Size C is for stores below 4000 sq ft

These designations are also categorized by their sales performance (based on a scale created in 2022) from tier 1 to 3 where:

Tier 1 is for stores performing greater than $8 million in the SY
Tier 2 is for stores performing between $5 and $8 million in the SY
Tier 3 is for stores who perform less than $5 million in the SY

So all 27 stores are identified by both a Size and Tier. I have listed below every store with their corresponding Tiers and Rank in descending order:

Size ‘A’:

- Seneca Meadows (Store #16) A1

- Muddy Branch (Store #09) A1

- Damestown (Store #20) A1

- Leisure World (Store #13) A2

- Kingsview (Store #06) A2

- Goshen Crossing (Store #17) A2

- Downtown Rockville (Store #27) A2

- Clarksburg Village (Store #03) A2

- Aspen Hill (Store #28) A2

- Gaithersburg (Store #30) A3

- Cloverly (Store #10) A3

Size ‘B’:

- Hampden Lane (Store #23) B1

- Kensington (Store #14) B1

- Montrose (Store #24) B1

- Westbard (Store #04) B1

- Silver Spring (Store #02) B2

- Walnut Hill (Store #11) B2

- Burtonsvile (Store #08) B3

- Cabin John (Store #12) B3

- White Oak Town Center (Store #22) B3

Size ‘C’:

- Potomac (Store #15) C1

- Olney (Store #18) C2

- Wheaton (Store #05) C2

- Fallsgrove (Store #21) C3

- Flower (Store #07) C3

- King Farm (Store #26) C3

- Poolesville (Store #29) C3

About the Data Set

There are a total of 18 variables and roughly 12 million observations provided in the transactions data set provided by ABS, which I will provide a breakdown of below. However, there will also be additional variables that I will also include for this data set and will be using throughout this project which will be mentioned here.

Variables originally included in the data set:

transdate: This is the physical date that the transaction took place. ABS has provided transactions starting from July 1, 2023 to June 30, 2025 (roughly 2 years worth of data)
transactionid: This is a unique identifier for each individual transaction (likely to not be used for this project)
store: This is the store number where the transaction took place
storename: This is the formatted store name where the transaction took place
itemid: This is the unique identifier for products. Item dimension attributes do not change. If an item’s pack unit or bottles per case changes a new code is created.
description: This is the product description
netamount: This is the total cost of the line item / product purchased (discounts included) without tax
lineqty: This is quantity sold of each individual line item in a transaction. So if 3 bottles of wine were sold, this would be 3. If 2 6-packs were sold, this would be 2
packunit: This is the selling unit for the line item. This is the unit as how the product was sold. This would either be Btl (single bottle) or 4pk, 6pk, etc. Some lines may only show 04 or 06 and this relates to 4pk or 6pk. This is more to show the calculation of the totalqty column, which represents how many actual bottles were sold to a customer
totalqty: This is the total quantity of items sold for that line item. (Line quantity * pack unit)
itemtag: This is the warehouse designation inventory tag. Some items may be NULL as tags are updated on a daily basis and at the time this data set was taken offline, it may have converted some TAGS to NULL. (note that I will only be observing itemtags that identify as “ST”, which is Standard Stock)
tagdesc: This is the formatted description of the warehouse inventory tags
classificationdepartment: This is the department the product belongs too (Level 1). The only departments are BEER, WINE, LIQ, MISC.
classifictiontype – This is the group type the product belongs too (Level 2). This references the type of product within the department. i.e.: LIQ > WHISKEY
classificationcategory – This is the group category the product belongs too (Level 3). This further references the category within the type of items. i.e.: LIQ > WHISKEY > IRISH WHISKEY.
bottlespercase – This is the total number of bottles in a case. This is important for our warehouse partners when calculating overall case volume of an item. totalqty / bottlespercase = Case(s) sold
size: This is the size of the product. There are standard sizes of many products and often one size of a product will outperform another for a variety of reasons. i.e. 750 mL vs 1.75 L
custaccount: This designates if the transaction was for a retail/public customer or a licensee (This will only be used to filter out all licensee transactions as I will only be observing public customers)

Variables added into the data set:

transdate2: This is used as a separately formatted date variable for visualization purposes upon cleaning the data set
year: Another created variable used for observational purposes to organize the data set specifically by year
storesizecat: This is added for observing the data by the store size categorization created by ABS, only “A”, “B”, or “C” will populate the rows corresponding to its respective store in a given transaction
storetiercat: This is added for observing the data by the stores financial performance based on ABS’ parameters, only showing “1”, “2”, or “3” in the rows corresponding to its respective store in a given transaction
cost: raw cost of a given transaction (not including tax) based on netamount / lineqty

Data Preparation

Before beginning my analysis, I will be preparing my libraries, loading in my data set, and also doing all necessary cleaning below.

library(tidyverse)
library(DBI)
library(RSQLite)
library(nnet)
library(caret)

abs_observed <- read_csv("TransactionData_UTF8.txt")

#Setup for the storesizecat and storetiercat variables
stores_A <- c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30)
stores_B <- c(02, 04, 08, 11, 12, 14, 22, 23, 24)
stores_C <- c(05, 07, 15, 18, 21, 26, 29)
tier1 <- c(04, 09, 14, 15, 16, 20, 23, 24)
tier2 <- c(02, 03, 05, 06, 11, 13, 17, 18, 27, 28)
tier3 <- c(07, 08, 10, 12, 21, 22, 26, 29, 30)

names(abs_observed) <- tolower(names(abs_observed)) #making all variables lowercase
#cleaning the store variable so that R recognizes it as intended
abs_observed$store <- as.character(as.integer(abs_observed$store))
abs_observed$store <- as.numeric(abs_observed$store)

abs_observed <- abs_observed |>
  filter(custaccount != "Licensee") |> #removing licensee transactions
  filter(itemtag == "ST") |> #removing all non-Standard stock transactions
  mutate(storesizecat = case_when(
  store %in% stores_A ~ "A",
  store %in% stores_B ~ "B",
  store %in% stores_C ~ "C",
  TRUE ~ NA_character_)) |>
  mutate(storetiercat = case_when(
  store %in% tier1 ~ "1",
  store %in% tier2 ~ "2",
  store %in% tier3 ~ "3")) |>
  filter(lineqty > 0 ) |> # filtering out transactions that were either voided or refunded
  filter(netamount > 0) #filtering out transactions that were either voided or refunded

# Creating cost, transdate2, and year variables
abs_observed$cost <- abs_observed$netamount / abs_observed$lineqty
abs_observed$transdate2 <- as.Date(abs_observed$transdate)
abs_observed$year <- format(abs_observed$transdate2, "%Y")

options(scipen = 999) #converting all visualizations from scientific notation to standard (as it likely will show up in scientific notation)

Observing the Data Set by the Three Levels of Classifications

There are three variables mentioned earlier (classificationdepartment, classifcationtype, and classificationcategory) that I would like to start with observing, as I would like to see the distribution of transactions based on those levels for all 27 locations. This should help give me a general idea of how different inventory are performing between the locations to determine if there are any notable differences.

to do so, I will create a new tibble named abs_classdeptcount where I will include a new variable count which counts the frequency of the corresponding classification department transactions in the main data set

abs_classdeptcount <- abs_observed |>
  group_by(store, classificationdepartment) |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)

ggplot(abs_classdeptcount, aes(x = store, y = count,  fill = classificationdepartment)) +
  geom_col() +
  labs(title = "Inventory Performance of All Stores by Classification Dept (2023-2025)",
       x = "Store",
       y = "Frequency",
       fill = "Classification Department",
       caption = "Source: Montgomery County ABS") +
  theme_bw() +
  coord_flip() +
  scale_x_continuous(breaks = seq(min(abs_classdeptcount$store), max(abs_classdeptcount$store), by = 1)) #To show ticks for every store

From this it is clear that the liquor classification department dominates among all stores, with wine coming in second, beer in third, and miscellaneous last. Though this is not a major surprise, it does give a good basis on how generally every store (for the most part) follows the same popularity distribution of liquor > Wine > Beer > Misc. There are a few outliers, like store 29 (Poolesville), which appears to only hold liquor sales. This may speak to why it is on the lowest scale of ABS’ method of observing buying patterns (that being that this store is catagorized as C3). Or it could be that due to the store’s previous low performance with inventory of other departments ABS has only made liquor available in said location. It is possible, though, that with further observation of the stores by classification will give more context.

Before doing any deeper research, I would like to see the performance of the top two performing classification departments: Liquor and Wine, both individually for all 27 locations and collectively to see if the top two performing departments share the same distribution patterns on the individual and collective level. This is to determine if there are any other anomalies among the stores that need to be noted. This will be done by creating another two tibbles, abs_liquor and abs_wine, where each will also include the count variable to accurately count the frequency for all transactions.

abs_liquor <- abs_observed |>
  group_by(store, classificationdepartment, classificationtype) |>
  filter(classificationdepartment == "LIQ") |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)

abs_wine <- abs_observed |>
  group_by(store, classificationdepartment, classificationtype) |>
  filter(classificationdepartment == "WINE") |>
  summarise(count = n(), .groups = "drop") |>
  arrange(store)

ABS Liquor Distribution Results

abs_liquor |>
  filter(store == 2) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Silver Spring Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 3) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Clarksburg Village Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 4) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Westbard Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 5) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Wheaton Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 6) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kingsview Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 7) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Flower Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 8) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Burtonsville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 9) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Muddy Branch Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 10) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cloverly Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 11) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Walnut Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 12) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cabin John Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 13) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Leisure World Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 14) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kensington Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 15) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Potomac Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 16) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Seneca Meadows Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 17) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Goshen Crossing Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 18) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Olney Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 20) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Darnestown Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 21) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Fallsgrove Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 22) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "White Oak Town Center Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 23) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Hampden Lane Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 24) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Montrose Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 26) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "King Farm Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 27) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Downtown Rockville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 28) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Aspen Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 29) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Poolesville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_liquor |>
  filter(store == 30) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Gaithersburg Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

ggplot(abs_liquor, aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "ABS Liquor Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

Generally, when observing liquor by classification type, the distributions all seem to be about the same, with slight differentiation. Thus it is safe to say that there is very little variation between all stores for the top performing classification department.

ABS Wine Distribution Results

abs_wine |>
  filter(store == 2) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Silver Spring Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 3) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Clarksburg Village WIne Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 4) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Westbard Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 5) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Wheaton Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 6) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kingsview Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 7) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Flower Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 8) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Burtonsville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 9) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Muddy Branch Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 10) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cloverly Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 11) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Walnut Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 12) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Cabin John Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 13) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Leisure World Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 14) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Kensington Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 15) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Potomac Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 16) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Seneca Meadows Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 17) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Goshen Crossing Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 18) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Olney Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 20) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Darnestown Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 21) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Fallsgrove Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 22) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "White Oak Town Center Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 23) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Hampden Lane Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 24) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Montrose Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 26) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "King Farm Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 27) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Downtown Rockville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 28) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Aspen Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 29) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Poolesville Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 7.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

abs_wine |>
  filter(store == 30) |>
  ggplot(aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "Gaithersburg Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

ggplot(abs_wine |>
         filter(count > 5000), aes(x = classificationtype, y = count)) +
  geom_col(position = "dodge") +
  labs(x = "Classification Type",
       y = "Frequency",
       caption = "Source: Montgomery County ABS",
       title = "ABS Wine Purchase Frequency from 2023-2025 by Classification Type") +
  theme(axis.text.x = element_text(angle = 90, size = 6.25, vjust = 0.50, hjust = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 0))

The distribution of the wine department by classification type definitely shows a variation based on location. When creating the visualizations I quickly realized that the general distribution of all stores collectively showed a large variety of options by classification type, resulting in a very messy visualization (as you can see from observing the visualizations of each individual store). To fix this, I decided to only show the top 13 performing options by filtering the count to show only options that appear more than 5000 times in the abs_observed data set. Outside of that, the biggest outlier is still poolesville, as they only offer sake, which performs very poorly (only being purchased roughly 25 times between 2023-2025).

Due to the wide variety of performance for the wine department, it is clear that not all departments are equal, and thus more observations are needed to determine how ABS’ inventory is performing in all locations.

Using SQL Queries to observe the stores by size and classification

Now I will be using SQL queries to help identify specifics regarding performance in all stores by all three classification levels. To make it easier to observe the results, I will be categorizing the results using the store size method that ABS uses for observing their stores. This will be done by organizing three different tibbles named TierA_Stores, TierB_Stores, and TierC_Stores, which pulls all information from the main data set into the respective tibble based on the corresponding store’s size category (by using the storesizecat variable). I will then be creating a database for all relevant queries named abstier.db.

TierA_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30))

TierB_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(02, 04, 08, 11, 12, 14, 22, 23, 24))

TierC_Stores <- abs_observed |>
  dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
  group_by(transdate) |>
  filter(store %in% c(05, 07, 15, 18, 21, 26, 29))

abstier.db <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(abstier.db, TierA_Stores)
copy_to(abstier.db, TierB_Stores)
copy_to(abstier.db, TierC_Stores)
copy_to(abstier.db, abs_observed) # Added the main data set just in case

Tier A Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationdepartment	quantity
2023	LIQ	711252
2023	WINE	357449
2023	BEER	51992
2023	MISC	17172
2024	LIQ	1393916
2024	WINE	627742
2024	BEER	158378
2024	MISC	38941
2025	LIQ	645536
2025	WINE	271803

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationtype	quantity
2023	VODKA	180113
2023	TEQUILA	84485
2023	WHISKEY	67596
2023	BOURBON	59527
2023	CORDIALS	53882
2023	CABERNET SAUVIGNON	52833
2023	RUM	51154
2023	RED BLEND	42182
2023	CHARDONNAY	41930
2023	SCOTCH	38718

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationcategory	quantity
2023	DOMESTIC VODKA	117579
2023	AMERICAN RED	105677
2023	TEQUILA	81571
2023	AMERICAN WHITE	76593
2023	STRAIGHT BOURBON WHISKEY	56354
2023	IMPORTED RUM	48381
2023	IMPORTED VODKA	42311
2023	IMPORTED CORDIALS	35515
2023	DOMESTIC VODKA FLAVORS	35248
2023	BLENDED WHISKEY	32742

Tier B Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationdepartment	quantity
2023	LIQ	548087
2023	WINE	332712
2023	BEER	47278
2023	MISC	15939
2024	LIQ	1051785
2024	WINE	545938
2024	BEER	135093
2024	MISC	38252
2025	LIQ	482284
2025	WINE	258699

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationtype	quantity
2023	VODKA	143294
2023	TEQUILA	63304
2023	BOURBON	49720
2023	WHISKEY	47040
2023	CORDIALS	44563
2023	CABERNET SAUVIGNON	42766
2023	CHARDONNAY	41574
2023	SAUVIGNON BLANC	36426
2023	RUM	36368
2023	RED BLEND	34551

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationcategory	quantity
2023	DOMESTIC VODKA	97461
2023	AMERICAN RED	89257
2023	AMERICAN WHITE	72842
2023	TEQUILA	60804
2023	STRAIGHT BOURBON WHISKEY	47967
2023	IMPORTED RUM	34516
2023	IMPORTED VODKA	33468
2023	IMPORTED CORDIALS	32014
2023	SPARKLING	27497
2023	BLENDED WHISKEY	22969

Tier C Stores

First this is the performance by Classification Department:

SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationdepartment	quantity
2023	LIQ	329124
2023	WINE	155622
2023	BEER	18487
2023	MISC	7638
2024	LIQ	649271
2024	WINE	270456
2024	BEER	57553
2024	MISC	18528
2025	LIQ	235017
2025	WINE	103997

Next by Classification Type:

SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationtype	quantity
2023	VODKA	90682
2023	TEQUILA	35247
2023	WHISKEY	33792
2023	BOURBON	30237
2023	CORDIALS	24567
2023	CABERNET SAUVIGNON	21665
2023	RUM	21559
2023	CHARDONNAY	20113
2023	SCOTCH	17745
2023	RED BLEND	16432

Finally by Classification Category:

SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC

Displaying records 1 - 10
year	classificationcategory	quantity
2023	DOMESTIC VODKA	61392
2023	AMERICAN RED	42911
2023	AMERICAN WHITE	34710
2023	TEQUILA	33912
2023	STRAIGHT BOURBON WHISKEY	28977
2023	IMPORTED VODKA	21135
2023	IMPORTED RUM	20500
2023	BLENDED WHISKEY	17630
2023	IMPORTED CORDIALS	16807
2023	DOMESTIC VODKA FLAVORS	14104

Results Overview

Generally, all tiers appear to follow the same trends throughout all years and all levels of classifications. This further proves that inventory performance likely has little to do with store size as tier A, B, and C stores all share generally the same results from 2023-2025. To give a quick rundown of the results I will report the top 5 and lowest 5 performing items on all three levels of classification for all three tiers of store size:

Classification Department

Tier A Stores
- Top to lowest performing (2023-2025)
  - Liquor
  - Wine
  - Beer
  - Miscellaneous
  - Donations
  - DLC (only in 2024 & 2025)
  - Non Inventory
  - Gift Card (only in 2023)
Tier B Stores
- Top to lowest performing (2023-2025)
  - Liquor
  - Wine
  - Beer
  - Misc
  - Donations
  - Non Inventory
  - Gift Card (only in 2023)
  - DLC (only in 2023 & 2025)
Tier C Stores
- Top to lowest performing (2023-2025)
  - Liquor
  - Wine
  - Beer
  - Misc
  - Donations
  - Non Inventory
  - Gift Card (only in 2023)
  - DLC (only in 2023 & 2025)

Classification Type

Tier A Stores
- Top 5 Performing:
  - 2023
    - Vodka
    - NA(Values with no data)
    - Tequila
    - Whiskey
    - Bourbon
  - 2024
    - Vodka
    - Tequila
    - Whiskey
    - Bourbon
    - Cordials
  - 2025 (thus far)
    - Vodka
    - Tequila
    - Whiskey
    - Bourbon
    - Cordials
- Lowest 5 Performing:
  - 2023
    - Soave
    - Pinotage
    - St. Emillion
    - Petite Syrah
    - IPA
  - 2024
    - Petite Syrah
    - Grenache
    - Vermentino
    - Bordeaux
    - Agave
  - 2025
    - Bordeaux
    - Torrontes
    - Other South African Red
    - Flavor Gin
    - Pinotage
    - (directly above is petite syrah, grenache, vermantino, and Agave)
Tier B Stores
- Top 5 Performing:
  - 2023
    - Vodka
    - NA(Values with no data)
    - Tequila
    - Bourbon
    - Cordials
  - 2024
    - Vodka
    - Tequila
    - Bourbon
    - Whiskey
    - Cordials
  - 2025
    - Vodka
    - Tequila
    - Bourbon
    - Whiskey
    - Cordials
- Lowest 5 Performing:
  - 2023
    - Liebfraumilch
    - Bianco
    - St. Emillion
    - Grenache
    - Torrontes
  - 2024
    - Vermentino
    - Other South African Red
    - Torrontes
    - Madeira
    - Bardolino
  - 2025
    - Bardolino
    - Other South African Red
    - Flavor Gin
    - Grenache
    - Primitivo
Tier C Stores
- Top 5 Performing:
  - 2023
    - Vodka
    - NA(Values with no data)
    - Tequila
    - Whiskey
    - Bourbon
  - 2024
    - Vodka
    - Tequila
    - Whiskey
    - Bourbon
    - Cordials
  - 2025
    - Vodka
    - Tequila
    - Bourbon
    - Whiskey
    - Cordials
- Lowest 5 Performing:
  - 2023
    - Soave
    - Gamay
    - Pinotage
    - Miscast
    - Bianco
  - 2024
    - Pinotage
    - Torrontes
    - Petite Syrah
    - Bardolino
    - Liebfraumilch
  - 2025
    - Vermentino
    - Graves
    - Bardolino
    - Petite Syrah
    - Grenache

Classification Category

Tier A Stores
- Top 5 performing:
  - 2023
    - Domestic Vodka
    - American Red
    - Tequila
    - American White
    - Straight Bourbon Whiskey
  - 2024
    - Domestic Vodka
    - American Red
    - Tequila
    - American White
    - Straight Bourbon Whiskey
  - 2025
    - Domestic Vodka
    - Tequila
    - American Red
    - American White
    - Straight Bourbon Whiskey
- Lowest 5 Performing:
  - 2023
    - French Red Wine
    - 1% Mixers
    - Malt Based RTD
    - Indian Whiskey
    - Greek
  - 2024
    - Indian Whiskey
    - Sake
    - Alsatian Wine
    - Misc Agave Spirit
    - Greek
  - 2025
    - Georgian Wine
    - Greek
    - Indian Whisky
    - Grappa
    - Armagnac
Tier B Stores
- Top 5 Performing:
  - 2023
    - Domestic Vodka
    - American Red
    - American White
    - Tequila
    - Straight Bourbon Whiskey
  - 2024
    - Domestic Vodka
    - American Red
    - Tequila
    - American White
    - Straight Bourbon Whiskey
  - 2025
    - Domestic Vodka
    - American Red
    - American White
    - Tequila
    - Straight Bourbon Whiskey
- Lowest 5 Performing:
  - 2023
    - Store Supplies
    - Grappa
    - Indian Whisky
    - Greek
    - 1% Mixers
  - 2024
    - Wine Based Seltzers
    - Grappa
    - Greek
    - Indian Whisky
    - 1% Mixers
    - 2025
    - Greek
    - Domestic Dessert Wine
    - 1% Mixers
    - Georgian Wine
    - Armagnac
Tier C Stores
- Top 5 Performance:
  - 2023
    - Domestic Vodka
    - American Red
    - Straight Bourbon Whiskey
    - Tequila
    - American White
  - 2024
    - Domestic Vodka
    - American Red
    - Tequila
    - Straight Bourbon Whiskey
    - American White
  - 2025
    - Domestic Vodka
    - American Red
    - Straight Bourbon Whiskey
    - Tequila
    - American White
- Lowest 5 Performance:
  - 2023
    - 1% Mixers
    - Indian Whisky
    - Domestic Dessert Wine
    - Store Supplies
    - Grappa
  - 2024
    - Malt Based RTD
    - Domestic Dessert Wine
    - 1% Mixers
    - Sake
    - Grappa
  - 2025
    - Store Supplies
    - Armagnac
    - Misc Agave Spirit
    - Georgian Wine
    - 1% Mixers

Observing the Data Set by Item ID

Now that I have determined that the inventory generally performs the same among all stores, regardless of store size, it further proves that ABS’ buying patterns methodology can be upgraded. My idea is for ABS to observe the data by the itemid, specifically for standard stock to prevent the variable becoming unusable as when switching from one itemtag to another, ABS must change the itemid even if it is the same item, so as to have a separate item id for its new tag designation. This is also why it was necessary for me to filter out all results in the main data set to show only standard stock, as this is the variable I will be mainly focusing on for this project. To give an idea of how necessary filtering the data is, before filtering out all transactions that were not standard stock, there were over 12 million observations, now there are roughly 9.7 million. That means that approximately 3 million transactions occur with items under a different item tag, which could likely be the same item under the standard stock, but with a different item id.

To start off, I will be creating a new category for use to assist in my analysis named priority which I will further explain below:

priority: This is an engineered variable for observing the itemid variable by frequency of transactions to determine if it is a “High”, “Medium”, or “Low” priority item. High priority merchandise are frequently purchased items in the upper 25 percentile of all transactions globally. Medium priority are items which populate the middle 50 percentile of all transactions. Low priority items populate the lower 25 percentile of all transactions. The percentile may be subject to changing according to ABS’ desired parameters, or other unexpected reasoning. (This variable may be subject to be moved to a tibble dedicated to observing itemid’s by priority)

For now I will be preparing other tibbles for creating a statistical model based on the variables lineqty and netamount to create the following variables:

total_netamount: Created by taking the sum of netamount of one individual itemid collectively grouped by year and store
total_lineqty: Created by taking the sum of lineqty of one individual itemid collectively grouped by year and store
avg_cost: Created by taking the mean of netamount / lineqty of one individual itemid collectively grouped by year and store
n_transactions: Created to count the frequency a given itemid appears in transactions per store and year
priority: Refer to notes on variable above, created from total_netamount

These variables will be created in two separate tibbles. The first tibble, abs_var will be for creating majority of the variables, with the exception of priority. And the second tibble item_summary will be used for creating the priority variable for use from here on out.

#Updated to show priority for itemid globally for each year for predictive model
abs_var <- abs_observed |>
  group_by(year, itemid, description) |>
  summarise(total_netamount = sum(netamount, na.rm = TRUE),
            total_lineqty   = sum(lineqty, na.rm = TRUE),
            avg_cost = mean(netamount / lineqty, na.rm = TRUE),
            n_transactions  = n(),
            .groups = "drop")
item_summary <- abs_var |>
  group_by(year) |>
  mutate(q25 = quantile(n_transactions, 0.25, na.rm = TRUE),
         q75 = quantile(n_transactions, 0.75, na.rm = TRUE),
         priority = case_when(n_transactions >= q75 ~ "High",
                              n_transactions <= q25 ~ "Low",
                              TRUE ~ "Medium")) |>
  ungroup() |>
  mutate(priority = factor(priority, levels = c("High", "Medium", "Low")))

item_summary |> count(year, priority)

# A tibble: 9 × 3
  year  priority     n
  <chr> <fct>    <int>
1 2023  High       964
2 2023  Medium    1926
3 2023  Low        965
4 2024  High       730
5 2024  Medium    1452
6 2024  Low        731
7 2025  High       708
8 2025  Medium    1413
9 2025  Low        708

Modeling Item Trends from 2023-2025 by Priority

I now want to create a multinomial model based on the newly engineered priority variable which will use the information from the data set, separated by year, to both learn how to determine what priority any given item should be classified as, and also predict what priority an item will have for the following year based on the performance provided by transaction data. I will do this by creating a new tibble named model_data which collects all needed transaction information from item_summary along with two new variables included, ‘next_priority’ and next_year’, for predictions of item priority for the following year.

#Create lagged dataset to model transitions using 2023 and 2024 transactions (predicting next year's priority)

item_transitions <- item_summary |>
  arrange(itemid, year) |>
  group_by(itemid) |>
  mutate(next_priority = lead(priority),
         next_year = lead(year)) |>
  filter(!is.na(next_priority)) |> # Removes 2025 year data as no data follows
  ungroup()

# Join store characteristics for more context

model_data <- item_transitions |>
  left_join(abs_observed |>
              select(itemid, store, storesizecat, storetiercat) |>
              distinct(), by = "itemid")

priority_model <- nnet::multinom(next_priority ~ total_netamount + total_lineqty + avg_cost + n_transactions + storetiercat + storesizecat + year, data = model_data)

# weights:  33 (20 variable)
initial  value 153713.436982 
iter  10 value 99370.097887
iter  20 value 83547.094840
iter  30 value 61582.220365
iter  40 value 61452.302486
iter  50 value 61426.708135
iter  60 value 61422.128167
final  value 61419.782406 
converged

summary(priority_model)

Call:
nnet::multinom(formula = next_priority ~ total_netamount + total_lineqty + 
    avg_cost + n_transactions + storetiercat + storesizecat + 
    year, data = model_data)

Coefficients:
       (Intercept) total_netamount total_lineqty   avg_cost n_transactions
Medium    3.339898  -0.00001890630  0.0001452922 0.03281650   -0.004026737
Low       2.950450  -0.00007892468  0.0003871200 0.05288098   -0.009980445
       storetiercat2 storetiercat3 storesizecatB storesizecatC year2024
Medium    0.00731104  -0.015739368   0.005882785   -0.01392901 4.194745
Low      -0.04390216  -0.001928148  -0.054497280   -0.11212288 7.009737

Std. Errors:
         (Intercept) total_netamount total_lineqty     avg_cost n_transactions
Medium 0.00004014928 0.0000008099377  0.0000275512 0.0008968979  0.00004118825
Low    0.00003012254 0.0000025138220  0.0001355416 0.0010823724  0.00017918395
       storetiercat2  storetiercat3 storesizecatB  storesizecatC       year2024
Medium 0.00001514294 0.000013241549 0.00001347795 0.000010382601 0.000019345275
Low    0.00001066761 0.000009807617 0.00001044328 0.000006683589 0.000004312398

Residual Deviance: 122839.6 
AIC: 122879.6

The purpose of this is for the model to track and learn the trends that create a high priority item versus a medium or low priority item using the variables provided. When the variable offers a positive coefficient tied to either medium or low, then it understands that the corresponding variable is likely to produce the given priority (medium or low, or both) when the given variable increases in number/amount. And when a variable offers negative coefficients for either medium or low, then it is stating that this variable is less likely to produce an item of said priority when the variable increases in number/amount. In the case of when both coefficients are negative, it is stating that high priority items are more likely produced when the corresponding variable increases in number/amount.

According to that breakdown, the following results are:

total_netamount: Produced negative coefficients for both medium and low, more likely to produce high priority items as net sales increase which is logical (proving the model is off to a good start)
total_lineqty: Produced positive coefficients for both, which means that higher line quantity increases the odds of an item being medium or low priority vs high priority. This is an interesting observation according to the model, as it appears counter-intuitive, however it simply means that quantity alone is not a good enough predictor for whether an item will be high priority or not as high priority items have stronger net sales vs high volumes.
avg_cost: Produced positive coefficients for both, meaning that higher cost items tend to become medium to low priority items. This is logical as higher cost items are less likely to be purchased.
n_transactions: Produced negative coefficients for both, which comes to no surprise as the number of transactions a given item has directly correlates to the priority of an item, so high number = high priority
storetiercat: Between two different storetiercat’s (storetiercat2 and storetiercat3), showed mixed results with very little variation. This further proves that inventory velocity has very little to do with store tier (in other words a store’s financial performance) as priority is mostly driven by item performance itself and is consistent across all store tiers.
Storesizecat: The same as storetiercat, it offers very little variation meaning that an item’s velocity is more independent than ABS’ current buying patterns methodology, further ensuring the usefulness of the priority variable.
year: The year variable shows both positive coefficients for medium and low showcasing that based on the data set (and based on the 2023 to 2024 performance), items are performing more poorly than the previous year (which is a part of ABS team’s concern) as items are more likely to be medium or low priority as time progresses.

In summary, the model shows that the total number of transactions and net sales amount are the strongest predictors of priority movement between years. Items with higher transaction counts and stronger net sales are much more likely to enter the high priority tier. Quantity sold on its own is not a reliable predictor and items with high physical volume can still fall into the medium or low tiers when net revenue does not match. Store size and store tier have only a minor influence which suggests that performance is driven by the product itself rather than the location. There is also a strong year effect that indicates priority classifications in 2024 were harder to reach compared to 2023 which may reflect broader changes in sales patterns.

Next, I will be testing the model’s performance to see how well it is able to predict a given item’s priority level based on what the model was able to analyze.

# Evaluate model performance

preds <- predict(priority_model, newdata = model_data)
confusion <- table(Predicted = preds, Actual = model_data$next_priority)
confusion

         Actual
Predicted  High Medium   Low
   High   32685   3167   101
   Medium  8186  76481 10718
   Low      132   2044  6402

# Overall accuracy

accuracy <- sum(diag(confusion)) / sum(confusion)
accuracy

[1] 0.8259813

According to the results, the model predicted that 35,953 items would be high priority, 95,385 items would be medium priority, and 8578 items would be low priority. Based on the given predictions, the model has an accuracy of 82.598% or approximately 82.6%, showing that the model is relatively strong at predicting an items performance. Or in other words, more than eight times out of ten the model will accurately predict an items priority level in the next year to give better perspective. To give a breakdown of the models predictions:

The model accurately predicted that 32,685 items would be high priority
- 3167 items that the model predicted to be high priority turned out to be medium priority
- 101 items that the model predicted to be high priority turned out to be low priority
The model accurately predicted 76481 items to be medium priority
- 8186 items that the model predicted to be medium priority turned out to be high priority
- 10718 items that the model predicted to be medium priority turned out to be low priority
The model accurately predicted 6402 items to be in the low priority
- 132 items that the model predicted to be low priority turned out to be high priority
- 2044 items that the model predicted to be low priority turned out to be medium priority

In summary, the model achieved an accuracy of roughly 82 percent which shows that the predictors used for understanding future priority transitions are strong. The confusion matrix shows that the model performs very well when identifying high priority items and rarely mistakes low performers for strong ones. Most of the mis-classifications occur within the medium tier which is expected since it represents the largest portion of items and shares characteristics with both high and low tiers. The strong performance between the high and low categories suggests that the model can safely support decisions involving replenishment and inventory risk. This confirms that the quartile-based priority structure is statistically meaningful and can be used confidently when observing how items move across years.

However, there are resounding concerns regarding other statistics, including a very high AIC score, which indicates that there are potential reliability concerns regarding how the multinomial model observes the variable.

Additional Modeling Method Testing

Before moving forward, I want to address the concern with my original multinomial model priority_model. Currently there are some concerns regarding it’s statistics, including a high AIC which indicates that there is too much noise in the data set itself despite its moderately high accuracy. As that is not something I would like to move forward with, I will be using a different method to observe my model_data, by first sectioning the data so that 80% of it is used for training the model and the remaining 20% for testing the model. I will also be creating a custom tibble used for observing the accuracy and F1 statistics of the models I will be creating. The goal is to determine the best modeling method to use for my final step. For this section I will be using the following modeling methods:

Multinomial (using a different method)
Linear Discriminate Analysis
Decision Tree
Random Forest
K nearest Neighbors
Artificial Neural Network

# Modifying variables to ensure storesizecat and storetiercat are seen as categorical and year is seen as numerical
model_data <- model_data |>
  mutate(next_priority = factor(next_priority, levels = c("High", "Medium", "Low")),
         storesizecat = factor(storesizecat),
         storetiercat = factor(storetiercat),
         year = as.numeric(year))

set.seed(111)
train_index <- caret::createDataPartition(model_data$next_priority, p = 0.80, list = FALSE) #Splitting the data so that 80% of the data goes to training and 20% goes to testing and to ensure priority stays balanced between both sets

train_data <- model_data[train_index, ] #training data
test_data <- model_data[-train_index, ] #testing data

ctrl <- caret::trainControl(method = "cv", #fold cross validation aka it is repeatedly split internally during training
                            number = 5, # number of folds
                            classProbs = TRUE, #enables probability outputs
                            summaryFunction = caret::multiClassSummary, #Gives statistics such as accuracy
                            savePredictions = "final") #stores the best model's predictions for diagnostics

priority_formula <- next_priority ~ total_netamount + total_lineqty + avg_cost + n_transactions + storetiercat + storesizecat + year #variables used for predicting the priority

#custom F1 metric function for predictions
multiclass_f1 <- function(cm_table){
  
  classes <- colnames(cm_table)
  
  per_class <- map_dfr(classes, function(cls){
    tp <- cm_table[cls, cls] # correct predictions
    fp <- sum(cm_table[, cls]) - tp #predicted the class but it was something else
    fn <- sum(cm_table[cls, ]) - tp #actual class but it predicted something else
    
    precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp)) #to determine how consistent the model is right 
    recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn)) #to determine out of all cases of a given priority, how often was caught by the model
    f1 <- ifelse(is.na(precision + recall) | (precision + recall == 0),
             NA, 2 * precision * recall / (precision + recall))
    tibble(class = cls, precision = precision, recall = recall, f1 = f1, support = sum(cm_table[cls, ]))})
  
macro_f1 <- mean(per_class$f1, na.rm = TRUE) #mean of F1 across classes
weighted_f1 <- weighted.mean(per_class$f1, w = per_class$support, na.rm = TRUE) # Weighted by number of items in a priority 

list(per_class = per_class, macro_f1 = macro_f1, weighted_f1 = weighted_f1)}

Linear Discriminate Model

set.seed(111)
lda_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "lda",
                          trControl = ctrl,
                          metric = "Accuracy")

lda_model

Linear Discriminant Analysis 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results:

  logLoss    AUC       prAUC      Accuracy   Kappa      Mean_F1  
  0.7604346  0.793622  0.6295548  0.6629443  0.2330129  0.4051539
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.4252191         0.7300844         0.8368055            0.880875           
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.8368055       0.4252191    0.2209814            0.5776517

lda_preds <- predict(lda_model, newdata = test_data)

lda_cm <- caret::confusionMatrix(lda_preds, test_data$next_priority)

lda_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    2226      0     0
    Medium  5974  16334  3405
    Low        0      4    39

Overall Statistics
                                               
               Accuracy : 0.6647               
                 95% CI : (0.6591, 0.6702)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.2378               
                                               
 Mcnemar's Test P-Value : NA                   

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity              0.27146        0.9998   0.011324
Specificity              1.00000        0.1945   0.999837
Pos Pred Value           1.00000        0.6352   0.906977
Neg Pred Value           0.76805        0.9982   0.878127
Prevalence               0.29305        0.5839   0.123079
Detection Rate           0.07955        0.5837   0.001394
Detection Prevalence     0.07955        0.9189   0.001537
Balanced Accuracy        0.63573        0.5971   0.505581

lda_f1 <- multiclass_f1(lda_cm$table)
lda_f1$per_class

# A tibble: 3 × 5
  class  precision recall     f1 support
  <chr>      <dbl>  <dbl>  <dbl>   <int>
1 High      0.271   1     0.427     2226
2 Medium    1.000   0.635 0.777    25713
3 Low       0.0113  0.907 0.0224      43

lda_f1$macro_f1

[1] 0.4087481

lda_f1$weighted_f1

[1] 0.7478752

Decision Tree Model

set.seed(111)
tree_model <- caret::train(priority_formula,
                           data = train_data,
                           method = "rpart",
                           trControl = ctrl,
                           tuneLength = 10,
                           metric = "Accuracy")

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

tree_model

CART 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  cp           logLoss    AUC        prAUC       Accuracy   Kappa     
  0.001481322  0.4354266  0.9147499  0.73971388  0.8467936  0.71523805
  0.001803349  0.4378278  0.9142193  0.64646083  0.8444976  0.71072783
  0.002747960  0.4459648  0.9110985  0.50114303  0.8411028  0.70459271
  0.002919708  0.4579212  0.9078631  0.48411659  0.8388693  0.70118405
  0.004765994  0.4693555  0.9048111  0.46759224  0.8364482  0.69746297
  0.009102619  0.4776502  0.9021292  0.46846395  0.8313828  0.68922625
  0.013310434  0.4839012  0.9003058  0.45370560  0.8256116  0.67706074
  0.025611851  0.5712307  0.8317907  0.35802463  0.7985152  0.60592765
  0.031129240  0.5849573  0.8223117  0.34124267  0.7910287  0.58511089
  0.405646200  0.8853990  0.5425396  0.01557607  0.6175079  0.09622994
  Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
  0.7900016  0.7668787         0.8968017         0.8232757          
  0.7879582  0.7644492         0.8950686         0.8218208          
  0.7850300  0.7618582         0.8931263         0.8185093          
  0.7832358  0.7611753         0.8925148         0.8151634          
  0.7814110  0.7604512         0.8917266         0.8119275          
  0.7757190  0.7590782         0.8897687         0.7986896          
  0.7605247  0.7421887         0.8855146         0.7863430          
  0.6992291  0.6502300         0.8569148         0.7600769          
  0.6768049  0.6190085         0.8482932         0.7448422          
        NaN  0.3740659         0.6950683               NaN          
  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
  0.9122454            0.8232757       0.7668787    0.2822645          
  0.9109123            0.8218208       0.7644492    0.2814992          
  0.9085086            0.8185093       0.7618582    0.2803676          
  0.9065078            0.8151634       0.7611753    0.2796231          
  0.9043786            0.8119275       0.7604512    0.2788161          
  0.9012861            0.7986896       0.7590782    0.2771276          
  0.8987560            0.7863430       0.7421887    0.2752039          
  0.8922560            0.7600769       0.6502300    0.2661717          
  0.8915336            0.7448422       0.6190085    0.2636762          
  0.8812209                  NaN       0.3740659    0.2058360          
  Mean_Balanced_Accuracy
  0.8318402             
  0.8297589             
  0.8274922             
  0.8268450             
  0.8260889             
  0.8244234             
  0.8138517             
  0.7535724             
  0.7336509             
  0.5345671             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.001481322.

tree_preds <- predict(tree_model, newdata = test_data)

tree_cm <- caret::confusionMatrix(tree_preds, test_data$next_priority)

tree_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6804    698    45
    Medium  1359  14983  1446
    Low       37    657  1953

Overall Statistics
                                               
               Accuracy : 0.8484               
                 95% CI : (0.8441, 0.8526)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.7183               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.8298        0.9171    0.56707
Specificity               0.9624        0.7591    0.97172
Pos Pred Value            0.9016        0.8423    0.73782
Neg Pred Value            0.9317        0.8671    0.94115
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2432        0.5355    0.06979
Detection Prevalence      0.2697        0.6357    0.09460
Balanced Accuracy         0.8961        0.8381    0.76940

tree_f1 <- multiclass_f1(tree_cm$table)
tree_f1$per_class

# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.830  0.902 0.864    7547
2 Medium     0.917  0.842 0.878   17788
3 Low        0.567  0.738 0.641    2647

tree_f1$macro_f1

[1] 0.7945125

tree_f1$weighted_f1

[1] 0.8519378

Random Forest Model

set.seed(111)
rf_model <- caret::train(priority_formula,
                         data = train_data,
                         method = "rf",
                         trControl = ctrl,
                         tuneLength = 5,
                         metric = "Accuracy",
                         importance = TRUE)

rf_model

Random Forest 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  mtry  logLoss      AUC        prAUC       Accuracy   Kappa      Mean_F1  
  2     0.192744639  0.9898836  0.90367954  0.9043722  0.8206985  0.8690667
  3     0.038045728  0.9998340  0.72523216  0.9978648  0.9961729  0.9969848
  5     0.007977311  0.9998431  0.20016075  0.9982043  0.9967822  0.9975785
  7     0.007635474  0.9998474  0.06603375  0.9981775  0.9967341  0.9975610
  9     0.007561591  0.9998613  0.06039125  0.9981686  0.9967181  0.9975514
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.8371475         0.9292035         0.9146889            0.9541546          
  0.9964763         0.9987484         0.9974983            0.9988661          
  0.9974036         0.9990141         0.9977556            0.9990055          
  0.9973322         0.9989843         0.9977925            0.9989873          
  0.9973079         0.9989713         0.9977974            0.9989839          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.9146889       0.8371475    0.3014574            0.8831755             
  0.9974983       0.9964763    0.3326216            0.9976124             
  0.9977556       0.9974036    0.3327348            0.9982088             
  0.9977925       0.9973322    0.3327258            0.9981582             
  0.9977974       0.9973079    0.3327229            0.9981396             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.

rf_preds <- predict(rf_model, newdata = test_data)

rf_cm <- caret::confusionMatrix(rf_preds, test_data$next_priority)
rf_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    8185     11    10
    Medium     9  16327     3
    Low        6      0  3431

Overall Statistics
                                             
               Accuracy : 0.9986             
                 95% CI : (0.9981, 0.999)    
    No Information Rate : 0.5839             
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.9975             
                                             
 Mcnemar's Test P-Value : 0.2407             

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.9982        0.9993     0.9962
Specificity               0.9989        0.9990     0.9998
Pos Pred Value            0.9974        0.9993     0.9983
Neg Pred Value            0.9992        0.9991     0.9995
Prevalence                0.2930        0.5839     0.1231
Detection Rate            0.2925        0.5835     0.1226
Detection Prevalence      0.2933        0.5839     0.1228
Balanced Accuracy         0.9986        0.9991     0.9980

rf_f1 <- multiclass_f1(rf_cm$table)
rf_f1$per_class

# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.998  0.997 0.998    8206
2 Medium     0.999  0.999 0.999   16339
3 Low        0.996  0.998 0.997    3437

rf_f1$macro_f1

[1] 0.9981135

rf_f1$weighted_f1

[1] 0.9986063

K Nearest Neighbors Model

set.seed(111)
knn_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "knn",
                          trControl = ctrl,
                          preProcess = c("center","scale"),
                          tuneLength = 10,
                          metric = "Accuracy")

knn_model

k-Nearest Neighbors 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  k   logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
   5  0.7340195  0.9507491  0.3189370  0.8413619  0.7112888  0.7934107
   7  0.6273057  0.9496576  0.3826737  0.8439348  0.7142589  0.7947029
   9  0.5654628  0.9479928  0.4297978  0.8460700  0.7165959  0.7957863
  11  0.5405543  0.9463873  0.4655751  0.8444083  0.7126375  0.7922833
  13  0.5189061  0.9450991  0.4930885  0.8417639  0.7073103  0.7882549
  15  0.5029809  0.9438835  0.5153264  0.8394679  0.7026519  0.7853978
  17  0.4925583  0.9426782  0.5330345  0.8385120  0.7003138  0.7836291
  19  0.4808587  0.9419760  0.5494355  0.8378241  0.6985476  0.7821388
  21  0.4740448  0.9409573  0.5622290  0.8372523  0.6970106  0.7806741
  23  0.4643513  0.9402343  0.5728135  0.8358050  0.6940929  0.7780040
  Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
  0.7831788         0.8991537         0.8052842            0.9051869          
  0.7807881         0.8990267         0.8116760            0.9076913          
  0.7785105         0.8986980         0.8178480            0.9099250          
  0.7733203         0.8968131         0.8171109            0.9094447          
  0.7686327         0.8947709         0.8142771            0.9080061          
  0.7651757         0.8929018         0.8125122            0.9067140          
  0.7625163         0.8917729         0.8123252            0.9064704          
  0.7602521         0.8908968         0.8122101            0.9063856          
  0.7579942         0.8901283         0.8123072            0.9063424          
  0.7550832         0.8890861         0.8102218            0.9055729          
  Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.8052842       0.7831788    0.2804540            0.8411663             
  0.8116760       0.7807881    0.2813116            0.8399074             
  0.8178480       0.7785105    0.2820233            0.8386043             
  0.8171109       0.7733203    0.2814694            0.8350667             
  0.8142771       0.7686327    0.2805880            0.8317018             
  0.8125122       0.7651757    0.2798226            0.8290387             
  0.8123252       0.7625163    0.2795040            0.8271446             
  0.8122101       0.7602521    0.2792747            0.8255744             
  0.8123072       0.7579942    0.2790841            0.8240613             
  0.8102218       0.7550832    0.2786017            0.8220846             

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.

knn_preds <- predict(knn_model, newdata = test_data)

knn_cm <- caret::confusionMatrix(knn_preds, test_data$next_priority)

knn_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6929    783    49
    Medium  1229  14758  1250
    Low       42    797  2145

Overall Statistics
                                               
               Accuracy : 0.8517               
                 95% CI : (0.8475, 0.8558)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.7283               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.8450        0.9033    0.62282
Specificity               0.9579        0.7871    0.96581
Pos Pred Value            0.8928        0.8562    0.71883
Neg Pred Value            0.9371        0.8530    0.94804
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2476        0.5274    0.07666
Detection Prevalence      0.2774        0.6160    0.10664
Balanced Accuracy         0.9015        0.8452    0.79432

knn_f1 <- multiclass_f1(knn_cm$table)
knn_f1$per_class

# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.845  0.893 0.868    7761
2 Medium     0.903  0.856 0.879   17237
3 Low        0.623  0.719 0.667    2984

knn_f1$macro_f1

[1] 0.8049135

knn_f1$weighted_f1

[1] 0.8535158

Multinomial Model (updated)

set.seed(111)
multinom_model <- nnet::multinom(priority_formula,
                                 data = train_data,
                                 MaxNWts = 20000,
                                 trace = FALSE)

multinom_preds <- predict(multinom_model, newdata = test_data)

multinom_cm <- caret::confusionMatrix(data = multinom_preds,
                                      reference = test_data$next_priority)

multinom_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6508    588    26
    Medium  1668  15358  2094
    Low       24    392  1324

Overall Statistics
                                               
               Accuracy : 0.8287               
                 95% CI : (0.8243, 0.8331)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.6699               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.7937        0.9400    0.38444
Specificity               0.9690        0.6769    0.98305
Pos Pred Value            0.9138        0.8032    0.76092
Neg Pred Value            0.9189        0.8894    0.91921
Prevalence                0.2930        0.5839    0.12308
Detection Rate            0.2326        0.5489    0.04732
Detection Prevalence      0.2545        0.6833    0.06218
Balanced Accuracy         0.8813        0.8085    0.68374

multinom_f1 <- multiclass_f1(multinom_cm$table)

multinom_f1$per_class

# A tibble: 3 × 5
  class  precision recall    f1 support
  <chr>      <dbl>  <dbl> <dbl>   <int>
1 High       0.794  0.914 0.849    7122
2 Medium     0.940  0.803 0.866   19120
3 Low        0.384  0.761 0.511    1740

multinom_f1$macro_f1

[1] 0.7421881

multinom_f1$weighted_f1

[1] 0.8398932

Artificial Neural Networks Model

set.seed(111)
ann_model <- caret::train(priority_formula,
                          data = train_data,
                          method = "avNNet",
                          trControl = ctrl,
                          preProcess = c("center", "scale"),
                          tuneLength = 5,
                          metric = "Accuracy",
                          trace = FALSE)

Warning: executing %dopar% sequentially: no parallel backend registered

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

ann_model

Model Averaged Neural Network 

111934 samples
     7 predictor
     3 classes: 'High', 'Medium', 'Low' 

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547 
Resampling results across tuning parameters:

  size  decay   logLoss    AUC        prAUC      Accuracy   Kappa    
  1     0.0000  0.6176581  0.9049980  0.7546039  0.7458059  0.4474598
  1     0.0001  0.5613458  0.9058845  0.7469962  0.7919488  0.5729839
  1     0.0010  0.5179014  0.9056283  0.7533280  0.7926278  0.5761921
  1     0.0100  0.5072165  0.9035008  0.7496246  0.7930119  0.5773725
  1     0.1000  0.5016587  0.9018635  0.7536386  0.7927797  0.5770050
  3     0.0000  0.7779042  0.9227665  0.8014519  0.7016731  0.3239037
  3     0.0001  0.4715130  0.9257318  0.8211841  0.8036163  0.6055512
  3     0.0010  0.4885967  0.9232982  0.8111446  0.8055907  0.6091597
  3     0.0100  0.4562472  0.9246343  0.8190856  0.8121751  0.6289616
  3     0.1000  0.4399294  0.9268412  0.8210443  0.8240034  0.6635185
  5     0.0000  0.5715859  0.9260465  0.8235875  0.7165818  0.3644351
  5     0.0001  0.5326066  0.9238333  0.8155376  0.7923240  0.5731236
  5     0.0010  0.4790989  0.9261988  0.8219890  0.8016866  0.5985397
  5     0.0100  0.4879355  0.9155957  0.8094544  0.8037506  0.6057537
  5     0.1000  0.4386123  0.9268583  0.8222660  0.8248075  0.6631698
  7     0.0000  3.1272382  0.8315843  0.5388359  0.7161083  0.3620200
  7     0.0001  0.5058773  0.9254289  0.8214351  0.7961834  0.5849631
  7     0.0010  0.4741548  0.9259822  0.8228672  0.8089590  0.6181997
  7     0.0100  0.4845290  0.9239439  0.8194451  0.8052781  0.6085089
  7     0.1000  0.4377404  0.9269499  0.8242323  0.8213230  0.6545071
  9     0.0000  0.5902360  0.9279246  0.8266170  0.7093113  0.3456493
  9     0.0001  0.4723693  0.9268482  0.8257063  0.8055371  0.6097869
  9     0.0010  0.5274488  0.9250983  0.8211098  0.8030892  0.6026031
  9     0.0100  0.5541231  0.9246532  0.8189026  0.7914396  0.5699787
  9     0.1000  0.4375539  0.9264588  0.8237556  0.8257186  0.6643101
  Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
        NaN  0.5253518         0.8010929               NaN          
        NaN  0.5801674         0.8394973               NaN          
        NaN  0.5827158         0.8410891               NaN          
        NaN  0.5834969         0.8415969               NaN          
        NaN  0.5834047         0.8415109               NaN          
  0.5668880  0.4702489         0.7626250         0.8994180          
  0.7272789  0.6265631         0.8523603         0.8104085          
  0.7478253  0.6317033         0.8532679         0.8163673          
  0.6988604  0.6520406         0.8610782         0.8146565          
  0.7354400  0.7096176         0.8762062         0.8057536          
  0.6356535  0.5070706         0.7776899         0.8415329          
  0.6385979  0.5912921         0.8389422         0.8627749          
  0.6854471  0.6134490         0.8490263         0.8235319          
  0.6640821  0.6214594         0.8521732         0.8095659          
  0.7338529  0.7025485         0.8749126         0.8137255          
  0.6859209  0.5077951         0.7763796         0.8269554          
  0.6463849  0.6005616         0.8442247         0.8200272          
  0.6598055  0.6398534         0.8560602         0.8322109          
  0.6872903  0.6263372         0.8525618         0.8203071          
  0.7142122  0.6935852         0.8720959         0.8166047          
  0.6545936  0.4972616         0.7710541         0.8058017          
  0.6580863  0.6270549         0.8531208         0.8356631          
  0.7285627  0.6225377         0.8506885         0.8240830          
  0.5823591  0.5802767         0.8378046         0.8617526          
  0.7414287  0.7069993         0.8745436         0.8204610          
  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
  0.9033792                  NaN       0.5253518    0.2486020          
  0.9053240                  NaN       0.5801674    0.2639829          
  0.9029975                  NaN       0.5827158    0.2642093          
  0.9027354                  NaN       0.5834969    0.2643373          
  0.9023622                  NaN       0.5834047    0.2642599          
  0.9093985            0.8994180       0.4702489    0.2338910          
  0.9055395            0.8104085       0.6265631    0.2678721          
  0.9074200            0.8163673       0.6317033    0.2685302          
  0.9046267            0.8146565       0.6520406    0.2707250          
  0.9044323            0.8057536       0.7096176    0.2746678          
  0.9074934            0.8415329       0.5070706    0.2388606          
  0.9077125            0.8627749       0.5912921    0.2641080          
  0.9065712            0.8235319       0.6134490    0.2672289          
  0.9046238            0.8095659       0.6214594    0.2679169          
  0.9060151            0.8137255       0.7025485    0.2749358          
  0.9112310            0.8269554       0.5077951    0.2387028          
  0.9053056            0.8200272       0.6005616    0.2653945          
  0.9077764            0.8322109       0.6398534    0.2696530          
  0.9067044            0.8203071       0.6263372    0.2684260          
  0.9048774            0.8166047       0.6935852    0.2737743          
  0.9056787            0.8058017       0.4972616    0.2364371          
  0.9056277            0.8356631       0.6270549    0.2685124          
  0.9062835            0.8240830       0.6225377    0.2676964          
  0.9083646            0.8617526       0.5802767    0.2638132          
  0.9073756            0.8204610       0.7069993    0.2752395          
  Mean_Balanced_Accuracy
  0.6632224             
  0.7098324             
  0.7119024             
  0.7125469             
  0.7124578             
  0.6164369             
  0.7394617             
  0.7424856             
  0.7565594             
  0.7929119             
  0.6423803             
  0.7151172             
  0.7312376             
  0.7368163             
  0.7887305             
  0.6420873             
  0.7223932             
  0.7479568             
  0.7394495             
  0.7828405             
  0.6341578             
  0.7400878             
  0.7366131             
  0.7090407             
  0.7907715             

Tuning parameter 'bag' was held constant at a value of FALSE
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 9, decay = 0.1 and bag = FALSE.

# Predictions on test set
ann_preds <- predict(ann_model, newdata = test_data)

# Confusion matrix
ann_cm <- caret::confusionMatrix(ann_preds, test_data$next_priority)

ann_cm

Confusion Matrix and Statistics

          Reference
Prediction  High Medium   Low
    High    6424    474    27
    Medium  1776  15864  3413
    Low        0      0     4

Overall Statistics
                                               
               Accuracy : 0.7967               
                 95% CI : (0.7919, 0.8014)     
    No Information Rate : 0.5839               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.5835               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022

Statistics by Class:

                     Class: High Class: Medium Class: Low
Sensitivity               0.7834        0.9710  0.0011614
Specificity               0.9747        0.5544  1.0000000
Pos Pred Value            0.9277        0.7535  1.0000000
Neg Pred Value            0.9157        0.9316  0.8770463
Prevalence                0.2930        0.5839  0.1230791
Detection Rate            0.2296        0.5669  0.0001429
Detection Prevalence      0.2475        0.7524  0.0001429
Balanced Accuracy         0.8790        0.7627  0.5005807

# F1 scores
ann_f1 <- multiclass_f1(ann_cm$table)

ann_f1$per_class

# A tibble: 3 × 5
  class  precision recall      f1 support
  <chr>      <dbl>  <dbl>   <dbl>   <int>
1 High     0.783    0.928 0.849      6925
2 Medium   0.971    0.754 0.849     21053
3 Low      0.00116  1     0.00232       4

ann_f1$macro_f1

[1] 0.5667737

ann_f1$weighted_f1

[1] 0.8486502

Model Results

#Creating a tibble to organize the resulting statistics from each model to compare
model_results <- tibble(model = c("Multinomial", "LDA", "Decision Tree", "Random Forest", "KNN", "ANN"),
                        accuracy = c(multinom_cm$overall["Accuracy"],
                                     lda_cm$overall["Accuracy"],
                                     tree_cm$overall["Accuracy"],
                                     rf_cm$overall["Accuracy"],
                                     knn_cm$overall["Accuracy"],
                                     ann_cm$overall["Accuracy"]),
                        macro_f1 = c(multinom_f1$macro_f1,
                                     lda_f1$macro_f1,
                                     tree_f1$macro_f1,
                                     rf_f1$macro_f1,
                                     knn_f1$macro_f1,
                                     ann_f1$macro_f1),
                        weighted_f1 = c(multinom_f1$weighted_f1,
                                        lda_f1$weighted_f1,
                                        tree_f1$weighted_f1,
                                        rf_f1$weighted_f1,
                                        knn_f1$weighted_f1,
                                        ann_f1$weighted_f1))

model_results

# A tibble: 6 × 4
  model         accuracy macro_f1 weighted_f1
  <chr>            <dbl>    <dbl>       <dbl>
1 Multinomial      0.829    0.742       0.840
2 LDA              0.665    0.409       0.748
3 Decision Tree    0.848    0.795       0.852
4 Random Forest    0.999    0.998       0.999
5 KNN              0.852    0.805       0.854
6 ANN              0.797    0.567       0.849

According to the tibble I created named model_results , the best model for predicting priority (when using the 7 mentioned variables to observe it) would be the Random Forest Model with an extremely high accuracy of 99.86% and a weighted F1 of 0.998 (very close to 1 which indicates high accuracy as well). So going forward I will only be utilizing the random forest model for the final step.

Inventory Management using the Random Forest Model

As it is becoming clear that inventory velocity appears to be unaffected by other categories, there are still other factors to consider such as:

How to prioritize what merchandise goes where when each individual storefront has its’ own parameters surrounding number of merchandise available to customers in store at any given moment?
How to handle already existing inventory that projects to move from one priority ranking to another?
How to handle new inventory?

For resolving this, I will start by listing the average number of SKUs each individual store holds in their storefront at one time based on present day.

Silver Spring (Store #2): 3546 SKUs
Clarksburg Village (Store #3): 4146 SKUs
Westbard (Store #4): 3773 SKUs
Wheaton (Store #5): 2878 SKUs
Kingsview (Store #6): 3454 SKUs
Flower (Store #7): 3386 SKUs
Burtonsville (Store #8): 3877 SKUs
Muddy Branch (Store #9): 4299 SKUs
Cloverly (Store #10): 3533 SKUs
Walnut Hill (Store #11): 3437 SKUs
Cabin John (Store #12): 3903 SKUs
Leisure World (Store #13): 3115 SKUs
Kensington (Store #14): 3630 SKUs
Potomac (Store #15): 3371 SKUs
Seneca Meadows (Store #16): 4060 SKUs
Goshen Crossing (Store #17): 3692 SKUs
Olney (Store #18): 2918 SKUs
Darnestown (Store #20): 3238 SKUs
Fallsgrove (Store #21): 2871 SKUs
White Oak Town Center (Store #22): 3118 SKUs
Hampden Lane (Store #23): 3404 SKUs
Montrose (Store #24): 3674 SKUs
King Farm (Store #26): 3341 SKUs
Downtown Rockville (Store #27): 4154 SKUs
Aspen Hill (Store #28): 4066 SKUs
Poolesville (Store #29): 1872 SKUs
Gaithersburg (Store #30): 3950 SKUs

As the size of the store no longer matters due to now looking specifically at the current average number of SKUs available in an individual storefront, it is best to only consider the ABS provided store classification group of financial performance (which is under the variable storetiercat). I have determined the following parameters for each tier (1 to 3) of store:

Tier 1 stores should be at minimum 55% high priority items, 30% medium priority items, and 15% low priority items
Tier 2 stores should be at minimum 75% high priority items, 20% medium priority items, and 5% low priority items
Tier 3 stores should be at minimum 85% high priority items, 15% medium priority items, and 0% low priority items

This is a general basis to start from that can always be modified based on ABS’ needs, however for the sake of making a baseline I will be referring to this outline. The reasoning for the outline is based on the idea that:

Tier 3 stores should not hold any merchandise that could be considered risky, prioritizing only merchandise that regularly perform either very well or generally well
Tier 2 stores should focus mainly on holding mostly high and medium priority merchandise, only considering low priority in these categories if the store has a high quantity of SKUs available
Tier 1 stores have more flexibility, giving more option to experiment with new inventory (which would theoretically start at low priority) along with keeping available select top performing low priority items as long as additional storage is available at a given store

First step is to build a prediction frame for the random forest model to use for predicting the inventory performance of the next year. (For this use case I will be using 2025 as the main data set does not have a complete record of that year to produce predictions for 2026). I will name this prediction_frame

# Take latest year per item/store as basis
prediction_frame <- model_data |>
  group_by(store, itemid) |>
  slice_tail(n = 1) |>  # use most recent stats
  ungroup() |>
  mutate(year = 2025)  # prediction year

Now that the information has been gathered, I will use my random forest model to predict the priority of items for 2025

prediction_frame$pred_priority <- predict(rf_model, newdata = prediction_frame)

Now, given that every store has a storefront SKU capacity, I need to create a new tibble which includes said information for each given store. I will name this sku_capacity

sku_capacity <- tribble(~store, ~sku_limit,
                        2, 3546,
                        3, 4146,
                        4, 3773,
                        5, 2878,
                        6, 3454,
                        7, 3386,
                        8, 3877,
                        9, 4299,
                        10, 3533,
                        11, 3437,
                        12, 3903,
                        13, 3115,
                        14, 3630,
                        15, 3371,
                        16, 4060,
                        17, 3692,
                        18, 2918,
                        20, 3238,
                        21, 2871,
                        22, 3118,
                        23, 3404,
                        24, 3674,
                        26, 3341,
                        27, 4154,
                        28, 4066,
                        29, 1872,
                        30, 3950)

Next, I will also be creating a tibble which includes all the rules mentioned previously regarding how I recommend organizing the stocking of items by priority for each store based on store tier. I will name this priority_mix

priority_mix <- tribble(~storetiercat, ~high_pct, ~med_pct, ~low_pct,
                        "1", 0.55, 0.30, 0.15,
                        "2", 0.75, 0.20, 0.05,
                        "3", 0.85, 0.15, 0.00)

Now to create an allocation function which will help with allocating the inventory based on priority, SKU count, and the rules created in priority_mix to help with producing a final table which provides all recommended inventory to stock for each store based on the model’s predictions.

allocate_items <- function(store_id) {
  
  store_df <- prediction_frame |> filter(store == store_id)
  tcat <- unique(store_df$storetiercat)
  scount <- sku_capacity |> filter(store == store_id) |> pull(sku_limit)
  mix <- priority_mix |> filter(storetiercat == tcat)

  high_needed   <- ceiling(scount * mix$high_pct)
  medium_needed <- ceiling(scount * mix$med_pct)
  low_needed    <- ceiling(scount * mix$low_pct)

  store_df <- store_df |>
    arrange(desc(n_transactions))  # best items first

  high_items <- store_df |> filter(pred_priority == "High") |> slice_head(n = high_needed)
  med_items  <- store_df |> filter(pred_priority == "Medium") |> slice_head(n = medium_needed)
  low_items  <- store_df |> filter(pred_priority == "Low") |> slice_head(n = low_needed)

  tibble(store = store_id,
         storetiercat = tcat,
         sku_limit = scount,
         high_needed = high_needed,
         medium_needed = medium_needed,
         low_needed = low_needed,
         recommended_high = list(high_items$itemid),
         recommended_medium = list(med_items$itemid),
         recommended_low = list(low_items$itemid),
         year = 2025)}

Finally, I will create the table named final_recommendtions which provides what I would recommend to be stocked in each store in 2025 based on the model’s predictions.

final_recommendations <- map_dfr(unique(sku_capacity$store), allocate_items)

Now, I would like to produce a csv file from final_recommendations which extracts all information provided in the table in a cleanly formatted spreadsheet. To do so, I will first need to create a function to reformat the data provided from final_recommendations to create one final tibble named final_recommendations_long which has all recommended priorities for all stores prepared to be written into a csv file.

clean_list <- function(vec){vec |> stringr::str_trim()}

final_recommendations_parsed <- final_recommendations |>
  mutate(recommended_high   = map(recommended_high,   clean_list),
         recommended_medium = map(recommended_medium, clean_list),
         recommended_low    = map(recommended_low,    clean_list))

final_recommendations_long <- bind_rows(final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_high) |>
    unnest_longer(recommended_high, values_to = "itemid") |>
    mutate(priority = "High"),
  
  final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_medium) |>
    unnest_longer(recommended_medium, values_to = "itemid") |>
    mutate(priority = "Medium"),
  
  final_recommendations_parsed |>
    dplyr::select(store, storetiercat, sku_limit, year, recommended_low) |>
    unnest_longer(recommended_low, values_to = "itemid") |>
    mutate(priority = "Low")) |>
  filter(itemid != "") |>
  arrange(store, match(priority, c("High","Medium","Low")), itemid)

Now that I have cleaned the data provided by the model in preparation for the final produced spreadsheet, I would like to also include a validation summary as well to help summarize the following:

Priority counts per store
Fill-rate vs. SKU limit
Compliance with Tier rules

This will help with reporting and also may be useful to the ABS team. I will be naming the tibble validation_table

priority_summary <- final_recommendations_long |>
  count(store, storetiercat, priority, name = "priority_count") |>
  pivot_wider(
    names_from = priority,
    values_from = priority_count,
    values_fill = 0)

priority_summary <- priority_summary |>
  left_join(
    final_recommendations |>
      dplyr::select(store, sku_limit, year),
    by = "store") |>
  mutate(total_recommended = High + Medium + Low,
         fill_rate = total_recommended / sku_limit)

quota_rules <- tibble(storetiercat = factor(c("1", "2", "3"),
                                            levels = levels(priority_summary$storetiercat)),
                      min_high   = c(0.55, 0.75, 0.85),
                      min_medium = c(0.30, 0.20, 0.15),
                      min_low    = c(0.15, 0.05, 0.00))


priority_summary <- priority_summary |>
  left_join(quota_rules, by = "storetiercat")

priority_summary <- priority_summary |>
  mutate(pct_high   = High / total_recommended,
         pct_medium = Medium / total_recommended,
         pct_low    = Low / total_recommended,
         meets_high   = pct_high   >= min_high,
         meets_medium = pct_medium >= min_medium,
         meets_low    = pct_low    >= min_low,
         quota_pass = meets_high & meets_medium & meets_low)

validation_table <- priority_summary |>
  dplyr::select(store, storetiercat, sku_limit, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high, meets_medium, meets_low, quota_pass) |>
  arrange(store)

Now that I have produced a prepared table, final_recommendations_long, along with a table to observe the quotas, validation_table, I would like to join the two and then produce a csv file as my final product for ABS.

First I need to join the two tables:

recommendations_with_validation <- final_recommendations_long |>
  left_join(validation_table |>
      dplyr::select(store, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high,  meets_medium, meets_low, quota_pass),
      by = "store")

And finally to produce the csv file:

readr::write_csv(recommendations_with_validation, "ABS_Item_Recommendations_2025.csv")

Final Remarks

As of current, this is the final produced product I can offer, however there are a number of concerns that would need to be addressed, and I would have addressed given more time:

The further I got into creating this product, the more I realized that my current ranking system for the priority of items is flawed. This becomes more apparent when creating a quota that the high priority items will never be able to fulfill for any store due to the limitations of the quantity. To resolve this I would:
- Change the definition of high priority items by changing it from the top 25% performing items organization wide to either top 33%, top 40%, or even consider letting a random forest model help with predicting the priority of items based on performance (which personally I’d recommend the most).
- Re calibrate the quotas I have created for all store tiers so that the amount of high priority items offer a more realistic percentage for each, making it more possible to fulfill.
Include the quantity of an item I would recommend to each individual store, using the random forest model to assist with predicting the best amount based on SKU limitations
Adjust the method that the model observes the data from a yearly perspective to a monthly perspective to get a more accurate perspective on inventory fulfillment.
- Possibly even make it so that the random forest model would produce a new recommended list before every inventory fulfillment cycle so that each individual store could determine what it would need in their inventory based on inventory performance.

Overall, I am certain that by using a random forest model to help with determining how Alcohol and Beverage Services of Montgomery County fulfill their inventory between all 27 locations, they will ultimately see more profit as there will be less inventory going to waste. My work is solely meant to lay the groundwork in creating the inventory ranking system, while also acknowledging how it can be improved in the future.