library(tidyverse)
library(DBI)
library(RSQLite)
library(nnet)
library(caret)ABS Buying Patterns Project
Purpose
To determine a better methodology for observing the buying patterns between different locations, I will be observing the inventory of all locations for the following:
Trends in inventory performance between all stores
Determining if the current methodology which Alcohol and Beverage Services of Montgomery County (or ABS for short) uses for observing buying patterns is the best method
Using models to optimize the inventory management process either through ABS’ current methodology, or a newly introduced model
To do so, I would like to observe the data set provided to me by ABS to locate any potential trends, determine best practices, and also engineer new variables if needed for modeling purposes.
Stores to be Observed
For my observations, I will be researching information for the following locations:
- Aspen Hill (store 28)
- Burtonsville (store 8)
- Cabin John (store 12)
- Clarksburg Village (store 3)
- Cloverly (store 10)
- Darnestown (store 20)
- Downtown Rockville (store 27)
- Fallsgrove (store 21)
- Flower (store 7)
- Gaithersburg Square (store 30)
- Goshen Crossing (store 17)
- Hampden Lane (store 23)
- Kensington (store 14)
- King Farm (store 26)
- Kingsview (store 6)
- Leisure World (store 13)
- Montrose (store 24)
- Muddy Branch (store 9)
- Olney (store 18)
- Poolesville (store 29)
- Potomac (store 15)
- Seneca Meadows (store 16)
- Silver Spring (store 2)
- Walnut Hill (store 11)
- Westbard (store 4)
- Wheaton (store 5)
- White Oak Town Center (store 22)
It is also important to note that there are 3 size designations for all 27 stores (Size A, B, and C) which is used to categorize the stores by their corresponding size:
- Size A is for stores with greater than 6000 sq ft
- Size B is for stores between 4000 and 6000 sq ft
- Size C is for stores below 4000 sq ft
These designations are also categorized by their sales performance (based on a scale created in 2022) from tier 1 to 3 where:
- Tier 1 is for stores performing greater than $8 million in the SY
- Tier 2 is for stores performing between $5 and $8 million in the SY
- Tier 3 is for stores who perform less than $5 million in the SY
So all 27 stores are identified by both a Size and Tier. I have listed below every store with their corresponding Tiers and Rank in descending order:
Size ‘A’:
- Seneca Meadows (Store #16) A1
- Muddy Branch (Store #09) A1
- Damestown (Store #20) A1
- Leisure World (Store #13) A2
- Kingsview (Store #06) A2
- Goshen Crossing (Store #17) A2
- Downtown Rockville (Store #27) A2
- Clarksburg Village (Store #03) A2
- Aspen Hill (Store #28) A2
- Gaithersburg (Store #30) A3
- Cloverly (Store #10) A3
Size ‘B’:
- Hampden Lane (Store #23) B1
- Kensington (Store #14) B1
- Montrose (Store #24) B1
- Westbard (Store #04) B1
- Silver Spring (Store #02) B2
- Walnut Hill (Store #11) B2
- Burtonsvile (Store #08) B3
- Cabin John (Store #12) B3
- White Oak Town Center (Store #22) B3
Size ‘C’:
- Potomac (Store #15) C1
- Olney (Store #18) C2
- Wheaton (Store #05) C2
- Fallsgrove (Store #21) C3
- Flower (Store #07) C3
- King Farm (Store #26) C3
- Poolesville (Store #29) C3
About the Data Set
There are a total of 18 variables and roughly 12 million observations provided in the transactions data set provided by ABS, which I will provide a breakdown of below. However, there will also be additional variables that I will also include for this data set and will be using throughout this project which will be mentioned here.
Variables originally included in the data set:
- transdate: This is the physical date that the transaction took place. ABS has provided transactions starting from July 1, 2023 to June 30, 2025 (roughly 2 years worth of data)
- transactionid: This is a unique identifier for each individual transaction (likely to not be used for this project)
- store: This is the store number where the transaction took place
- storename: This is the formatted store name where the transaction took place
- itemid: This is the unique identifier for products. Item dimension attributes do not change. If an item’s pack unit or bottles per case changes a new code is created.
- description: This is the product description
- netamount: This is the total cost of the line item / product purchased (discounts included) without tax
- lineqty: This is quantity sold of each individual line item in a transaction. So if 3 bottles of wine were sold, this would be 3. If 2 6-packs were sold, this would be 2
- packunit: This is the selling unit for the line item. This is the unit as how the product was sold. This would either be Btl (single bottle) or 4pk, 6pk, etc. Some lines may only show 04 or 06 and this relates to 4pk or 6pk. This is more to show the calculation of the totalqty column, which represents how many actual bottles were sold to a customer
- totalqty: This is the total quantity of items sold for that line item. (Line quantity * pack unit)
- itemtag: This is the warehouse designation inventory tag. Some items may be NULL as tags are updated on a daily basis and at the time this data set was taken offline, it may have converted some TAGS to NULL. (note that I will only be observing itemtags that identify as “ST”, which is Standard Stock)
- tagdesc: This is the formatted description of the warehouse inventory tags
- classificationdepartment: This is the department the product belongs too (Level 1). The only departments are BEER, WINE, LIQ, MISC.
- classifictiontype – This is the group type the product belongs too (Level 2). This references the type of product within the department. i.e.: LIQ > WHISKEY
- classificationcategory – This is the group category the product belongs too (Level 3). This further references the category within the type of items. i.e.: LIQ > WHISKEY > IRISH WHISKEY.
- bottlespercase – This is the total number of bottles in a case. This is important for our warehouse partners when calculating overall case volume of an item. totalqty / bottlespercase = Case(s) sold
- size: This is the size of the product. There are standard sizes of many products and often one size of a product will outperform another for a variety of reasons. i.e. 750 mL vs 1.75 L
- custaccount: This designates if the transaction was for a retail/public customer or a licensee (This will only be used to filter out all licensee transactions as I will only be observing public customers)
Variables added into the data set:
- transdate2: This is used as a separately formatted date variable for visualization purposes upon cleaning the data set
- year: Another created variable used for observational purposes to organize the data set specifically by year
- storesizecat: This is added for observing the data by the store size categorization created by ABS, only “A”, “B”, or “C” will populate the rows corresponding to its respective store in a given transaction
- storetiercat: This is added for observing the data by the stores financial performance based on ABS’ parameters, only showing “1”, “2”, or “3” in the rows corresponding to its respective store in a given transaction
- cost: raw cost of a given transaction (not including tax) based on netamount / lineqty
Data Preparation
Before beginning my analysis, I will be preparing my libraries, loading in my data set, and also doing all necessary cleaning below.
abs_observed <- read_csv("TransactionData_UTF8.txt")#Setup for the storesizecat and storetiercat variables
stores_A <- c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30)
stores_B <- c(02, 04, 08, 11, 12, 14, 22, 23, 24)
stores_C <- c(05, 07, 15, 18, 21, 26, 29)
tier1 <- c(04, 09, 14, 15, 16, 20, 23, 24)
tier2 <- c(02, 03, 05, 06, 11, 13, 17, 18, 27, 28)
tier3 <- c(07, 08, 10, 12, 21, 22, 26, 29, 30)
names(abs_observed) <- tolower(names(abs_observed)) #making all variables lowercase
#cleaning the store variable so that R recognizes it as intended
abs_observed$store <- as.character(as.integer(abs_observed$store))
abs_observed$store <- as.numeric(abs_observed$store)
abs_observed <- abs_observed |>
filter(custaccount != "Licensee") |> #removing licensee transactions
filter(itemtag == "ST") |> #removing all non-Standard stock transactions
mutate(storesizecat = case_when(
store %in% stores_A ~ "A",
store %in% stores_B ~ "B",
store %in% stores_C ~ "C",
TRUE ~ NA_character_)) |>
mutate(storetiercat = case_when(
store %in% tier1 ~ "1",
store %in% tier2 ~ "2",
store %in% tier3 ~ "3")) |>
filter(lineqty > 0 ) |> # filtering out transactions that were either voided or refunded
filter(netamount > 0) #filtering out transactions that were either voided or refunded
# Creating cost, transdate2, and year variables
abs_observed$cost <- abs_observed$netamount / abs_observed$lineqty
abs_observed$transdate2 <- as.Date(abs_observed$transdate)
abs_observed$year <- format(abs_observed$transdate2, "%Y")options(scipen = 999) #converting all visualizations from scientific notation to standard (as it likely will show up in scientific notation)Observing the Data Set by the Three Levels of Classifications
There are three variables mentioned earlier (classificationdepartment, classifcationtype, and classificationcategory) that I would like to start with observing, as I would like to see the distribution of transactions based on those levels for all 27 locations. This should help give me a general idea of how different inventory are performing between the locations to determine if there are any notable differences.
to do so, I will create a new tibble named abs_classdeptcount where I will include a new variable count which counts the frequency of the corresponding classification department transactions in the main data set
abs_classdeptcount <- abs_observed |>
group_by(store, classificationdepartment) |>
summarise(count = n(), .groups = "drop") |>
arrange(store)ggplot(abs_classdeptcount, aes(x = store, y = count, fill = classificationdepartment)) +
geom_col() +
labs(title = "Inventory Performance of All Stores by Classification Dept (2023-2025)",
x = "Store",
y = "Frequency",
fill = "Classification Department",
caption = "Source: Montgomery County ABS") +
theme_bw() +
coord_flip() +
scale_x_continuous(breaks = seq(min(abs_classdeptcount$store), max(abs_classdeptcount$store), by = 1)) #To show ticks for every storeFrom this it is clear that the liquor classification department dominates among all stores, with wine coming in second, beer in third, and miscellaneous last. Though this is not a major surprise, it does give a good basis on how generally every store (for the most part) follows the same popularity distribution of liquor > Wine > Beer > Misc. There are a few outliers, like store 29 (Poolesville), which appears to only hold liquor sales. This may speak to why it is on the lowest scale of ABS’ method of observing buying patterns (that being that this store is catagorized as C3). Or it could be that due to the store’s previous low performance with inventory of other departments ABS has only made liquor available in said location. It is possible, though, that with further observation of the stores by classification will give more context.
Before doing any deeper research, I would like to see the performance of the top two performing classification departments: Liquor and Wine, both individually for all 27 locations and collectively to see if the top two performing departments share the same distribution patterns on the individual and collective level. This is to determine if there are any other anomalies among the stores that need to be noted. This will be done by creating another two tibbles, abs_liquor and abs_wine, where each will also include the count variable to accurately count the frequency for all transactions.
abs_liquor <- abs_observed |>
group_by(store, classificationdepartment, classificationtype) |>
filter(classificationdepartment == "LIQ") |>
summarise(count = n(), .groups = "drop") |>
arrange(store)
abs_wine <- abs_observed |>
group_by(store, classificationdepartment, classificationtype) |>
filter(classificationdepartment == "WINE") |>
summarise(count = n(), .groups = "drop") |>
arrange(store)ABS Liquor Distribution Results
abs_liquor |>
filter(store == 2) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Silver Spring Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 3) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Clarksburg Village Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 4) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Westbard Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 5) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Wheaton Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 6) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Kingsview Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 7) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Flower Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 8) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Burtonsville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 9) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Muddy Branch Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 10) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Cloverly Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 11) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Walnut Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 12) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Cabin John Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 13) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Leisure World Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 14) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Kensington Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 15) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Potomac Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 16) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Seneca Meadows Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 17) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Goshen Crossing Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 18) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Olney Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 20) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Darnestown Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 21) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Fallsgrove Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 22) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "White Oak Town Center Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 23) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Hampden Lane Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 24) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Montrose Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 26) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "King Farm Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 27) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Downtown Rockville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 28) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Aspen Hill Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 29) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Poolesville Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_liquor |>
filter(store == 30) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Gaithersburg Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))ggplot(abs_liquor, aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "ABS Liquor Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))Generally, when observing liquor by classification type, the distributions all seem to be about the same, with slight differentiation. Thus it is safe to say that there is very little variation between all stores for the top performing classification department.
ABS Wine Distribution Results
abs_wine |>
filter(store == 2) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Silver Spring Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 3) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Clarksburg Village WIne Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 4) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Westbard Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 5) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Wheaton Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 6) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Kingsview Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 7) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Flower Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 8) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Burtonsville Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 9) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Muddy Branch Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 10) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Cloverly Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 11) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Walnut Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 12) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Cabin John Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 13) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Leisure World Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 14) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Kensington Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 15) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Potomac Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 16) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Seneca Meadows Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 17) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Goshen Crossing Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 18) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Olney Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 20) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Darnestown Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 21) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Fallsgrove Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 22) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "White Oak Town Center Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 23) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Hampden Lane Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 24) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Montrose Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 26) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "King Farm Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 27) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Downtown Rockville Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 28) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Aspen Hill Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 29) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Poolesville Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 7.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))abs_wine |>
filter(store == 30) |>
ggplot(aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "Gaithersburg Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 5.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))ggplot(abs_wine |>
filter(count > 5000), aes(x = classificationtype, y = count)) +
geom_col(position = "dodge") +
labs(x = "Classification Type",
y = "Frequency",
caption = "Source: Montgomery County ABS",
title = "ABS Wine Purchase Frequency from 2023-2025 by Classification Type") +
theme(axis.text.x = element_text(angle = 90, size = 6.25, vjust = 0.50, hjust = 1)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 0))The distribution of the wine department by classification type definitely shows a variation based on location. When creating the visualizations I quickly realized that the general distribution of all stores collectively showed a large variety of options by classification type, resulting in a very messy visualization (as you can see from observing the visualizations of each individual store). To fix this, I decided to only show the top 13 performing options by filtering the count to show only options that appear more than 5000 times in the abs_observed data set. Outside of that, the biggest outlier is still poolesville, as they only offer sake, which performs very poorly (only being purchased roughly 25 times between 2023-2025).
Due to the wide variety of performance for the wine department, it is clear that not all departments are equal, and thus more observations are needed to determine how ABS’ inventory is performing in all locations.
Using SQL Queries to observe the stores by size and classification
Now I will be using SQL queries to help identify specifics regarding performance in all stores by all three classification levels. To make it easier to observe the results, I will be categorizing the results using the store size method that ABS uses for observing their stores. This will be done by organizing three different tibbles named TierA_Stores, TierB_Stores, and TierC_Stores, which pulls all information from the main data set into the respective tibble based on the corresponding store’s size category (by using the storesizecat variable). I will then be creating a database for all relevant queries named abstier.db.
TierA_Stores <- abs_observed |>
dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
group_by(transdate) |>
filter(store %in% c(03, 06, 09, 10, 13, 16, 17, 20, 27, 28, 30))
TierB_Stores <- abs_observed |>
dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
group_by(transdate) |>
filter(store %in% c(02, 04, 08, 11, 12, 14, 22, 23, 24))
TierC_Stores <- abs_observed |>
dplyr::select(transdate, year, store, storename, itemid, description, netamount, lineqty, packunit, totalqty, itemtag, tagdesc, classificationdepartment, classificationtype, classificationcategory, bottlespercase, size, custaccount) |>
group_by(transdate) |>
filter(store %in% c(05, 07, 15, 18, 21, 26, 29))abstier.db <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(abstier.db, TierA_Stores)
copy_to(abstier.db, TierB_Stores)
copy_to(abstier.db, TierC_Stores)
copy_to(abstier.db, abs_observed) # Added the main data set just in caseTier A Stores
First this is the performance by Classification Department:
SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC| year | classificationdepartment | quantity |
|---|---|---|
| 2023 | LIQ | 711252 |
| 2023 | WINE | 357449 |
| 2023 | BEER | 51992 |
| 2023 | MISC | 17172 |
| 2024 | LIQ | 1393916 |
| 2024 | WINE | 627742 |
| 2024 | BEER | 158378 |
| 2024 | MISC | 38941 |
| 2025 | LIQ | 645536 |
| 2025 | WINE | 271803 |
Next by Classification Type:
SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC| year | classificationtype | quantity |
|---|---|---|
| 2023 | VODKA | 180113 |
| 2023 | TEQUILA | 84485 |
| 2023 | WHISKEY | 67596 |
| 2023 | BOURBON | 59527 |
| 2023 | CORDIALS | 53882 |
| 2023 | CABERNET SAUVIGNON | 52833 |
| 2023 | RUM | 51154 |
| 2023 | RED BLEND | 42182 |
| 2023 | CHARDONNAY | 41930 |
| 2023 | SCOTCH | 38718 |
Finally by Classification Category:
SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierA_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC| year | classificationcategory | quantity |
|---|---|---|
| 2023 | DOMESTIC VODKA | 117579 |
| 2023 | AMERICAN RED | 105677 |
| 2023 | TEQUILA | 81571 |
| 2023 | AMERICAN WHITE | 76593 |
| 2023 | STRAIGHT BOURBON WHISKEY | 56354 |
| 2023 | IMPORTED RUM | 48381 |
| 2023 | IMPORTED VODKA | 42311 |
| 2023 | IMPORTED CORDIALS | 35515 |
| 2023 | DOMESTIC VODKA FLAVORS | 35248 |
| 2023 | BLENDED WHISKEY | 32742 |
Tier B Stores
First this is the performance by Classification Department:
SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC| year | classificationdepartment | quantity |
|---|---|---|
| 2023 | LIQ | 548087 |
| 2023 | WINE | 332712 |
| 2023 | BEER | 47278 |
| 2023 | MISC | 15939 |
| 2024 | LIQ | 1051785 |
| 2024 | WINE | 545938 |
| 2024 | BEER | 135093 |
| 2024 | MISC | 38252 |
| 2025 | LIQ | 482284 |
| 2025 | WINE | 258699 |
Next by Classification Type:
SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC| year | classificationtype | quantity |
|---|---|---|
| 2023 | VODKA | 143294 |
| 2023 | TEQUILA | 63304 |
| 2023 | BOURBON | 49720 |
| 2023 | WHISKEY | 47040 |
| 2023 | CORDIALS | 44563 |
| 2023 | CABERNET SAUVIGNON | 42766 |
| 2023 | CHARDONNAY | 41574 |
| 2023 | SAUVIGNON BLANC | 36426 |
| 2023 | RUM | 36368 |
| 2023 | RED BLEND | 34551 |
Finally by Classification Category:
SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierB_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC| year | classificationcategory | quantity |
|---|---|---|
| 2023 | DOMESTIC VODKA | 97461 |
| 2023 | AMERICAN RED | 89257 |
| 2023 | AMERICAN WHITE | 72842 |
| 2023 | TEQUILA | 60804 |
| 2023 | STRAIGHT BOURBON WHISKEY | 47967 |
| 2023 | IMPORTED RUM | 34516 |
| 2023 | IMPORTED VODKA | 33468 |
| 2023 | IMPORTED CORDIALS | 32014 |
| 2023 | SPARKLING | 27497 |
| 2023 | BLENDED WHISKEY | 22969 |
Tier C Stores
First this is the performance by Classification Department:
SELECT year, classificationdepartment, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationdepartment
ORDER BY year, quantity DESC| year | classificationdepartment | quantity |
|---|---|---|
| 2023 | LIQ | 329124 |
| 2023 | WINE | 155622 |
| 2023 | BEER | 18487 |
| 2023 | MISC | 7638 |
| 2024 | LIQ | 649271 |
| 2024 | WINE | 270456 |
| 2024 | BEER | 57553 |
| 2024 | MISC | 18528 |
| 2025 | LIQ | 235017 |
| 2025 | WINE | 103997 |
Next by Classification Type:
SELECT year, classificationtype, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationtype
ORDER BY year, quantity DESC| year | classificationtype | quantity |
|---|---|---|
| 2023 | VODKA | 90682 |
| 2023 | TEQUILA | 35247 |
| 2023 | WHISKEY | 33792 |
| 2023 | BOURBON | 30237 |
| 2023 | CORDIALS | 24567 |
| 2023 | CABERNET SAUVIGNON | 21665 |
| 2023 | RUM | 21559 |
| 2023 | CHARDONNAY | 20113 |
| 2023 | SCOTCH | 17745 |
| 2023 | RED BLEND | 16432 |
Finally by Classification Category:
SELECT year, classificationcategory, COUNT(*) AS quantity
FROM TierC_Stores
GROUP BY year, classificationcategory
ORDER BY year, quantity DESC| year | classificationcategory | quantity |
|---|---|---|
| 2023 | DOMESTIC VODKA | 61392 |
| 2023 | AMERICAN RED | 42911 |
| 2023 | AMERICAN WHITE | 34710 |
| 2023 | TEQUILA | 33912 |
| 2023 | STRAIGHT BOURBON WHISKEY | 28977 |
| 2023 | IMPORTED VODKA | 21135 |
| 2023 | IMPORTED RUM | 20500 |
| 2023 | BLENDED WHISKEY | 17630 |
| 2023 | IMPORTED CORDIALS | 16807 |
| 2023 | DOMESTIC VODKA FLAVORS | 14104 |
Results Overview
Generally, all tiers appear to follow the same trends throughout all years and all levels of classifications. This further proves that inventory performance likely has little to do with store size as tier A, B, and C stores all share generally the same results from 2023-2025. To give a quick rundown of the results I will report the top 5 and lowest 5 performing items on all three levels of classification for all three tiers of store size:
Classification Department
- Tier A Stores
- Top to lowest performing (2023-2025)
- Liquor
- Wine
- Beer
- Miscellaneous
- Donations
- DLC (only in 2024 & 2025)
- Non Inventory
- Gift Card (only in 2023)
- Top to lowest performing (2023-2025)
- Tier B Stores
- Top to lowest performing (2023-2025)
- Liquor
- Wine
- Beer
- Misc
- Donations
- Non Inventory
- Gift Card (only in 2023)
- DLC (only in 2023 & 2025)
- Top to lowest performing (2023-2025)
- Tier C Stores
- Top to lowest performing (2023-2025)
- Liquor
- Wine
- Beer
- Misc
- Donations
- Non Inventory
- Gift Card (only in 2023)
- DLC (only in 2023 & 2025)
- Top to lowest performing (2023-2025)
Classification Type
- Tier A Stores
- Top 5 Performing:
- 2023
- Vodka
- NA(Values with no data)
- Tequila
- Whiskey
- Bourbon
- 2024
- Vodka
- Tequila
- Whiskey
- Bourbon
- Cordials
- 2025 (thus far)
- Vodka
- Tequila
- Whiskey
- Bourbon
- Cordials
- 2023
- Lowest 5 Performing:
- 2023
- Soave
- Pinotage
- St. Emillion
- Petite Syrah
- IPA
- 2024
- Petite Syrah
- Grenache
- Vermentino
- Bordeaux
- Agave
- 2025
- Bordeaux
- Torrontes
- Other South African Red
- Flavor Gin
- Pinotage
- (directly above is petite syrah, grenache, vermantino, and Agave)
- 2023
- Top 5 Performing:
- Tier B Stores
- Top 5 Performing:
- 2023
- Vodka
- NA(Values with no data)
- Tequila
- Bourbon
- Cordials
- 2024
- Vodka
- Tequila
- Bourbon
- Whiskey
- Cordials
- 2025
- Vodka
- Tequila
- Bourbon
- Whiskey
- Cordials
- 2023
- Lowest 5 Performing:
- 2023
- Liebfraumilch
- Bianco
- St. Emillion
- Grenache
- Torrontes
- 2024
- Vermentino
- Other South African Red
- Torrontes
- Madeira
- Bardolino
- 2025
- Bardolino
- Other South African Red
- Flavor Gin
- Grenache
- Primitivo
- 2023
- Top 5 Performing:
- Tier C Stores
- Top 5 Performing:
- 2023
- Vodka
- NA(Values with no data)
- Tequila
- Whiskey
- Bourbon
- 2024
- Vodka
- Tequila
- Whiskey
- Bourbon
- Cordials
- 2025
- Vodka
- Tequila
- Bourbon
- Whiskey
- Cordials
- 2023
- Lowest 5 Performing:
- 2023
- Soave
- Gamay
- Pinotage
- Miscast
- Bianco
- 2024
- Pinotage
- Torrontes
- Petite Syrah
- Bardolino
- Liebfraumilch
- 2025
- Vermentino
- Graves
- Bardolino
- Petite Syrah
- Grenache
- 2023
- Top 5 Performing:
Classification Category
- Tier A Stores
- Top 5 performing:
- 2023
- Domestic Vodka
- American Red
- Tequila
- American White
- Straight Bourbon Whiskey
- 2024
- Domestic Vodka
- American Red
- Tequila
- American White
- Straight Bourbon Whiskey
- 2025
- Domestic Vodka
- Tequila
- American Red
- American White
- Straight Bourbon Whiskey
- 2023
- Lowest 5 Performing:
- 2023
- French Red Wine
- 1% Mixers
- Malt Based RTD
- Indian Whiskey
- Greek
- 2024
- Indian Whiskey
- Sake
- Alsatian Wine
- Misc Agave Spirit
- Greek
- 2025
- Georgian Wine
- Greek
- Indian Whisky
- Grappa
- Armagnac
- 2023
- Top 5 performing:
- Tier B Stores
- Top 5 Performing:
- 2023
- Domestic Vodka
- American Red
- American White
- Tequila
- Straight Bourbon Whiskey
- 2024
- Domestic Vodka
- American Red
- Tequila
- American White
- Straight Bourbon Whiskey
- 2025
- Domestic Vodka
- American Red
- American White
- Tequila
- Straight Bourbon Whiskey
- 2023
- Lowest 5 Performing:
- 2023
- Store Supplies
- Grappa
- Indian Whisky
- Greek
- 1% Mixers
- 2024
- Wine Based Seltzers
- Grappa
- Greek
- Indian Whisky
- 1% Mixers
- 2025
- Greek
- Domestic Dessert Wine
- 1% Mixers
- Georgian Wine
- Armagnac
- 2023
- Top 5 Performing:
- Tier C Stores
- Top 5 Performance:
- 2023
- Domestic Vodka
- American Red
- Straight Bourbon Whiskey
- Tequila
- American White
- 2024
- Domestic Vodka
- American Red
- Tequila
- Straight Bourbon Whiskey
- American White
- 2025
- Domestic Vodka
- American Red
- Straight Bourbon Whiskey
- Tequila
- American White
- 2023
- Lowest 5 Performance:
- 2023
- 1% Mixers
- Indian Whisky
- Domestic Dessert Wine
- Store Supplies
- Grappa
- 2024
- Malt Based RTD
- Domestic Dessert Wine
- 1% Mixers
- Sake
- Grappa
- 2025
- Store Supplies
- Armagnac
- Misc Agave Spirit
- Georgian Wine
- 1% Mixers
- 2023
- Top 5 Performance:
Observing the Data Set by Item ID
Now that I have determined that the inventory generally performs the same among all stores, regardless of store size, it further proves that ABS’ buying patterns methodology can be upgraded. My idea is for ABS to observe the data by the itemid, specifically for standard stock to prevent the variable becoming unusable as when switching from one itemtag to another, ABS must change the itemid even if it is the same item, so as to have a separate item id for its new tag designation. This is also why it was necessary for me to filter out all results in the main data set to show only standard stock, as this is the variable I will be mainly focusing on for this project. To give an idea of how necessary filtering the data is, before filtering out all transactions that were not standard stock, there were over 12 million observations, now there are roughly 9.7 million. That means that approximately 3 million transactions occur with items under a different item tag, which could likely be the same item under the standard stock, but with a different item id.
To start off, I will be creating a new category for use to assist in my analysis named priority which I will further explain below:
- priority: This is an engineered variable for observing the itemid variable by frequency of transactions to determine if it is a “High”, “Medium”, or “Low” priority item. High priority merchandise are frequently purchased items in the upper 25 percentile of all transactions globally. Medium priority are items which populate the middle 50 percentile of all transactions. Low priority items populate the lower 25 percentile of all transactions. The percentile may be subject to changing according to ABS’ desired parameters, or other unexpected reasoning. (This variable may be subject to be moved to a tibble dedicated to observing itemid’s by priority)
For now I will be preparing other tibbles for creating a statistical model based on the variables lineqty and netamount to create the following variables:
- total_netamount: Created by taking the sum of netamount of one individual itemid collectively grouped by year and store
- total_lineqty: Created by taking the sum of lineqty of one individual itemid collectively grouped by year and store
- avg_cost: Created by taking the mean of netamount / lineqty of one individual itemid collectively grouped by year and store
- n_transactions: Created to count the frequency a given itemid appears in transactions per store and year
- priority: Refer to notes on variable above, created from total_netamount
These variables will be created in two separate tibbles. The first tibble, abs_var will be for creating majority of the variables, with the exception of priority. And the second tibble item_summary will be used for creating the priority variable for use from here on out.
#Updated to show priority for itemid globally for each year for predictive model
abs_var <- abs_observed |>
group_by(year, itemid, description) |>
summarise(total_netamount = sum(netamount, na.rm = TRUE),
total_lineqty = sum(lineqty, na.rm = TRUE),
avg_cost = mean(netamount / lineqty, na.rm = TRUE),
n_transactions = n(),
.groups = "drop")
item_summary <- abs_var |>
group_by(year) |>
mutate(q25 = quantile(n_transactions, 0.25, na.rm = TRUE),
q75 = quantile(n_transactions, 0.75, na.rm = TRUE),
priority = case_when(n_transactions >= q75 ~ "High",
n_transactions <= q25 ~ "Low",
TRUE ~ "Medium")) |>
ungroup() |>
mutate(priority = factor(priority, levels = c("High", "Medium", "Low")))
item_summary |> count(year, priority)# A tibble: 9 × 3
year priority n
<chr> <fct> <int>
1 2023 High 964
2 2023 Medium 1926
3 2023 Low 965
4 2024 High 730
5 2024 Medium 1452
6 2024 Low 731
7 2025 High 708
8 2025 Medium 1413
9 2025 Low 708
Modeling Item Trends from 2023-2025 by Priority
I now want to create a multinomial model based on the newly engineered priority variable which will use the information from the data set, separated by year, to both learn how to determine what priority any given item should be classified as, and also predict what priority an item will have for the following year based on the performance provided by transaction data. I will do this by creating a new tibble named model_data which collects all needed transaction information from item_summary along with two new variables included, ‘next_priority’ and next_year’, for predictions of item priority for the following year.
#Create lagged dataset to model transitions using 2023 and 2024 transactions (predicting next year's priority)
item_transitions <- item_summary |>
arrange(itemid, year) |>
group_by(itemid) |>
mutate(next_priority = lead(priority),
next_year = lead(year)) |>
filter(!is.na(next_priority)) |> # Removes 2025 year data as no data follows
ungroup()
# Join store characteristics for more context
model_data <- item_transitions |>
left_join(abs_observed |>
select(itemid, store, storesizecat, storetiercat) |>
distinct(), by = "itemid")priority_model <- nnet::multinom(next_priority ~ total_netamount + total_lineqty + avg_cost + n_transactions + storetiercat + storesizecat + year, data = model_data)# weights: 33 (20 variable)
initial value 153713.436982
iter 10 value 99370.097887
iter 20 value 83547.094840
iter 30 value 61582.220365
iter 40 value 61452.302486
iter 50 value 61426.708135
iter 60 value 61422.128167
final value 61419.782406
converged
summary(priority_model)Call:
nnet::multinom(formula = next_priority ~ total_netamount + total_lineqty +
avg_cost + n_transactions + storetiercat + storesizecat +
year, data = model_data)
Coefficients:
(Intercept) total_netamount total_lineqty avg_cost n_transactions
Medium 3.339898 -0.00001890630 0.0001452922 0.03281650 -0.004026737
Low 2.950450 -0.00007892468 0.0003871200 0.05288098 -0.009980445
storetiercat2 storetiercat3 storesizecatB storesizecatC year2024
Medium 0.00731104 -0.015739368 0.005882785 -0.01392901 4.194745
Low -0.04390216 -0.001928148 -0.054497280 -0.11212288 7.009737
Std. Errors:
(Intercept) total_netamount total_lineqty avg_cost n_transactions
Medium 0.00004014928 0.0000008099377 0.0000275512 0.0008968979 0.00004118825
Low 0.00003012254 0.0000025138220 0.0001355416 0.0010823724 0.00017918395
storetiercat2 storetiercat3 storesizecatB storesizecatC year2024
Medium 0.00001514294 0.000013241549 0.00001347795 0.000010382601 0.000019345275
Low 0.00001066761 0.000009807617 0.00001044328 0.000006683589 0.000004312398
Residual Deviance: 122839.6
AIC: 122879.6
The purpose of this is for the model to track and learn the trends that create a high priority item versus a medium or low priority item using the variables provided. When the variable offers a positive coefficient tied to either medium or low, then it understands that the corresponding variable is likely to produce the given priority (medium or low, or both) when the given variable increases in number/amount. And when a variable offers negative coefficients for either medium or low, then it is stating that this variable is less likely to produce an item of said priority when the variable increases in number/amount. In the case of when both coefficients are negative, it is stating that high priority items are more likely produced when the corresponding variable increases in number/amount.
According to that breakdown, the following results are:
- total_netamount: Produced negative coefficients for both medium and low, more likely to produce high priority items as net sales increase which is logical (proving the model is off to a good start)
- total_lineqty: Produced positive coefficients for both, which means that higher line quantity increases the odds of an item being medium or low priority vs high priority. This is an interesting observation according to the model, as it appears counter-intuitive, however it simply means that quantity alone is not a good enough predictor for whether an item will be high priority or not as high priority items have stronger net sales vs high volumes.
- avg_cost: Produced positive coefficients for both, meaning that higher cost items tend to become medium to low priority items. This is logical as higher cost items are less likely to be purchased.
- n_transactions: Produced negative coefficients for both, which comes to no surprise as the number of transactions a given item has directly correlates to the priority of an item, so high number = high priority
- storetiercat: Between two different storetiercat’s (storetiercat2 and storetiercat3), showed mixed results with very little variation. This further proves that inventory velocity has very little to do with store tier (in other words a store’s financial performance) as priority is mostly driven by item performance itself and is consistent across all store tiers.
- Storesizecat: The same as storetiercat, it offers very little variation meaning that an item’s velocity is more independent than ABS’ current buying patterns methodology, further ensuring the usefulness of the priority variable.
- year: The year variable shows both positive coefficients for medium and low showcasing that based on the data set (and based on the 2023 to 2024 performance), items are performing more poorly than the previous year (which is a part of ABS team’s concern) as items are more likely to be medium or low priority as time progresses.
In summary, the model shows that the total number of transactions and net sales amount are the strongest predictors of priority movement between years. Items with higher transaction counts and stronger net sales are much more likely to enter the high priority tier. Quantity sold on its own is not a reliable predictor and items with high physical volume can still fall into the medium or low tiers when net revenue does not match. Store size and store tier have only a minor influence which suggests that performance is driven by the product itself rather than the location. There is also a strong year effect that indicates priority classifications in 2024 were harder to reach compared to 2023 which may reflect broader changes in sales patterns.
Next, I will be testing the model’s performance to see how well it is able to predict a given item’s priority level based on what the model was able to analyze.
# Evaluate model performance
preds <- predict(priority_model, newdata = model_data)
confusion <- table(Predicted = preds, Actual = model_data$next_priority)
confusion Actual
Predicted High Medium Low
High 32685 3167 101
Medium 8186 76481 10718
Low 132 2044 6402
# Overall accuracy
accuracy <- sum(diag(confusion)) / sum(confusion)
accuracy[1] 0.8259813
According to the results, the model predicted that 35,953 items would be high priority, 95,385 items would be medium priority, and 8578 items would be low priority. Based on the given predictions, the model has an accuracy of 82.598% or approximately 82.6%, showing that the model is relatively strong at predicting an items performance. Or in other words, more than eight times out of ten the model will accurately predict an items priority level in the next year to give better perspective. To give a breakdown of the models predictions:
- The model accurately predicted that 32,685 items would be high priority
- 3167 items that the model predicted to be high priority turned out to be medium priority
- 101 items that the model predicted to be high priority turned out to be low priority
- The model accurately predicted 76481 items to be medium priority
- 8186 items that the model predicted to be medium priority turned out to be high priority
- 10718 items that the model predicted to be medium priority turned out to be low priority
- The model accurately predicted 6402 items to be in the low priority
- 132 items that the model predicted to be low priority turned out to be high priority
- 2044 items that the model predicted to be low priority turned out to be medium priority
In summary, the model achieved an accuracy of roughly 82 percent which shows that the predictors used for understanding future priority transitions are strong. The confusion matrix shows that the model performs very well when identifying high priority items and rarely mistakes low performers for strong ones. Most of the mis-classifications occur within the medium tier which is expected since it represents the largest portion of items and shares characteristics with both high and low tiers. The strong performance between the high and low categories suggests that the model can safely support decisions involving replenishment and inventory risk. This confirms that the quartile-based priority structure is statistically meaningful and can be used confidently when observing how items move across years.
However, there are resounding concerns regarding other statistics, including a very high AIC score, which indicates that there are potential reliability concerns regarding how the multinomial model observes the variable.
Additional Modeling Method Testing
Before moving forward, I want to address the concern with my original multinomial model priority_model. Currently there are some concerns regarding it’s statistics, including a high AIC which indicates that there is too much noise in the data set itself despite its moderately high accuracy. As that is not something I would like to move forward with, I will be using a different method to observe my model_data, by first sectioning the data so that 80% of it is used for training the model and the remaining 20% for testing the model. I will also be creating a custom tibble used for observing the accuracy and F1 statistics of the models I will be creating. The goal is to determine the best modeling method to use for my final step. For this section I will be using the following modeling methods:
- Multinomial (using a different method)
- Linear Discriminate Analysis
- Decision Tree
- Random Forest
- K nearest Neighbors
- Artificial Neural Network
# Modifying variables to ensure storesizecat and storetiercat are seen as categorical and year is seen as numerical
model_data <- model_data |>
mutate(next_priority = factor(next_priority, levels = c("High", "Medium", "Low")),
storesizecat = factor(storesizecat),
storetiercat = factor(storetiercat),
year = as.numeric(year))
set.seed(111)
train_index <- caret::createDataPartition(model_data$next_priority, p = 0.80, list = FALSE) #Splitting the data so that 80% of the data goes to training and 20% goes to testing and to ensure priority stays balanced between both sets
train_data <- model_data[train_index, ] #training data
test_data <- model_data[-train_index, ] #testing data
ctrl <- caret::trainControl(method = "cv", #fold cross validation aka it is repeatedly split internally during training
number = 5, # number of folds
classProbs = TRUE, #enables probability outputs
summaryFunction = caret::multiClassSummary, #Gives statistics such as accuracy
savePredictions = "final") #stores the best model's predictions for diagnostics
priority_formula <- next_priority ~ total_netamount + total_lineqty + avg_cost + n_transactions + storetiercat + storesizecat + year #variables used for predicting the priority #custom F1 metric function for predictions
multiclass_f1 <- function(cm_table){
classes <- colnames(cm_table)
per_class <- map_dfr(classes, function(cls){
tp <- cm_table[cls, cls] # correct predictions
fp <- sum(cm_table[, cls]) - tp #predicted the class but it was something else
fn <- sum(cm_table[cls, ]) - tp #actual class but it predicted something else
precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp)) #to determine how consistent the model is right
recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn)) #to determine out of all cases of a given priority, how often was caught by the model
f1 <- ifelse(is.na(precision + recall) | (precision + recall == 0),
NA, 2 * precision * recall / (precision + recall))
tibble(class = cls, precision = precision, recall = recall, f1 = f1, support = sum(cm_table[cls, ]))})
macro_f1 <- mean(per_class$f1, na.rm = TRUE) #mean of F1 across classes
weighted_f1 <- weighted.mean(per_class$f1, w = per_class$support, na.rm = TRUE) # Weighted by number of items in a priority
list(per_class = per_class, macro_f1 = macro_f1, weighted_f1 = weighted_f1)}Linear Discriminate Model
set.seed(111)
lda_model <- caret::train(priority_formula,
data = train_data,
method = "lda",
trControl = ctrl,
metric = "Accuracy")
lda_modelLinear Discriminant Analysis
111934 samples
7 predictor
3 classes: 'High', 'Medium', 'Low'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547
Resampling results:
logLoss AUC prAUC Accuracy Kappa Mean_F1
0.7604346 0.793622 0.6295548 0.6629443 0.2330129 0.4051539
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
0.4252191 0.7300844 0.8368055 0.880875
Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.8368055 0.4252191 0.2209814 0.5776517
lda_preds <- predict(lda_model, newdata = test_data)
lda_cm <- caret::confusionMatrix(lda_preds, test_data$next_priority)
lda_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 2226 0 0
Medium 5974 16334 3405
Low 0 4 39
Overall Statistics
Accuracy : 0.6647
95% CI : (0.6591, 0.6702)
No Information Rate : 0.5839
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.2378
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.27146 0.9998 0.011324
Specificity 1.00000 0.1945 0.999837
Pos Pred Value 1.00000 0.6352 0.906977
Neg Pred Value 0.76805 0.9982 0.878127
Prevalence 0.29305 0.5839 0.123079
Detection Rate 0.07955 0.5837 0.001394
Detection Prevalence 0.07955 0.9189 0.001537
Balanced Accuracy 0.63573 0.5971 0.505581
lda_f1 <- multiclass_f1(lda_cm$table)
lda_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.271 1 0.427 2226
2 Medium 1.000 0.635 0.777 25713
3 Low 0.0113 0.907 0.0224 43
lda_f1$macro_f1[1] 0.4087481
lda_f1$weighted_f1[1] 0.7478752
Decision Tree Model
set.seed(111)
tree_model <- caret::train(priority_formula,
data = train_data,
method = "rpart",
trControl = ctrl,
tuneLength = 10,
metric = "Accuracy")Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
tree_modelCART
111934 samples
7 predictor
3 classes: 'High', 'Medium', 'Low'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547
Resampling results across tuning parameters:
cp logLoss AUC prAUC Accuracy Kappa
0.001481322 0.4354266 0.9147499 0.73971388 0.8467936 0.71523805
0.001803349 0.4378278 0.9142193 0.64646083 0.8444976 0.71072783
0.002747960 0.4459648 0.9110985 0.50114303 0.8411028 0.70459271
0.002919708 0.4579212 0.9078631 0.48411659 0.8388693 0.70118405
0.004765994 0.4693555 0.9048111 0.46759224 0.8364482 0.69746297
0.009102619 0.4776502 0.9021292 0.46846395 0.8313828 0.68922625
0.013310434 0.4839012 0.9003058 0.45370560 0.8256116 0.67706074
0.025611851 0.5712307 0.8317907 0.35802463 0.7985152 0.60592765
0.031129240 0.5849573 0.8223117 0.34124267 0.7910287 0.58511089
0.405646200 0.8853990 0.5425396 0.01557607 0.6175079 0.09622994
Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
0.7900016 0.7668787 0.8968017 0.8232757
0.7879582 0.7644492 0.8950686 0.8218208
0.7850300 0.7618582 0.8931263 0.8185093
0.7832358 0.7611753 0.8925148 0.8151634
0.7814110 0.7604512 0.8917266 0.8119275
0.7757190 0.7590782 0.8897687 0.7986896
0.7605247 0.7421887 0.8855146 0.7863430
0.6992291 0.6502300 0.8569148 0.7600769
0.6768049 0.6190085 0.8482932 0.7448422
NaN 0.3740659 0.6950683 NaN
Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
0.9122454 0.8232757 0.7668787 0.2822645
0.9109123 0.8218208 0.7644492 0.2814992
0.9085086 0.8185093 0.7618582 0.2803676
0.9065078 0.8151634 0.7611753 0.2796231
0.9043786 0.8119275 0.7604512 0.2788161
0.9012861 0.7986896 0.7590782 0.2771276
0.8987560 0.7863430 0.7421887 0.2752039
0.8922560 0.7600769 0.6502300 0.2661717
0.8915336 0.7448422 0.6190085 0.2636762
0.8812209 NaN 0.3740659 0.2058360
Mean_Balanced_Accuracy
0.8318402
0.8297589
0.8274922
0.8268450
0.8260889
0.8244234
0.8138517
0.7535724
0.7336509
0.5345671
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.001481322.
tree_preds <- predict(tree_model, newdata = test_data)
tree_cm <- caret::confusionMatrix(tree_preds, test_data$next_priority)
tree_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 6804 698 45
Medium 1359 14983 1446
Low 37 657 1953
Overall Statistics
Accuracy : 0.8484
95% CI : (0.8441, 0.8526)
No Information Rate : 0.5839
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.7183
Mcnemar's Test P-Value : < 0.00000000000000022
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.8298 0.9171 0.56707
Specificity 0.9624 0.7591 0.97172
Pos Pred Value 0.9016 0.8423 0.73782
Neg Pred Value 0.9317 0.8671 0.94115
Prevalence 0.2930 0.5839 0.12308
Detection Rate 0.2432 0.5355 0.06979
Detection Prevalence 0.2697 0.6357 0.09460
Balanced Accuracy 0.8961 0.8381 0.76940
tree_f1 <- multiclass_f1(tree_cm$table)
tree_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.830 0.902 0.864 7547
2 Medium 0.917 0.842 0.878 17788
3 Low 0.567 0.738 0.641 2647
tree_f1$macro_f1[1] 0.7945125
tree_f1$weighted_f1[1] 0.8519378
Random Forest Model
set.seed(111)
rf_model <- caret::train(priority_formula,
data = train_data,
method = "rf",
trControl = ctrl,
tuneLength = 5,
metric = "Accuracy",
importance = TRUE)
rf_modelRandom Forest
111934 samples
7 predictor
3 classes: 'High', 'Medium', 'Low'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547
Resampling results across tuning parameters:
mtry logLoss AUC prAUC Accuracy Kappa Mean_F1
2 0.192744639 0.9898836 0.90367954 0.9043722 0.8206985 0.8690667
3 0.038045728 0.9998340 0.72523216 0.9978648 0.9961729 0.9969848
5 0.007977311 0.9998431 0.20016075 0.9982043 0.9967822 0.9975785
7 0.007635474 0.9998474 0.06603375 0.9981775 0.9967341 0.9975610
9 0.007561591 0.9998613 0.06039125 0.9981686 0.9967181 0.9975514
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
0.8371475 0.9292035 0.9146889 0.9541546
0.9964763 0.9987484 0.9974983 0.9988661
0.9974036 0.9990141 0.9977556 0.9990055
0.9973322 0.9989843 0.9977925 0.9989873
0.9973079 0.9989713 0.9977974 0.9989839
Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.9146889 0.8371475 0.3014574 0.8831755
0.9974983 0.9964763 0.3326216 0.9976124
0.9977556 0.9974036 0.3327348 0.9982088
0.9977925 0.9973322 0.3327258 0.9981582
0.9977974 0.9973079 0.3327229 0.9981396
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
rf_preds <- predict(rf_model, newdata = test_data)
rf_cm <- caret::confusionMatrix(rf_preds, test_data$next_priority)
rf_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 8185 11 10
Medium 9 16327 3
Low 6 0 3431
Overall Statistics
Accuracy : 0.9986
95% CI : (0.9981, 0.999)
No Information Rate : 0.5839
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.9975
Mcnemar's Test P-Value : 0.2407
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.9982 0.9993 0.9962
Specificity 0.9989 0.9990 0.9998
Pos Pred Value 0.9974 0.9993 0.9983
Neg Pred Value 0.9992 0.9991 0.9995
Prevalence 0.2930 0.5839 0.1231
Detection Rate 0.2925 0.5835 0.1226
Detection Prevalence 0.2933 0.5839 0.1228
Balanced Accuracy 0.9986 0.9991 0.9980
rf_f1 <- multiclass_f1(rf_cm$table)
rf_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.998 0.997 0.998 8206
2 Medium 0.999 0.999 0.999 16339
3 Low 0.996 0.998 0.997 3437
rf_f1$macro_f1[1] 0.9981135
rf_f1$weighted_f1[1] 0.9986063
K Nearest Neighbors Model
set.seed(111)
knn_model <- caret::train(priority_formula,
data = train_data,
method = "knn",
trControl = ctrl,
preProcess = c("center","scale"),
tuneLength = 10,
metric = "Accuracy")
knn_modelk-Nearest Neighbors
111934 samples
7 predictor
3 classes: 'High', 'Medium', 'Low'
Pre-processing: centered (9), scaled (9)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547
Resampling results across tuning parameters:
k logLoss AUC prAUC Accuracy Kappa Mean_F1
5 0.7340195 0.9507491 0.3189370 0.8413619 0.7112888 0.7934107
7 0.6273057 0.9496576 0.3826737 0.8439348 0.7142589 0.7947029
9 0.5654628 0.9479928 0.4297978 0.8460700 0.7165959 0.7957863
11 0.5405543 0.9463873 0.4655751 0.8444083 0.7126375 0.7922833
13 0.5189061 0.9450991 0.4930885 0.8417639 0.7073103 0.7882549
15 0.5029809 0.9438835 0.5153264 0.8394679 0.7026519 0.7853978
17 0.4925583 0.9426782 0.5330345 0.8385120 0.7003138 0.7836291
19 0.4808587 0.9419760 0.5494355 0.8378241 0.6985476 0.7821388
21 0.4740448 0.9409573 0.5622290 0.8372523 0.6970106 0.7806741
23 0.4643513 0.9402343 0.5728135 0.8358050 0.6940929 0.7780040
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
0.7831788 0.8991537 0.8052842 0.9051869
0.7807881 0.8990267 0.8116760 0.9076913
0.7785105 0.8986980 0.8178480 0.9099250
0.7733203 0.8968131 0.8171109 0.9094447
0.7686327 0.8947709 0.8142771 0.9080061
0.7651757 0.8929018 0.8125122 0.9067140
0.7625163 0.8917729 0.8123252 0.9064704
0.7602521 0.8908968 0.8122101 0.9063856
0.7579942 0.8901283 0.8123072 0.9063424
0.7550832 0.8890861 0.8102218 0.9055729
Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.8052842 0.7831788 0.2804540 0.8411663
0.8116760 0.7807881 0.2813116 0.8399074
0.8178480 0.7785105 0.2820233 0.8386043
0.8171109 0.7733203 0.2814694 0.8350667
0.8142771 0.7686327 0.2805880 0.8317018
0.8125122 0.7651757 0.2798226 0.8290387
0.8123252 0.7625163 0.2795040 0.8271446
0.8122101 0.7602521 0.2792747 0.8255744
0.8123072 0.7579942 0.2790841 0.8240613
0.8102218 0.7550832 0.2786017 0.8220846
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
knn_preds <- predict(knn_model, newdata = test_data)
knn_cm <- caret::confusionMatrix(knn_preds, test_data$next_priority)
knn_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 6929 783 49
Medium 1229 14758 1250
Low 42 797 2145
Overall Statistics
Accuracy : 0.8517
95% CI : (0.8475, 0.8558)
No Information Rate : 0.5839
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.7283
Mcnemar's Test P-Value : < 0.00000000000000022
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.8450 0.9033 0.62282
Specificity 0.9579 0.7871 0.96581
Pos Pred Value 0.8928 0.8562 0.71883
Neg Pred Value 0.9371 0.8530 0.94804
Prevalence 0.2930 0.5839 0.12308
Detection Rate 0.2476 0.5274 0.07666
Detection Prevalence 0.2774 0.6160 0.10664
Balanced Accuracy 0.9015 0.8452 0.79432
knn_f1 <- multiclass_f1(knn_cm$table)
knn_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.845 0.893 0.868 7761
2 Medium 0.903 0.856 0.879 17237
3 Low 0.623 0.719 0.667 2984
knn_f1$macro_f1[1] 0.8049135
knn_f1$weighted_f1[1] 0.8535158
Multinomial Model (updated)
set.seed(111)
multinom_model <- nnet::multinom(priority_formula,
data = train_data,
MaxNWts = 20000,
trace = FALSE)
multinom_preds <- predict(multinom_model, newdata = test_data)
multinom_cm <- caret::confusionMatrix(data = multinom_preds,
reference = test_data$next_priority)
multinom_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 6508 588 26
Medium 1668 15358 2094
Low 24 392 1324
Overall Statistics
Accuracy : 0.8287
95% CI : (0.8243, 0.8331)
No Information Rate : 0.5839
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.6699
Mcnemar's Test P-Value : < 0.00000000000000022
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.7937 0.9400 0.38444
Specificity 0.9690 0.6769 0.98305
Pos Pred Value 0.9138 0.8032 0.76092
Neg Pred Value 0.9189 0.8894 0.91921
Prevalence 0.2930 0.5839 0.12308
Detection Rate 0.2326 0.5489 0.04732
Detection Prevalence 0.2545 0.6833 0.06218
Balanced Accuracy 0.8813 0.8085 0.68374
multinom_f1 <- multiclass_f1(multinom_cm$table)
multinom_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.794 0.914 0.849 7122
2 Medium 0.940 0.803 0.866 19120
3 Low 0.384 0.761 0.511 1740
multinom_f1$macro_f1[1] 0.7421881
multinom_f1$weighted_f1[1] 0.8398932
Artificial Neural Networks Model
set.seed(111)
ann_model <- caret::train(priority_formula,
data = train_data,
method = "avNNet",
trControl = ctrl,
preProcess = c("center", "scale"),
tuneLength = 5,
metric = "Accuracy",
trace = FALSE)Warning: executing %dopar% sequentially: no parallel backend registered
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
ann_modelModel Averaged Neural Network
111934 samples
7 predictor
3 classes: 'High', 'Medium', 'Low'
Pre-processing: centered (9), scaled (9)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 89547, 89548, 89547, 89547, 89547
Resampling results across tuning parameters:
size decay logLoss AUC prAUC Accuracy Kappa
1 0.0000 0.6176581 0.9049980 0.7546039 0.7458059 0.4474598
1 0.0001 0.5613458 0.9058845 0.7469962 0.7919488 0.5729839
1 0.0010 0.5179014 0.9056283 0.7533280 0.7926278 0.5761921
1 0.0100 0.5072165 0.9035008 0.7496246 0.7930119 0.5773725
1 0.1000 0.5016587 0.9018635 0.7536386 0.7927797 0.5770050
3 0.0000 0.7779042 0.9227665 0.8014519 0.7016731 0.3239037
3 0.0001 0.4715130 0.9257318 0.8211841 0.8036163 0.6055512
3 0.0010 0.4885967 0.9232982 0.8111446 0.8055907 0.6091597
3 0.0100 0.4562472 0.9246343 0.8190856 0.8121751 0.6289616
3 0.1000 0.4399294 0.9268412 0.8210443 0.8240034 0.6635185
5 0.0000 0.5715859 0.9260465 0.8235875 0.7165818 0.3644351
5 0.0001 0.5326066 0.9238333 0.8155376 0.7923240 0.5731236
5 0.0010 0.4790989 0.9261988 0.8219890 0.8016866 0.5985397
5 0.0100 0.4879355 0.9155957 0.8094544 0.8037506 0.6057537
5 0.1000 0.4386123 0.9268583 0.8222660 0.8248075 0.6631698
7 0.0000 3.1272382 0.8315843 0.5388359 0.7161083 0.3620200
7 0.0001 0.5058773 0.9254289 0.8214351 0.7961834 0.5849631
7 0.0010 0.4741548 0.9259822 0.8228672 0.8089590 0.6181997
7 0.0100 0.4845290 0.9239439 0.8194451 0.8052781 0.6085089
7 0.1000 0.4377404 0.9269499 0.8242323 0.8213230 0.6545071
9 0.0000 0.5902360 0.9279246 0.8266170 0.7093113 0.3456493
9 0.0001 0.4723693 0.9268482 0.8257063 0.8055371 0.6097869
9 0.0010 0.5274488 0.9250983 0.8211098 0.8030892 0.6026031
9 0.0100 0.5541231 0.9246532 0.8189026 0.7914396 0.5699787
9 0.1000 0.4375539 0.9264588 0.8237556 0.8257186 0.6643101
Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
NaN 0.5253518 0.8010929 NaN
NaN 0.5801674 0.8394973 NaN
NaN 0.5827158 0.8410891 NaN
NaN 0.5834969 0.8415969 NaN
NaN 0.5834047 0.8415109 NaN
0.5668880 0.4702489 0.7626250 0.8994180
0.7272789 0.6265631 0.8523603 0.8104085
0.7478253 0.6317033 0.8532679 0.8163673
0.6988604 0.6520406 0.8610782 0.8146565
0.7354400 0.7096176 0.8762062 0.8057536
0.6356535 0.5070706 0.7776899 0.8415329
0.6385979 0.5912921 0.8389422 0.8627749
0.6854471 0.6134490 0.8490263 0.8235319
0.6640821 0.6214594 0.8521732 0.8095659
0.7338529 0.7025485 0.8749126 0.8137255
0.6859209 0.5077951 0.7763796 0.8269554
0.6463849 0.6005616 0.8442247 0.8200272
0.6598055 0.6398534 0.8560602 0.8322109
0.6872903 0.6263372 0.8525618 0.8203071
0.7142122 0.6935852 0.8720959 0.8166047
0.6545936 0.4972616 0.7710541 0.8058017
0.6580863 0.6270549 0.8531208 0.8356631
0.7285627 0.6225377 0.8506885 0.8240830
0.5823591 0.5802767 0.8378046 0.8617526
0.7414287 0.7069993 0.8745436 0.8204610
Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
0.9033792 NaN 0.5253518 0.2486020
0.9053240 NaN 0.5801674 0.2639829
0.9029975 NaN 0.5827158 0.2642093
0.9027354 NaN 0.5834969 0.2643373
0.9023622 NaN 0.5834047 0.2642599
0.9093985 0.8994180 0.4702489 0.2338910
0.9055395 0.8104085 0.6265631 0.2678721
0.9074200 0.8163673 0.6317033 0.2685302
0.9046267 0.8146565 0.6520406 0.2707250
0.9044323 0.8057536 0.7096176 0.2746678
0.9074934 0.8415329 0.5070706 0.2388606
0.9077125 0.8627749 0.5912921 0.2641080
0.9065712 0.8235319 0.6134490 0.2672289
0.9046238 0.8095659 0.6214594 0.2679169
0.9060151 0.8137255 0.7025485 0.2749358
0.9112310 0.8269554 0.5077951 0.2387028
0.9053056 0.8200272 0.6005616 0.2653945
0.9077764 0.8322109 0.6398534 0.2696530
0.9067044 0.8203071 0.6263372 0.2684260
0.9048774 0.8166047 0.6935852 0.2737743
0.9056787 0.8058017 0.4972616 0.2364371
0.9056277 0.8356631 0.6270549 0.2685124
0.9062835 0.8240830 0.6225377 0.2676964
0.9083646 0.8617526 0.5802767 0.2638132
0.9073756 0.8204610 0.7069993 0.2752395
Mean_Balanced_Accuracy
0.6632224
0.7098324
0.7119024
0.7125469
0.7124578
0.6164369
0.7394617
0.7424856
0.7565594
0.7929119
0.6423803
0.7151172
0.7312376
0.7368163
0.7887305
0.6420873
0.7223932
0.7479568
0.7394495
0.7828405
0.6341578
0.7400878
0.7366131
0.7090407
0.7907715
Tuning parameter 'bag' was held constant at a value of FALSE
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 9, decay = 0.1 and bag = FALSE.
# Predictions on test set
ann_preds <- predict(ann_model, newdata = test_data)
# Confusion matrix
ann_cm <- caret::confusionMatrix(ann_preds, test_data$next_priority)
ann_cmConfusion Matrix and Statistics
Reference
Prediction High Medium Low
High 6424 474 27
Medium 1776 15864 3413
Low 0 0 4
Overall Statistics
Accuracy : 0.7967
95% CI : (0.7919, 0.8014)
No Information Rate : 0.5839
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5835
Mcnemar's Test P-Value : < 0.00000000000000022
Statistics by Class:
Class: High Class: Medium Class: Low
Sensitivity 0.7834 0.9710 0.0011614
Specificity 0.9747 0.5544 1.0000000
Pos Pred Value 0.9277 0.7535 1.0000000
Neg Pred Value 0.9157 0.9316 0.8770463
Prevalence 0.2930 0.5839 0.1230791
Detection Rate 0.2296 0.5669 0.0001429
Detection Prevalence 0.2475 0.7524 0.0001429
Balanced Accuracy 0.8790 0.7627 0.5005807
# F1 scores
ann_f1 <- multiclass_f1(ann_cm$table)
ann_f1$per_class# A tibble: 3 × 5
class precision recall f1 support
<chr> <dbl> <dbl> <dbl> <int>
1 High 0.783 0.928 0.849 6925
2 Medium 0.971 0.754 0.849 21053
3 Low 0.00116 1 0.00232 4
ann_f1$macro_f1[1] 0.5667737
ann_f1$weighted_f1[1] 0.8486502
Model Results
#Creating a tibble to organize the resulting statistics from each model to compare
model_results <- tibble(model = c("Multinomial", "LDA", "Decision Tree", "Random Forest", "KNN", "ANN"),
accuracy = c(multinom_cm$overall["Accuracy"],
lda_cm$overall["Accuracy"],
tree_cm$overall["Accuracy"],
rf_cm$overall["Accuracy"],
knn_cm$overall["Accuracy"],
ann_cm$overall["Accuracy"]),
macro_f1 = c(multinom_f1$macro_f1,
lda_f1$macro_f1,
tree_f1$macro_f1,
rf_f1$macro_f1,
knn_f1$macro_f1,
ann_f1$macro_f1),
weighted_f1 = c(multinom_f1$weighted_f1,
lda_f1$weighted_f1,
tree_f1$weighted_f1,
rf_f1$weighted_f1,
knn_f1$weighted_f1,
ann_f1$weighted_f1))
model_results# A tibble: 6 × 4
model accuracy macro_f1 weighted_f1
<chr> <dbl> <dbl> <dbl>
1 Multinomial 0.829 0.742 0.840
2 LDA 0.665 0.409 0.748
3 Decision Tree 0.848 0.795 0.852
4 Random Forest 0.999 0.998 0.999
5 KNN 0.852 0.805 0.854
6 ANN 0.797 0.567 0.849
According to the tibble I created named model_results , the best model for predicting priority (when using the 7 mentioned variables to observe it) would be the Random Forest Model with an extremely high accuracy of 99.86% and a weighted F1 of 0.998 (very close to 1 which indicates high accuracy as well). So going forward I will only be utilizing the random forest model for the final step.
Inventory Management using the Random Forest Model
As it is becoming clear that inventory velocity appears to be unaffected by other categories, there are still other factors to consider such as:
- How to prioritize what merchandise goes where when each individual storefront has its’ own parameters surrounding number of merchandise available to customers in store at any given moment?
- How to handle already existing inventory that projects to move from one priority ranking to another?
- How to handle new inventory?
For resolving this, I will start by listing the average number of SKUs each individual store holds in their storefront at one time based on present day.
- Silver Spring (Store #2): 3546 SKUs
- Clarksburg Village (Store #3): 4146 SKUs
- Westbard (Store #4): 3773 SKUs
- Wheaton (Store #5): 2878 SKUs
- Kingsview (Store #6): 3454 SKUs
- Flower (Store #7): 3386 SKUs
- Burtonsville (Store #8): 3877 SKUs
- Muddy Branch (Store #9): 4299 SKUs
- Cloverly (Store #10): 3533 SKUs
- Walnut Hill (Store #11): 3437 SKUs
- Cabin John (Store #12): 3903 SKUs
- Leisure World (Store #13): 3115 SKUs
- Kensington (Store #14): 3630 SKUs
- Potomac (Store #15): 3371 SKUs
- Seneca Meadows (Store #16): 4060 SKUs
- Goshen Crossing (Store #17): 3692 SKUs
- Olney (Store #18): 2918 SKUs
- Darnestown (Store #20): 3238 SKUs
- Fallsgrove (Store #21): 2871 SKUs
- White Oak Town Center (Store #22): 3118 SKUs
- Hampden Lane (Store #23): 3404 SKUs
- Montrose (Store #24): 3674 SKUs
- King Farm (Store #26): 3341 SKUs
- Downtown Rockville (Store #27): 4154 SKUs
- Aspen Hill (Store #28): 4066 SKUs
- Poolesville (Store #29): 1872 SKUs
- Gaithersburg (Store #30): 3950 SKUs
As the size of the store no longer matters due to now looking specifically at the current average number of SKUs available in an individual storefront, it is best to only consider the ABS provided store classification group of financial performance (which is under the variable storetiercat). I have determined the following parameters for each tier (1 to 3) of store:
- Tier 1 stores should be at minimum 55% high priority items, 30% medium priority items, and 15% low priority items
- Tier 2 stores should be at minimum 75% high priority items, 20% medium priority items, and 5% low priority items
- Tier 3 stores should be at minimum 85% high priority items, 15% medium priority items, and 0% low priority items
This is a general basis to start from that can always be modified based on ABS’ needs, however for the sake of making a baseline I will be referring to this outline. The reasoning for the outline is based on the idea that:
- Tier 3 stores should not hold any merchandise that could be considered risky, prioritizing only merchandise that regularly perform either very well or generally well
- Tier 2 stores should focus mainly on holding mostly high and medium priority merchandise, only considering low priority in these categories if the store has a high quantity of SKUs available
- Tier 1 stores have more flexibility, giving more option to experiment with new inventory (which would theoretically start at low priority) along with keeping available select top performing low priority items as long as additional storage is available at a given store
First step is to build a prediction frame for the random forest model to use for predicting the inventory performance of the next year. (For this use case I will be using 2025 as the main data set does not have a complete record of that year to produce predictions for 2026). I will name this prediction_frame
# Take latest year per item/store as basis
prediction_frame <- model_data |>
group_by(store, itemid) |>
slice_tail(n = 1) |> # use most recent stats
ungroup() |>
mutate(year = 2025) # prediction yearNow that the information has been gathered, I will use my random forest model to predict the priority of items for 2025
prediction_frame$pred_priority <- predict(rf_model, newdata = prediction_frame)Now, given that every store has a storefront SKU capacity, I need to create a new tibble which includes said information for each given store. I will name this sku_capacity
sku_capacity <- tribble(~store, ~sku_limit,
2, 3546,
3, 4146,
4, 3773,
5, 2878,
6, 3454,
7, 3386,
8, 3877,
9, 4299,
10, 3533,
11, 3437,
12, 3903,
13, 3115,
14, 3630,
15, 3371,
16, 4060,
17, 3692,
18, 2918,
20, 3238,
21, 2871,
22, 3118,
23, 3404,
24, 3674,
26, 3341,
27, 4154,
28, 4066,
29, 1872,
30, 3950)Next, I will also be creating a tibble which includes all the rules mentioned previously regarding how I recommend organizing the stocking of items by priority for each store based on store tier. I will name this priority_mix
priority_mix <- tribble(~storetiercat, ~high_pct, ~med_pct, ~low_pct,
"1", 0.55, 0.30, 0.15,
"2", 0.75, 0.20, 0.05,
"3", 0.85, 0.15, 0.00)Now to create an allocation function which will help with allocating the inventory based on priority, SKU count, and the rules created in priority_mix to help with producing a final table which provides all recommended inventory to stock for each store based on the model’s predictions.
allocate_items <- function(store_id) {
store_df <- prediction_frame |> filter(store == store_id)
tcat <- unique(store_df$storetiercat)
scount <- sku_capacity |> filter(store == store_id) |> pull(sku_limit)
mix <- priority_mix |> filter(storetiercat == tcat)
high_needed <- ceiling(scount * mix$high_pct)
medium_needed <- ceiling(scount * mix$med_pct)
low_needed <- ceiling(scount * mix$low_pct)
store_df <- store_df |>
arrange(desc(n_transactions)) # best items first
high_items <- store_df |> filter(pred_priority == "High") |> slice_head(n = high_needed)
med_items <- store_df |> filter(pred_priority == "Medium") |> slice_head(n = medium_needed)
low_items <- store_df |> filter(pred_priority == "Low") |> slice_head(n = low_needed)
tibble(store = store_id,
storetiercat = tcat,
sku_limit = scount,
high_needed = high_needed,
medium_needed = medium_needed,
low_needed = low_needed,
recommended_high = list(high_items$itemid),
recommended_medium = list(med_items$itemid),
recommended_low = list(low_items$itemid),
year = 2025)}Finally, I will create the table named final_recommendtions which provides what I would recommend to be stocked in each store in 2025 based on the model’s predictions.
final_recommendations <- map_dfr(unique(sku_capacity$store), allocate_items)Now, I would like to produce a csv file from final_recommendations which extracts all information provided in the table in a cleanly formatted spreadsheet. To do so, I will first need to create a function to reformat the data provided from final_recommendations to create one final tibble named final_recommendations_long which has all recommended priorities for all stores prepared to be written into a csv file.
clean_list <- function(vec){vec |> stringr::str_trim()}final_recommendations_parsed <- final_recommendations |>
mutate(recommended_high = map(recommended_high, clean_list),
recommended_medium = map(recommended_medium, clean_list),
recommended_low = map(recommended_low, clean_list))final_recommendations_long <- bind_rows(final_recommendations_parsed |>
dplyr::select(store, storetiercat, sku_limit, year, recommended_high) |>
unnest_longer(recommended_high, values_to = "itemid") |>
mutate(priority = "High"),
final_recommendations_parsed |>
dplyr::select(store, storetiercat, sku_limit, year, recommended_medium) |>
unnest_longer(recommended_medium, values_to = "itemid") |>
mutate(priority = "Medium"),
final_recommendations_parsed |>
dplyr::select(store, storetiercat, sku_limit, year, recommended_low) |>
unnest_longer(recommended_low, values_to = "itemid") |>
mutate(priority = "Low")) |>
filter(itemid != "") |>
arrange(store, match(priority, c("High","Medium","Low")), itemid)Now that I have cleaned the data provided by the model in preparation for the final produced spreadsheet, I would like to also include a validation summary as well to help summarize the following:
- Priority counts per store
- Fill-rate vs. SKU limit
- Compliance with Tier rules
This will help with reporting and also may be useful to the ABS team. I will be naming the tibble validation_table
priority_summary <- final_recommendations_long |>
count(store, storetiercat, priority, name = "priority_count") |>
pivot_wider(
names_from = priority,
values_from = priority_count,
values_fill = 0)
priority_summary <- priority_summary |>
left_join(
final_recommendations |>
dplyr::select(store, sku_limit, year),
by = "store") |>
mutate(total_recommended = High + Medium + Low,
fill_rate = total_recommended / sku_limit)
quota_rules <- tibble(storetiercat = factor(c("1", "2", "3"),
levels = levels(priority_summary$storetiercat)),
min_high = c(0.55, 0.75, 0.85),
min_medium = c(0.30, 0.20, 0.15),
min_low = c(0.15, 0.05, 0.00))
priority_summary <- priority_summary |>
left_join(quota_rules, by = "storetiercat")
priority_summary <- priority_summary |>
mutate(pct_high = High / total_recommended,
pct_medium = Medium / total_recommended,
pct_low = Low / total_recommended,
meets_high = pct_high >= min_high,
meets_medium = pct_medium >= min_medium,
meets_low = pct_low >= min_low,
quota_pass = meets_high & meets_medium & meets_low)
validation_table <- priority_summary |>
dplyr::select(store, storetiercat, sku_limit, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high, meets_medium, meets_low, quota_pass) |>
arrange(store)Now that I have produced a prepared table, final_recommendations_long, along with a table to observe the quotas, validation_table, I would like to join the two and then produce a csv file as my final product for ABS.
First I need to join the two tables:
recommendations_with_validation <- final_recommendations_long |>
left_join(validation_table |>
dplyr::select(store, total_recommended, fill_rate, pct_high, pct_medium, pct_low, meets_high, meets_medium, meets_low, quota_pass),
by = "store")And finally to produce the csv file:
readr::write_csv(recommendations_with_validation, "ABS_Item_Recommendations_2025.csv")Final Remarks
As of current, this is the final produced product I can offer, however there are a number of concerns that would need to be addressed, and I would have addressed given more time:
The further I got into creating this product, the more I realized that my current ranking system for the priority of items is flawed. This becomes more apparent when creating a quota that the high priority items will never be able to fulfill for any store due to the limitations of the quantity. To resolve this I would:
- Change the definition of high priority items by changing it from the top 25% performing items organization wide to either top 33%, top 40%, or even consider letting a random forest model help with predicting the priority of items based on performance (which personally I’d recommend the most).
- Re calibrate the quotas I have created for all store tiers so that the amount of high priority items offer a more realistic percentage for each, making it more possible to fulfill.
Include the quantity of an item I would recommend to each individual store, using the random forest model to assist with predicting the best amount based on SKU limitations
Adjust the method that the model observes the data from a yearly perspective to a monthly perspective to get a more accurate perspective on inventory fulfillment.
- Possibly even make it so that the random forest model would produce a new recommended list before every inventory fulfillment cycle so that each individual store could determine what it would need in their inventory based on inventory performance.
Overall, I am certain that by using a random forest model to help with determining how Alcohol and Beverage Services of Montgomery County fulfill their inventory between all 27 locations, they will ultimately see more profit as there will be less inventory going to waste. My work is solely meant to lay the groundwork in creating the inventory ranking system, while also acknowledging how it can be improved in the future.