Final Project BAIS 462, By: Olivia W.

Project: Hypothesis -Do higher priced liquor products generate more revenue despite selling fewer bottles, and do consumers rate higher priced products better?

My name is Olivia, and I am a senior studying Marketing and Business Analytics & Information Systems. After graduation, I will be joining 84.51° as a consultant and analyst. The spirits industry has grown significantly in recent years, with consumers becoming increasingly selective about the brands and styles they purchase. This analysis focuses on two of the most popular spirit categories (whiskey and vodka) and explores what factors contribute to higher ratings among enthusiasts. Specifically, I am interested in whether the type of spirit, its origin, or its price point influences how it is rated on platforms like Distiller, one of the most popular spirits review websites.

As someone interested in consumer behavior and the beverage industry, I find it compelling that spirits can vary so dramatically in price while sometimes receiving similar ratings to much cheaper alternatives. This raises the question: do consumers actually rate more expensive spirits higher, or is price simply a marketing tool? Using data scraped from Distiller’s curated lists of top-rated whiskeys and vodkas, this analysis aims to describe the rating landscape across both categories and identify any patterns that emerge between spirit characteristics and user ratings.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(knitr)
iowa_liquor <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/wasserstromo_xavier_edu/IQBw4EI-SKD8SpV7cFBsku5BAd_H7cvhd8yNRlRBXyRlCUU?download=1")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 12591077 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Invoice/Item Number, Date, Store Name, Address, City, Store Locati...
dbl  (9): Store Number, Zip Code, Category, Item Number, Pack, Bottle Volume...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vodka_distiller_reviews <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/wasserstromo_xavier_edu/IQB5BFrPTEKsSb0P1WnWYdTmAfksRmbuPtk19zS_aPgAgvo?download=1")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): vodka_name, vodka_description
dbl (1): vodka_rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
whiskey_distiller_reviews <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/wasserstromo_xavier_edu/IQAySM_sWpbqSbRfuKeI5TvSAWaRtiOPqSFUpD38PR5W4Jc?download=1")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): whiskey_name, whiskey_description
dbl (1): whiskey_rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary Stats for Iowa Liquor Dataset

summary(iowa_liquor)
 Invoice/Item Number     Date            Store Number   Store Name       
 Length:12591077     Length:12591077    Min.   :2106   Length:12591077   
 Class :character    Class :character   1st Qu.:2604   Class :character  
 Mode  :character    Mode  :character   Median :3704   Mode  :character  
                                        Mean   :3565                     
                                        3rd Qu.:4304                     
                                        Max.   :9932                     
                                                                         
   Address              City              Zip Code     Store Location    
 Length:12591077    Length:12591077    Min.   :50002   Length:12591077   
 Class :character   Class :character   1st Qu.:50316   Class :character  
 Mode  :character   Mode  :character   Median :51101   Mode  :character  
                                       Mean   :51268                     
                                       3rd Qu.:52310                     
                                       Max.   :56201                     
                                       NA's   :10360                     
 County Number         County             Category       Category Name     
 Length:12591077    Length:12591077    Min.   : 101220   Length:12591077   
 Class :character   Class :character   1st Qu.:1012210   Class :character  
 Mode  :character   Mode  :character   Median :1031200   Mode  :character  
                                       Mean   :1044710                     
                                       3rd Qu.:1062310                     
                                       Max.   :1901200                     
                                       NA's   :8020                        
 Vendor Number      Vendor Name         Item Number     Item Description  
 Length:12591077    Length:12591077    Min.   :   101   Length:12591077   
 Class :character   Class :character   1st Qu.: 27056   Class :character  
 Mode  :character   Mode  :character   Median : 38177   Mode  :character  
                                       Mean   : 46037                     
                                       3rd Qu.: 63755                     
                                       Max.   :999275                     
                                                                          
      Pack        Bottle Volume (ml) State Bottle Cost  State Bottle Retail
 Min.   :  1.00   Min.   :     0.0   Length:12591077    Length:12591077    
 1st Qu.:  6.00   1st Qu.:   750.0   Class :character   Class :character   
 Median : 12.00   Median :   750.0   Mode  :character   Mode  :character   
 Mean   : 12.23   Mean   :   928.9                                         
 3rd Qu.: 12.00   3rd Qu.:  1000.0                                         
 Max.   :600.00   Max.   :378000.0                                         
                                                                           
  Bottles Sold      Sale (Dollars)     Volume Sold (Liters)
 Min.   :    0.00   Length:12591077    Min.   :    0.000   
 1st Qu.:    2.00   Class :character   1st Qu.:    1.500   
 Median :    4.00   Mode  :character   Median :    3.000   
 Mean   :    8.14                      Mean   :    7.489   
 3rd Qu.:   12.00                      3rd Qu.:    9.000   
 Max.   :15000.00                      Max.   :15000.000   
                                                           
 Volume Sold (Gallons)
 Min.   :   0.000     
 1st Qu.:   0.400     
 Median :   0.790     
 Mean   :   1.977     
 3rd Qu.:   2.380     
 Max.   :3962.580     
                      

Data Dictionary:

  • Invoice/Item Number

  • Date

  • Store Number

  • Store Name

  • Address

  • City

  • Zip Code

  • Store Location

  • County Number

  • County

  • Category

  • Category Name

  • Vendor Number

  • Vendor Name

  • Item Number

  • Item Description

  • Pack

  • Bottle Volume (ml)

  • State Bottle Cost

  • State Bottle Retail

  • Bottles Sold

  • Sale (Dollars)

  • Volume Sold (Liters)

  • Volume Sold (Gallons)

iowa_liquor <- iowa_liquor %>%
  mutate(`State Bottle Retail` = as.numeric(str_replace(`State Bottle Retail`, "\\$", "")),
         `State Bottle Cost` = as.numeric(str_replace(`State Bottle Cost`, "\\$", "")),
         `Sale (Dollars)` = as.numeric(str_replace(`Sale (Dollars)`, "\\$", "")))

Visualization 1: Average Bottles Sold by Price Tier

iowa_liquor %>%
  filter(!is.na(`State Bottle Retail`), !is.na(`Bottles Sold`)) %>%
  mutate(price_tier = ifelse(`State Bottle Retail` < 10, "Under $10",
                      ifelse(`State Bottle Retail` < 25, "$10-$25",
                      ifelse(`State Bottle Retail` < 50, "$25-$50", "Over $50")))) %>%
  group_by(price_tier) %>%
  summarize(avg_bottles = mean(`Bottles Sold`)) %>%
  ggplot(aes(x = factor(price_tier), y = avg_bottles)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Bottles Sold by Price Tier",
       x = "Price Tier",
       y = "Average Bottles Sold")

Visualization 2: Total Revenue by Price Tier

iowa_liquor %>%
  filter(!is.na(`State Bottle Retail`), !is.na(`Sale (Dollars)`)) %>%
  mutate(price_tier = ifelse(`State Bottle Retail` < 10, "Under $10",
                      ifelse(`State Bottle Retail` < 25, "$10-$25",
                      ifelse(`State Bottle Retail` < 50, "$25-$50", "Over $50")))) %>%
  group_by(price_tier) %>%
  summarize(total_revenue = sum(`Sale (Dollars)`)) %>%
  ggplot(aes(x = factor(price_tier), y = total_revenue)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Total Revenue by Price Tier",
       x = "Price Tier",
       y = "Total Revenue (USD)")

iowa_liquor <- iowa_liquor %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))

Visualization 3: Total Liquor Sales Over Time by Price Tier

iowa_liquor %>%
  filter(!is.na(`State Bottle Retail`), !is.na(`Sale (Dollars)`)) %>%
  mutate(price_tier = ifelse(`State Bottle Retail` < 10, "Under $10",
                      ifelse(`State Bottle Retail` < 25, "$10-$25",
                      ifelse(`State Bottle Retail` < 50, "$25-$50", "Over $50"))),
         month = floor_date(Date, "month")) %>%
  group_by(month, price_tier) %>%
  summarize(total_sales = sum(`Sale (Dollars)`, na.rm = TRUE)) %>%
  ggplot(aes(x = month, y = total_sales, color = price_tier)) +
  geom_line() +
  labs(title = "Total Liquor Sales Over Time by Price Tier",
       x = "Month",
       y = "Total Sales (USD)",
       color = "Price Tier")
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

names(iowa_liquor)
 [1] "Invoice/Item Number"   "Date"                  "Store Number"         
 [4] "Store Name"            "Address"               "City"                 
 [7] "Zip Code"              "Store Location"        "County Number"        
[10] "County"                "Category"              "Category Name"        
[13] "Vendor Number"         "Vendor Name"           "Item Number"          
[16] "Item Description"      "Pack"                  "Bottle Volume (ml)"   
[19] "State Bottle Cost"     "State Bottle Retail"   "Bottles Sold"         
[22] "Sale (Dollars)"        "Volume Sold (Liters)"  "Volume Sold (Gallons)"

Visualization 4: Distribution of Bottles Sold by Price Tier

iowa_liquor %>%
  filter(!is.na(`State Bottle Retail`),
         !is.na(`Bottles Sold`),
         `Bottles Sold` <= 100) %>%
  mutate(price_tier = ifelse(`State Bottle Retail` < 10, "Under $10",
                      ifelse(`State Bottle Retail` < 25, "$10-$25",
                      ifelse(`State Bottle Retail` < 50, "$25-$50", "Over $50")))) %>%
  ggplot(aes(x = factor(price_tier), y = `Bottles Sold`)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Distribution of Bottles Sold by Price Tier",
       x = "Price Tier",
       y = "Bottles Sold")

Visualization 5: Top 10 Best Selling Products

iowa_liquor %>%
  filter(!is.na(`Item Description`), 
         !is.na(`Bottles Sold`), 
         !is.na(`State Bottle Retail`)) %>%
  group_by(`Item Description`) %>%
  summarize(total_bottles = sum(`Bottles Sold`),
            avg_price = mean(`State Bottle Retail`)) %>%
  arrange(desc(total_bottles)) %>%
  slice_head(n = 10) %>%
  mutate(price_tier = ifelse(avg_price < 10, "Under $10",
                      ifelse(avg_price < 25, "$10-$25",
                      ifelse(avg_price < 50, "$25-$50", "Over $50")))) %>%
  ggplot(aes(x = reorder(`Item Description`, total_bottles), y = total_bottles, fill = price_tier)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Best Selling Products",
       x = "Product",
       y = "Total Bottles Sold",
       fill = "Price Tier") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Vodka Star Ratings from Reviews Bar Chart

vodka_distiller_reviews %>% 
  ggplot(aes(x = vodka_name, y = vodka_rating)) +
  geom_col() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Whiskey Star Ratings from Reviews Bar Chart

whiskey_distiller_reviews %>% 
  ggplot(aes(x = whiskey_name, y = whiskey_rating)) +
  geom_col() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When looking at whiskeys and vodkas, more expensive spirits tend to receive higher ratings. Premium whiskeys, often aged longer and produced in smaller batches, consistently score higher than budget options. The same pattern holds for vodkas, where higher-end bottles tend to outperform cheaper alternatives in user reviews.