Canadian Whiskey in America’s Heartland - Exploring and Visualizing Iowa Alcohol Sales

Purpose

The purpose of this exercise was to explore Alcohol Sales data in the State of Iowa for 2018 to see which brands were most popular, and which stores were selling most of the Top 2 selling brands. Going deeper, I was curious if some of these top performing stores were located near colleges. Some colleges tend to be large consumers of alcohol.

Research Questions

Which counties are the top 10 in sales?
What are the top 10 most valuable products in terms of sales?
What are the most popular liqour brands?
What is the spatial distirubtion of top selling locations of the top 2 selling brands?
Are these stores located near college campuses?

This report will be organized in the following sections: 1. Data Cleaning and Pre-Processin 2. Exploratory Data Analysis (EDA) 3. Mapping

Data

Iowa Liquor Sales data was obtained from Kaggle. A more current version of the data can be obtained directly from the State of Iowa. This will provide more context and explanations of the Metadata.

1. Data Cleaning and Pre-Processing

# packages --------------------------------------------------------------
library(tidyverse)
library(lubridate)
library(Hmisc)

Bring in the data

retail_dataimport <- read_csv("./Iowa_Liquor_Sales.csv")

Create a fresh copy of the data so it can be reverted back to easily. Get a sense of the variables in the dataset and their structure.

retail <- retail_dataimport

glimpse(retail)

## Observations: 12,591,077
## Variables: 24
## $ `Invoice/Item Number`   <chr> "S29198800001", "S29195400002", "S29050300001…
## $ Date                    <chr> "11/20/2015", "11/21/2015", "11/16/2015", "11…
## $ `Store Number`          <dbl> 2191, 2205, 3549, 2513, 3942, 3650, 2538, 394…
## $ `Store Name`            <chr> "Keokuk Spirits", "Ding's Honk And Holler", "…
## $ Address                 <chr> "1013 MAIN", "900 E WASHINGTON", "1414 48TH S…
## $ City                    <chr> "KEOKUK", "CLARINDA", "FORT MADISON", "IOWA C…
## $ `Zip Code`              <chr> "52632", "51632", "52627", "52240", "52342", …
## $ `Store Location`        <chr> "1013 MAIN\r\nKEOKUK 52632\r\n(40.39978, -91.…
## $ `County Number`         <chr> "56", "73", "56", "52", "86", "47", "07", "86…
## $ County                  <chr> "Lee", "Page", "Lee", "Johnson", "Tama", "Ida…
## $ Category                <dbl> NA, NA, NA, NA, NA, NA, 1701100, NA, 1701100,…
## $ `Category Name`         <chr> NA, NA, NA, NA, NA, NA, "DECANTERS & SPECIALT…
## $ `Vendor Number`         <dbl> 255, 255, 130, 65, 130, 65, 962, 65, 65, 130,…
## $ `Vendor Name`           <chr> "Wilson Daniels Ltd.", "Wilson Daniels Ltd.",…
## $ `Item Number`           <dbl> 297, 297, 249, 237, 249, 237, 238, 237, 173, …
## $ `Item Description`      <chr> "Templeton Rye w/Flask", "Templeton Rye w/Fla…
## $ Pack                    <dbl> 6, 6, 20, 3, 20, 3, 6, 3, 12, 20, 20, 3, 3, 6…
## $ `Bottle Volume (ml)`    <dbl> 750, 750, 150, 1750, 150, 1750, 1500, 1750, 7…
## $ `State Bottle Cost`     <chr> "$18.09", "$18.09", "$6.40", "$35.55", "$6.40…
## $ `State Bottle Retail`   <chr> "$27.14", "$27.14", "$9.60", "$53.34", "$9.60…
## $ `Bottles Sold`          <dbl> 6, 12, 2, 3, 2, 1, 6, 2, 4, 2, 20, 2, 3, 1, 2…
## $ `Sale (Dollars)`        <chr> "$162.84", "$325.68", "$19.20", "$160.02", "$…
## $ `Volume Sold (Liters)`  <dbl> 4.50, 9.00, 0.30, 5.25, 0.30, 1.75, 9.00, 3.5…
## $ `Volume Sold (Gallons)` <dbl> 1.19, 2.38, 0.08, 1.39, 0.08, 0.46, 2.38, 0.9…

Cleaning - Check for missing data.

Identify which columns have missing values and how many.

retail %>% 
  map(.,~sum(is.na(.)))

## $`Invoice/Item Number`
## [1] 0
## 
## $Date
## [1] 0
## 
## $`Store Number`
## [1] 0
## 
## $`Store Name`
## [1] 0
## 
## $Address
## [1] 2376
## 
## $City
## [1] 2375
## 
## $`Zip Code`
## [1] 2420
## 
## $`Store Location`
## [1] 2375
## 
## $`County Number`
## [1] 79178
## 
## $County
## [1] 79178
## 
## $Category
## [1] 8020
## 
## $`Category Name`
## [1] 16086
## 
## $`Vendor Number`
## [1] 3
## 
## $`Vendor Name`
## [1] 1
## 
## $`Item Number`
## [1] 0
## 
## $`Item Description`
## [1] 0
## 
## $Pack
## [1] 0
## 
## $`Bottle Volume (ml)`
## [1] 0
## 
## $`State Bottle Cost`
## [1] 10
## 
## $`State Bottle Retail`
## [1] 10
## 
## $`Bottles Sold`
## [1] 0
## 
## $`Sale (Dollars)`
## [1] 10
## 
## $`Volume Sold (Liters)`
## [1] 0
## 
## $`Volume Sold (Gallons)`
## [1] 0

#We ignore the entire row(ie observation), if any column has a missing value
retail <- retail[complete.cases(retail), ]

# Check to see if we have any NA's

retail %>% 
  map(., ~sum(is.na(.)))

## $`Invoice/Item Number`
## [1] 0
## 
## $Date
## [1] 0
## 
## $`Store Number`
## [1] 0
## 
## $`Store Name`
## [1] 0
## 
## $Address
## [1] 0
## 
## $City
## [1] 0
## 
## $`Zip Code`
## [1] 0
## 
## $`Store Location`
## [1] 0
## 
## $`County Number`
## [1] 0
## 
## $County
## [1] 0
## 
## $Category
## [1] 0
## 
## $`Category Name`
## [1] 0
## 
## $`Vendor Number`
## [1] 0
## 
## $`Vendor Name`
## [1] 0
## 
## $`Item Number`
## [1] 0
## 
## $`Item Description`
## [1] 0
## 
## $Pack
## [1] 0
## 
## $`Bottle Volume (ml)`
## [1] 0
## 
## $`State Bottle Cost`
## [1] 0
## 
## $`State Bottle Retail`
## [1] 0
## 
## $`Bottles Sold`
## [1] 0
## 
## $`Sale (Dollars)`
## [1] 0
## 
## $`Volume Sold (Liters)`
## [1] 0
## 
## $`Volume Sold (Gallons)`
## [1] 0

## Boom yes! We have considerably reduced misssing values.

Inspect the dataset

glimpse(retail)

## Observations: 12,495,974
## Variables: 24
## $ `Invoice/Item Number`   <chr> "S28865700001", "S29339300091", "S28866900001…
## $ Date                    <chr> "11/09/2015", "11/30/2015", "11/11/2015", "11…
## $ `Store Number`          <dbl> 2538, 2662, 3650, 3723, 2642, 3842, 2539, 460…
## $ `Store Name`            <chr> "Hy-Vee Food Store #3 / Waterloo", "Hy-Vee Wi…
## $ Address                 <chr> "1422 FLAMMANG DR", "522 MULBERRY, SUITE A", …
## $ City                    <chr> "WATERLOO", "MUSCATINE", "HOLSTEIN", "ONAWA",…
## $ `Zip Code`              <chr> "50702", "52761", "51025", "51040", "50219", …
## $ `Store Location`        <chr> "1422 FLAMMANG DR\r\nWATERLOO 50702\r\n(42.45…
## $ `County Number`         <chr> "07", "70", "47", "67", "63", "55", "42", "50…
## $ County                  <chr> "Black Hawk", "Muscatine", "Ida", "Monona", "…
## $ Category                <dbl> 1701100, 1701100, 1701100, 1081200, 1701100, …
## $ `Category Name`         <chr> "DECANTERS & SPECIALTY PACKAGES", "DECANTERS …
## $ `Vendor Number`         <dbl> 962, 65, 962, 305, 962, 962, 962, 962, 65, 96…
## $ `Vendor Name`           <chr> "Duggan's Distillers Products Corp", "Jim Bea…
## $ `Item Number`           <dbl> 238, 173, 238, 258, 238, 238, 238, 238, 173, …
## $ `Item Description`      <chr> "Forbidden Secret Coffee Pack", "Laphroaig w/…
## $ Pack                    <dbl> 6, 12, 6, 1, 6, 6, 6, 6, 12, 6, 1, 1, 12, 1, …
## $ `Bottle Volume (ml)`    <dbl> 1500, 750, 1500, 6000, 1500, 1500, 1500, 1500…
## $ `State Bottle Cost`     <chr> "$11.62", "$19.58", "$11.62", "$99.00", "$11.…
## $ `State Bottle Retail`   <chr> "$17.43", "$29.37", "$17.43", "$148.50", "$17…
## $ `Bottles Sold`          <dbl> 6, 4, 1, 1, 6, 3, 6, 2, 36, 12, 1, 1, 1, 2, 1…
## $ `Sale (Dollars)`        <chr> "$104.58", "$117.48", "$17.43", "$148.50", "$…
## $ `Volume Sold (Liters)`  <dbl> 9.00, 3.00, 1.50, 6.00, 9.00, 4.50, 9.00, 3.0…
## $ `Volume Sold (Gallons)` <dbl> 2.38, 0.79, 0.40, 1.59, 2.38, 1.19, 2.38, 0.7…

We need to change Date from “chr” to “mdy”. Then change Item_Description, County, Invoice Number, Category Name, to a Factor for analysis. We will also need to make some minor manipulations using gsub to remove underscores and slashes in variable names.

#Replace spaces in variable names with underscores.
names(retail) <- gsub("/", "_", names(retail))
names(retail) <- gsub(" ", "_", names(retail))

# Also remove the slashes between dates, they make formatting difficult.
retail$Date <- gsub("/","",retail$Date)

names(retail) <- gsub("[()]", "", names(retail))
retail$Sale_Dollars <- as.numeric(gsub("\\$", "", retail$Sale_Dollars))
retail$State_Bottle_Cost <- as.numeric(gsub("\\$", "", retail$State_Bottle_Cost))
retail$State_Bottle_Retail <- as.numeric(gsub("\\$", "", retail$State_Bottle_Retail))


# The last step is to convert county names to uppercase. They were recorded incorrectly with upper and lowercase.
retail <- retail %>% mutate_each(funs(toupper),County)

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

# Coerce more variables into factor form
retail_cleaned <- retail %>%
  mutate(Date = mdy(Date)) %>% #coerces InvoiceDate in a Date Time format
  mutate(Item_Description = factor(Item_Description, levels = unique(Item_Description))) %>% 
#Coerces Description as a factor with each item as individual level of a factor
  
  
  mutate(County = factor(County, levels = unique(County)))%>%
  mutate(Invoice_Item_Number = factor(Invoice_Item_Number, levels = unique(Invoice_Item_Number))) %>%
  mutate(Category_Name = factor(Category_Name, levels = unique(Category_Name)))

Inspect the data again to examine new variables and changed data types.

glimpse(retail_cleaned)

## Observations: 12,495,974
## Variables: 24
## $ Invoice_Item_Number <fct> S28865700001, S29339300091, S28866900001, S291343…
## $ Date                <date> 2015-11-09, 2015-11-30, 2015-11-11, 2015-11-18, …
## $ Store_Number        <dbl> 2538, 2662, 3650, 3723, 2642, 3842, 2539, 4604, 2…
## $ Store_Name          <chr> "Hy-Vee Food Store #3 / Waterloo", "Hy-Vee Wine &…
## $ Address             <chr> "1422 FLAMMANG DR", "522 MULBERRY, SUITE A", "118…
## $ City                <chr> "WATERLOO", "MUSCATINE", "HOLSTEIN", "ONAWA", "PE…
## $ Zip_Code            <chr> "50702", "52761", "51025", "51040", "50219", "505…
## $ Store_Location      <chr> "1422 FLAMMANG DR\r\nWATERLOO 50702\r\n(42.459938…
## $ County_Number       <chr> "07", "70", "47", "67", "63", "55", "42", "50", "…
## $ County              <fct> BLACK HAWK, MUSCATINE, IDA, MONONA, MARION, KOSSU…
## $ Category            <dbl> 1701100, 1701100, 1701100, 1081200, 1701100, 1701…
## $ Category_Name       <fct> DECANTERS & SPECIALTY PACKAGES, DECANTERS & SPECI…
## $ Vendor_Number       <dbl> 962, 65, 962, 305, 962, 962, 962, 962, 65, 962, 3…
## $ Vendor_Name         <chr> "Duggan's Distillers Products Corp", "Jim Beam Br…
## $ Item_Number         <dbl> 238, 173, 238, 258, 238, 238, 238, 238, 173, 238,…
## $ Item_Description    <fct> Forbidden Secret Coffee Pack, Laphroaig w/ Whiske…
## $ Pack                <dbl> 6, 12, 6, 1, 6, 6, 6, 6, 12, 6, 1, 1, 12, 1, 6, 1…
## $ Bottle_Volume_ml    <dbl> 1500, 750, 1500, 6000, 1500, 1500, 1500, 1500, 75…
## $ State_Bottle_Cost   <dbl> 11.62, 19.58, 11.62, 99.00, 11.62, 11.62, 11.62, …
## $ State_Bottle_Retail <dbl> 17.43, 29.37, 17.43, 148.50, 17.43, 17.43, 17.43,…
## $ Bottles_Sold        <dbl> 6, 4, 1, 1, 6, 3, 6, 2, 36, 12, 1, 1, 1, 2, 1, 2,…
## $ Sale_Dollars        <dbl> 104.58, 117.48, 17.43, 148.50, 104.58, 52.29, 104…
## $ Volume_Sold_Liters  <dbl> 9.00, 3.00, 1.50, 6.00, 9.00, 4.50, 9.00, 3.00, 2…
## $ Volume_Sold_Gallons <dbl> 2.38, 0.79, 0.40, 1.59, 2.38, 1.19, 2.38, 0.79, 7…

The data has been satisfactorily cleaned and processed. Save as an .Rdata file to make it easy to import for the EDA.

save(retail_cleaned, file = "retail_cleaned.RData")

2. Exploratory Data Analysis (EDA) and Visualization

Questions

Now that the data has been cleaned and process let’s explore and visualize some of the trends we see in the data. Specifically, we should investigate:

Which counties are the top 10 in sales?
What are the top 10 most valuable products in terms of sales?
What are the most popular liquor brands?

# packages --------------------------------------------------------------

library(forcats)

Load the data and inpsect the data.

load("./retail_cleaned.RData")

Exploratory Data Analysis (EDA)

Summarize the data according to the counties that have the most sales and identify the Top 10 Best Selling Counties.

# Summarize the data by the counties that have the most sales.
county_mostsales<-retail_cleaned %>% 
  group_by(County) %>% 
  summarise(countywise_sales = sum(Sale_Dollars)/1000000) %>% 
  arrange(desc(countywise_sales))

## Identify Top 10 Counties
by_top10_counties <- county_mostsales %>% 
  top_n(n = 10, wt = countywise_sales)

Visualize the top 10 counties as per total sales.

by_top10_counties %>% 
    mutate(County = fct_reorder(County, countywise_sales)) %>% 
    ggplot(aes(County, countywise_sales))+
    geom_bar(stat = "identity",fill = "green4")+
  theme_minimal()+
  ggtitle('Top 10 Liquor Selling Counties')+
  coord_flip()+
  ylab('Total Sales ($M)')+
  theme(plot.title = element_text(hjust = 0.5))

So Polk county has the most sales by an order of magnitude. Polk County contains the state capital, Des Moines, and is the most populous county in Iowa according to World Population Review. Let’s remove Polk county and then look at the rest of the data in the top 10 counties.

county_mostsales_without_polk <-  county_mostsales[-1,]  
  
by_top10_counties_without_polk <- county_mostsales_without_polk %>% 
  top_n(n=10, wt = countywise_sales )

by_top10_counties_without_polk %>% 
  mutate(County = fct_reorder(County, countywise_sales)) %>% 
  ggplot(aes(County, countywise_sales))+
  geom_bar(stat = "identity",fill = "green4")+
  theme_minimal()+
  ggtitle('Top 10 Liquor Selling Counties')+
  coord_flip()+
  ylab('Total Sales ($M)')+
  theme(plot.title = element_text(hjust = 0.5))

So without Polk county, Linn county is the second leader in Liquor sales in the state of Iowa with $125M in sales for the reporting period. That is a lot of alcohol!

What are the most valued products? What are the Top 10 most valuable products?

mostvalued_product <- retail_cleaned %>% 
  group_by(Item_Description) %>%
  summarise(value_of_product = sum(Sale_Dollars)/1000000) %>% 
  arrange(desc(value_of_product)) 

top10_mostvalued_products <- mostvalued_product %>% 
  top_n(n = 10, wt = value_of_product)

The categorical variables Item_Description does not have an inherent order. Reorder it as an increasing count. Visualize Top 10 Best Selling Products.

top10_mostvalued_products %>% 
  mutate(Item_Description = fct_reorder(Item_Description, value_of_product)) %>% 
  
  ggplot(aes(Item_Description, value_of_product))+
   
    geom_bar(stat = "identity",fill = "green4")+
    theme_minimal()+
    ggtitle('Top 10 Most Popular Products')+
    coord_flip()+
    ylab('Total Sales ($M)')+
    theme(plot.title = element_text(hjust = 0.5))

Interesetingly Canadian Whisky is the most valued Liquor product in America’s Heartland. Followed by Jack Daniels, then Captain Morgan.

What are the best selling liquor brands in Iowa? What are the top 10 best brands?

# What are the best selling liquor brands in Iowa

mostsold_brand <- retail_cleaned %>% 
  group_by(Item_Description) %>% 
  summarise(total_units_sold = sum(Bottles_Sold)/1000000) %>% 
  arrange(desc(total_units_sold))

top10_mostsold_brand <- mostsold_brand %>% 
  top_n(n=10,wt=total_units_sold)

most_sold<- top10_mostsold_brand %>% 
  mutate(Item_Description = fct_reorder(Item_Description, total_units_sold)) %>% 
  ggplot(aes(Item_Description, total_units_sold))+
  geom_bar(stat = "identity",fill = "green4")+
  theme_minimal()+
  ggtitle('Top 10 Most Sold Brands')+
  coord_flip()+
  ylab('Total Sales ($M)')+
  theme(plot.title = element_text(hjust = 0.5))

most_sold

Black Velvet again. It is the most sold and accounts for the most value followed by Hawkeye Vodka (This seems logical given their passion for college football) and Captain Morgan Spiced Rum.

Which stores sold the most Black Velvet and Hawkeye Vodka? First we must remove empty spaces in Item_Description names.

blackvelvet <- retail_cleaned %>% 
   filter(Item_Description == "Black Velvet") %>% 
  group_by(Store_Number) %>% 
  summarise(allSold = sum(Bottles_Sold),
            totalSales = sum(Sale_Dollars),
            totalvolGal = sum(Volume_Sold_Gallons)) %>% 
  arrange(desc(allSold)) %>%
  top_n(n=10, wt = allSold)

Use store number for the join.Create data frame just with locations and pertinent summary information.

top10bvstores<- retail_cleaned %>% 
  select(Store_Number,Store_Name,Address,City,Zip_Code,Store_Location,
         County_Number,Category_Name,Vendor_Name,Item_Description) %>% 
  filter(Item_Description == "Black Velvet") %>% 
  arrange(desc(Store_Number)) %>% 
  distinct(Store_Number,.keep_all = TRUE)
  
bv_most_pop<- top10bvstores %>% 
  inner_join(blackvelvet,top10bvstores, by = "Store_Number")

Now find way to break out spatial coordinates to store locations. Turn string into data list, then dataframe.

bv_data_clean <- c("7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266\n(41.561342, -93.806489)",
                  "305 AIRPORT RD\nAMES 50010\n(42.001123, -93.61365)",               
                  "210 EAST TOWER PARK DR\nWATERLOO 50702\n(42.456362, -92.352552)",
                  "4201 S. YORK ST.\nSIOUX CITY 51106\n(42.433711, -96.370146)",          
                  "1101 73RD STREET\nWINDSOR HEIGHTS 50311\n(41.599172, -93.718027)",     
                  "1610 OKOBOJI AVENUE\nMILFORD 51351\n(43.331525, -95.149955)",          
                  "2605 BLAIRS FERRY RD NE\nCEDAR RAPIDS 52402\n(42.034737, -91.679406)", 
                  "1511 2ND AVE NORTH\nFORT DODGE 50501\n(42.508344, -94.177165)",        
                  "3221 SE 14TH ST\nDES MOINES 50320\n(41.554101, -93.596754)",           
                  "1201 12TH AVE SW\nLEMARS 51031\n(42.778257, -96.18335)"               
                  )      
     
bv_data_clean <- as.data.frame(bv_data_clean)

Separate out lat and long and bind them to original Black Velvet Top 10 Dataset.Now write it to a .csv. Repeat process for Hawkeye Vodka.

bvdc_latlon <-bv_data_clean%>%
  mutate(bv_data_clean = gsub('\n', '', bv_data_clean)) %>%
  extract(bv_data_clean, into = c('address', 'lat', 'lon'), 
          regex = '(.*)\\((.*),\\s+(.*)\\)', convert = TRUE) %>% 
          bind_cols(bv_most_pop) %>% 
          write.csv(.,file = "bvdc_latlon.csv")

Continuing the same process for Hawkeye Vodka

hawkeye <- retail_cleaned %>% 
  filter(Item_Description == "Hawkeye Vodka") %>% 
  group_by(Store_Number) %>% 
  summarise(allSold = sum(Bottles_Sold),
            totalSales = sum(Sale_Dollars),
            totalvolGal = sum(Volume_Sold_Gallons)) %>% 
  arrange(desc(allSold)) %>%
  top_n(n=10, wt = allSold)


top10hvstores<- retail_cleaned %>% 
  select(Store_Number,Store_Name,Address,City,Zip_Code,Store_Location,
         County_Number,Category_Name,Vendor_Name,Item_Description) %>% 
  filter(Item_Description == "Hawkeye Vodka") %>% 
  arrange(desc(Store_Number)) %>% 
  distinct(Store_Number,.keep_all = TRUE)

hv_most_pop<- top10hvstores %>% 
  inner_join(hawkeye,top10hvstores, by = "Store_Number")

hv_data_clean <- c( "1501 MICHIGAN AVE\nDES MOINES 50314\n(41.605561, -93.613738)",          
                    "1373 PIERCE ST\nSIOUX CITY 51105\n(42.504732, -96.405013)",             
                    "507 W 19th St\nSIOUX CITY 51103\n(42.510535, -96.420193)",              
                    "7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266\n(41.561342, -93.806489)", 
                    "1101 73RD STREET\nWINDSOR HEIGHTS 50311\n(41.599172, -93.718027)",      
                    "1511 2ND AVE NORTH\nFORT DODGE 50501\n(42.508344, -94.177165)",         
                    "3221 SE 14TH ST\nDES MOINES 50320\n(41.554101, -93.596754)",            
                    "4100 UNIVERSITY AVE\nDES MOINES 50311\n(41.600361, -93.673223)",        
                    "2001 BLAIRS FERRY ROAD NE\nCEDAR RAPIDS 52402\n(42.034799, -91.668909)",
                    "551 S ILLINOIS AVE\nMASON CITY 50401\n(43.14623, -93.17114)"           
                    )

hv_data_clean <- as.data.frame(hv_data_clean)

hvdc_latlon <-hv_data_clean%>%
  mutate(hv_data_clean = gsub('\n', '', hv_data_clean)) %>%
  extract(hv_data_clean, into = c('address', 'lat', 'lon'), 
          regex = '(.*)\\((.*),\\s+(.*)\\)', convert = TRUE) %>% 
  bind_cols(hv_most_pop) %>% 
  write.csv(.,file = "hvdc_latlon.csv")

Good. Now both datasets have been exported and can be imported for spatial manipulation in a subsequent analysis.

3. Mapping Top Selling Liquor Stores by Most Popular Brand

In th previous section we found that Black Velvet and Hawkeye Vodka were the top seling liquor brands in Iowa. It would be interesting to plot the spatial distribution of these stores to see if any illustrative spatial patterns emerge. In addition, we have a hypothesis we would like to test.

Hypothesis:

If stores are top sellers of Back Velvet or Hawkeye Vodka, then they will be located near college campuses.

# Install/Load Packages
library(leaflet)
library(leaflet.extras)
library(htmltools)

Data - Load data

In order to do this we must extract data from the IPEDS database containing information on all 4-year colleges within the U.S. The dataset can be found here :IPEDS.

ipeds <- read_csv("./IPEDS2018.csv")
ipeds <- as.data.frame(ipeds)
# Load in Top Selling stores data from previous exercise

hawkeye <- read_csv("./hvdc_latlon.csv")
blackvelvet <- read_csv("./bvdc_latlon.csv")

Inspect the data

glimpse(hawkeye)

## Observations: 10
## Variables: 17
## $ X1               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ address          <chr> "1501 MICHIGAN AVEDES MOINES 50314", "1373 PIERCE ST…
## $ lat              <dbl> 41.60556, 42.50473, 42.51053, 41.56134, 41.59917, 42…
## $ lon              <dbl> -93.61374, -96.40501, -96.42019, -93.80649, -93.7180…
## $ Store_Number     <dbl> 4829, 3814, 3420, 2644, 2633, 2626, 2552, 2544, 2517…
## $ Store_Name       <chr> "Central City 2", "Costco Wholesale #788", "Sam's Cl…
## $ Address          <chr> "1501 MICHIGAN AVE", "7205 MILLS CIVIC PKWY", "1101 …
## $ City             <chr> "DES MOINES", "WEST DES MOINES", "WINDSOR HEIGHTS", …
## $ Zip_Code         <dbl> 50314, 50266, 50311, 50501, 50320, 50311, 52404, 501…
## $ Store_Location   <chr> "1501 MICHIGAN AVE\r\r\nDES MOINES 50314\r\r\n(41.60…
## $ County_Number    <dbl> 77, 25, 77, 94, 77, 77, 57, 64, 50, 52
## $ Category_Name    <chr> "VODKA 80 PROOF", "VODKA 80 PROOF", "VODKA 80 PROOF"…
## $ Vendor_Name      <chr> "Luxco-St Louis", "Luxco-St Louis", "Luxco-St Louis"…
## $ Item_Description <chr> "Hawkeye Vodka", "Hawkeye Vodka", "Hawkeye Vodka", "…
## $ allSold          <dbl> 21250, 22299, 37571, 27916, 32506, 25323, 21471, 303…
## $ totalSales       <dbl> 134915.4, 232703.2, 355015.7, 211826.7, 203534.8, 17…
## $ totalvolGal      <dbl> 5666.97, 10308.76, 15375.67, 8847.08, 8702.11, 6539.…

glimpse(blackvelvet)

## Observations: 10
## Variables: 17
## $ X1               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ address          <chr> "7205 MILLS CIVIC PKWYWEST DES MOINES 50266", "305 A…
## $ lat              <dbl> 41.56134, 42.00112, 42.45636, 42.43371, 41.59917, 43…
## $ lon              <dbl> -93.80649, -93.61365, -92.35255, -96.37015, -93.7180…
## $ Store_Number     <dbl> 3814, 3524, 3494, 3447, 3420, 3385, 2633, 2604, 2595…
## $ Store_Name       <chr> "Costco Wholesale #788", "Sam's Club 6568 / Ames", "…
## $ Address          <chr> "7205 MILLS CIVIC PKWY", "305 AIRPORT RD", "210 EAST…
## $ City             <chr> "WEST DES MOINES", "AMES", "WATERLOO", "SIOUX CITY",…
## $ Zip_Code         <dbl> 50266, 50010, 50702, 51106, 50311, 52402, 50320, 510…
## $ Store_Location   <chr> "7205 MILLS CIVIC PKWY\r\r\nWEST DES MOINES 50266\r\…
## $ County_Number    <chr> "25", "85", "07", "97", "77", "57", "77", "75", "24"…
## $ Category_Name    <chr> "CANADIAN WHISKIES", "CANADIAN WHISKIES", "CANADIAN …
## $ Vendor_Name      <chr> "Constellation Wine Company, Inc.", "Constellation W…
## $ Item_Description <chr> "Black Velvet", "Black Velvet", "Black Velvet", "Bla…
## $ allSold          <dbl> 25855, 29562, 33056, 24249, 43064, 39366, 34168, 240…
## $ totalSales       <dbl> 372030.2, 414180.2, 432051.1, 313857.3, 586250.9, 49…
## $ totalvolGal      <dbl> 11866.18, 12710.64, 12922.11, 9322.21, 17903.93, 146…

glimpse(ipeds)

## Observations: 6,857
## Variables: 73
## $ UNITID   <dbl> 100654, 100663, 100690, 100706, 100724, 100733, 100751, 1007…
## $ INSTNM   <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ IALIAS   <chr> "AAMU", NA, "Southern Christian University |Regions Universi…
## $ ADDR     <chr> "4900 Meridian Street", "Administration Bldg Suite 1070", "1…
## $ CITY     <chr> "Normal", "Birmingham", "Montgomery", "Huntsville", "Montgom…
## $ STABBR   <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ ZIP      <chr> "35762", "35294-0110", "36117-3553", "35899", "36104-0271", …
## $ FIPS     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ OBEREG   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ CHFNM    <chr> "Dr. Andrew Hugine, Jr.", "Ray L. Watts", "Michael C.Turner"…
## $ CHFTITLE <chr> "President", "President", "President", "President", "Preside…
## $ GENTELE  <dbl> 2.563725e+09, 2.059344e+09, 3.343870e+13, 2.568246e+09, 3.34…
## $ EIN      <dbl> 636001109, 636005396, 237034324, 630520830, 636001101, 63600…
## $ DUNS     <chr> "197216455", "63690705", "126307792", "949687123", "40672685…
## $ OPEID    <dbl> 100200, 105200, 2503400, 105500, 100500, 800400, 105100, 100…
## $ OPEFLAG  <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, …
## $ WEBADDR  <chr> "www.aamu.edu/", "www.uab.edu", "www.amridgeuniversity.edu",…
## $ ADMINURL <chr> "www.aamu.edu/Admissions/Pages/default.aspx", "www.uab.edu/s…
## $ FAIDURL  <chr> "www.aamu.edu/admissions/fincialaid/pages/default.aspx", "ww…
## $ APPLURL  <chr> "https://www.aamu.edu/Admissions/UndergraduateAdmissions/Pag…
## $ NPRICURL <chr> "https://galileo.aamu.edu/NetPriceCalculator/npcalc.htm", "u…
## $ VETURL   <chr> NA, "www.uab.edu/students/veterans", "www.amridgeuniversity.…
## $ ATHURL   <chr> NA, "www.uab.edu/registrar/students", NA, "www.uah.edu/heoa"…
## $ DISAURL  <chr> "www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Se…
## $ SECTOR   <dbl> 1, 1, 2, 1, 1, 0, 1, 4, 1, 1, 1, 2, 4, 2, 3, 4, 4, 2, 4, 9, …
## $ ICLEVEL  <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 3, …
## $ CONTROL  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 1, 2, 1, 3, …
## $ HLOFFER  <dbl> 9, 9, 9, 9, 9, 9, 9, 3, 7, 9, 9, 5, 3, 5, 9, 3, 3, 9, 3, 2, …
## $ UGOFFER  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ GROFFER  <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, …
## $ HDEGOFR1 <dbl> 12, 11, 12, 11, 11, 11, 11, 40, 20, 12, 11, 30, 40, 30, 13, …
## $ DEGGRANT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, …
## $ HBCU     <dbl> 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, …
## $ HOSPITAL <dbl> 2, 1, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, -2, -2, 2, -2, -2, 2, -…
## $ MEDICAL  <dbl> 2, 1, 2, 2, 2, -2, 2, 2, 2, 2, 1, 2, 2, -2, 2, 2, 2, 2, 2, 2…
## $ TRIBAL   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ LOCALE   <dbl> 12, 12, 12, 12, 12, 12, 12, 32, 31, 12, 13, 12, 41, 32, 12, …
## $ OPENPUBL <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ACT      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
## $ NEWID    <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, …
## $ DEATHYR  <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 2018, -2…
## $ CLOSEDAT <chr> "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", "-2", …
## $ CYACTIVE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, …
## $ POSTSEC  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ PSEFLAG  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, …
## $ PSET4FLG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, …
## $ RPTMTH   <dbl> 1, 1, 1, 1, 1, -2, 1, 1, 1, 1, 1, 1, 1, -2, 1, 1, 1, 1, 1, 2…
## $ INSTCAT  <dbl> 2, 2, 2, 2, 2, -2, 2, 4, 2, 2, 2, 2, 4, -2, 2, 4, 4, 2, 4, 6…
## $ C18BASIC <dbl> 18, 15, 20, 16, 19, -2, 15, 2, 22, 18, 15, 21, 1, 23, 20, 5,…
## $ C18IPUG  <dbl> 16, 17, 19, 17, 13, -2, 17, 2, 15, 16, 17, 9, 2, 5, 17, 3, 2…
## $ C18IPGRD <dbl> 17, 17, 18, 17, 13, -2, 15, 0, 0, 4, 14, 0, 0, 0, 8, 0, 0, 1…
## $ C18UGPRF <dbl> 10, 9, 5, 15, 10, -2, 14, 2, 5, 9, 14, 14, 2, 10, 7, 2, 2, 7…
## $ C18ENPRF <dbl> 4, 5, 5, 4, 3, -2, 4, 1, 3, 3, 4, 2, 1, 2, 3, 1, 1, 4, 1, -2…
## $ C18SZSET <dbl> 14, 15, 6, 12, 14, -2, 16, 2, 9, 13, 15, 11, 2, 8, 6, 2, 4, …
## $ C15BASIC <dbl> 18, 15, 20, 16, 19, -2, 16, 1, 22, 18, 16, 21, 1, 22, 22, 8,…
## $ CCBASIC  <dbl> 18, 15, 21, 15, 18, -2, 16, 2, 22, 18, 16, 21, 2, -2, 23, 2,…
## $ CARNEGIE <dbl> 16, 15, 51, 16, 21, -2, 15, 40, 32, 21, 15, 31, 40, -2, 40, …
## $ LANDGRNT <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ INSTSIZE <dbl> 3, 5, 1, 3, 2, -2, 5, 2, 2, 3, 5, 2, 2, -2, 1, 2, 3, 2, 2, 1…
## $ F1SYSTYP <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, -2, 1, 1, 1, 2, 1, 2,…
## $ F1SYSNAM <chr> "-2", "The University of Alabama System", "-2", "The Univers…
## $ F1SYSCOD <dbl> -2, 101050, -2, 101050, -2, 101050, 101050, 101030, -2, 1010…
## $ CBSA     <dbl> 26620, 13820, 33860, 26620, 33860, 46220, 46220, 10760, 2662…
## $ CBSATYPE <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, …
## $ CSA      <dbl> 290, 142, -2, 290, -2, -2, -2, 142, 290, -2, 194, 142, 194, …
## $ NECTA    <dbl> -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, …
## $ COUNTYCD <dbl> 1089, 1073, 1101, 1089, 1101, 1125, 1125, 1123, 1083, 1101, …
## $ COUNTYNM <chr> "Madison County", "Jefferson County", "Montgomery County", "…
## $ CNGDSTCD <dbl> 105, 107, 102, 105, 107, 107, 107, 103, 105, 102, 103, 107, …
## $ LONGITUD <dbl> -86.56850, -86.79935, -86.17401, -86.64045, -86.29568, -87.5…
## $ LATITUDE <dbl> 34.78337, 33.50570, 32.36261, 34.72456, 32.36432, 33.20701, …
## $ DFRCGID  <dbl> 119, 105, 137, 109, 127, -2, 103, 75, 147, 120, 104, 144, 70…
## $ DFRCUSCG <dbl> 1, 1, 2, 2, 1, -2, 1, 2, 1, 1, 1, 1, 2, -2, 1, 2, 1, 2, 2, 1…

IPEDS contain a lot of data we don’t really need. Subset data related to insitution name and location. I filtered for schools in Iowa that were four years or more. I wanted to try to remove community colleges and vocational schools as they are less likely to have a student population residing on or near campus.Close proximity to campus being the primary locus of the purchase and consumption of alcohol.

ipeds_iowa <- ipeds %>% 
      select(INSTNM,IALIAS,ADDR,CITY,STABBR,ICLEVEL,ZIP,LONGITUD,LATITUDE) %>% 
      dplyr::filter(STABBR == "IA", ICLEVEL == '1') %>% 
      rename(lon = LONGITUD, lat = LATITUDE)

Mapping

Let’s start fresh by just plotting colleges first and calling that our base map. First we need to generate some icons for universities. Some pretty neat icons can be found at Font Awesome.

college_icons <- awesomeIcons(icon='university',
                              library = 'fa',
                              markerColor = 'orange')
# This is our base map
map<- leaflet() %>% 
      addProviderTiles("CartoDB")

# Let's save out one for college to making plotting easier later.

college_map <- map

college_map

So now we have our base map. Now let’s add the icons we made.The clusterOptions argument clusters icons into regions so the map is more readable.

college_map <- college_map %>% 

  addAwesomeMarkers(data=ipeds_iowa,icon =college_icons, 
                   popup = ~INSTNM,group='Schools',
                   clusterOptions = markerClusterOptions())

college_map

Create a color pallette and plot the map.

pal <- colorFactor(palette = c("blue","red"),
                   domain = c("Black Velvet", "Hawkeye Vodka"))

#Add markers for Top 10 Black Velvet and Hawkeye Vodka selling stores.
liquor_collegemap <- college_map %>%
                    addCircleMarkers(data = blackvelvet,radius = 2,
                     color= ~pal(Item_Description),
                     group = "Black Velvet",
                      label = ~Store_Name) %>% 
  
                  addCircleMarkers(data = hawkeye,radius = 2,
                   color= ~pal(Item_Description),
                   group = "Hawkeye Vodka",
                   label = ~Store_Name) %>%
  
  #Add the ability to toggle layers on/off to help with visual analysis.
                  addLayersControl(
                    overlayGroups = c("Black Velvet",
                                      "Hawkeye Vodka")) %>% 
  #Add a legend to the lower right-hand side of the map.
                  addLegend("bottomright", colors= c("blue", "red"), 
                    labels=c("Black Velvet", "Hawkeye")) %>% 
  #Add a measuring tool to inspect distance between stores and schools.
                    addMeasure()
liquor_collegemap

Conclusions

From this mapping exercise we can refute our hypothesis that the best selling stores will be near college campuses. It appears the more likely explanation is that the best selling store tend to be located in larger population centers near major roads and receive more customer traffic. Use the map to explore spatial relationships and possible correlations between the top two selling brands (Hawkeye Vodka and Black Velvet) and other phenomena that may result in high sales of these two particular products.