Consumer Grocery Trends Based on Sales Data

#Data sets for the second half data, filtered by state# 

ByStateURL <- "https://www.ers.usda.gov/webdocs/DataFiles/100189/StateAndCategory.csv?v=5936.5"
download.file(ByStateURL, "U://Data Vis//Rfiles//GrocerySales_ByState.csv")
Statefile <-"U://Data Vis//Rfiles//GrocerySales_ByState.csv"
dfState <-fread(Statefile, na.strings=c(NA, ""))
dfState_2 <-fread(Statefile, na.strings=c(NA, ""))

#Filter to only have PERCENT CHANGE UNITS 1 YEAR and look at ALCOHOL as a more leisurely spend than eggs, which was my original deep dive#
dfState <- dfState[grep("Percent change units 1 year", dfState$Variable),]
dfState<- dfState[grep("Alcohol", dfState$Category),]

#Aggregate the data into just by year, averaging percent change for each month for the year into a new DF called dfstate_year #
xstate<-ymd(dfState$Date)                  
dfState$year <- year(xstate)
dfState$month<-months(xstate,abbreviate=TRUE)
dfState$year <- floor_date(as.Date(dfState$Date), unit="year")
dfState_year <- aggregate(Value ~ year + State, data=dfState, mean)

#Turn the year variable back into "year" # 
xstate_year<-ymd(dfState_year$year)
dfState_year$year <-year(xstate_year)

#Turn to a factor for the years when graphing so each year is a level#
dfState_year$year <- factor(dfState_year$year)

#Filter out 2019 to keep to the 2020s #
dfState_year <- dfState_year %>%
  filter(year!=2019)



## INSTEAD OF UNITS, look at SHARE# 
#The set up of data frames used

dfshare <- df %>%
  filter(Variable == 'Share')
#Filter out All foods as this is 100% of all the categories/subcategories
dfshare <- dfshare %>%
  filter(Category!="All foods")

xshare<-ymd(dfshare$Date)
dfshare$year <-year(xshare)
dfshare$month <-month(xshare)
dfshare$monthname<-months(xshare,abbreviate=TRUE)
dfshare$year <- factor(dfshare$year)

#Narrow down to Alcohol and order in descending order the Shares to see the top dates#
dfshare_alc <- dfshare[order(-Value),]
dfshare_alc <- dfshare_alc[dfshare_alc$Category == 'Alcohol',]
dfshare_bev <- dfshare 
dfshare_bev <- dfshare_bev[dfshare_bev$Category %in% c("Alcohol", "Beverages"),]


#ggplot bar for tab 5, will be plotly#

#Try plotly for interactive#
#Need to sort by date so Plotly knows # 
dfshare_alc_sorted <-dfshare_alc %>%
  arrange(Date)

Introduction

There’s no question that the 2020 pandemic caused numerous lifestyle changes. Some had lasted and become the norm (remote and hybrid work), while others faded out (sanitizing each grocery item before bringing them in the house). Eventually, we had a sense of safety again and returned to some normal habits as businesses opened back up.

I was curious how consumer behavior had shifted within their grocery shop across the past 3 years, without having restaurants to go to, being stuck inside, and now, dealing with inflation.

Dataset

Collected by the Economic Research Service and the USDA, the Weekly Retail Food Sales series include sales from a nationally representative sample of grocery stores and other retail food outlets. The national sales data are documented across 10 categories and 51 subcategories and reported on a weekly basis, beginning with the week ending October 6, 2019.

A secondary set is used as well which aggregates totals by 39 states; however, this data set is limited to the 10 categories and isn’t broken out by the 51 subcategories.

This data set collects an extensive amount of sales measurements, but the ones we’ll explore here include unit sales, percent change in unit sales from 1 year ago, and the share (category’s percentage of entirety of that weeks’ grocery sales, out of 100)

For more information on this data set, visit ers.usda.gov

Findings

For the most part, over the past three years, the average grocery basket has been generally broken out the same percentage between the various categories captured in this data set.

As alcohol is one of the larger categories by share, and by its’ nature is a unique commodity, I decided to focus on this category for the remainder of the tabs.

Tab 1

Share of grocery spend across categories from 2019-2022. The categories remain steady in their share for the most part. These are yearly averaged totals since they are reported on a weekly basis.

# lattice bar chart for grocery share across category# 
df_test <- aggregate(Value ~ year + Category, data=dfshare, mean)


ggplot(df_test, aes(x=Category, y=Value, fill=Category)) +
  geom_bar(stat="identity", position="dodge") +
  theme_classic() +
  theme(plot.title=element_text(hjust=0.5)) +
  labs(title="Share of Grocery Spend Across Categories, by Year",
       y= "Share", 
       fill="Category")+
  facet_wrap(~year, ncol=2, nrow=2, scales="free")+
  theme(panel.spacing=unit(1, "lines")) + 
  theme(panel.grid.minor.y = element_line(1)) + 
  scale_y_continuous(breaks = seq(0, 10, by=1)) + 
  geom_text(aes(label=paste0(round(Value,2)))
            ,vjust=1, size=2.75) + 
  theme(axis.text.x=element_blank())

Tab 2

Alcohol units by year, from 2020 through the end of 2022.

2020 saw the most sales, but tapered out as the years went on. This cause could be in inferred that no places were open to drink outside the house, so grocery stores sold more units for people to consume at home.

df_alc <- df[,c('Date','Category','Variable','Value')]
df_alc <- subset(df_alc, (Category=='Alcohol' & Variable=='Unit sales'))
Xalc<-ymd(df_alc$Date)
df_alc$year <-year(Xalc)
df_alc$month <-month(Xalc)
df_alc$monthname<-months(Xalc,abbreviate=TRUE)
# FLoor date to sum up all values within a weeks within a month into 1 respective month # 
df_alc$month <- floor_date(as.Date(df_alc$Date), unit="month")
df_alc <- aggregate(Value ~ year + month, data=df_alc, sum)
Xalc2 <- ymd(df_alc$month)
df_alc$month <-months(Xalc2, abbreviate=TRUE)
df_alc$year <- factor(df_alc$year)

mymonths <-c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
month_order <- factor(df_alc$month, level=mymonths)

df_alc2020s <- df_alc %>%
  filter(year!=2019)
df_alc2020s$month <-factor(df_alc2020s$month, level=mymonths)



plot_ly(df_alc2020s, labels = ~year, values = ~Value, sort = FALSE)%>%
  add_pie(hole=0.6) %>%
  layout(title="Alcohol Units Sold by Year") %>%
  layout(annotations=list(text=paste0("Total Units Sold: \n", scales::comma(sum(df_alc$Value))),
                          "showarrow"=F))

Tab 3

Driving further into the dates, the donut chart data is broken out by months and years. 2020 data starts on the outside, moving inward to 2022.

December 2022 saw the lowest amount of alcohol units sold over the past 3 years. This data is limited to the U.S., so we can infer some basic reasoning behind spikes in sales on a monthly basis based on holidays or events, such as stimulus check distribution (increasing some households discretionary spend).

fig <- plot_ly(hole=.7) %>%
  layout(title="Alcohol Units Sold by Month by Year(2020-2022)") %>%
  add_trace(data=df_alc2020s[df_alc2020s$year==2020,],
            labels= ~month,
            values = ~df_alc2020s[df_alc2020s$year==2020,"Value"],
            sort = FALSE,
            type="pie",
            textposition="inside",
            hovertemplate = "Year:2020<br>Month:%{label}<br>Percent:%{percent}<br>Units:%{value}<extra></extra>") %>%
  add_trace(data=df_alc2020s[df_alc2020s$year==2021,],
            labels= ~month,
            values = ~df_alc2020s[df_alc2020s$year==2021,"Value"],
            sort = FALSE,
            type="pie",
            textposition="inside",
            hovertemplate = "Year:2021<br>Month:%{label}<br>Percent:%{percent}<br>Units:%{value}<extra></extra>",
            domain = list(
              x = c(0.16,0.84),
              y = c(0.16,0.84))) %>%
  add_trace(data=df_alc2020s[df_alc2020s$year==2022,],
            labels= ~month,
            values = ~df_alc2020s[df_alc2020s$year==2022,"Value"],
            sort=FALSE,
            type="pie",
            textposition="inside",
            hovertemplate = "Year:2022<br>Month:%{label}<br>Percent:%{percent}<br>Units:%{value}<extra></extra>",
            domain = list(
              x = c(0.27,0.73),
              y = c(0.27,0.73)))
fig

Tab 4

The following is a heat map depicting the year over year percentage change of alcohol sales (units sold), sorted by state and year.

In 2020, a majority of the states saw increased alcohol sales from the previous year. Only a handful of states were able to bring their unit sales down to resemble 2019 sales.

Factors of influence could pertain to restaurants and bars opening back up, thus reducing at-home alcohol consumption.To further investigate, taking this geographical view by state and stacking it with layers of other data tracked in the same time frame could paint a more vivid picture of trends on the state level.

Some thought starters for data to layer in:

Population reported struggling with alcoholism
- Treated vs. untreated
- AA group counts
Restaurant and bar opens/closures
General population counts

Note: Some states and Washington D.C. were not accounted for within this data set.

breaks <- c(seq(-30, max(dfState_year$Value), by=5))

ggplot(dfState_year, aes(x=year, y=State, fill=Value)) + 
  geom_tile(color="black") +
  geom_text(aes(label=scales::percent(Value/100, accuracy=0.1)))+
  coord_equal(ratio=.2) +
  labs(title="Avg % Change in Alcohol Units Sold YoY Per Year, by State",
       x="Year",
       y="State",
       fill="Avg % Change\nof Units Sold YoY") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) +
  theme(legend.title.align=0.5) +
  scale_fill_gradient2(low="green", mid="lightyellow",high="red", breaks=breaks) +
  guides(fill=guide_legend(reverse=TRUE, orverride.aes=list(colour="white"))) + 
  scale_y_discrete(limits=rev)

Tab 5

The grocery share is out of 100, accumulated from all the categories within this data set. To read this, for example, out of all the grocery sales reported, 7% was made up of alcohol sales on March 15, 2020.

Hovering over the line will reveal the week ending date and the exact share for that week.

This view really allows us to infer spending habits based on seasonality and drinking holidays.

#Share of grocery spend on alcohol category 


plot_ly(data=dfshare_alc_sorted, x = ~Date, y=~Value, type = "scatter", mode = "lines",
        hovertemplate=paste("Date: %{x}<br>Value: %{y}<extra></extra>")) %>%
  layout(title= "Share of Grocery Spend on Alcohol Category (2019-2022)", 
         xaxis=list(title="Date", tickmode="linear", dtick="M2"),
         yaxis=list(title = "Share of Grocery Spend"))

Conclusion

Over this short period of time, consumers’ grocery baskets don’t fluctuate in how they’re broken out from a total share of basket view - at least not when viewed from the 10 categories. If comparing the share across subcategories, my hypothesis is we’d see more ups and down over time, as they’re more specific variables, therefore more sensitive to change.

Based off of these charts, I would say that alcohol had a great 2020 for direct-to-consumer purchases, however, they may be seeing their customer shopping patterns revert back to pre-pandemic habits by consuming at bars and restaurants.