Introduction

Business Question

As a data scientist, I have been tasked with identifying a potential area of growth for national grocery chain “Regork”. In my research. I chose to focus on the business question: Do the US Sports Championships(NFL, NBA, MLB, MLS) affect purchase patterns? To answer this question, I chose several different components to address:

-How do the top 10 product types purchased change during the week of each of the Sports Championships vs the year top 10?
-How much of a difference does each championship make? Are they of equal impact?
-Is each income demographic affected by each championship? Are there demographics that are not affected by this?

Methodology

To answer these questions, I used the “transactions”, “products”, and “demographics” data sets provided by the “completejourney” package I also made use of various data analytics tools/functionality provided in the “tidyverse” package. Answering each of these components involved different analytics techniques, including data transformation, joining, filtering, mutating, graphing, and more.

Importance

This report breaks down actual purchase pattern statistics into easily understandable segments and visualizations. This will allow for various insights into the purchasing patterns around sports championships, which could be used in the future for possible promotions to increase revenue and profits.

Packages Required

This anlaysis makes use of data provided by the “completejourney” package along with various data manipulation and analysis functions that are provided in the “tidyverse” package.

library(tidyverse)
library(completejourney)

transactions = get_transactions()

How do the top 10 product types purchased change during the week of each of the Sports Championships vs the year top 10?

Chamipionship Dates/Week Numbers

The first thing to do is identify the date of each sports championship, and which week it occurred in. For championships that occur over multiple days/weeks(NBA & MLB), I chose to focus on the week of the first game.

Super Bowl - Feb 5, 2017, week = 6 NBA Finals - Started Jun 1, 2017, week = 22 World Series - Started Oct 24, 2017, week = 43 MLS Cup Final - December 9, 2017, week = 50

Get the top 10 items total and for each championship

Get the top 10 items types for the whole year

totalItems = transactions %>%
  inner_join(products, by = "product_id") %>%
  group_by(product_type) %>%
  summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
  arrange(desc(totalSales)) %>%
  filter(!str_detect(product_type, "GASOLINE"))

head(totalItems, 10)
## # A tibble: 10 × 3
##    product_type                   totalQty totalSales
##    <chr>                             <dbl>      <dbl>
##  1 FLUID MILK WHITE ONLY             52443     91834.
##  2 SOFT DRINKS 12/18&15PK CAN CAR    31759     89657.
##  3 BEERALEMALT LIQUORS               11186     82692.
##  4 CIGARETTES                         9327     54350.
##  5 CHOICE BEEF                        7921     45684.
##  6 SHREDDED CHEESE                   20256     38214.
##  7 PRIMAL                             9727     36653.
##  8 PREMIUM                           11494     36018.
##  9 BABY DIAPERS                       2969     33840.
## 10 TOILET TISSUE                      9235     33618.

Get the top 10 items for week 6(Super Bowl)

superBowlItems = transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(week == 6) %>%
  group_by(product_type) %>%
  summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
  arrange(desc(totalSales)) %>%
  filter(!str_detect(product_type, "GASOLINE"))

head(superBowlItems, 10)
## # A tibble: 10 × 3
##    product_type                   totalQty totalSales
##    <chr>                             <dbl>      <dbl>
##  1 SOFT DRINKS 12/18&15PK CAN CAR     1368      3376.
##  2 BEERALEMALT LIQUORS                 302      2148.
##  3 FLUID MILK WHITE ONLY               775      1870.
##  4 CIGARETTES                          179      1341.
##  5 PRIMAL                              254      1080.
##  6 SHREDDED CHEESE                     404       942.
##  7 SNACKS/APPETIZERS                   330       923.
##  8 CHOICE BEEF                         142       849.
##  9 ENHANCED                            175       826.
## 10 POTATO CHIPS                        340       743.

Get the top 10 items for week 22(NBA Finals)

NBAItems = transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(week == 22) %>%
  group_by(product_type) %>%
  summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
  arrange(desc(totalSales)) %>%
  filter(!str_detect(product_type, "GASOLINE"))

head(NBAItems, 10)
## # A tibble: 10 × 3
##    product_type                   totalQty totalSales
##    <chr>                             <dbl>      <dbl>
##  1 SOFT DRINKS 12/18&15PK CAN CAR     1535      3928.
##  2 BEERALEMALT LIQUORS                 262      2104.
##  3 FLUID MILK WHITE ONLY               637      1462.
##  4 CIGARETTES                          162       864.
##  5 CHOICE BEEF                         110       758.
##  6 POTATO CHIPS                        366       757.
##  7 SELECT BEEF                          76       708.
##  8 PREMIUM                             201       646.
##  9 BABY DIAPERS                         53       638.
## 10 PRIMAL                              161       615.

Get the top 10 items for week 43(World Series)

MLBItems = transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(week == 43) %>%
  group_by(product_type) %>%
  summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
  arrange(desc(totalSales)) %>%
  filter(!str_detect(product_type, "GASOLINE"))

head(MLBItems, 10)
## # A tibble: 10 × 3
##    product_type                   totalQty totalSales
##    <chr>                             <dbl>      <dbl>
##  1 FLUID MILK WHITE ONLY               771      1819.
##  2 SOFT DRINKS 12/18&15PK CAN CAR      566      1654.
##  3 BEERALEMALT LIQUORS                 145      1118.
##  4 CIGARETTES                          164       882.
##  5 PRIMAL                              260       862.
##  6 PREMIUM                             299       847.
##  7 SHREDDED CHEESE                     370       762.
##  8 TICKETS                              14       687.
##  9 CHOICE BEEF                         109       675.
## 10 TOILET TISSUE                       182       670.

Get the top 10 items for week 50(MLS Cup Final)

MLSItems = transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(week == 50) %>%
  group_by(product_type) %>%
  summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
  arrange(desc(totalSales)) %>%
  filter(!str_detect(product_type, "GASOLINE"))

head(MLSItems, 10)
## # A tibble: 10 × 3
##    product_type                   totalQty totalSales
##    <chr>                             <dbl>      <dbl>
##  1 FLUID MILK WHITE ONLY              1281      1784.
##  2 BEERALEMALT LIQUORS                 204      1394.
##  3 CIGARETTES                          196      1188.
##  4 SOFT DRINKS 12/18&15PK CAN CAR      311      1077.
##  5 CHOICE BEEF                         170       953.
##  6 LEAN                                304       891.
##  7 SHREDDED CHEESE                     392       823.
##  8 SFT DRNK 2 LITER BTL CARB INCL      827       805.
##  9 FRZN BREADED PREPARED CHICK         157       727.
## 10 BABY DIAPERS                         67       716.

Compute the average weekly amounts for each item

avgTotalItems = totalItems %>%
  mutate(avgQty = round(totalQty/52, 2), avgSales = round(totalSales/52, 2)) %>%
  select(c(product_type, avgQty, avgSales))

head(avgTotalItems, 10)
## # A tibble: 10 × 3
##    product_type                   avgQty avgSales
##    <chr>                           <dbl>    <dbl>
##  1 FLUID MILK WHITE ONLY          1009.     1766.
##  2 SOFT DRINKS 12/18&15PK CAN CAR  611.     1724.
##  3 BEERALEMALT LIQUORS             215.     1590.
##  4 CIGARETTES                      179.     1045.
##  5 CHOICE BEEF                     152.      879.
##  6 SHREDDED CHEESE                 390.      735.
##  7 PRIMAL                          187.      705.
##  8 PREMIUM                         221.      693.
##  9 BABY DIAPERS                     57.1     651.
## 10 TOILET TISSUE                   178.      646.

Findings

In general, there is not much change in the top items purchased between the average week and any of the weeks of sports championships. There seems to be a set of staple products that are purchased in high amounts for every week including milk, soft drinks, liquors, beef. However, there are some items such as potato chips that are present in some of the championships top ten, but not the average top ten. There are also some items such as baby diapers and toilet tissue that are in the average top ten, but not in every championship top ten. There also appears to be some trends in the amount of each item purchased, which we will continue to investigate in the next section.

How much of a difference does each championship make? Are they of equal impact?

Graph the difference between the average Weekly Quantity/Sales and to each Championship Week for the top items

Super Bowl

mergeTotalItemsSuperBowl = inner_join(avgTotalItems, superBowlItems, by = "product_type") %>%
  mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
  select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsSuperBowl), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Quantity Difference from Average Week to Super Bowl Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

ggplot(head(mergeTotalItemsSuperBowl), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Sales Difference from Average Week to Super Bowl Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

NBA Finals

mergeTotalItemsNBAFinals = inner_join(avgTotalItems, NBAItems, by = "product_type") %>%
  mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
  select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsNBAFinals), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Quantity Difference from Average Week to NBA Finals Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

ggplot(head(mergeTotalItemsNBAFinals), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Sales Difference from Average Week to NBA Finals Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

World Series

mergeTotalItemsWorldSeries = inner_join(avgTotalItems, MLBItems, by = "product_type") %>%
  mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
  select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsWorldSeries), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Quantity Difference from Average Week to World Series Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

ggplot(head(mergeTotalItemsWorldSeries), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Sales Difference from Average Week to World Series Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

MLS Finals

mergeTotalItemsMLSFinals = inner_join(avgTotalItems, MLSItems, by = "product_type") %>%
  mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
  select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsMLSFinals), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Quantity Difference from Average Week to MLS Finals Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

ggplot(head(mergeTotalItemsMLSFinals), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  labs(title = "Sales Difference from Average Week to MLS Finals Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
  theme_minimal()

Findings

There does appear to be some trends in the quantity and sales value of the top items when comparing the week of each sports championship and the average week. During the week of the Super Bowl, there is a substantial rise in the amount of beer/liquor and soft drinks purchased. During the week of the NBA Finals, there is also a rise in the beer/liquor and soft drinks purchased. During the week of the World Series, there is a decrease in a variety of items purchased. Finally, during the week of the MLS Final, there is an increase in the quantity of milk purchased but not in the total sales value of milk purchased. Overall, there do appear to be trends, but I am only confident in the Super Bowl trends. The trends for the other championships I am weary to say whether the championships cause these trends or if there are other factors at play, such as overall seasonal trends.

Is each income demographic affected by each championship? Are there demographics that are not affected by this?

How is each income demographic affected

We will look at how each income demographics total quantity and sales for each championship week differs from average.

Weekly Average Split by Income

avgByIncome = transactions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(Sales = round(sum(sales_value)/52,2), Qty = round(sum(quantity)/52, 2), category = "Weekly Average")

Super Bowl Week Split by Income

superBowlByIncome = transactions %>%
  filter(week == 6) %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "Super Bowl")

NBA Finals Week Split by Income

NBAFinalsByIncome = transactions %>%
  filter(week ==2) %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "NBA Finals")

World Series Week Split by Income

worldSeriesByIncome = transactions %>%
  filter(week == 43) %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "World Series")

MLS Finals Week Split by Income

MLSFinalsByIncome = transactions %>%
  filter(week == 50) %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "MLS Finals")

Group them all into one data frame to allow for easy graphing

incomeDF = bind_rows(avgByIncome, superBowlByIncome, NBAFinalsByIncome, worldSeriesByIncome, MLSFinalsByIncome)
incomeDF$category = as.factor(incomeDF$category)
ggplot(incomeDF, aes(x = income, y = Sales, color = category, group = category)) +
  geom_line(size = 1) + 
  geom_point(size = 3) +
  labs(title = "Sales vs Income", x = "Income Range", y = "Sales", color = "category") +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  theme_minimal()

ggplot(incomeDF, aes(x = income, y = Qty, color = category, group = category)) +
  geom_line(size = 1) + 
  geom_point(size = 3) +
  labs(title = "Quantity vs Income", x = "Income Range", y = "Quantity", color = "category") +
  scale_x_discrete(guide = guide_axis(n.dodge=3)) +
  theme_minimal()

Findings

Overall, there do appear to be some minor trends in the deomgraphic data. For example, it appears that the Super Bowl leads to more items purchased in the 100-124k and 175-199k income range groups, and the 125-149k income range purchase more items for the World Series and MLS Finals than their neighboring two income groups. When looking at the sales value, there does appear to be some separation in the Under 15k and 250k+ income groups, but I would like to see more data points in those ranges before making any conclusions. Another trend in the sales is that the 50-74k and 125-149k range spend the most during the NBA Finals week, whereas that is when the 35-49k spends the least.

Summary

At the beginning of this, I sought to answer the question: Do the US Sports Championships(NFL, NBA, MLB, MLS) affect purchase patterns?

My Process

To answer this question, I started by joining the transactions data set with the products data set and summarizing it to find the total sale value and quantity of each product. Then, I did the same thing but only looking at items purchased during the week of each championship. This gave me the top items that were purchased the week of each championship. I compared this to the overall top items to see if there were any noticeable trends. After looking at the top items, I wanted to compare the amount of each item that was purchased during each of the weeks. To do this, I combined each week data set with the top items for the year and the average sales value and quantity for each week. Then, I calculated the difference for each top item between each championship week and the average week. After this, I graphed it to easily visualize increases and decreases in the sales value and quantity of the top items. Finally, I wanted to look at how different demographics were affected by the presence of a championship. I joined the transaction data set with the demographics data set and separated the total items purchased for each championship week into the different income demographics. Then, I graphed these along with the average week to visualize any trends that were there.

My Findings

Overall, my analysis found minor correlation between the presence of a championship and purchasing patterns. The largest trend occurred during the week of the Super Bowl, where there were clear increases in the purchasing of soft drinks, beer/liquor, and snacks such as potato chips. The other championships appeared to have some differing patterns than average, but I am not certain that the changes in purchasing patterns were because of the championship. For Example, the week of the NBA Finals saw an increase in the amount of soft drinks purchased, but that was also the first week of June, so it is hard to discern if the increase in soft drinks is because of the NBA Finals or because of the start of the Summer season. The same thing occurs with the MLS Finals and the amount of milk purchased; one can not discern if the rise is because of the MLS Finals or because it is the start of December and holiday season. The only trends that I am confident in are the ones discussed above concerning the week of the Super Bowl.

My Recommendation

I would recommend making use of the increase in the purchase of soft drinks, beer/liquor, and snack items the week of the Super Bowl to increase revenue and profit. This could be done by simply raising the price of these items during that week or running promotions that include these items which would encourage customers to purchase more. The way that this gets implemented could be the subject of future analysis.

Limitations & Possible Future Analysis

As mentioned above, there were several issues that limited the amount of knowledge that one could draw from this analysis. For this analysis, I chose to only focus on specific weeks, which makes it difficult to determine if trends are specifically due to a championship or part of a larger trend. This is something that could be addressed in future analysis. Another issue is that this data set only includes data from 2017, because of this, it is difficult to discern if trends are consistent year-over-year or if they are simply the result of random variations. Again, once more data is available, this is something that could be analyzed in the future.