As a data scientist, I have been tasked with identifying a potential area of growth for national grocery chain “Regork”. In my research. I chose to focus on the business question: Do the US Sports Championships(NFL, NBA, MLB, MLS) affect purchase patterns? To answer this question, I chose several different components to address:
-How do the top 10 product types purchased change during the week of
each of the Sports Championships vs the year top 10?
-How much of a difference does each championship make? Are they of equal
impact?
-Is each income demographic affected by each championship? Are there
demographics that are not affected by this?
To answer these questions, I used the “transactions”, “products”, and “demographics” data sets provided by the “completejourney” package I also made use of various data analytics tools/functionality provided in the “tidyverse” package. Answering each of these components involved different analytics techniques, including data transformation, joining, filtering, mutating, graphing, and more.
This report breaks down actual purchase pattern statistics into easily understandable segments and visualizations. This will allow for various insights into the purchasing patterns around sports championships, which could be used in the future for possible promotions to increase revenue and profits.
This anlaysis makes use of data provided by the “completejourney” package along with various data manipulation and analysis functions that are provided in the “tidyverse” package.
library(tidyverse)
library(completejourney)
transactions = get_transactions()
The first thing to do is identify the date of each sports championship, and which week it occurred in. For championships that occur over multiple days/weeks(NBA & MLB), I chose to focus on the week of the first game.
Super Bowl - Feb 5, 2017, week = 6 NBA Finals - Started Jun 1, 2017, week = 22 World Series - Started Oct 24, 2017, week = 43 MLS Cup Final - December 9, 2017, week = 50
Get the top 10 items types for the whole year
totalItems = transactions %>%
inner_join(products, by = "product_id") %>%
group_by(product_type) %>%
summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
arrange(desc(totalSales)) %>%
filter(!str_detect(product_type, "GASOLINE"))
head(totalItems, 10)
## # A tibble: 10 × 3
## product_type totalQty totalSales
## <chr> <dbl> <dbl>
## 1 FLUID MILK WHITE ONLY 52443 91834.
## 2 SOFT DRINKS 12/18&15PK CAN CAR 31759 89657.
## 3 BEERALEMALT LIQUORS 11186 82692.
## 4 CIGARETTES 9327 54350.
## 5 CHOICE BEEF 7921 45684.
## 6 SHREDDED CHEESE 20256 38214.
## 7 PRIMAL 9727 36653.
## 8 PREMIUM 11494 36018.
## 9 BABY DIAPERS 2969 33840.
## 10 TOILET TISSUE 9235 33618.
Get the top 10 items for week 6(Super Bowl)
superBowlItems = transactions %>%
inner_join(products, by = "product_id") %>%
filter(week == 6) %>%
group_by(product_type) %>%
summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
arrange(desc(totalSales)) %>%
filter(!str_detect(product_type, "GASOLINE"))
head(superBowlItems, 10)
## # A tibble: 10 × 3
## product_type totalQty totalSales
## <chr> <dbl> <dbl>
## 1 SOFT DRINKS 12/18&15PK CAN CAR 1368 3376.
## 2 BEERALEMALT LIQUORS 302 2148.
## 3 FLUID MILK WHITE ONLY 775 1870.
## 4 CIGARETTES 179 1341.
## 5 PRIMAL 254 1080.
## 6 SHREDDED CHEESE 404 942.
## 7 SNACKS/APPETIZERS 330 923.
## 8 CHOICE BEEF 142 849.
## 9 ENHANCED 175 826.
## 10 POTATO CHIPS 340 743.
Get the top 10 items for week 22(NBA Finals)
NBAItems = transactions %>%
inner_join(products, by = "product_id") %>%
filter(week == 22) %>%
group_by(product_type) %>%
summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
arrange(desc(totalSales)) %>%
filter(!str_detect(product_type, "GASOLINE"))
head(NBAItems, 10)
## # A tibble: 10 × 3
## product_type totalQty totalSales
## <chr> <dbl> <dbl>
## 1 SOFT DRINKS 12/18&15PK CAN CAR 1535 3928.
## 2 BEERALEMALT LIQUORS 262 2104.
## 3 FLUID MILK WHITE ONLY 637 1462.
## 4 CIGARETTES 162 864.
## 5 CHOICE BEEF 110 758.
## 6 POTATO CHIPS 366 757.
## 7 SELECT BEEF 76 708.
## 8 PREMIUM 201 646.
## 9 BABY DIAPERS 53 638.
## 10 PRIMAL 161 615.
Get the top 10 items for week 43(World Series)
MLBItems = transactions %>%
inner_join(products, by = "product_id") %>%
filter(week == 43) %>%
group_by(product_type) %>%
summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
arrange(desc(totalSales)) %>%
filter(!str_detect(product_type, "GASOLINE"))
head(MLBItems, 10)
## # A tibble: 10 × 3
## product_type totalQty totalSales
## <chr> <dbl> <dbl>
## 1 FLUID MILK WHITE ONLY 771 1819.
## 2 SOFT DRINKS 12/18&15PK CAN CAR 566 1654.
## 3 BEERALEMALT LIQUORS 145 1118.
## 4 CIGARETTES 164 882.
## 5 PRIMAL 260 862.
## 6 PREMIUM 299 847.
## 7 SHREDDED CHEESE 370 762.
## 8 TICKETS 14 687.
## 9 CHOICE BEEF 109 675.
## 10 TOILET TISSUE 182 670.
Get the top 10 items for week 50(MLS Cup Final)
MLSItems = transactions %>%
inner_join(products, by = "product_id") %>%
filter(week == 50) %>%
group_by(product_type) %>%
summarize(totalQty = sum(quantity), totalSales = sum(sales_value)) %>%
arrange(desc(totalSales)) %>%
filter(!str_detect(product_type, "GASOLINE"))
head(MLSItems, 10)
## # A tibble: 10 × 3
## product_type totalQty totalSales
## <chr> <dbl> <dbl>
## 1 FLUID MILK WHITE ONLY 1281 1784.
## 2 BEERALEMALT LIQUORS 204 1394.
## 3 CIGARETTES 196 1188.
## 4 SOFT DRINKS 12/18&15PK CAN CAR 311 1077.
## 5 CHOICE BEEF 170 953.
## 6 LEAN 304 891.
## 7 SHREDDED CHEESE 392 823.
## 8 SFT DRNK 2 LITER BTL CARB INCL 827 805.
## 9 FRZN BREADED PREPARED CHICK 157 727.
## 10 BABY DIAPERS 67 716.
Compute the average weekly amounts for each item
avgTotalItems = totalItems %>%
mutate(avgQty = round(totalQty/52, 2), avgSales = round(totalSales/52, 2)) %>%
select(c(product_type, avgQty, avgSales))
head(avgTotalItems, 10)
## # A tibble: 10 × 3
## product_type avgQty avgSales
## <chr> <dbl> <dbl>
## 1 FLUID MILK WHITE ONLY 1009. 1766.
## 2 SOFT DRINKS 12/18&15PK CAN CAR 611. 1724.
## 3 BEERALEMALT LIQUORS 215. 1590.
## 4 CIGARETTES 179. 1045.
## 5 CHOICE BEEF 152. 879.
## 6 SHREDDED CHEESE 390. 735.
## 7 PRIMAL 187. 705.
## 8 PREMIUM 221. 693.
## 9 BABY DIAPERS 57.1 651.
## 10 TOILET TISSUE 178. 646.
In general, there is not much change in the top items purchased between the average week and any of the weeks of sports championships. There seems to be a set of staple products that are purchased in high amounts for every week including milk, soft drinks, liquors, beef. However, there are some items such as potato chips that are present in some of the championships top ten, but not the average top ten. There are also some items such as baby diapers and toilet tissue that are in the average top ten, but not in every championship top ten. There also appears to be some trends in the amount of each item purchased, which we will continue to investigate in the next section.
Super Bowl
mergeTotalItemsSuperBowl = inner_join(avgTotalItems, superBowlItems, by = "product_type") %>%
mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsSuperBowl), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Quantity Difference from Average Week to Super Bowl Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
theme_minimal()
ggplot(head(mergeTotalItemsSuperBowl), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Sales Difference from Average Week to Super Bowl Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
theme_minimal()
NBA Finals
mergeTotalItemsNBAFinals = inner_join(avgTotalItems, NBAItems, by = "product_type") %>%
mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsNBAFinals), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Quantity Difference from Average Week to NBA Finals Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
theme_minimal()
ggplot(head(mergeTotalItemsNBAFinals), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Sales Difference from Average Week to NBA Finals Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
theme_minimal()
World Series
mergeTotalItemsWorldSeries = inner_join(avgTotalItems, MLBItems, by = "product_type") %>%
mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsWorldSeries), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Quantity Difference from Average Week to World Series Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
theme_minimal()
ggplot(head(mergeTotalItemsWorldSeries), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Sales Difference from Average Week to World Series Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
theme_minimal()
MLS Finals
mergeTotalItemsMLSFinals = inner_join(avgTotalItems, MLSItems, by = "product_type") %>%
mutate(ChampMinusAvgQty = totalQty - avgQty, ChampMinusAvgSales = totalSales - avgSales) %>%
select(c(product_type, ChampMinusAvgQty, ChampMinusAvgSales))
ggplot(head(mergeTotalItemsMLSFinals), aes(x = product_type, y = ChampMinusAvgQty, fill = ChampMinusAvgQty > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Quantity Difference from Average Week to MLS Finals Week", y = "Quantity Change", x = "Item Type", fill = "Legend") +
theme_minimal()
ggplot(head(mergeTotalItemsMLSFinals), aes(x = product_type, y = ChampMinusAvgSales, fill = ChampMinusAvgSales > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "blue"), labels = c("Decrease", "Increase")) +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
labs(title = "Sales Difference from Average Week to MLS Finals Week", y = "Sales Change", x = "Item Type", fill = "Legend") +
theme_minimal()
There does appear to be some trends in the quantity and sales value of the top items when comparing the week of each sports championship and the average week. During the week of the Super Bowl, there is a substantial rise in the amount of beer/liquor and soft drinks purchased. During the week of the NBA Finals, there is also a rise in the beer/liquor and soft drinks purchased. During the week of the World Series, there is a decrease in a variety of items purchased. Finally, during the week of the MLS Final, there is an increase in the quantity of milk purchased but not in the total sales value of milk purchased. Overall, there do appear to be trends, but I am only confident in the Super Bowl trends. The trends for the other championships I am weary to say whether the championships cause these trends or if there are other factors at play, such as overall seasonal trends.
We will look at how each income demographics total quantity and sales for each championship week differs from average.
Weekly Average Split by Income
avgByIncome = transactions %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(Sales = round(sum(sales_value)/52,2), Qty = round(sum(quantity)/52, 2), category = "Weekly Average")
Super Bowl Week Split by Income
superBowlByIncome = transactions %>%
filter(week == 6) %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "Super Bowl")
NBA Finals Week Split by Income
NBAFinalsByIncome = transactions %>%
filter(week ==2) %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "NBA Finals")
World Series Week Split by Income
worldSeriesByIncome = transactions %>%
filter(week == 43) %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "World Series")
MLS Finals Week Split by Income
MLSFinalsByIncome = transactions %>%
filter(week == 50) %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(Sales = sum(sales_value), Qty = sum(quantity), category = "MLS Finals")
Group them all into one data frame to allow for easy graphing
incomeDF = bind_rows(avgByIncome, superBowlByIncome, NBAFinalsByIncome, worldSeriesByIncome, MLSFinalsByIncome)
incomeDF$category = as.factor(incomeDF$category)
ggplot(incomeDF, aes(x = income, y = Sales, color = category, group = category)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(title = "Sales vs Income", x = "Income Range", y = "Sales", color = "category") +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
theme_minimal()
ggplot(incomeDF, aes(x = income, y = Qty, color = category, group = category)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(title = "Quantity vs Income", x = "Income Range", y = "Quantity", color = "category") +
scale_x_discrete(guide = guide_axis(n.dodge=3)) +
theme_minimal()
Overall, there do appear to be some minor trends in the deomgraphic data. For example, it appears that the Super Bowl leads to more items purchased in the 100-124k and 175-199k income range groups, and the 125-149k income range purchase more items for the World Series and MLS Finals than their neighboring two income groups. When looking at the sales value, there does appear to be some separation in the Under 15k and 250k+ income groups, but I would like to see more data points in those ranges before making any conclusions. Another trend in the sales is that the 50-74k and 125-149k range spend the most during the NBA Finals week, whereas that is when the 35-49k spends the least.
At the beginning of this, I sought to answer the question: Do the US Sports Championships(NFL, NBA, MLB, MLS) affect purchase patterns?
To answer this question, I started by joining the transactions data set with the products data set and summarizing it to find the total sale value and quantity of each product. Then, I did the same thing but only looking at items purchased during the week of each championship. This gave me the top items that were purchased the week of each championship. I compared this to the overall top items to see if there were any noticeable trends. After looking at the top items, I wanted to compare the amount of each item that was purchased during each of the weeks. To do this, I combined each week data set with the top items for the year and the average sales value and quantity for each week. Then, I calculated the difference for each top item between each championship week and the average week. After this, I graphed it to easily visualize increases and decreases in the sales value and quantity of the top items. Finally, I wanted to look at how different demographics were affected by the presence of a championship. I joined the transaction data set with the demographics data set and separated the total items purchased for each championship week into the different income demographics. Then, I graphed these along with the average week to visualize any trends that were there.
Overall, my analysis found minor correlation between the presence of a championship and purchasing patterns. The largest trend occurred during the week of the Super Bowl, where there were clear increases in the purchasing of soft drinks, beer/liquor, and snacks such as potato chips. The other championships appeared to have some differing patterns than average, but I am not certain that the changes in purchasing patterns were because of the championship. For Example, the week of the NBA Finals saw an increase in the amount of soft drinks purchased, but that was also the first week of June, so it is hard to discern if the increase in soft drinks is because of the NBA Finals or because of the start of the Summer season. The same thing occurs with the MLS Finals and the amount of milk purchased; one can not discern if the rise is because of the MLS Finals or because it is the start of December and holiday season. The only trends that I am confident in are the ones discussed above concerning the week of the Super Bowl.
I would recommend making use of the increase in the purchase of soft drinks, beer/liquor, and snack items the week of the Super Bowl to increase revenue and profit. This could be done by simply raising the price of these items during that week or running promotions that include these items which would encourage customers to purchase more. The way that this gets implemented could be the subject of future analysis.
As mentioned above, there were several issues that limited the amount of knowledge that one could draw from this analysis. For this analysis, I chose to only focus on specific weeks, which makes it difficult to determine if trends are specifically due to a championship or part of a larger trend. This is something that could be addressed in future analysis. Another issue is that this data set only includes data from 2017, because of this, it is difficult to discern if trends are consistent year-over-year or if they are simply the result of random variations. Again, once more data is available, this is something that could be analyzed in the future.