Mid-Term Project

Introduction

In this report, I will be taking a look into a certain demographic group, the age group, and identifying if they are responsible for buying large amounts of certain products. In other words, what products is each age group buying the most of? Is there particular products that are more popular among certain age groups?

I focused mainly on the demographics, transactions, and products data table in this report. By joining the three tables together and grouping them by age, product category, and the date the transaction took place, I was able to calculate the quantity of products purchased based on those groups. That way, I can see how many products from each category was purchased based on the date and age group.

This report will help identify and analyze the data behind the three tables, finding the top 5 products purchased by each age group. Once we know what products are more popular between certain demographic groups, we can use that information to create a plan to invest or market towards that group and hopefully capture a bigger audience and increase our revenues.

Packages Required

The following packages are used in this project:

library(completejourney)              # package contains the data sets used in this report
library(tidyverse)                    # package is for data manipulation
library(ggplot2)                      # package is for data visualization
library(lubridate)                    # package is for working with dates
library(DT)                           # package is for creating HTML widget to display data tables

Data Prep

To begin, the Complete Journey data sets contain a sample data set for the transactions and promotions data because of how large they are. So first I have to get the full transactions data set (since I will not be using the promotions data set, I will not be pulling it in).

transactions <- get_transactions()

Here is a preview of the data sets before I have made any changes to them:

datatable(head(transactions, 50), options = list(pageLength = 10))

datatable(head(demographics, 50), options = list(pageLength = 10))

datatable(head(products, 50), options = list(pageLength = 10))

Now, I join all three tables together and start grouping them by data, product category, and age. This will be the main data set used for visualizations.

prod_quantity <- demographics %>%
  left_join(select(transactions, household_id, product_id, quantity, transaction_timestamp), 
            by = "household_id") %>%
  inner_join(products, by = "product_id") %>%
  group_by(date = date(transaction_timestamp), age, product_category) %>%
  summarise(quantity_sold = sum(quantity))

datatable(head(prod_quantity, 100), options = list(pageLength = 10))

Exploratory Data Analysis

Before we start looking at the graphs, I just want to clarify that the amount of people in each age group is also different, meaning the amount of items that they buy will most likely be different. Groups with more people in them will most likely buy more in terms of items. So we should keep that in mind when looking at the visualizations and making judgement and decisions.

table(demographics['age'])

## age
## 19-24 25-34 35-44 45-54 55-64   65+ 
##    46   142   194   288    59    72

table(prod_quantity['age'])

## age
## 19-24 25-34 35-44 45-54 55-64   65+ 
## 20768 44283 52978 59492 25769 28835

I begin my analyse by finding the products that sold the most amount of items per age group:

prod_quantity %>%
  group_by(age, product_category) %>%
  summarise(total_quantity = sum(quantity_sold)) %>%
  arrange(desc(total_quantity)) %>%
  slice(1:5) %>%
  ggplot(aes(x = product_category, y = total_quantity, fill = product_category)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_blank(), text = element_text(size = 25)) +
  labs(x = "Product Category", y = "Total Quantity",
       title = "Top 5 Products Bought by Age Group",
       fill = "Product Category") +
  facet_wrap(~ age, nrow = 3)

As you can see, the amount of items sold from the Coupon/ Misc Items category is so high that it completely shadows over the other categories. Obviously everybody likes to buy items from this category, regardless of age group, but other than that it’s hard to get much useful information from this, so I have decided remove this category so that we can take a closer look on the other categories.

prod_quantity %>%
  filter(product_category != "COUPON/MISC ITEMS") %>%
  group_by(age, product_category) %>%
  summarise(total_quantity = sum(quantity_sold)) %>%
  arrange(desc(total_quantity)) %>%
  slice(1:5) %>%
  ggplot(aes(x = product_category, y = total_quantity, fill = product_category)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x=element_blank(), text = element_text(size = 25)) +
  labs(x = "Product Category", y = "Total Quantity",
       title = "Top 5 Products Bought by Age Group",
       fill = "Product Category") +
  facet_wrap(~ age, nrow = 3)

Looking at the graph now, one thing you’ll notice is in the 25-34 age group, large quantities of fuel was bought. So much that it became the top product for that age group, yet it didn’t make it into the top 5 for the other age groups. This is a very interesting find. Does it mean customers aged 25-34 suddenly started traveling a lot and thus demand for fuel rose? Or was there some sudden, bulk purchases of fuel that caused this sudden spike? To look further into it, I created a time series graph to see if I could find the reasoning behind it.

prod_quantity %>%
  filter(product_category == "FUEL") %>%
  group_by(date, age) %>%
  summarise(total_quantity = sum(quantity_sold)) %>%
  ggplot(aes(x = date, y = total_quantity, color = age)) +
  geom_line(alpha = 0.7) +
  scale_x_date(breaks = make_date(2017,1:12), labels = month.abb) +
  theme(text = element_text(size = 25)) +
  labs(x = "Month", y = "Total Quantity",
       title = "Fuel Bought by Age Group",
       color = "Age")

From this graph, we can see that the quantity of fuel sold skyrocketed all of a sudden in between October and November. Looking deeper into it, I filtered the data table to see which transaction it was that sold such large amounts of fuel.

datatable(head(prod_quantity %>%
  filter(product_category == "FUEL")%>%
  filter(quantity_sold > 100), 100), options = list(pageLength = 10))

We can see that it was two transactions that bought large amounts of fuel, and just those two transactions caused the fuel product category to reach the top product sold for that age group. This could’ve been an error when entering the data, a special occasion, or some other reason. But whatever the reason it was, I think it is safe to consider it an outlier. Since it is an outlier, I think the best solution is to remove it so that it doesn’t effect the overall results of the graph. So removing both fuel and misc items, we get:

prod_quantity %>%
  filter(product_category != "COUPON/MISC ITEMS") %>%
  filter(product_category != "FUEL") %>%
  group_by(age, product_category) %>%
  summarise(total_quantity = sum(quantity_sold)) %>%
  arrange(desc(total_quantity)) %>%
  slice(1:5) %>%
  ggplot(aes(x = product_category, y = total_quantity, fill = product_category)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x=element_blank(), text = element_text(size = 25)) +
  labs(x = "Product Category", y = "Total Quantity",
       title = "Top 5 Products Bought by Age Group",
       fill = "Product Category") +
  facet_wrap(~ age, nrow = 3)

Now when we look at the graph, we can see a few products that show up multiple times for each age group and a few products that show up only once or twice. For example, the Soft Drinks and Baked Bread/Buns/Rolls product category shows up in every age group. The Fluid Milk Category shows up in all but the 19-24 age group. The Bag Snacks category is only in the top 5 for the 45-54 age group. There is more, but the ones I listed will be the ones I will be taking a closer look at in this report.

I begin by creating another time series graph to see the quantities bought over time to see if there are any more noticeable outliers.

prod_quantity %>%
  filter(product_category == "BAG SNACKS" | product_category == "SOFT DRINKS" |
           product_category == "BAKED BREAD/BUNS/ROLLS" | product_category == "FLUID MILK PRODUCTS") %>%
  group_by(date, age, product_category) %>%
  summarise(total_quantity = sum(quantity_sold)) %>%
  ggplot(aes(x = date, y = total_quantity, color = age)) +
  geom_line(alpha = 0.7) +
  scale_x_date(breaks = make_date(2017,1:12), labels = month.abb) +
  theme(text = element_text(size = 25)) +
  labs(x = "Month", y = "Total Quantity",
       title = "Products Bought by Age Group",
       color = "Age") +
  facet_wrap(~ product_category, nrow = 4)

Looking at the four graphs above, I don’t think there are any obvious outliers, so I believe it is safe to continue with my analysis.

The first category I will analyze will be the Fluid Milk Products. This category of items are in the top 5 for all age groups except the 19-24 age group. Instead, the Fluid Milk Products have been mainly replaced by the Soup and Frozen Meat/Meat Dinners category, both of which don’t show up too often in the other age groups.

prod_quantity %>%
  filter(age == "19-24") %>%
  filter(product_category == "FLUID MILK PRODUCTS" | product_category == "SOUP" |
           product_category == "FRZN MEAT/MEAT DINNERS") %>%
  group_by(product_category, month = floor_date(date, 'month')) %>%
  summarise(quantity_per_month = sum(quantity_sold)) %>%
  ggplot(aes(x = month, y = quantity_per_month, fill = product_category)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_date(breaks = make_date(2017,1:12), labels = month.abb) +
  theme(text = element_text(size = 25)) +
  labs(x = "Month", y = "Total Quantity",
       title = "Products Bought per Month by the 19-24 Age Group",
       fill = "Product Category")

From the graph, we can see that the quantity of fluid milk products sold per month only surpasses both of the two categories on one occasions, in July.

prod_quantity %>%
  filter(product_category == "BAG SNACKS") %>%
  group_by(age, month = floor_date(date, 'month'), product_category) %>%
  summarise(quantity_per_month = sum(quantity_sold)) %>%
  ggplot(aes(x = month, y = quantity_per_month, fill = age)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_date(breaks = make_date(2017,1:12), labels = month.abb) +
  theme(text = element_text(size = 25)) +
  labs(x = "Month", y = "Total Quantity",
       title = "Bag Snacks Bought per Month",
       fill = "Age")

This graph shows that the age group 45-54 bought the most amounts of bag snacks for every month, much more than any other age group. This could mean that the people in this age group really love their bagged snacks, but like I said at the beginning, this age group also has the most amount of people in it(refer to top of page), so the sheer number of people might also be one of the reasons for this. To get a better idea of if it truly is the case for such high numbers, I will compare the product with other product categories within the same age group that aren’t in the top 5.

prod_quantity %>%
  filter(age == "45-54") %>%
  filter(product_category == "BAG SNACKS" | product_category == "SOUP" |
           product_category == "FRZN MEAT/MEAT DINNERS" | product_category == "VEGETABLES - SHELF STABLE") %>%
  group_by(product_category, month = floor_date(date, 'month')) %>%
  summarise(quantity_per_month = sum(quantity_sold)) %>%
  ggplot(aes(x = month, y = quantity_per_month, fill = product_category)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_date(breaks = make_date(2017,1:12), labels = month.abb) +
  theme(text = element_text(size = 25)) +
  labs(x = "Month", y = "Total Quantity",
       title = "Products Bought per Month by the 45-54 Age Group",
       fill = "Product Category")

We can see that the quantity of bag snacks sold is highest between April and September, where they are consistently the highest ones sold. For the other six months they aren’t the highest, but they also aren’t ever the lowest either. With this information, I think its safe to assume that bag snacks are a pretty popular category to buy from among this age group. At least, they do tend to be more popular compared to the other product categories.

Summary

My goal for this report was to analyze and find the most bought product category of each age group to see if I could identify an area where we could invest in to generate more revenue. By combining the transactions, demographics, and products table and grouping them by the appropriate variables, I was able to find the total quantity of products sold to each age group that day for each product category. From there, I was able to put that data into graphs and get a better visualization of it. By doing so, I was able to see and choose to remove the unwanted variables, such as the Coupons/Misc Items and Fuel category since they were having a detrimental effect on my analysis. Once the top 5 product categories for each age group as been found, I could look at the graph and identify the categories which are more popular among certain groups as well as product categories which are popular among all groups.

Some interesting finds are all age groups have their own product categories in which they buy the most from. There are common product categories that show up multiple times for each age group, such as the Fluid Milk products or Cheese products, but none of them show up for all age groups except the Soft Drinks and Baked Bread/Buns/Rolls product category. These two categories are the top 5 products sold for all age groups.

My recommendations for investment are the Bag Snacks and Fluid Milk product categories. The Bag Snacks are really liked by people in the age group 45-54 so we could create a marketing strategy to target that specific demographic group. The Fluid Milk products are in the top 5 for products sold for all age groups except the 19-24 age group. For this product category, I think that creating a marketing strategy targets all age groups instead of a specific age group would be a good idea as well. My reasoning for this is because there is already enough sales for fluid milk among the 19-24 age group, it’s just not as high when compared to other product categories and other age groups. So a strategy that captures the attention of all age groups will allow us to capture the entire market share regardless of age, hopefully increase sales for this product in the age groups that are already purchasing it and at the same time, increasing awareness of it to the 19-24 age group. Last but not least, the Soft Drinks and Baked Bread/Buns/Rolls product category. As mentioned above, these products have been bought the most regardless of age. Since that is the case, I believe it is a good idea to invest more money into these two categories to try and capture a larger share in this area. It would most likely cause a larger increase in revenue if successful since it applies to all age groups.

With that being said, there are some limitations to this analysis that would require further improvements. One that I would recommend future me or any teams that are expanding upon this analysis is to possibly look at more than just the age demographic. Including the income level, household size, or kids count into the analysis might be helpful in uncovering even more precise and accurate predictions. Another limitation is I only explored some of the product categories that had potential for improvement instead of all of them due to time limitations. If possible, I would recommend future analytics analyzing this to look at the other product categories that I didn’t have time to analyze, such as why the Yogurt product category shows up for the age groups 25-34 and 55-64 but not the others, why Cheese was in top 5 for the first 4 age groups but not the last two, etc.

Mid-term Project

Johnny Liu

2023-10-08