The business problem I am looking into is which age and income groups are spending the most. I want to see which products are being bought the most or not bought enough. After finding this data, I will make my recommendations on how to keep certain age groups buying certain items, or how to get different age and income groups involved.
Here I loaded the necessary libraries needed to view the data sets located in tidyverse and complete journey.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
transactions <- get_transactions()
demographics <- completejourney::demographics
Datasets are inspected to check the structure and to see what data is located in there. You can also view any missing or unique values in the dataset.
glimpse(transactions)
## Rows: 1,469,307
## Columns: 11
## $ household_id <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.88, …
## $ retail_disc <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.21, …
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
glimpse(demographics)
## Rows: 801
## Columns: 8
## $ household_id <chr> "1", "1001", "1003", "1004", "101", "1012", "1014", "10…
## $ age <ord> 65+, 45-54, 35-44, 25-34, 45-54, 35-44, 45-54, 45-54, 4…
## $ income <ord> 35-49K, 50-74K, 25-34K, 15-24K, Under 15K, 35-49K, 15-2…
## $ home_ownership <ord> Homeowner, Homeowner, NA, NA, Homeowner, NA, NA, Homeow…
## $ marital_status <ord> Married, Unmarried, Unmarried, Unmarried, Married, Marr…
## $ household_size <ord> 2, 1, 1, 1, 4, 5+, 4, 1, 5+, 2, 5+, 4, 2, 1, 5+, 1, 1, …
## $ household_comp <ord> 2 Adults No Kids, 1 Adult No Kids, 1 Adult No Kids, 1 A…
## $ kids_count <ord> 0, 0, 0, 0, 2, 3+, 2, 0, 3+, 0, 3+, 2, 0, 0, 3+, 0, 0, …
In this code chunk, I am merging transaction data with the demographic information based on household IDs to link purchases to customer profiles.
df <- transactions %>%
left_join(demographics, by = "household_id") %>%
select(household_id, age, income, sales_value) %>%
drop_na()
Group and summarize total spending by age and income groups.
demographic_spending <- df %>%
group_by(age, income) %>%
summarise(total_spent = sum(sales_value, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
glimpse(demographic_spending)
## Rows: 62
## Columns: 3
## $ age <ord> 19-24, 19-24, 19-24, 19-24, 19-24, 19-24, 19-24, 19-24, 25…
## $ income <ord> Under 15K, 15-24K, 25-34K, 35-49K, 50-74K, 75-99K, 200-249…
## $ total_spent <dbl> 30485.56, 7297.92, 9519.19, 28985.34, 28326.37, 7796.25, 1…
ggplot(demographic_spending, aes(x = age, y = total_spent, fill = as.factor(income))) +
geom_col(position = "dodge") +
labs(title = "Income Distrubution per Age Group", x = "Age Groups", y = "Count", fill = "Income Level") +
theme_minimal()
This bar chart displays total spending across age and income groups located in the data set. From the results of this chart, we can see that most of the clients are in the 45-54 age group. Regork is lacking in the 19-24 and 55+ age groups. In these age groups, the average income level ranges from 34-49k and 50-74k. We can use this information and make recommendations based on the given income level and targeted age groups. This also allows us to create recommendations to increase activity in the age groups that aren’t as involved.
Next, I loaded the “products” data set to be able to links products with demographics and household IDs.
data("products")
df <- transactions %>%
left_join(demographics, by = "household_id") %>%
select(household_id, age, income, sales_value, product_id) %>%
drop_na()
df_products <- df %>%
left_join(products, by = "product_id") %>%
select(household_id, age, income, sales_value, product_category) %>%
drop_na(product_category)
This section calculates the total sales value for each product category by age and income groups.
product_spending <- df_products %>%
group_by(age, income, product_category) %>%
summarise(total_spent = sum(sales_value, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'age', 'income'. You can override using the
## `.groups` argument.
This bar chart shows how different product categories are purchases across age and income groups.
ggplot(product_spending, aes(x = product_category, y = total_spent, fill = as.factor(income))) +
geom_col(position = "dodge") +
facet_wrap(~ age) +
labs(title = "Product Spending by Age and Income", x = "Product Category", y = "Total Spending", fill = "Income Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14),
strip.text = element_text(size = 12)) +
theme_minimal()
This graph highlights product spending patterns by age and income
groups.
df_products <- df_products %>%
mutate(income_group = case_when(
income <= 50 ~ "Low",
income > 50 & income <= 100 ~ "Middle",
income > 100 ~ "High"
))
top_5_products <- df_products %>% group_by(age, product_category) %>% summarise(total_spent = sum(sales_value, na.rm = TRUE)) %>% slice_max(order_by = total_spent, n = 5) %>% ungroup() %>% select(age, product_category)
df_top_products <- df_products %>% semi_join(top_5_products, by = c(“age”, “product_category”))
ggplot(df_top_products, aes(x = product_category, y = sales_value, fill = income_group)) + geom_col(position = “dodge”, stat = “summary”, fun = sum) + facet_wrap(~ age) + labs(title = “Top 5 Product Spending by Income Group and Age Group”, x = “Product Category”, y = “Total Spending”, fill = “Income Group”) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), axis.title = element_text(size = 12), plot.title = element_text(size = 14), strip.text = element_text(size = 12)) + theme_minimal()
Displays product spending for each income group across different age groups.
ggplot(df_products, aes(x = product_category, y = sales_value, fill = income_group)) +
geom_col(position = "dodge", stat = "summary", fun = sum) +
facet_wrap(~ age) +
labs(title = "Spending by Income Group and Product Category", x = "Product Category", y = "Total Spending", fill = "Income Group") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14),
strip.text = element_text(size = 12)) +
theme_minimal()
## Warning in geom_col(position = "dodge", stat = "summary", fun = sum): Ignoring
## unknown parameters: `stat` and `fun`
A bar plot that counts the number of products purchased by different age and income groups.
ggplot(df_products, aes(x = product_category, fill = income_group)) +
geom_bar(position = "dodge") +
facet_wrap(~ age) +
labs(title = "Count of Products Bought by Age Group and Income Level", x = "Product Category", y = "Count", fill = "Income Group") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14),
strip.text = element_text(size = 12)) +
theme_minimal()
Identifies the top 5 best-selling product categories for each age group.
top_products <- df_products %>%
group_by(age, product_category) %>%
summarise(total_spent = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_spent)) %>%
slice_max(order_by = total_spent, n = 5) %>%
ungroup()
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
print(top_products, n = 30)
## # A tibble: 30 × 3
## age product_category total_spent
## <ord> <chr> <dbl>
## 1 19-24 COUPON/MISC ITEMS 7840.
## 2 19-24 SOFT DRINKS 6572.
## 3 19-24 BEEF 4398.
## 4 19-24 FROZEN PIZZA 3699.
## 5 19-24 CHEESE 3672.
## 6 25-34 COUPON/MISC ITEMS 43606.
## 7 25-34 SOFT DRINKS 15150.
## 8 25-34 BEEF 14737.
## 9 25-34 FLUID MILK PRODUCTS 11913.
## 10 25-34 CHEESE 11723.
## 11 35-44 COUPON/MISC ITEMS 72982.
## 12 35-44 SOFT DRINKS 27337.
## 13 35-44 BEEF 27124.
## 14 35-44 FLUID MILK PRODUCTS 17171.
## 15 35-44 CHEESE 16180.
## 16 45-54 COUPON/MISC ITEMS 89171.
## 17 45-54 SOFT DRINKS 38801.
## 18 45-54 BEEF 36485.
## 19 45-54 FLUID MILK PRODUCTS 25563.
## 20 45-54 CHEESE 21791.
## 21 55-64 COUPON/MISC ITEMS 18077.
## 22 55-64 BEEF 6956.
## 23 55-64 SOFT DRINKS 5744.
## 24 55-64 FRZN MEAT/MEAT DINNERS 3730.
## 25 55-64 CHEESE 3657.
## 26 65+ COUPON/MISC ITEMS 13810.
## 27 65+ BEEF 6190.
## 28 65+ SOFT DRINKS 6180.
## 29 65+ FLUID MILK PRODUCTS 4401.
## 30 65+ BAKED BREAD/BUNS/ROLLS 3834.
Creating a bar chart to display the top 5 products for each age group by total spending.
ggplot(top_products, aes(x = reorder(product_category, -total_spent), y = total_spent, fill = as.factor(age))) +
geom_col() +
facet_wrap(~ age, scales = "free_x") +
labs(title = " Top 5 Products by Age Group", x = "Product Cateogry", y = "Total Spending", fill = "Age Group") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8, vjust = 1),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14),
strip.text = element_text(size = 12)) +
theme_minimal()
In my analysis, I used the “transactions” and “demographics” data sets from the CompleteJourney package. After merging transactions and demographics, I summarized total spending by age and income groups. Next, I merged both data sets with the “products” data set and made multiple visualizations to find certain trends on product spending within age and income groups. After finding the spending results from the top 5 products in the age and income groups, I made my recommendations.
Based on the items listed in the tibble from earlier, the top 5 items were: Soft Drinks, Beef, Fluid Milk Products, Frozen Pizza, and Cheese. My first recommendation would be to target the younger age groups with promotions. This age group of 19-24 years old had the lowest overall spending out of all of the groups, meaning there is room to gather some more activity from them. I recommend that Regork should introduce either a student discount for college kids and maybe offer a bundle deal that includes frozen pizza and soft drinks. Coupons being created for certain products would also be a great idea, since this age groups heavily relies on discounts.
Regarding the 55+ age groups, they are a lower spending group, but they primarily buy Baked Breads, Fluid Milk Products, and Beef. I suggest that Regork increases their marketing towards this group on healthy options such as the convenient products like dairy and baked bread.
The highest spending groups is the middle-aged group that consists of ages 45-54. They are the leading buyers of almost all of the top 5 products listed above. I suggest that Regork keeps up on marketing to this group and offering new promotions for all of these products to ensure repeat purchases. For the higher income levels in this group, Regork can offer premium products at a higher price to create new top products specifically for this group. Premium products will double revenue and provide the store with plenty of options for all groups, no matter the income level or age.
In conclusion, the recommendations above will give Regork what they need to ensure that they are marketing to the right groups and offering sought after products to ensure that engagement improves among the age groups, and that they are maximizing revenue.