library(ggridges)
library(ggplot2)
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(stringr)
transactions <- get_transactions()
jams_data <- products %>%
filter(str_detect(product_category, regex("PNT BTR/JELLY/JAMS", ignore_case = TRUE))) %>%
inner_join(transactions, by = 'product_id') %>%
inner_join(demographics, by = 'household_id') %>%
mutate(month = month(transaction_timestamp))
age_dist <- ggplot(jams_data, aes(x = age, y = sales_value)) +
geom_bar(stat = "identity") +
facet_wrap(~household_comp) +
labs(x = "Age Group", y = "Sales Value", title = "Sales by Age Group and Household Comp")
print(age_dist)
First, I would like to find out whether a specific age group or household comp is purchasing the most jam. Except 1 Adult Kids group where there isn’t any significant statistics, age is likely to be the factor that affect the jam sales. The group of 2 Adults Kids in age 35-44 buy most jam. The two remaining groups without kids witnessed a consistent trend that people aged 45-54 seems to like jam the most. Across age groups and different household comp, 19-24 is the group with least interest in buying jam.
avg_sales_value <- ggplot(jams_data, aes(x = sales_value, y = product_type)) +
ggridges::geom_density_ridges() +
scale_x_continuous(labels = scales::dollar) +
scale_fill_viridis_c(option = "magma") +
labs(title = "Distribution of Sales by Product Type",
x = "Sales Value ($)",
y = "Product Type",
fill = "Sales")
print(avg_sales_value)
## Picking joint bandwidth of 0.211
Then I would like to look closer into different flavors to find out the price distribution of 5 different product types. We can see that Apple butter has the lowest sale value while Peanut butter seems to be the highest because its value peak nearly around $2.5. So I want to look deeper into whether the price is the main factor that affect the sales or not.
flavor <- ggplot(jams_data, aes(x = product_type, y = quantity)) +
geom_bar(stat = "identity") +
facet_wrap(~household_comp) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 10)) +
labs(x = "Product Type", y = "Sales Quantity", title = "Sales Quantity by Flavors and Household Comp") +
scale_fill_brewer(palette = "Set2")
print(flavor)
Against my expectation, even though Peanut Butter and Preserves Jam Marmalade has the highest sales values, people bought these most. It seems to be the national preferences that regardless a household has kids or not, the quantity sold for these jams still remained the consistently across groups. Go Peanut Butter!