1. Introduction

Summary of report and problem statement:

For this analysis, we are investigating trends in the sales values, as it relates to various demographics, over the past year from 2,469 households, provided by 84.51°, stored within the completejourney package.

Through our analysis, we are aiming to show the purchasing behavior among various demographics, including income range, age group, marital status, and household size. We hopr to provide insights on their impact on the products being bought, the departments where most purchases occur, and ultimately the total sales throughout the year. We hope to answer these questions to help the finance team to locate areas where money could be invested or divested throughout the product mix. On the other hand, we will provide our insights to the marketing team so they are able to better market products and departments to the audiences that are contributing more to the total sales.

Please use the various tabs to navigate through our analysis which includes information on the set up of our analysis and 7 visuals to help analyze the trends among different demographics. The summary tab includes information on key takeaways and implications for our findings.

2. Packages Required

The associated packages used in our analysis, including a short description:

  • completejourney: Retail shopping transactions for 2,469 households over one year
  • tidyverse: A collection of packages designed for data science
  • ggplot2: A system for ‘declaratively’ creating graphics
  • dplyr: Package used for data manipulation
  • lubridate: Provide tools that make it easier to parse and manipulate dates
  • DT: Package allows for exporting data into HTML output easily
  • stringr: Set of functions designed to work with strings
  • wordcloud: Package used for plotting a word cloud visual
library(tidyverse)
library(ggplot2)
library(dplyr)
library(lubridate)
library(completejourney)
library(DT)
library(stringr)
library(wordcloud)

3. Data Preperation

#Importing data sets needed for analysis
transactions <- get_transactions()
products <- products
demographics <- demographics

#Joining, mutating and cleaning data sets
tran_prod_dem <- transactions %>% 
  inner_join(products) %>% 
  inner_join(demographics) %>% 
  mutate(Transaction_date = date(transaction_timestamp))

glimpse(tran_prod_dem)
## Rows: 825,622
## Columns: 25
## $ household_id          <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id              <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id             <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id            <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity              <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value           <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.00, …
## $ retail_disc           <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.29, …
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
## $ manufacturer_id       <chr> "2", "69", "69", "2142", "2326", "608", "2326", …
## $ department            <chr> "PASTRY", "GROCERY", "GROCERY", "GROCERY", "GROC…
## $ brand                 <fct> National, Private, Private, National, National, …
## $ product_category      <chr> "ROLLS", "FACIAL TISS/DNR NAPKIN", "BAG SNACKS",…
## $ product_type          <chr> "ROLLS: BAGELS", "FACIAL TISSUE & PAPER HANDKE",…
## $ package_size          <chr> "4 OZ", "85 CT", "11.5 OZ", "17.1 OZ", "5.0 OZ",…
## $ age                   <ord> 35-44, 35-44, 45-54, 55-64, 55-64, 55-64, 55-64,…
## $ income                <ord> 35-49K, 35-49K, 100-124K, Under 15K, Under 15K, …
## $ home_ownership        <ord> Homeowner, Homeowner, NA, Homeowner, Homeowner, …
## $ marital_status        <ord> Married, Married, Unmarried, Married, Married, M…
## $ household_size        <ord> 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ household_comp        <ord> 2 Adults No Kids, 2 Adults No Kids, 1 Adult No K…
## $ kids_count            <ord> 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Transaction_date      <date> 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01,…

4. Analysis

A. Word Cloud

#making a data frame of words and its frequency
cloud_topn <- tran_prod_dem %>% 
  group_by(department) %>%
  summarise(no_rows = NROW(department)) %>% 
  mutate(freq = (no_rows/sum(no_rows))*100) %>% 
  arrange(desc(no_rows)) 

cloud_topn

#building the word cloud
wordcloud(words = cloud_topn$department, freq = cloud_topn$no_rows, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35,scale=c(4,.5), 
          colors=brewer.pal(8, "Dark2")) 

#making a data frame of words and its frequency
cloud_grocery <- tran_prod_dem %>% 
  filter(department == "GROCERY") %>% 
  group_by(product_category) %>%
  summarise(no_rows = length(department)) %>% 
  mutate(freq = (no_rows/sum(no_rows))*100) %>% 
  arrange(desc(no_rows)) %>% 
  filter(freq >= 1)

cloud_grocery

#building the word cloud
wordcloud(words = cloud_grocery$product_category, freq = cloud_grocery$no_rows, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35,scale=c(1,.5), 
          colors=brewer.pal(8, "Dark2"))

Summary/Explanation:

When analyzing where investments should be made in relation to marketing, promotions, etc., we were curious to analyze the dispersion of frequency among the different departments. The grocery department took up a 6% relative frequency of all transactions. Where as, all other departments took 1.07% or less relative frequency of transactions.

Following the analysis of department transaction frequencies, we were curious to learn more about what products were being purchased within the Grocery department. Interestingly, we found soft drinks to take the lead at 6.67% of transactions, followed by fluid milk products at 5.24% and baked bread/buns/rolls at 5.06%.

B. Total Sales by Product Type and Household Size

tran_prod_dem %>%
  group_by(household_size, product_type) %>%
  summarize(total = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total)) %>%
  slice(2:6) %>%
  ggplot() +
  geom_bar(aes(x = product_type, y = total, fill = household_size), stat = 'identity') +
  ylab("Total Sales") +
  xlab("Product Type") +
  theme(axis.text.x = element_text(size = 6, angle = 45, vjust = 1, hjust = 1))+
  labs(title = "Sales of Top Product Types by Household Size",
       x = "Product Type",
       y = "Total Sales",
       fill = "Household Size")+
    scale_y_continuous(name = "Total Sales", labels = scales::dollar) 

Summary/Explanation:

This graph represents the total sales by product type and household size. The graph indicates that fluid white milk is the product with the most sales among all households and shredded cheese being the product with the least sales among all households.

We can also see that household with 5 or more people also contribute the least amount to total sales, overall. Where as, households of size 1 or size 2 make more of an impact towards the sales value of these product types.

C. Total Sales Throughout 2017 by Age Group

tran_prod_dem %>%
  inner_join(products, by = "product_id") %>%
  mutate(month = month(transaction_timestamp, label = TRUE)) %>%
  group_by(Transaction_date, age) %>%
  summarize(total_sales = sum(sales_value, na.rm = FALSE)) %>%
  ggplot(aes(x = Transaction_date, y = total_sales, color = age)) +
  geom_line() +
  scale_y_continuous(name = "Total Sales", labels = scales::dollar) +
  labs(title = "Total Sales by Age Group",
       subtitle = "Comparing the total sales of all products over the span of 2017",
       x = "Month",
       y = "Total Sales",
       color = "Age")

Summary/Explanation:

This graph provides insights as to which age groups are contributing more or less to total sales throughout 2017, and the purchasing trends the groups are making. For example, we can see that around the holiday season, there is a spike in sales within the 45-54 age group. We also see this group being the main contributing group to total sales throughout the entire year.

On the other hand, we can see the age groups of 19-24, 55-64 and 65+ are contributing the least amount towards the total sales, with a fairly steady mean and variance throughout the year.

D. Total Sales by Income Range and Kid Count

demographics %>%
  group_by(kids_count, income) %>%
  inner_join(transactions, by = "household_id") %>%
  summarize(total_sales = sum(sales_value, na.rm = FALSE)) %>%
  ggplot(aes(x = kids_count, y = total_sales)) +
  geom_point(aes(color = income)) +
  scale_y_continuous(name = "Total Sales", labels = scales::dollar) +
  labs(title = "Total Sales of All Products per Income range per Kids count",
       subtitle = "Comparing the total sales of all products amongst income ranges and kids count",
       x = "Kids Count",
       y = "Total Sales",
       color = "Income")

Summary/Explanation:

The point graph above shows how income range and # of kids can be a determining factor in total sales. You would think that families in the higher income range would contribute to higher sales, however, it is the contrary. The highest income range of ‘250k’ is seen on the lower side of total sales than many of the other income ranges. However, as the graph progresses to the right, there is a decrease in the maximum total sales. This proves the idea that kids are expensive, so money has to be diversified more than the families with zero kids

E. Top 3 Products

tran_prod_dem %>%
  group_by(product_type) %>%
  filter(kids_count >= 1) %>%
  summarize(total_sales = sum(sales_value, na.rm = FALSE)) %>%
  arrange(desc(total_sales))
### GASOLINE-REG UNLEADED, FLUID MILK WHITE ONLY, SOFT DRINKS 12/18&15PK CAN CAR

tran_prod_dem %>%
  filter(str_detect(product_type, regex("FLUID MILK WHITE ONLY|SOFT DRINKS 12/18&15PK CAN CAR|GASOLINE-REG UNLEADED", ignore_case = TRUE))) %>%
  filter(kids_count >= 1) %>%
  mutate(month = month(transaction_timestamp, label = TRUE)) %>%
  group_by(product_type, month) %>%
  ggplot(aes(x = product_type, y = sales_value, fill = month)) +
  geom_col(position = 'dodge') +
  coord_cartesian(ylim = c(0, 250)) +
  scale_y_continuous(name = "Sales Value", labels = scales::dollar) +
  labs(title = "Comparing the Sales Value of the Top 3 Products of Families with kids", 
       subtitle = "Using the 2017 data provided this graph was created",
       x = "Top 3 Products",
       y = "Sales Value",
       fill = "Months")

Summary/Explanation:

The column graph shows that there isn’t a constant demand for all of the products throughout the year. The obvious outlier is in the month of July for the product Gasoline. On the other hand, milk has a consistent demand throughout the year with slight peaks or valleys.

F. Sales Value per Department

tran_prod_dem %>% 
  filter(marital_status != "NA") %>% 
  ggplot(aes(x = sales_value, y = department)) + geom_line(aes(color = marital_status)) +
  labs(title = "Sales Value by Department and Marital Status",
       subtitle = "Brands Labeled",
       x = "Sales Value",
       y = "Department",
       color = "Marital Status") +
    scale_x_continuous(name = "Total Sales", labels = scales::dollar) 

Summary/Explanation:

This graph represents the Sales Value per Department based on marital status (married vs unmarried). X-axis represents the sales value and Y-axis represents department, and then you see a purple and yellow color on each individual line. The purple represents married couples and yellow represents unmarried. You can see that most departments are relatively similar in sales value, however there are 2 that are significantly larger than the rest, the drug department and the miscellaneous department. However, with our aim to figure out where our sales come from in terms of demographics, we can see that the majority come from unmarried couples.

Overall, this allows us to understand that unmarried couples are choosing to spend their money more than married. This could be because those who are not married have less financial responsibilities. They are not having to potentially pay for kid expenses, or buying things for a spouse, etc.

5. Summary

Problem:

The problem we were trying to assess through the complete journey data was how the use of the top sales values and specific demographics to find the purchasing behaviors and trends and decide when and what products would be the best to invest in.

Addressing the Problem:

We addressed this issue by using the data provided for the transactions, demographics, and products to create graphs representing the information needed. We produced graphs showing the comparison of family total sales of ranges of income and number of kids, which products had the most sales value for families of 5 or more, and the sales value of the top 3 products per month. We also used graphs to determine the total sales of each month in 2017, and the sales value of each department and whether the brand was national or private. Another graph we used to address the problem in a more visual way was through word clouds that showed the frequency of the top departments and transactions. After creating and analyzing these graphs representing the relationships of demographics and transactions, we could conduct some insights about each

Insights:

We found that the highest-income families were surprisingly not the ones taking up more of the total sales. Through sales value per department, we found that there are multiple departments that don’t have very much sales value and that most departments have customers buying private brands. We also found that “Fluid White Milk” is a product that has the most sales for families of 5 or more, is one of the top sales over all in the products and is also very consistently bought throughout the year. It was also interesting to see how the total sales in 2017 were at a constant rate through most of the months of the year, even through the summer, and then significantly took a decline to less than $4,000 in sales in between November and December. It then took an even larger spike up to almost $16,000 at the end of the 2017 year. When looking at the different departments our Word Cloud showed that Grocery had the highest number of transactions with all other departments falling about 5% behind. This word made it easy to see which department would be more useful to invest in.

Implications:

From these insights we were able to conclude that even though it might seem like the higher income families would be part of more of the total sales, but it actually could just depend on the number of children in the family or just the amount of family members in general. We concluded that it would be good to invest in some of the departments with higher sales values shown in the “Sales Value per Department”, “Top 3 Products”, and “Sales for Households over 5”.

These show that departments such as Grocery, Drug GM, Miscellaneous, and Fuel have higher sales value and have more of a variety of items that more of the customers purchase. Even further looking into products the “Total Sales of All Products per Month” graph helps us to conclude that the time to best invest would be near the end of the year when total sales go up by a great amount most likely because of the holiday seasons and all the products customers are purchasing that time of year.

We are also suggesting that those departments and products with the best sales value should be used in new ways to market customers with more kids, certain incomes, and larger households. This can be done by using promotions that specifically cater to those top departments and how they relate to the demographics. It may also be a good idea to use the information showing the departments with little or no sales to focus on trying to bring these sales up and figure out ways to incorporate certain demographics in marketing to increase these sales in the lower departments.

Limitations of our analysis:

Some things that might be helpful to go more in depth with at a later time would be: which departments have the most purchases between each demographic. It also might have been better to find more statistics of the age demographic as this would help give a better understanding of what age groups we are dealing with and how to market to them in a supportive way. Our biggest improvement would be to dive a little deeper into certain demographics while we went into a few but mostly showed the relationships of a few of them with the sales values.