Introduction

Essential Business Question: The business question this report will attempt to solve is if we are doing the most possible business in every income demographic possible.

Thought Process: By Focusing on income levels we can see which provide the most value to our company and plan accordingly

Plan of Attack: Utilize product, transaction, and demographics data in order to provide a valuable analysis on how we can better our products for our customers and bring the company more profit.

Packages Required

Complete Journey- This gives us all of the Regork data needed for the project

GG PLOT 2- This package allows us to have awesome visualizations to help draw conclusions and make predictions about our data

DPLYR- This package allows us to format and use our data in advantageous ways

Data Tables

Transactions

Demographics

Products

transactions


demographics -> demographics

products -> products 

Most Important Data Used

# income_count to show number of customers per income level
income_count <- demographics %>%
  group_by(income) %>%
  summarize(num_customers = n()) %>%
  ungroup()

#plot it

ggplot(data = income_count, aes(x = income, y = num_customers)) +
  geom_col(fill = "skyblue") +
  labs(
    title = "Number of Customers Per Income Level",
    x = "Income Level",
    y = "Number of Customers"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Number of Customers Per Income Level

This chart is helping us visualize the number of customers we serve per income level. While we expect more customers close to the national average in the $35,000-$100,000 range, it appears we are severely low on 200-249K customers.While we have enough customers in the other categories, this report will focus on increasing our customer base in the 200-249k customer range. Now we should check out the transactions per household.

trans_per_household <- transactions %>%
  group_by(household_id) %>%
  summarize(num_transactions = n_distinct(basket_id))

# join with demo
trans_per_household_income <- trans_per_household %>%
  left_join(demographics, by = "household_id")

# look over the data/make sure it is looking good
head(trans_per_household_income)
## # A tibble: 6 × 9
##   household_id num_transactions age   income home_ownership marital_status
##   <chr>                   <int> <ord> <ord>  <ord>          <ord>         
## 1 1                          51 65+   35-49K Homeowner      Married       
## 2 10                          1 <NA>  <NA>   <NA>           <NA>          
## 3 100                        21 <NA>  <NA>   <NA>           <NA>          
## 4 1000                       85 <NA>  <NA>   <NA>           <NA>          
## 5 1001                       46 45-54 50-74K Homeowner      Unmarried     
## 6 1002                       16 <NA>  <NA>   <NA>           <NA>          
## # ℹ 3 more variables: household_size <ord>, household_comp <ord>,
## #   kids_count <ord>
#o see the avg # of transactions per household in each income level and filter out n/a
transactions_by_income <- trans_per_household_income %>%
  group_by(income) %>%
  summarize(avg_transactions = mean(num_transactions, na.rm = TRUE),
            count_households = n()) %>%
  ungroup()

#look over again
head(transactions_by_income)
## # A tibble: 6 × 3
##   income    avg_transactions count_households
##   <ord>                <dbl>            <int>
## 1 Under 15K            121.                61
## 2 15-24K                98.0               74
## 3 25-34K               115.                77
## 4 35-49K                99.2              172
## 5 50-74K                98.6              192
## 6 75-99K                89                 96
#plot it
ggplot(transactions_by_income, aes(x = income, y = avg_transactions)) +
  geom_col(fill = "skyblue") +
  labs(
    title = "Average Number of Transactions Per Household by Income",
    x = "Income Range",
    y = "Avg. Transactions per Household"
  ) +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Average Number of Transactions Per Household

This chart helps paint a clear picture that we need to increase our number of customers in the 200-249k range and we need to find a way to increase their transactions. This chart also showcases all of the null/N/A data we have that can take away from the accuracy. This data also helps show some great consistency with the other income groups, besides $200-$249k

transactions_per_basket <- transactions %>%
  group_by(household_id, basket_id) %>%
  summarize(
    total_spend = sum(sales_value, na.rm = TRUE),
    .groups = "drop"    
  )
trans_basket_income <- transactions_per_basket %>%
  left_join(demographics, by = "household_id")

spend_by_income <- trans_basket_income %>%
  group_by(income) %>%
  summarize(
    avg_spend_per_transaction = mean(total_spend, na.rm = TRUE),
    num_transactions = n(),  
    .groups = "drop"
  )

ggplot(spend_by_income, aes(x = income, y = avg_spend_per_transaction)) +
  geom_col(fill = "skyblue") +
  labs(
    title = "Avg Dollars Spent Per Transaction",
    x = "Income",
    y = "Avg $ Spend Per"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Average Dollars Spent Per Transaction

This helps show the importance of increasing our customer base in the 200-249k category. The 200-249k is among the highest spending per transaction. This provides our company a massive void to fill that can increase our customer base in this income rage.

#join/filter the data for the graph
transactions_200_249 <- transactions %>%
  left_join(demographics, by = "household_id") %>%
  left_join(products,  by = "product_id") %>%
  filter(income == "200-249K")
#group
top_sales_categories <- transactions_200_249 %>%
  group_by(product_category) %>%
  summarize(
    total_sales = sum(sales_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_sales)) %>%
  slice_head(n = 10)   

#plot it

ggplot(top_sales_categories, aes(
  x = reorder(product_category, total_sales),
  y = total_sales
)) +
  geom_col(fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Top 10 Products by Sales For $200K-$249K)",
    x = "Products",
    y = "Total Sales In Dollars"
  ) +
  theme_minimal(base_size = 14)

Top 10 Products For $200-$249k

This data helps us see what products we can capitalize on to increase growth from the $200-$249k range. We can see that they primarily buy groceries in their transactions. Considering their income, one thing we can do a better job at is offering higher quality beef products to suit their demands. In addition, we can do a better job at advertising that we have pet products available at our store.

Limitations of Analysis

There are a few major limitations in the analysis.

Conclusion:

The new tables and data visualizations paint a clear picture on a fast way to improve our company. My official recommendation would be to run advertising campaigns centered at the $200-$249k income group. We should feature the products these households have been proven to buy with us, especially since they spend a high amount per transaction. If we can get their market share on par with other similarly sized markets, it will lead to near instant growth.