Phase 1 : Ask

Scenario :

At Quantium, I had the opportunity to work as part of the Retail Analytics Team, where I tackled an exciting project aimed at analyzing customer purchasing behavior for the chips category. This project, commissioned by the client, Julia, the Category Manager for Chips, sought to uncover actionable insights that could drive the supermarket’s strategic plan for this category over the next six months.

Project Overview :

Upon receiving the project brief, my manager, Zilinka, outlined the key tasks and expectations, ensuring I was set up for success. To gain a deeper understanding of customer segmentation and purchasing behavior in the chips category, enabling strategic recommendations for optimizing sales and targeting the right customer segments.

Key Questions :

  • What are the current purchasing trends and behaviors?
  • How do customer segments differ in terms of chip purchasing behavior?
  • Which metrics best describe customer spending and what drives spending within each segment?
  • How can insights from the data inform strategies to target key customer segments effectively?

Deliverables :

  • Initial findings supported by summaries, charts, and metrics.
  • A strategic recommendation aligned with Julia’s needs.
  • Save work as a .pdf file for submission and review.

Framework :

  • Following Google’s Data Analytics Framework, I approached the project through these six phases:

  1. Ask :
  • Define the problem and establish the goals of the analysis. This phase focuses on understanding what the stakeholders want to achieve and framing the right questions to address the business problem.
  1. Prepare :
  • Collect and organize data from various sources to ensure it is reliable and useful for analysis. This step ensures data is ready to be used for further analysis.
  1. Process :
  • Clean and transform the data to prepare it for analysis. This step ensures that the data is accurate, consistent, and free from errors.
  1. Analyze :
  • Use analytical tools and techniques to uncover insights, patterns, and trends in the data. This phase involves exploring the data and applying statistical or computational methods.
  1. Share :
  • Communicate the findings and insights in a clear and impactful way to stakeholders through reports, visualizations, or presentations.
  1. Act :
  • Use the insights to make data-driven decisions and take action to solve the problem or improve outcomes. This phase ensures the analysis adds value by enabling change or improvement.


Phase 2 : Prepare

Scope of Work :

Data Preparation:

  • Cleaned transaction and customer data by identifying and handling inconsistencies, outliers, and missing values, ensuring accuracy for analysis.

Analysis Focus:

  • Defined key metrics like total sales and sales drivers, and derived features such as pack size and brand name.
  • Conducted EDA to uncover trends, applied descriptive statistics, ANOVA, and Tukey HSD post-hoc tests for segment analysis.
  • Performed Chi-Squared tests to examine categorical relationships and conducted correlation analysis to identify key sales drivers..
  • Analyzed customer segments and purchasing patterns.

Tools and Approach:

  • Preferred tool: R, used to create visualizations and summarize key insights from data exploration and modeling.
  • Built Random Forest and Linear Regression models to predict sales, assess variable importance, and perform residual analysis.

Libraries:

library(tidyverse)
library(skimr)
library(tidyr)
library(corrplot)
library(forecast)
library(zoo)
library(lubridate)
library(janitor)
library(readxl)
library(dplyr)
library(corrplot)
library(ggcorrplot)
library(ggmosaic)
library(rstatix)
library(ggpubr)
library(randomForest)
library(caret)
library(broom)


Phase 3 : Process

Data Import and Cleaning:

purchase_data:

head(purchase_data)
##   LYLTY_CARD_NBR              LIFESTAGE PREMIUM_CUSTOMER
## 1           1000  YOUNG SINGLES/COUPLES          Premium
## 2           1002  YOUNG SINGLES/COUPLES       Mainstream
## 3           1003         YOUNG FAMILIES           Budget
## 4           1004  OLDER SINGLES/COUPLES       Mainstream
## 5           1005 MIDAGE SINGLES/COUPLES       Mainstream
## 6           1007  YOUNG SINGLES/COUPLES           Budget

transaction_data:

head(transaction_data)
## # A tibble: 6 × 8
##    DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME    PROD_QTY TOT_SALES
##   <dbl>     <dbl>          <dbl>  <dbl>    <dbl> <chr>           <dbl>     <dbl>
## 1 43390         1           1000      1        5 Natural Chi…        2       6  
## 2 43599         1           1307    348       66 CCs Nacho C…        3       6.3
## 3 43605         1           1343    383       61 Smiths Crin…        2       2.9
## 4 43329         2           2373    974       69 Smiths Chip…        5      15  
## 5 43330         2           2426   1038      108 Kettle Tort…        3      13.8
## 6 43604         4           4074   2982       57 Old El Paso…        1       5.1

Summary of ‘purchase_data’:

summary(purchase_data)
##  LYLTY_CARD_NBR     LIFESTAGE         PREMIUM_CUSTOMER  
##  Min.   :   1000   Length:72637       Length:72637      
##  1st Qu.:  66202   Class :character   Class :character  
##  Median : 134040   Mode  :character   Mode  :character  
##  Mean   : 136186                                        
##  3rd Qu.: 203375                                        
##  Max.   :2373711

Checked Data Structure:

str(purchase_data)
## 'data.frame':    72637 obs. of  3 variables:
##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
##  $ LIFESTAGE       : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
##  $ PREMIUM_CUSTOMER: chr  "Premium" "Mainstream" "Budget" "Mainstream" ...

• The dataset consists of 72,637 observations and 3 variables.

Handled Missing Values:

skim_without_charts(purchase_data)
Data summary
Name purchase_data
Number of rows 72637
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LIFESTAGE 0 1 8 22 0 7 0
PREMIUM_CUSTOMER 0 1 6 10 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
LYLTY_CARD_NBR 0 1 136185.9 89892.93 1000 66202 134040 203375 2373711

• The dataset is free of NA values and empty strings.

Checked for Duplicate Data Points:

purchase_data <- purchase_data %>% 
  distinct()
str(purchase_data)
## 'data.frame':    72637 obs. of  3 variables:
##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
##  $ LIFESTAGE       : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
##  $ PREMIUM_CUSTOMER: chr  "Premium" "Mainstream" "Budget" "Mainstream" ...

• The dataset has no duplicate rows based on all variables.

Identified Data Across All Tables:

table(purchase_data$LIFESTAGE, purchase_data$PREMIUM_CUSTOMER)
##                         
##                          Budget Mainstream Premium
##   MIDAGE SINGLES/COUPLES   1504       3340    2431
##   NEW FAMILIES             1112        849     588
##   OLDER FAMILIES           4675       2831    2274
##   OLDER SINGLES/COUPLES    4929       4930    4750
##   RETIREES                 4454       6479    3872
##   YOUNG FAMILIES           4017       2728    2433
##   YOUNG SINGLES/COUPLES    3779       8088    2574

• A comprehensive verification of the data across all tables was conducted to ensure consistency, completeness, and integrity.

Summary of ‘transaction_data’:

summary(transaction_data)
##       DATE         STORE_NBR     LYLTY_CARD_NBR        TXN_ID       
##  Min.   :43282   Min.   :  1.0   Min.   :   1000   Min.   :      1  
##  1st Qu.:43373   1st Qu.: 70.0   1st Qu.:  70021   1st Qu.:  67602  
##  Median :43464   Median :130.0   Median : 130358   Median : 135138  
##  Mean   :43464   Mean   :135.1   Mean   : 135550   Mean   : 135158  
##  3rd Qu.:43555   3rd Qu.:203.0   3rd Qu.: 203094   3rd Qu.: 202701  
##  Max.   :43646   Max.   :272.0   Max.   :2373711   Max.   :2415841  
##     PROD_NBR       PROD_NAME            PROD_QTY         TOT_SALES      
##  Min.   :  1.00   Length:264836      Min.   :  1.000   Min.   :  1.500  
##  1st Qu.: 28.00   Class :character   1st Qu.:  2.000   1st Qu.:  5.400  
##  Median : 56.00   Mode  :character   Median :  2.000   Median :  7.400  
##  Mean   : 56.58                      Mean   :  1.907   Mean   :  7.304  
##  3rd Qu.: 85.00                      3rd Qu.:  2.000   3rd Qu.:  9.200  
##  Max.   :114.00                      Max.   :200.000   Max.   :650.000

Checked Data Structure:

str(transaction_data)
## tibble [264,836 × 8] (S3: tbl_df/tbl/data.frame)
##  $ DATE          : num [1:264836] 43390 43599 43605 43329 43330 ...
##  $ STORE_NBR     : num [1:264836] 1 1 1 2 2 4 4 4 5 7 ...
##  $ LYLTY_CARD_NBR: num [1:264836] 1000 1307 1343 2373 2426 ...
##  $ TXN_ID        : num [1:264836] 1 348 383 974 1038 ...
##  $ PROD_NBR      : num [1:264836] 5 66 61 69 108 57 16 24 42 52 ...
##  $ PROD_NAME     : chr [1:264836] "Natural Chip        Compny SeaSalt175g" "CCs Nacho Cheese    175g" "Smiths Crinkle Cut  Chips Chicken 170g" "Smiths Chip Thinly  S/Cream&Onion 175g" ...
##  $ PROD_QTY      : num [1:264836] 2 3 2 5 3 1 1 1 1 2 ...
##  $ TOT_SALES     : num [1:264836] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...

• The dataset consists of 264,836 observations and 8 variables.

Handled Missing Values:

skim_without_charts(transaction_data)
Data summary
Name transaction_data
Number of rows 264836
Number of columns 8
_______________________
Column type frequency:
character 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PROD_NAME 0 1 17 40 0 114 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
DATE 0 1 43464.04 105.39 43282.0 43373.0 43464.0 43555.0 43646
STORE_NBR 0 1 135.08 76.78 1.0 70.0 130.0 203.0 272
LYLTY_CARD_NBR 0 1 135549.48 80579.98 1000.0 70021.0 130357.5 203094.2 2373711
TXN_ID 0 1 135158.31 78133.03 1.0 67601.5 135137.5 202701.2 2415841
PROD_NBR 0 1 56.58 32.83 1.0 28.0 56.0 85.0 114
PROD_QTY 0 1 1.91 0.64 1.0 2.0 2.0 2.0 200
TOT_SALES 0 1 7.30 3.08 1.5 5.4 7.4 9.2 650

• The dataset is free of NA values and empty strings too.

Checked for Duplicate Data Points:

transaction_data %>% 
  filter(duplicated(.)) %>% 
  print
## # A tibble: 1 × 8
##    DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME    PROD_QTY TOT_SALES
##   <dbl>     <dbl>          <dbl>  <dbl>    <dbl> <chr>           <dbl>     <dbl>
## 1 43374       107         107024 108462       45 Smiths Thin…        2         6

• A duplicate data point was identified.

Removed Duplicate Data:

transaction_data <- transaction_data %>% 
  distinct()
str(transaction_data)
## tibble [264,835 × 8] (S3: tbl_df/tbl/data.frame)
##  $ DATE          : num [1:264835] 43390 43599 43605 43329 43330 ...
##  $ STORE_NBR     : num [1:264835] 1 1 1 2 2 4 4 4 5 7 ...
##  $ LYLTY_CARD_NBR: num [1:264835] 1000 1307 1343 2373 2426 ...
##  $ TXN_ID        : num [1:264835] 1 348 383 974 1038 ...
##  $ PROD_NBR      : num [1:264835] 5 66 61 69 108 57 16 24 42 52 ...
##  $ PROD_NAME     : chr [1:264835] "Natural Chip        Compny SeaSalt175g" "CCs Nacho Cheese    175g" "Smiths Crinkle Cut  Chips Chicken 170g" "Smiths Chip Thinly  S/Cream&Onion 175g" ...
##  $ PROD_QTY      : num [1:264835] 2 3 2 5 3 1 1 1 1 2 ...
##  $ TOT_SALES     : num [1:264835] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...

Removed duplicate observation, reducing the dataset to 264,835 unique records.

Formatted Excel Date (origin = “1899-12-30”):

transaction_data$DATE <- as.Date(transaction_data$DATE, origin = "1899-12-30")
str(transaction_data)
## tibble [264,835 × 8] (S3: tbl_df/tbl/data.frame)
##  $ DATE          : Date[1:264835], format: "2018-10-17" "2019-05-14" ...
##  $ STORE_NBR     : num [1:264835] 1 1 1 2 2 4 4 4 5 7 ...
##  $ LYLTY_CARD_NBR: num [1:264835] 1000 1307 1343 2373 2426 ...
##  $ TXN_ID        : num [1:264835] 1 348 383 974 1038 ...
##  $ PROD_NBR      : num [1:264835] 5 66 61 69 108 57 16 24 42 52 ...
##  $ PROD_NAME     : chr [1:264835] "Natural Chip        Compny SeaSalt175g" "CCs Nacho Cheese    175g" "Smiths Crinkle Cut  Chips Chicken 170g" "Smiths Chip Thinly  S/Cream&Onion 175g" ...
##  $ PROD_QTY      : num [1:264835] 2 3 2 5 3 1 1 1 1 2 ...
##  $ TOT_SALES     : num [1:264835] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...

Merged Both of The Datasets As ‘merged_data’:

merged_data <- merge(purchase_data, transaction_data, by = "LYLTY_CARD_NBR")
str(merged_data)
## 'data.frame':    264835 obs. of  10 variables:
##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ LIFESTAGE       : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ PREMIUM_CUSTOMER: chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ DATE            : Date, format: "2018-10-17" "2018-09-16" ...
##  $ STORE_NBR       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ TXN_ID          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ PROD_NBR        : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ PROD_NAME       : chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ PROD_QTY        : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ TOT_SALES       : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...

Merged the two datasets to create a comprehensive dataframe, with 264,835 observations and 10 variables

Renamed Column Names:

merged_data <- merged_data %>% 
  rename(card_id = LYLTY_CARD_NBR,
         life_stage = LIFESTAGE,
         tier = PREMIUM_CUSTOMER,
         date = DATE,
         store_id = STORE_NBR,
         txn_id = TXN_ID,
         product_id = PROD_NBR,
         product_name = PROD_NAME,
         quantity = PROD_QTY,
         revenue = TOT_SALES)

str(merged_data)
## 'data.frame':    264835 obs. of  10 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ tier        : chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...

Mutated ‘month_year’ Column:

merged_data <- merged_data %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>% 
  mutate(month_year = factor(format(date, "%b-%Y"), 
                             levels = c("Jul-2018", "Aug-2018", 
                                        "Sep-2018", "Oct-2018", 
                                        "Nov-2018", "Dec-2018", "Jan-2019", 
                                        "Feb-2019","Mar-2019", "Apr-2019", 
                                        "May-2019", "Jun-2019")))
str(merged_data)
## 'data.frame':    264835 obs. of  11 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ tier        : chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ month_year  : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...

Outlier Detection And Treatment:

merged_data %>% 
  ggplot(aes(quantity)) + geom_boxplot()

• The plot shows a strong right skew in quantity distribution, with potential outliers on the higher end.

Filtered ‘merged_data’:

merged_data <- merged_data %>% 
  filter(quantity < 200)
str(merged_data)
## 'data.frame':    264833 obs. of  11 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ tier        : chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ month_year  : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...

• Now the merged dataset consists of 264,833 observations and 10 variables.

Customer Segmentation By Revenue:

merged_data %>%
  ggplot(aes(tier,revenue, fill = tier)) +
  geom_boxplot() +
  geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
  stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
  theme_minimal() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Tiers",
    y = "Revenue",
    title = "Customer Segmentation by Revenue",
    caption = "Data Analyst : JP"
  )

• Outliers were identified and handled appropriately to avoid skewing the distribution before creating the boxplot.

Enhanced Data Quality: Improved data accuracy and consistency by addressing missing values, outliers, and inconsistencies.

Standardized Data Formats: Ensured uniform data formats for further analysis and modeling.

Optimized Data Structure: Merged and transformed data into a suitable format for analysis.

Leveraged Data Cleaning Techniques: Employed data visualization techniques box plots, to identify and address outliers in the quantity and revenue variable. Outliers were handled by capping.



Phase 4 & 5 : Analyze and Share

Descriptive and Exploratory Data Analysis (EDA):

Univariate Analysis

For Numerical Variables:

merged_data %>% 
  select(where(is.numeric)) %>%
  skim() 
Data summary
Name Piped data
Number of rows 264833
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
card_id 0 1 135548.90 80580.03 1000.0 70021.0 130357.0 203094.0 2373711.0 ▇▁▁▁▁
store_id 0 1 135.08 76.78 1.0 70.0 130.0 203.0 272.0 ▆▇▇▇▇
txn_id 0 1 135157.72 78133.05 1.0 67600.0 135137.0 202700.0 2415841.0 ▇▁▁▁▁
product_id 0 1 56.58 32.83 1.0 28.0 56.0 85.0 114.0 ▇▇▇▇▇
quantity 0 1 1.91 0.34 1.0 2.0 2.0 2.0 5.0 ▁▇▁▁▁
revenue 0 1 7.30 2.53 1.5 5.4 7.4 9.2 29.5 ▆▇▁▁▁

• Suitable Variables for Histogram and Density are revenue, quantity :

Histogram and Density Plot of ‘revenue’:

merged_data %>% 
  ggplot(aes(revenue)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
  geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
  theme_minimal() +
  theme(
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    legend.position = "none",
    plot.title = element_text(size = 16, face = "bold")) +
      labs(title = "Histogram and Density Distribution on Revenue",
           caption = "Data Analyst : JP")

• The plot shows most transactions have low to moderate revenue, peaking around 5 units with a long tail toward higher values. This distribution can guide revenue optimization and pricing strategies.

Histogram and Density Plot of ‘quantity’:

merged_data %>% 
  ggplot(aes(quantity)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
  geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
  theme_minimal() +
  theme(
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    legend.position = "none",
    plot.title = element_text(size = 16, face = "bold")) +
      labs(title = "Histogram and Density Distribution on Quantity",
           caption = "Data Analyst : JP")

• The plot shows most transactions involve small quantities, with a peak around 2 units and long tails. This insight can aid in optimizing inventory and pricing strategies.

Histogram and Density Plot of ‘store_id’:

merged_data %>% 
  ggplot(aes(store_id)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
  theme_minimal() +
  theme(
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    legend.position = "none",
    plot.title = element_text(size = 16, face = "bold")) +
      labs(title = "Histogram and Density Distribution on Store IDs",
           caption = "Data Analyst : JP")

• Store ID distribution helps identify clusters or patterns in sales activity across stores.

For Categorical Variables:

merged_data %>%
  select(where(is.character)) %>%
  skim()
Data summary
Name Piped data
Number of rows 264833
Number of columns 3
_______________________
Column type frequency:
character 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
life_stage 0 1 8 22 0 7 0
tier 0 1 6 10 0 3 0
product_name 0 1 17 40 0 114 0

Tier Frequencies:

merged_data %>%
  count(tier) %>% 
  rename(counts = n)
##         tier counts
## 1     Budget  93157
## 2 Mainstream 101988
## 3    Premium  69688

Distribution of Member Tier:

merged_data %>%
  count(tier) %>%
  rename(counts = n) %>%
  mutate(percentage = counts / sum(counts) * 100) %>%
  ggplot(aes(x = "", y = counts, fill = tier)) +
  geom_bar(stat = "identity", color = 'black', width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(percentage, 2), "%")), position = position_stack(vjust = 0.5)) +
  theme_void() +
  labs(title = "Distribution of Member Types",
       fill = "Members Tier", 
       caption = "Data Analyst : JP")

• This pie chart shows the distribution of members across budget tiers.

‘life_stage’ With ‘tier’ Frquencies:

merged_data %>% 
  count(life_stage, tier) %>% 
  rename(counts = n) %>% 
  mutate(percentage  = counts / sum(counts) * 100)
##                life_stage       tier counts percentage
## 1  MIDAGE SINGLES/COUPLES     Budget   5020  1.8955342
## 2  MIDAGE SINGLES/COUPLES Mainstream  11874  4.4835802
## 3  MIDAGE SINGLES/COUPLES    Premium   8216  3.1023324
## 4            NEW FAMILIES     Budget   3005  1.1346773
## 5            NEW FAMILIES Mainstream   2325  0.8779117
## 6            NEW FAMILIES    Premium   1589  0.6000008
## 7          OLDER FAMILIES     Budget  23160  8.7451337
## 8          OLDER FAMILIES Mainstream  14244  5.3784838
## 9          OLDER FAMILIES    Premium  11190  4.2253042
## 10  OLDER SINGLES/COUPLES     Budget  18407  6.9504178
## 11  OLDER SINGLES/COUPLES Mainstream  18318  6.9168117
## 12  OLDER SINGLES/COUPLES    Premium  17753  6.7034697
## 13               RETIREES     Budget  15201  5.7398436
## 14               RETIREES Mainstream  21466  8.1054853
## 15               RETIREES    Premium  13096  4.9450031
## 16         YOUNG FAMILIES     Budget  19122  7.2203993
## 17         YOUNG FAMILIES Mainstream  12907  4.8736373
## 18         YOUNG FAMILIES    Premium  11563  4.3661477
## 19  YOUNG SINGLES/COUPLES     Budget   9242  3.4897464
## 20  YOUNG SINGLES/COUPLES Mainstream  20854  7.8743963
## 21  YOUNG SINGLES/COUPLES    Premium   6281  2.3716833

‘life_stage’ and Frquency Breakdown:

merged_data %>% 
  count(life_stage, tier) %>% 
  rename(counts = n) %>% 
  mutate(percentage  = counts / sum(counts) * 100) %>% 
  ggplot(aes(life_stage, counts, fill = life_stage)) +
  geom_bar(stat = "identity")+
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Life Stage",
       y = "Frquency",
       title = "Life Stage and Frquency Breakdown",
       caption = "Data Analyst : JP")

The bar chart reveals that Older Singles/Couples and Retirees have the highest frequency, while New Families are the least represented. This breakdown provides insights into the dominant life stages for targeted initiatives and product offerings.

Life Stage and Tier Frquency Breakdown:

merged_data %>%
  count(life_stage, tier) %>%
  ggplot(aes(x = tier, y = n, fill = life_stage)) +
  geom_bar(stat = "identity", color = 'black', position = "dodge") +
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Tier",
       y = "Frquency",
       fill = "Life Stages",
       title = "Life Stage and Tier Frquency Breakdown",
       caption = "Data Analyst : JP")

• The bar chart highlights that Young Families and Older Families dominate in frequency across different tiers, particularly Budget and Premium, while Mid-Age Singles/Couples have the lowest representation. This distribution can inform targeted marketing and product development strategies.

’Unique Customers, Tiers and Average Transactions Breakdown:

# Calculate total unique customers and average transactions per customer
customer_summary <- merged_data %>%
  group_by(card_id) %>%
  summarise(total_transactions = n()) %>%  
  summarise(total_customers = n(), 
            avg_transactions_per_customer = mean(total_transactions))

# View the result
print(customer_summary)
## # A tibble: 1 × 2
##   total_customers avg_transactions_per_customer
##             <int>                         <dbl>
## 1           72636                          3.65

’Unique Customers, Tiers and Average Revenue Breakdown:

# Calculate total unique customers and average revenue per customer
customer_summary1 <- merged_data %>%
  group_by(card_id) %>%
  summarise(total_revenue = sum(revenue)) %>%  
  summarise(total_customers = n(), 
            avg_revenue_per_customer = mean(total_revenue))

# View the result
print(customer_summary1)
## # A tibble: 1 × 2
##   total_customers avg_revenue_per_customer
##             <int>                    <dbl>
## 1           72636                     26.6

Total Revenue Distribution Across Life Stages and Tiers::

revenue_summary <- merged_data %>%
  group_by(tier, life_stage) %>%
  summarise(
    total_revenue = sum(revenue, na.rm = TRUE),   # Total revenue
    unique_customers = n_distinct(card_id)       # Number of unique customers
  )
## `summarise()` has grouped output by 'tier'. You can override using the
## `.groups` argument.
# Plot total revenue by tier, filled by life_stage (number of unique customers)

revenue_summary %>% 
  mutate(percentage = (total_revenue / sum(total_revenue)) * 100) %>% 
  ggplot(aes(life_stage, total_revenue, fill = tier )) +
  geom_bar(stat = "identity", color = 'black', position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")),  
            position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.y = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    plot.title = element_text(size = 16, face = "bold")
  ) +
  labs(
    x = "Life Stages",
    y = "Total Revenue",
    fill = "Tier",
    title = "Total Revenue by Tier and Life Stage (Unique Customers)",
    caption = "Data Analyst: JP"
  )

• Increase Premium tier customer acquisition and retention, especially in the Older Singles/Couples and Retirees segments.

Unique Customers Breakdown by Life Stage and Tier:

merged_data %>% 
  group_by(card_id) %>%              
  summarise(life_stage = first(life_stage), tier = first(tier)) %>%
  count(life_stage, tier) %>%        
  rename(counts = n) %>%              
  mutate(percentage = counts / sum(counts) * 100) %>% 
  ggplot(aes(x = life_stage, y = counts, fill = tier)) + 
  geom_bar(stat = "identity", color = 'black', position = "dodge") +     
  geom_text(aes(label = paste0(round(percentage, 1), "%")),  
            position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold"),
        legend.title = element_text(face = "bold"),
        legend.position = "right") +
  labs(x = "Life Stage",
       y = "Frequency of Unique Customers",
       title = "Unique Customers Breakdown by Life Stage and Tier",
       fill = "Tier",  # Legend title
       caption = "Data Analyst : JP")

Bivariate Analysis

For Numerical-Numerical

merged_data %>% 
  select(where(is.numeric)) %>%
  skim_without_charts()
Data summary
Name Piped data
Number of rows 264833
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
card_id 0 1 135548.90 80580.03 1000.0 70021.0 130357.0 203094.0 2373711.0
store_id 0 1 135.08 76.78 1.0 70.0 130.0 203.0 272.0
txn_id 0 1 135157.72 78133.05 1.0 67600.0 135137.0 202700.0 2415841.0
product_id 0 1 56.58 32.83 1.0 28.0 56.0 85.0 114.0
quantity 0 1 1.91 0.34 1.0 2.0 2.0 2.0 5.0
revenue 0 1 7.30 2.53 1.5 5.4 7.4 9.2 29.5

Correlation Between Numerical Variables:

merged_data %>% 
  select_if(is.numeric) %>%
  cor() %>% 
  ggcorrplot(hc.order = TRUE, type = "lower", lab = TRUE) +
  theme_dark() +
  theme(
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(title = "Correlation between Numerical Variables",
       caption = "Data Analyst : JP")

• The matrix highlights potential correlations between store IDs, card IDs, and sales quantities, warranting further analysis to uncover underlying factors.

  • Scatter plots

Correlation between ‘store_id’ And ‘txn_id’:

merged_data %>%
  ggplot(aes(store_id, txn_id)) +
  geom_point(color = "red", size = 2) +
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(title = "Correlation between Store IDs and Transaction IDs",
       x = "Store IDs",
       y = "Txn IDs",
       caption = "Data Analyst : JP")

Scatter plot showing the correlation between Store IDs and Transaction IDs. Note: correlation doesn’t imply causation; the relationship may be coincidental or influenced by other factors.

Correlation between ‘store_id’ And ‘card_id’:

merged_data %>%
  ggplot(aes(store_id, card_id)) +
  geom_point(color = "#f6522e", size = 2)+
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(title = "Correlation between Store IDs and Card IDs",
       x = "Store IDs",
       y = "Card IDs",
       caption = "Data Analyst : JP")

Scatter plot showing the correlation between Store IDs and Card IDs. Note: correlation doesn’t imply causation; the relationship may be coincidental or influenced by other factors.

Correlation Between ‘quantity’ and ‘revenue’:

merged_data %>%
  ggplot(aes(quantity, revenue)) +
  geom_point(color = "#fb785b", size = 2) +
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(title = "Correlation between Quantity and Revenue",
       x = "Quantity",
       y = "Revenue",
       caption = "Data Analyst : JP")

• This scatter plot visualizes the correlation between Quantity and Revenue.

For Numerical-Categorical

Total Quantities sold Through ;store_id’:

merged_data %>% 
  group_by(store_id) %>% 
  summarise(total_quantity_sold = sum(quantity)) %>% 
  arrange(desc(total_quantity_sold))
## # A tibble: 272 × 2
##    store_id total_quantity_sold
##       <dbl>               <dbl>
##  1      226                4001
##  2       88                3718
##  3       93                3639
##  4      165                3602
##  5       43                3519
##  6      237                3515
##  7       40                3499
##  8      230                3476
##  9      213                3470
## 10       58                3463
## # ℹ 262 more rows

Top 10 Stores Sold The Most Quantity:

merged_data %>%
  group_by(store_id) %>%
  summarise(total_quantity_sold = sum(quantity)) %>%
  arrange(desc(total_quantity_sold)) %>%
  slice_max(total_quantity_sold, n = 10) %>%
  mutate(store_id = factor(store_id, levels = store_id)) %>% 
  ggplot(aes(store_id, total_quantity_sold, fill = store_id)) +
  geom_bar(stat = "identity", color = 'black') +
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Store IDs",
       y = "Total Quantities Sold",
       title = "Top 10 Stores Sold The Most Quantity ",
       caption = "Data Analyst : JP")

• The chart shows wide variation in performance among the top 10 stores, with Store 226 leading with significantly more quantities sold.

Lowest-Performing Stores:

merged_data %>% 
  group_by(store_id) %>% 
  summarise(total_quantity_sold = sum(quantity)) %>% 
  arrange(total_quantity_sold)
## # A tibble: 272 × 2
##    store_id total_quantity_sold
##       <dbl>               <dbl>
##  1       11                   2
##  2       76                   2
##  3       92                   2
##  4      206                   2
##  5      211                   2
##  6      252                   2
##  7       31                   4
##  8      193                   4
##  9       85                   5
## 10      117                  47
## # ℹ 262 more rows

10 Lowest-Performing Stores:

merged_data %>% 
  group_by(store_id) %>% 
  summarise(total_quantity_sold = sum(quantity)) %>% 
  arrange(total_quantity_sold) %>%
  slice_min(total_quantity_sold, n = 10) %>%
  mutate(store_id = factor(store_id, levels = store_id)) %>% 
  ggplot(aes(store_id, total_quantity_sold, fill = store_id)) +
  geom_bar(stat = "identity", color = 'black') +
  theme_dark() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Store IDs",
       y = "Total Quantities Sold",
       title = "10 Lowest-Performing Stores",
       caption = "Data Analyst : JP") 

This bar chart illustrates the 10 stores with the lowest quantities sold. Store 117 stands out as the highest-performing store among these 10, with significantly more quantities sold than the others.

Extracted ‘brand’ Names:

merged_data <- merged_data %>% 
  mutate(brand = word(product_name, 1)) 
str(merged_data)
## 'data.frame':    264833 obs. of  12 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ tier        : chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ month_year  : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
##  $ brand       : chr  "Natural" "Red" "Grain" "Natural" ...

Top 5 Best-Selling Products: Monthly Trends:

merged_data %>%
  group_by(month_year, brand) %>%
  summarise(total_quantity_sold = sum(quantity), .groups = 'drop') %>%
  arrange(month_year, desc(total_quantity_sold)) %>%
  group_by(month_year) %>%
  slice_head(n = 5) %>% 
  ggplot(aes(x = reorder(brand, -total_quantity_sold), y = total_quantity_sold, fill = brand)) +
  geom_bar(stat = "identity", color = 'black') +
  facet_wrap(~ month_year) +
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Brand Name", y = "Total Quantity Sold", title = "Top 5 Products Sold Per Month over the Year",
       caption = "Data Analyst : JP")

Kettle leads in sales, followed by Doritos and Pringles with slight fluctuations. Smiths and Thins have lower sales, with overall minor variations across brands.

Time Series Analysis of Top 5 Product Sales:

merged_data %>%
  group_by(month_year, brand) %>%
  summarise(total_quantity_sold = sum(quantity), .groups = 'drop') %>%
  arrange(month_year, desc(total_quantity_sold)) %>%
  group_by(month_year) %>%
  slice_head(n = 5) %>% 
  ggplot(aes(month_year, total_quantity_sold, color = brand, group = brand)) +
  geom_line() +
  geom_point(color = 'black', ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Month-Year", y = "Total Quantity Sold", title = "Trend In Top 5 Products Sold Per Month over Year",
       caption = "Data Analyst : JP")

• Line chart showing sales trends of top 5 snack brands (Doritos, Kettle, Pringles, Smiths, Thins) from July 2018 to July 2019. Doritos and Kettle lead, Thins lags, and total sales fluctuate, suggesting seasonal patterns.

‘revenue’ Distribution Across ‘life_stage’:

merged_data %>%
  ggplot(aes(life_stage, revenue, fill = life_stage)) +
  geom_boxplot() +
  geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
  stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Life Stage",
    y = "Revenue",
    title = "Revenue Distribution Across Life Stage",
    caption = "Data Analyst : JP"
  )

• Revenue spread varies across life stages, providing useful data for targeted marketing strategies.

Monthly Revenue Performance Over the Year:

merged_data %>%
  ggplot(aes(month_year, revenue, fill = month_year)) +
  geom_boxplot() +
  geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
  stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        legend.position = "none",
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Month-Year",
    y = "Revenue",
    title = "Monthly Revenue Performance Over the Year",
    caption = "Data Analyst : JP"
  )

• The box plot shows monthly revenue distribution. The median revenue varies across months, with wider boxes indicating greater revenue variability.

Chi-Square Test

For Categorical-Categorical

Contingency Table For ‘life_stage’ And ‘tier’:

lt_table <- table(merged_data$tier, merged_data$life_stage)
print(lt_table)
##             
##              MIDAGE SINGLES/COUPLES NEW FAMILIES OLDER FAMILIES
##   Budget                       5020         3005          23160
##   Mainstream                  11874         2325          14244
##   Premium                      8216         1589          11190
##             
##              OLDER SINGLES/COUPLES RETIREES YOUNG FAMILIES
##   Budget                     18407    15201          19122
##   Mainstream                 18318    21466          12907
##   Premium                    17753    13096          11563
##             
##              YOUNG SINGLES/COUPLES
##   Budget                      9242
##   Mainstream                 20854
##   Premium                     6281

Chi-Square Test For ‘life_stage’ And ‘tier’:

print(chisq.test(lt_table))
## 
##  Pearson's Chi-squared test
## 
## data:  lt_table
## X-squared = 15226, df = 12, p-value < 2.2e-16

• Chi-Square Test indicates a strong relationship between life stages and tier.

Heatmap For ‘life_stage’ And ‘tier’:

lt_table <- data.frame(lt_table)

lt_table %>% 
ggplot(aes(Var1, Var2, fill = Freq)) + 
  geom_tile(color = 'black') + 
  scale_fill_gradient(low = "white", high = "darkblue") +
  theme_minimal() +
  theme(
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Tier",
    y = "Life Stage",
    fill = "Frequency",
    title = "Customer Segmentation Heatmap",
    caption = "Data Analyst : JP"
  )

• The heatmap visualizes customer distribution across life stages and tiers. Tile size and color intensity indicate customer frequency.

Contingency Table For ‘life_stage’ And ‘brand’:

lb_table <- table(merged_data$life_stage, merged_data$brand)
print(lb_table)
##                         
##                          Burger  CCs Cheetos Cheezels Cobs Dorito Doritos
##   MIDAGE SINGLES/COUPLES    152  433     265      443  961    300    2361
##   NEW FAMILIES               40   93      62      129  288     91     693
##   OLDER FAMILIES            353  941     615      813 1624    533    4332
##   OLDER SINGLES/COUPLES     292  851     580      966 2036    684    5236
##   RETIREES                  256  741     491      867 1884    619    4852
##   YOUNG FAMILIES            293  898     550      771 1504    484    3950
##   YOUNG SINGLES/COUPLES     178  594     364      614 1396    472    3538
##                         
##                          French Grain GrnWves Infuzions Infzns Kettle Natural
##   MIDAGE SINGLES/COUPLES    117   613     119      1059    344   4055     550
##   NEW FAMILIES               32   183      30       300     93   1171     132
##   OLDER FAMILIES            283  1090     339      1926    570   6851    1254
##   OLDER SINGLES/COUPLES     286  1311     292      2327    635   8847    1198
##   RETIREES                  228  1214     252      2107    612   8194    1040
##   YOUNG FAMILIES            278   948     273      1775    440   6277    1107
##   YOUNG SINGLES/COUPLES     194   913     163      1563    450   5893     769
##                         
##                           NCC  Old Pringles  Red  RRD Smith Smiths Snbts
##   MIDAGE SINGLES/COUPLES  144  915     2389  535 1079   267   2689   132
##   NEW FAMILIES             26  245      698  137  251    78    681    31
##   OLDER FAMILIES          317 1589     4244 1270 2625   642   5801   337
##   OLDER SINGLES/COUPLES   268 1945     5307 1149 2175   557   5762   310
##   RETIREES                246 1839     4951 1037 1943   518   5095   291
##   YOUNG FAMILIES          260 1467     3829 1052 2344   563   5110   295
##   YOUNG SINGLES/COUPLES   158 1324     3684  705 1477   338   3721   180
##                         
##                          Sunbites Thins Tostitos Twisties Tyrrells Woolworths
##   MIDAGE SINGLES/COUPLES      117  1316      924      931      611        382
##   NEW FAMILIES                 30   378      277      233      187         90
##   OLDER FAMILIES              285  2475     1546     1644     1093        939
##   OLDER SINGLES/COUPLES       274  2969     2039     1949     1340        858
##   RETIREES                    244  2792     1850     1890     1259        741
##   YOUNG FAMILIES              301  2186     1467     1412      997        851
##   YOUNG SINGLES/COUPLES       181  1959     1368     1395      955        576
##                         
##                            WW
##   MIDAGE SINGLES/COUPLES  907
##   NEW FAMILIES            240
##   OLDER FAMILIES         2263
##   OLDER SINGLES/COUPLES  2035
##   RETIREES               1710
##   YOUNG FAMILIES         1910
##   YOUNG SINGLES/COUPLES  1255

Chi-Square Test For ‘life_stage’ And ‘brand’:

print(chisq.test(lb_table))
## 
##  Pearson's Chi-squared test
## 
## data:  lb_table
## X-squared = 1574.3, df = 168, p-value < 2.2e-16

• Chi-Square Test indicates a strong relationship between life stages and brands.

Heatmap For ‘life_stage’ And ‘brand’:

lb_table <- data.frame(lb_table)

lb_table %>% 
ggplot(aes(Var2, Var1, fill = Freq)) + 
  geom_tile(color = 'black') + 
  scale_fill_gradient(low = "white", high = "darkblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Brands",
    y = "Life Stage",
    fill = "Frequency",
    title = "Brand Segmentation by Life Stage",
    caption = "Data Analyst : JP"
  )

• The heatmap visualizes brands distribution across life stages. Tile size and color intensity indicate brands frequency.

Contingency Table For ‘tier’ and ‘brand’:

tb_table <- table(merged_data$tier, merged_data$brand)
print(tb_table)
##             
##              Burger   CCs Cheetos Cheezels  Cobs Dorito Doritos French Grain
##   Budget        579  1679    1051     1626  3274   1056    8762    539  2114
##   Mainstream    548  1631    1111     1735  3889   1253    9939    507  2516
##   Premium       437  1241     765     1242  2530    874    6261    372  1642
##             
##              GrnWves Infuzions Infzns Kettle Natural   NCC   Old Pringles   Red
##   Budget         542      3855   1067  14154    2246   539  3203     8620  2109
##   Mainstream     521      4290   1260  16423    2162   495  3725     9903  2156
##   Premium        405      2912    817  10711    1642   385  2396     6579  1620
##             
##                RRD Smith Smiths Snbts Sunbites Thins Tostitos Twisties Tyrrells
##   Budget      4371  1118  10430   610      536  4931     3236     3229     2195
##   Mainstream  4306  1055  10787   544      498  5436     3737     3785     2583
##   Premium     3217   790   7642   422      398  3708     2498     2440     1664
##             
##              Woolworths    WW
##   Budget           1605  3881
##   Mainstream       1607  3586
##   Premium          1225  2853

Chi-Square Test For ‘tier’ and ‘brand’:

print(chisq.test(tb_table))
## 
##  Pearson's Chi-squared test
## 
## data:  tb_table
## X-squared = 348.61, df = 56, p-value < 2.2e-16

• The Pearson’s Chi-squared test yielded a strong relationship between the variables.

Heatmap For ‘tier’ and ‘brand’:

tb_table <- data.frame(tb_table)

tb_table %>% 
ggplot(aes(Var1, Var2, fill = Freq)) + 
  geom_tile(color = 'black') + 
  scale_fill_gradient(low = "white", high = "darkblue") +
  theme_minimal() +
  theme(
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(
    x = "Tier",
    y = "Brands",
    fill = "Frequency",
    title = "Brand Segmentation by Tier",
    caption = "Data Analyst : JP"
  )

• The heatmap visualizes tiers distribution across brands. Tile size and color intensity indicate brands frequency.

Forecasting

Projected Revenue for the Next 6 Months (180 Days):

zoo_data <- merged_data %>% 
  group_by(date) %>% 
  summarise(total_revenue = sum(revenue))
  

zoo_data <- zoo(zoo_data$total_revenue, order.by = as.Date(zoo_data$date))
fit <- auto.arima(zoo_data)

forecast(fit, h = 180) %>%
  plot(xlab = "Days", ylab = "Total Revenue",
       main = "Projected Revenue for the Next 6 Months (180 Days)")

• The plot shows projected revenue for the next 6 months, with the predicted trend line and confidence interval. It suggests stable revenue with some fluctuations.

Assessed Model Fit:

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,0) with non-zero mean
## Q* = 6.5198, df = 8, p-value = 0.5892
## 
## Model df: 2.   Total lags used: 10

• The residuals are randomly distributed with no patterns. The histogram is approximately normal, and the Ljung-Box test(Q* = 6.52, p = 0.5892) confirms no autocorrelation. The ARIMA(2,0,0) model fits the data well.

Model Summary for Forecasting:

summary(fit)
## Series: zoo_data 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2       mean
##       0.4304  0.2417  5314.2705
## s.e.  0.0507  0.0506    41.6029
## 
## sigma^2 = 69528:  log likelihood = -2544.56
## AIC=5097.12   AICc=5097.23   BIC=5112.72
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 0.3734051 262.5937 203.2132 -0.24488 3.860994 0.8611807 -0.0114896

• The projected revenue for the next six months shows a consistent trend, with a slight upward slope.

• The confidence interval is relatively narrow, suggesting a high degree of accuracy.”

Extracted Pack Size from Product Name:

merged_data <- merged_data %>% 
  mutate(pack_size = parse_number(product_name))

str(merged_data)
## 'data.frame':    264833 obs. of  13 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ tier        : chr  "Premium" "Mainstream" "Budget" "Budget" ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : num  5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ month_year  : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
##  $ brand       : chr  "Natural" "Red" "Grain" "Natural" ...
##  $ pack_size   : num  175 150 210 175 160 165 110 150 330 170 ...

Total Revenue by Pack Size:

merged_data %>%
  group_by(pack_size) %>%
  summarise(total_revenue = sum(revenue)) %>%
  arrange(desc(total_revenue)) %>%
  ggplot(aes(reorder(factor(pack_size), total_revenue), y = total_revenue, fill = pack_size)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::comma) +  # Remove scientific notation
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Pack Size",
       y = "Total Revenue",
       title = "Total Revenue by Pack Size (Ordered by Revenue)",
       caption = "Data Analyst : JP")

• Pack sizes 175 and 150 yield the highest revenue, while sizes 125 and 220 generate the lowest, showing significant revenue variation across pack sizes.

Machine Learning

Transformed Categorical Variables for Model Fitting:

model_data <- merged_data %>% 
  mutate(pack_size = as.factor(pack_size),
         store_id = as.factor(store_id),
         brand = as.factor(brand),
         month_year = as.factor(month_year),
         product_id = as.factor(product_id),
         life_stage = as.factor(life_stage),
         tier = as.factor(tier))

str(model_data)
## 'data.frame':    264833 obs. of  13 variables:
##  $ card_id     : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ life_stage  : Factor w/ 7 levels "MIDAGE SINGLES/COUPLES",..: 7 7 6 6 4 1 7 7 2 7 ...
##  $ tier        : Factor w/ 3 levels "Budget","Mainstream",..: 3 2 1 1 2 2 1 1 3 2 ...
##  $ date        : Date, format: "2018-10-17" "2018-09-16" ...
##  $ store_id    : Factor w/ 272 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ txn_id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_id  : Factor w/ 114 levels "1","2","3","4",..: 5 58 52 106 96 86 49 10 20 51 ...
##  $ product_name: chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ quantity    : num  2 1 1 1 1 1 1 1 1 2 ...
##  $ revenue     : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ month_year  : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
##  $ brand       : Factor w/ 29 levels "Burger","CCs",..: 14 18 9 14 29 3 11 19 7 7 ...
##  $ pack_size   : Factor w/ 21 levels "70","90","110",..: 11 7 15 11 8 9 3 7 20 10 ...

Data Preparation: Splitting:

model_data1 <- merged_data %>% 
  mutate(pack_size = as.factor(pack_size),
         store_id = as.factor(store_id),
         brand = as.factor(brand),
         month_year = as.factor(month_year),
         product_id = as.factor(product_id),
         life_stage = as.factor(life_stage),
         tier = as.factor(tier))


set.seed(123)
sample_data1 <- sample(nrow(model_data1), size = floor(0.75 * nrow(model_data1)))
train_data1 <- model_data1[sample_data1, ]
test_data1 <- model_data1[-sample_data1, ]

Linear Regression Model Training and Evaluation:

# Trained the linear regression model on the train_data:

lm_model <- lm(revenue ~ (brand + quantity) ^ 2 + pack_size, data = train_data1)

# Made predictions on the test_data:

predictions <- predict(lm_model, newdata = test_data1)

# Calculated validation metrics on the test_data:
actuals1 <- test_data1$revenue
residuals1 <- actuals1 - predictions

# Calculated R-squared and RMSE:
rsquared1 <- 1 - sum(residuals1^2) / sum((actuals1 - mean(actuals1))^2)
rmse1 <- sqrt(mean(residuals1^2))

# Print results:

print(paste("R-squared:", rsquared1))
## [1] "R-squared: 0.979241536717138"
print(paste("RMSE:", rmse1))
## [1] "RMSE: 0.363925484510656"

• The Linear Regression Model also achieved an R-squared of 0.979, explaining 97.9% of the variance in revenue, with a low RMSE of 0.365, demonstrating accurate and reliable predictions.

Linear Regression Model Summary:

# Tidy coefficients
tidy_summary1 <- tidy(lm_model)
print(tidy_summary1)
## # A tibble: 78 × 5
##    term           estimate std.error statistic   p.value
##    <chr>             <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)    1.28e-11    0.0503  2.55e-10 1.00     
##  2 brandCCs       5.18e+ 0    0.0602  8.61e+ 1 0        
##  3 brandCheetos   6.45e+ 0    0.0647  9.97e+ 1 0        
##  4 brandCheezels  3.58e+ 0    0.0612  5.85e+ 1 0        
##  5 brandCobs     -2.66e+ 0    0.0576 -4.61e+ 1 0        
##  6 brandDorito    2.96e- 1    0.0682  4.34e+ 0 0.0000141
##  7 brandDoritos   5.60e+ 0    0.0532  1.05e+ 2 0        
##  8 brandFrench    5.18e+ 0    0.0797  6.50e+ 1 0        
##  9 brandGrain     1.36e- 1    0.0590  2.31e+ 0 0.0209   
## 10 brandGrnWves  -1.63e-11    0.0752 -2.16e-10 1.00     
## # ℹ 68 more rows
glance_summary1 <- glance(lm_model)
print(glance_summary1)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik     AIC     BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1     0.980         0.980 0.361   136048.       0    70 -79588. 159320. 160055.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
anova(lm_model)
## Analysis of Variance Table
## 
## Response: revenue
##                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## brand              28 648431   23158  177412.0 < 2.2e-16 ***
## quantity            1 333729  333729 2556652.3 < 2.2e-16 ***
## pack_size          13 240365   18490  141646.3 < 2.2e-16 ***
## brand:quantity     28  20598     736    5635.7 < 2.2e-16 ***
## Residuals      198553  25918       0                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

• These demonstrates the strong predictive power of the selected features and model effectiveness in uncovering revenue drivers.

Variable Importance in Linear Regression Model:

top_vars1 <- varImp(lm_model)
tidy_vars1 <- top_vars1 %>% head(20)  
print(tidy_vars1)
##                     Overall
## brandCCs       8.609736e+01
## brandCheetos   9.971482e+01
## brandCheezels  5.847882e+01
## brandCobs      4.612246e+01
## brandDorito    4.342043e+00
## brandDoritos   1.053334e+02
## brandFrench    6.504406e+01
## brandGrain     2.310227e+00
## brandGrnWves   2.162047e-10
## brandInfuzions 3.792556e+01
## brandInfzns    3.905534e+01
## brandKettle    1.145380e+02
## brandNatural   8.937486e+01
## brandNCC       6.821387e+01
## brandOld       1.444717e+02
## brandPringles  5.680685e-01
## brandRed       1.102868e+02
## brandRRD       1.146858e+02
## brandSmith     9.967471e+01
## brandSmiths    7.650353e+01

• The variable importance table provides a comprehensive view of how different categories (capped to 20) influence revenue.

Residual Analysis:

plot(lm_model)

• The model fits reasonably well, with residuals showing a largely random scatter. However, minor issues with variance and normality suggest exploring transformations or advanced models like Random Forest for improved accuracy.


ANOVA and Post Hoc Analysis (Tukey Test):

model_data2 <- merged_data %>% 
  mutate(pack_size = as.factor(pack_size),
         store_id = as.factor(store_id),
         brand = as.factor(brand),
         month_year = as.factor(month_year),
         product_id = as.factor(product_id),
         life_stage = as.factor(life_stage),
         tier = as.factor(tier),
         quantity = as.factor(quantity))



aov_model1 <- aov(revenue ~ brand + quantity + pack_size, data = model_data2)


turkey_summary <- tidy(TukeyHSD(aov_model1))
print(turkey_summary)
## # A tibble: 626 × 7
##    term  contrast         null.value estimate conf.low conf.high adj.p.value
##    <chr> <chr>                 <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
##  1 brand CCs-Burger                0   -0.395   -0.448    -0.342           0
##  2 brand Cheetos-Burger            0    1.40     1.34      1.46            0
##  3 brand Cheezels-Burger           0    4.33     4.28      4.38            0
##  4 brand Cobs-Burger               0    2.91     2.86      2.96            0
##  5 brand Dorito-Burger             0    7.90     7.85      7.96            0
##  6 brand Doritos-Burger            0    3.71     3.66      3.75            0
##  7 brand French-Burger             0    1.22     1.16      1.29            0
##  8 brand Grain-Burger              0    2.50     2.45      2.55            0
##  9 brand GrnWves-Burger            0    1.47     1.40      1.53            0
## 10 brand Infuzions-Burger          0    2.53     2.48      2.58            0
## # ℹ 616 more rows

TukeyHSD Pairwise Comparisons:

tukey_results <- TukeyHSD(aov_model1)
plot(tukey_results, las = 1, col = "blue", cex.axis = 0.8)

Pack Size: Not all pack sizes have the same impact on the revenue.

Quantity: The effect of quantity on the revenue varies across different levels.

Brand: Different brands exhibit varying levels of influence on the revenue.

Data Splitting: Training and Testing Sets:

set.seed(123)
sample_data <- sample(nrow(model_data), size = floor(0.75 * nrow(model_data)))
train_data <- model_data[sample_data, ]
test_data <- model_data[-sample_data, ]

Random Forest Model Training and Performance Evaluation:

# Trained Random Forest model:

rf_model <- randomForest(revenue ~ brand + pack_size + quantity  , 
                          data = train_data, ntree = 100, importance = TRUE)


# Made predictions on the test_data:

rf_predictions <- predict(rf_model, newdata = test_data)

# Calculated validation metrics on the test_data:

actuals <- test_data$revenue
residuals <- actuals - rf_predictions

# Calculated R-squared and RMSE:
rsquared_rf <- 1 - sum(residuals^2) / sum((actuals - mean(actuals))^2)
rmse_rf <- sqrt(mean(residuals^2))

# Print results
cat("Random Forest R-squared:", rsquared_rf, "\n")
## Random Forest R-squared: 0.9790648
cat("Random Forest RMSE:", rmse_rf, "\n")
## Random Forest RMSE: 0.3654712

• The Random Forest model achieved an R-squared of 0.979, explaining 97.9% of the variance in revenue, with a low RMSE of 0.365, demonstrating accurate and reliable predictions.

Random Forest Model Summary:

print(rf_model)
## 
## Call:
##  randomForest(formula = revenue ~ brand + pack_size + quantity,      data = train_data, ntree = 100, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1505496
##                     % Var explained: 97.64

• The random forest model with 100 trees explains 97.64% of the variance in revenue, with a mean squared residual of 0.15.

Feature Importance Analysis:

importance(rf_model)
##            %IncMSE IncNodePurity
## brand     45.34634      529829.2
## pack_size 30.25738      294077.5
## quantity  84.35584      322880.8

• Feature importance analysis identified quantity as the most significant predictor (%IncMSE: 84.36), followed by brand (45.35) and pack_size (30.26). These findings underscore the critical impact of quantity on revenue predictions

Variable Importance Analysis with Random Forest:

varImp(rf_model)
##            Overall
## brand     45.34634
## pack_size 30.25738
## quantity  84.35584

• The variable importance table provides a comprehensive view of how different categories influence revenue.

Visualizing Feature Importance with Random Forest:

varImpPlot(rf_model, main = "Random Forest Feature Importance")

• The plot suggests that that quantity is the most important predictor, followed by brand, and then pack_size.

Created a Data Frame for Actuals and Predictions:

results <- data.frame(
  Actuals = actuals,
  Predictions = rf_predictions
)

Scatterplot for Actuals vs Predictions:

results %>% 
  ggplot(aes(Actuals, Predictions)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = "Actual Revenue",
       y = "Predicted Revenue",
       title = "Actual vs. Predicted Revenue",
       caption = "Data Analyst : JP") +
  annotate(
    "text", x = min(actuals), y = max(rf_predictions),
    label = paste("R-squared:", round(rsquared_rf, 2), "\nRMSE:", round(rmse_rf, 2)),
    hjust = 0, vjust = 1, size = 4, color = "darkgreen"
  )

Model Performance as Data Frame:

model_performance <- data.frame(
  Model = c("Random Forest", "Linear Regression"),
  R_squared = c(rsquared_rf, rsquared1),
  RMSE = c(rmse_rf, rmse1))

Bar Chart for R-squared Comparison:

model_performance %>% 
  ggplot(aes(Model, R_squared, fill = Model)) +
  geom_bar(stat = "identity", color = "black") +
  theme_minimal() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold"),
        legend.position = "none") +
  labs(x = "Model",
       y = "R-squared",
       title = "R-squared Comparison Between Models",
       caption = "Data Analyst : JP")

Bar Chart for RMSE Comparison:

model_performance %>% 
  ggplot(aes(Model, RMSE, fill = Model)) +
  geom_bar(stat = "identity", color = "black") +
  theme_minimal() +
  theme(axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        plot.title = element_text(size = 16, face = "bold"),
        legend.position = "none") +
  labs(x = "Model",
       y = "RMSE",
       title = "RMSE Comparison Between Models",
       caption = "Data Analyst : JP")

Saving The Best Model:

saveRDS(rf_model, file = "best_rf_model.rds")


Later

we are tasked with evaluating the performance of a trial conducted across three stores (Store 77, Store 86, and Store 88) to assess whether the trial had a significant impact on sales and customer behavior. Specifically, we aim to analyze whether the changes observed in the trial stores can be attributed to factors like increased customer traffic, higher transaction frequency, or other underlying causes.

# Load data
qvi <- read.csv("QVI_data.csv")

# Check structure of data
str(qvi)
## 'data.frame':    264834 obs. of  12 variables:
##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
##  $ DATE            : chr  "2018-10-17" "2018-09-16" "2019-03-07" "2019-03-08" ...
##  $ STORE_NBR       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TXN_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PROD_NBR        : int  5 58 52 106 96 86 49 10 20 51 ...
##  $ PROD_NAME       : chr  "Natural Chip        Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour    Cream&Chives 210G" "Natural ChipCo      Hony Soy Chckn175g" ...
##  $ PROD_QTY        : int  2 1 1 1 1 1 1 1 1 2 ...
##  $ TOT_SALES       : num  6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
##  $ PACK_SIZE       : int  175 150 210 175 160 165 110 150 330 170 ...
##  $ BRAND           : chr  "NATURAL" "RRD" "GRNWVES" "NATURAL" ...
##  $ LIFESTAGE       : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
##  $ PREMIUM_CUSTOMER: chr  "Premium" "Mainstream" "Budget" "Budget" ...
# Remove duplicates if any
qvi <- qvi %>%
  distinct()

# Convert DATE to Date format if it's not already
qvi$DATE <- as.Date(qvi$DATE)

# Check for duplicates again and print them
qvi %>% 
  filter(duplicated(.)) %>% 
  print()
##  [1] LYLTY_CARD_NBR   DATE             STORE_NBR        TXN_ID          
##  [5] PROD_NBR         PROD_NAME        PROD_QTY         TOT_SALES       
##  [9] PACK_SIZE        BRAND            LIFESTAGE        PREMIUM_CUSTOMER
## <0 rows> (or 0-length row.names)
# Define trial store IDs
trial_stores <- c(77, 86, 88)

# Define the trial period
trial_start <- as.Date("2018-07-01")
trial_end <- as.Date("2019-06-30")

# Step 1: Calculate total sales, transactions, and average transactions for trial stores
trial_sales <- qvi %>%
  filter(STORE_NBR %in% trial_stores, as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end) %>%
  group_by(STORE_NBR, DATE) %>%
  summarise(
    trial_sales = sum(TOT_SALES),
    trial_txn_count = n(),
    avg_txn_per_customer = trial_sales / trial_txn_count,
    .groups = "drop"
  )

# Step 2: Calculate total sales, transactions, and average transactions for control store candidates
control_candidates <- qvi %>%
  filter(!STORE_NBR %in% trial_stores, as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end) %>%
  group_by(STORE_NBR, DATE) %>%
  summarise(
    store_sales = sum(TOT_SALES),
    txn_count = n(),
    avg_txn_per_customer = store_sales / txn_count,
    .groups = "drop"
  )

# Step 3: Identify the best control store for each trial store using correlation
find_best_control <- function(trial_store_id) {
  # Extract sales data for the selected trial store
  trial_store_sales <- trial_sales %>%
    filter(STORE_NBR == trial_store_id) %>%
    select(DATE, trial_sales)
  
  # Compare correlation with each control store
  correlations <- control_candidates %>%
    left_join(trial_store_sales, by = "DATE") %>%
    group_by(STORE_NBR) %>%
    summarise(
      correlation = ifelse(sum(!is.na(store_sales) & !is.na(trial_sales)) > 1, 
                           { 
                             # Check if standard deviation is zero
                             if (sd(store_sales, na.rm = TRUE) == 0 | sd(trial_sales, na.rm = TRUE) == 0) {
                               NA  # Set correlation to NA if standard deviation is zero
                             } else {
                               cor(store_sales, trial_sales, use = "complete.obs")
                             }
                           }, 
                           NA),
      .groups = "drop"
    ) %>%
    arrange(desc(correlation))
  
  # Return the best matching control store, excluding any with NA correlation
  return(correlations %>% filter(!is.na(correlation)) %>% slice(1))
}

# Find the best control stores for trial stores 77, 86, and 88
best_control_77 <- find_best_control(77)
best_control_86 <- find_best_control(86)
best_control_88 <- find_best_control(88)

# Display the best control stores for each trial store
best_control_77
## # A tibble: 1 × 2
##   STORE_NBR correlation
##       <int>       <dbl>
## 1       193       0.374
best_control_86
## # A tibble: 1 × 2
##   STORE_NBR correlation
##       <int>       <dbl>
## 1        85           1
best_control_88
## # A tibble: 1 × 2
##   STORE_NBR correlation
##       <int>       <dbl>
## 1       204       0.191
# Step 4: Combine the trial store data with its best control store data
combined_data_77 <- qvi %>%
  filter(STORE_NBR %in% c(77, best_control_77$STORE_NBR)) %>%
  filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)

combined_data_86 <- qvi %>%
  filter(STORE_NBR %in% c(86, best_control_86$STORE_NBR)) %>%
  filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)

combined_data_88 <- qvi %>%
  filter(STORE_NBR %in% c(88, best_control_88$STORE_NBR)) %>%
  filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)

# Step 5: Perform the t-test for each trial store against its best control store
t_test_results <- function(trial_data) {
  t_test_result <- t.test(TOT_SALES ~ STORE_NBR, data = trial_data)
  return(t_test_result$p.value)
}

# Get t-test results for each trial store
p_value_77 <- t_test_results(combined_data_77)
p_value_86 <- t_test_results(combined_data_86)
p_value_88 <- t_test_results(combined_data_88)

# Display p-values for each trial store
p_value_77
## [1] 0.6068824
p_value_86
## [1] 0.3906579
p_value_88
## [1] 5.496461e-09
# Aggregate trial_sales and control_candidates by STORE_NBR and DATE
trial_sales_agg <- trial_sales %>%
  group_by(STORE_NBR, DATE) %>%
  summarise(
    trial_sales = sum(trial_sales),
    trial_txn_count = sum(trial_txn_count),
    avg_txn_per_customer_trial = mean(avg_txn_per_customer),
    .groups = "drop"
  )

control_candidates_agg <- control_candidates %>%
  group_by(STORE_NBR, DATE) %>%
  summarise(
    store_sales = sum(store_sales),
    txn_count = sum(txn_count),
    avg_txn_per_customer_control = mean(avg_txn_per_customer),
    .groups = "drop"
  )


# Step 6: Sales change calculation (including a correction for avg_txn_per_customer)



# Step 6: Sales change calculation (including a correction for avg_txn_per_customer)
sales_change <- trial_sales_agg %>%
  left_join(control_candidates_agg, by = "DATE", suffix = c("_trial", "_control")) %>%
  mutate(
    change_in_customers = trial_txn_count - txn_count,
    change_in_avg_txn_per_customer = avg_txn_per_customer_trial - avg_txn_per_customer_control,
    .groups = "drop"
  )

# Display sales change for validation
head(sales_change)
## # A tibble: 6 × 12
##   STORE_NBR_trial DATE       trial_sales trial_txn_count avg_txn_per_customer_…¹
##             <int> <date>           <dbl>           <int>                   <dbl>
## 1              77 2018-07-01        15.6               2                     7.8
## 2              77 2018-07-01        15.6               2                     7.8
## 3              77 2018-07-01        15.6               2                     7.8
## 4              77 2018-07-01        15.6               2                     7.8
## 5              77 2018-07-01        15.6               2                     7.8
## 6              77 2018-07-01        15.6               2                     7.8
## # ℹ abbreviated name: ¹​avg_txn_per_customer_trial
## # ℹ 7 more variables: STORE_NBR_control <int>, store_sales <dbl>,
## #   txn_count <int>, avg_txn_per_customer_control <dbl>,
## #   change_in_customers <int>, change_in_avg_txn_per_customer <dbl>,
## #   .groups <chr>
# Step 7: Visualization of sales trends for trial stores
ggplot(trial_sales, aes(x = DATE, y = trial_sales, color = factor(STORE_NBR))) +
  geom_line() +
  labs(title = "Sales Trends for Trial Stores", x = "Date", y = "Total Sales",
       color = "Store Number",
       caption = "Data Analyst : JP")



Phase 6 : Act

Data-Driven Decision Making:

Key Insights:

  • Model Accuracy:

    • The Random Forest model achieved an R-squared of 0.979, explaining 97.9% of the variance in revenue, with a low RMSE of 0.365, demonstrating accurate and reliable predictions.

  • Feature Importance:

    • Feature importance analysis identified quantity as the most significant predictor (%IncMSE: 84.36), followed by brand (45.35) and pack size (30.26), underscoring the critical impact of quantity on revenue predictions.

  • Customer Segmentation:

    • young singles/couples and older singles/couples were dominant customer groups, providing an opportunity to target these segments with tailored promotions.

  • Sales Drivers:

    • Quantity, Brand and Pack Size emerged as the most significant predictors of revenue, suggesting a focus on optimizing these attributes.

  • Seasonal Trends:

    • Sales exhibited fluctuations indicative of seasonal patterns, highlighting opportunities for seasonal campaigns.

  • Price Sensitivity:

    • Revenue distributions across budget tiers revealed price sensitivity among mid-tier customers, emphasizing the need for competitive pricing strategies.


  • Store Performance:

    • Significant variations in store performance were observed, with Store 226 leading sales and others lagging behind, calling for location-specific strategies.

    Store 88 is the most successful, with consistently higher sales compared to the other two stores.

    Store 86 shows a moderate sales performance, with fluctuations throughout the period.

    Store 77 appears to be underperforming, with generally lower sales compared to the other two stores.

    • The trial period analysis, including t-tests and correlation with control stores, highlights the differences in sales performance and the areas that need attention. Recommendations implement best practices from Store 88.


Strategic Recommendations:

  • Promotional Campaigns:

    Focus promotions on top-performing brands like Kettle and Doritos during peak sales seasons.

    Introduce bundle offers for popular pack sizes (e.g., 175g and 150g) to maximize revenue.

  • Targeted Marketing:

    Develop loyalty programs and tailored advertisements for Older Singles/Couples and Retirees.

    Leverage insights from life-stage segmentation to craft personalized campaigns for other growing customer groups.

  • Inventory Optimization:

    Prioritize stocking high-performing pack sizes (175g, 150g) and brands in stores with higher sales volumes.

    Adjust inventory in underperforming stores to minimize wastage and meet local demand.

  • Pricing Strategies:

    Implement tier-based pricing strategies, especially for mid-tier customers, to capture revenue from price-sensitive segments.

    Test price adjustments during lower sales months to encourage purchases.

  • Store-Specific Initiatives:

    Allocate additional resources and marketing efforts to underperforming stores to drive sales growth.

    Conduct store-level analyses to identify specific operational bottlenecks.

  • Continuous Monitoring:

    Use predictive models to forecast future trends and refine strategies based on new data insights.

    Regularly evaluate the impact of initiatives using key metrics like sales growth, market share, and customer retention.


Deliverables:

  • Final Report:

    • A comprehensive document including visualizations, customer segmentation insights, and actionable recommendations.

  • Presentation Deck:

    • Key findings and strategies presented for Julia and other stakeholders.

  • Data Submission:

    • All relevant analyses and models saved and documented for future reference.