Phase 3 : Process
Data Import and Cleaning:
purchase_data:
head(purchase_data)
## LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
## 1 1000 YOUNG SINGLES/COUPLES Premium
## 2 1002 YOUNG SINGLES/COUPLES Mainstream
## 3 1003 YOUNG FAMILIES Budget
## 4 1004 OLDER SINGLES/COUPLES Mainstream
## 5 1005 MIDAGE SINGLES/COUPLES Mainstream
## 6 1007 YOUNG SINGLES/COUPLES Budget
transaction_data:
head(transaction_data)
## # A tibble: 6 × 8
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 43390 1 1000 1 5 Natural Chi… 2 6
## 2 43599 1 1307 348 66 CCs Nacho C… 3 6.3
## 3 43605 1 1343 383 61 Smiths Crin… 2 2.9
## 4 43329 2 2373 974 69 Smiths Chip… 5 15
## 5 43330 2 2426 1038 108 Kettle Tort… 3 13.8
## 6 43604 4 4074 2982 57 Old El Paso… 1 5.1
Summary of ‘purchase_data’:
summary(purchase_data)
## LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
## Min. : 1000 Length:72637 Length:72637
## 1st Qu.: 66202 Class :character Class :character
## Median : 134040 Mode :character Mode :character
## Mean : 136186
## 3rd Qu.: 203375
## Max. :2373711
Checked Data Structure:
str(purchase_data)
## 'data.frame': 72637 obs. of 3 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Mainstream" ...
• The dataset consists of 72,637
observations and 3 variables.
Handled Missing Values:
skim_without_charts(purchase_data)
Data summary
| Name |
purchase_data |
| Number of rows |
72637 |
| Number of columns |
3 |
| _______________________ |
|
| Column type frequency: |
|
| character |
2 |
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
| LIFESTAGE |
0 |
1 |
8 |
22 |
0 |
7 |
0 |
| PREMIUM_CUSTOMER |
0 |
1 |
6 |
10 |
0 |
3 |
0 |
Variable type: numeric
| LYLTY_CARD_NBR |
0 |
1 |
136185.9 |
89892.93 |
1000 |
66202 |
134040 |
203375 |
2373711 |
• The dataset is free of NA values and
empty strings.
Checked for Duplicate Data Points:
purchase_data <- purchase_data %>%
distinct()
str(purchase_data)
## 'data.frame': 72637 obs. of 3 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Mainstream" ...
• The dataset has no duplicate rows based on all
variables.
Identified Data Across All Tables:
table(purchase_data$LIFESTAGE, purchase_data$PREMIUM_CUSTOMER)
##
## Budget Mainstream Premium
## MIDAGE SINGLES/COUPLES 1504 3340 2431
## NEW FAMILIES 1112 849 588
## OLDER FAMILIES 4675 2831 2274
## OLDER SINGLES/COUPLES 4929 4930 4750
## RETIREES 4454 6479 3872
## YOUNG FAMILIES 4017 2728 2433
## YOUNG SINGLES/COUPLES 3779 8088 2574
• A comprehensive verification of the data across
all tables was conducted to ensure consistency,
completeness, and integrity.
Summary of ‘transaction_data’:
summary(transaction_data)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID
## Min. :43282 Min. : 1.0 Min. : 1000 Min. : 1
## 1st Qu.:43373 1st Qu.: 70.0 1st Qu.: 70021 1st Qu.: 67602
## Median :43464 Median :130.0 Median : 130358 Median : 135138
## Mean :43464 Mean :135.1 Mean : 135550 Mean : 135158
## 3rd Qu.:43555 3rd Qu.:203.0 3rd Qu.: 203094 3rd Qu.: 202701
## Max. :43646 Max. :272.0 Max. :2373711 Max. :2415841
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## Min. : 1.00 Length:264836 Min. : 1.000 Min. : 1.500
## 1st Qu.: 28.00 Class :character 1st Qu.: 2.000 1st Qu.: 5.400
## Median : 56.00 Mode :character Median : 2.000 Median : 7.400
## Mean : 56.58 Mean : 1.907 Mean : 7.304
## 3rd Qu.: 85.00 3rd Qu.: 2.000 3rd Qu.: 9.200
## Max. :114.00 Max. :200.000 Max. :650.000
Checked Data Structure:
str(transaction_data)
## tibble [264,836 × 8] (S3: tbl_df/tbl/data.frame)
## $ DATE : num [1:264836] 43390 43599 43605 43329 43330 ...
## $ STORE_NBR : num [1:264836] 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: num [1:264836] 1000 1307 1343 2373 2426 ...
## $ TXN_ID : num [1:264836] 1 348 383 974 1038 ...
## $ PROD_NBR : num [1:264836] 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr [1:264836] "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g" "Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g" ...
## $ PROD_QTY : num [1:264836] 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num [1:264836] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
• The dataset consists of 264,836
observations and 8 variables.
Handled Missing Values:
skim_without_charts(transaction_data)
Data summary
| Name |
transaction_data |
| Number of rows |
264836 |
| Number of columns |
8 |
| _______________________ |
|
| Column type frequency: |
|
| character |
1 |
| numeric |
7 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
| PROD_NAME |
0 |
1 |
17 |
40 |
0 |
114 |
0 |
Variable type: numeric
| DATE |
0 |
1 |
43464.04 |
105.39 |
43282.0 |
43373.0 |
43464.0 |
43555.0 |
43646 |
| STORE_NBR |
0 |
1 |
135.08 |
76.78 |
1.0 |
70.0 |
130.0 |
203.0 |
272 |
| LYLTY_CARD_NBR |
0 |
1 |
135549.48 |
80579.98 |
1000.0 |
70021.0 |
130357.5 |
203094.2 |
2373711 |
| TXN_ID |
0 |
1 |
135158.31 |
78133.03 |
1.0 |
67601.5 |
135137.5 |
202701.2 |
2415841 |
| PROD_NBR |
0 |
1 |
56.58 |
32.83 |
1.0 |
28.0 |
56.0 |
85.0 |
114 |
| PROD_QTY |
0 |
1 |
1.91 |
0.64 |
1.0 |
2.0 |
2.0 |
2.0 |
200 |
| TOT_SALES |
0 |
1 |
7.30 |
3.08 |
1.5 |
5.4 |
7.4 |
9.2 |
650 |
• The dataset is free of NA values and
empty strings too.
Checked for Duplicate Data Points:
transaction_data %>%
filter(duplicated(.)) %>%
print
## # A tibble: 1 × 8
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 43374 107 107024 108462 45 Smiths Thin… 2 6
• A duplicate data point was identified.
Removed Duplicate Data:
transaction_data <- transaction_data %>%
distinct()
str(transaction_data)
## tibble [264,835 × 8] (S3: tbl_df/tbl/data.frame)
## $ DATE : num [1:264835] 43390 43599 43605 43329 43330 ...
## $ STORE_NBR : num [1:264835] 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: num [1:264835] 1000 1307 1343 2373 2426 ...
## $ TXN_ID : num [1:264835] 1 348 383 974 1038 ...
## $ PROD_NBR : num [1:264835] 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr [1:264835] "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g" "Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g" ...
## $ PROD_QTY : num [1:264835] 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num [1:264835] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
• Removed duplicate observation, reducing the
dataset to 264,835 unique records.
Formatted Excel Date (origin = “1899-12-30”):
transaction_data$DATE <- as.Date(transaction_data$DATE, origin = "1899-12-30")
str(transaction_data)
## tibble [264,835 × 8] (S3: tbl_df/tbl/data.frame)
## $ DATE : Date[1:264835], format: "2018-10-17" "2019-05-14" ...
## $ STORE_NBR : num [1:264835] 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: num [1:264835] 1000 1307 1343 2373 2426 ...
## $ TXN_ID : num [1:264835] 1 348 383 974 1038 ...
## $ PROD_NBR : num [1:264835] 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr [1:264835] "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g" "Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g" ...
## $ PROD_QTY : num [1:264835] 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num [1:264835] 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
Merged Both of The Datasets As ‘merged_data’:
merged_data <- merge(purchase_data, transaction_data, by = "LYLTY_CARD_NBR")
str(merged_data)
## 'data.frame': 264835 obs. of 10 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ DATE : Date, format: "2018-10-17" "2018-09-16" ...
## $ STORE_NBR : num 1 1 1 1 1 1 1 1 1 1 ...
## $ TXN_ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ PROD_NBR : num 5 58 52 106 96 86 49 10 20 51 ...
## $ PROD_NAME : chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ PROD_QTY : num 2 1 1 1 1 1 1 1 1 2 ...
## $ TOT_SALES : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
• Merged the two datasets to create a
comprehensive dataframe, with 264,835
observations and 10 variables
Renamed Column Names:
merged_data <- merged_data %>%
rename(card_id = LYLTY_CARD_NBR,
life_stage = LIFESTAGE,
tier = PREMIUM_CUSTOMER,
date = DATE,
store_id = STORE_NBR,
txn_id = TXN_ID,
product_id = PROD_NBR,
product_name = PROD_NAME,
quantity = PROD_QTY,
revenue = TOT_SALES)
str(merged_data)
## 'data.frame': 264835 obs. of 10 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ tier : chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : num 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
Mutated ‘month_year’ Column:
merged_data <- merged_data %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
mutate(month_year = factor(format(date, "%b-%Y"),
levels = c("Jul-2018", "Aug-2018",
"Sep-2018", "Oct-2018",
"Nov-2018", "Dec-2018", "Jan-2019",
"Feb-2019","Mar-2019", "Apr-2019",
"May-2019", "Jun-2019")))
str(merged_data)
## 'data.frame': 264835 obs. of 11 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ tier : chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : num 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ month_year : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
Outlier Detection And Treatment:
merged_data %>%
ggplot(aes(quantity)) + geom_boxplot()

• The plot shows a strong right skew in quantity
distribution, with potential outliers on the higher end.
Filtered ‘merged_data’:
merged_data <- merged_data %>%
filter(quantity < 200)
str(merged_data)
## 'data.frame': 264833 obs. of 11 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ tier : chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : num 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ month_year : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
• Now the merged dataset consists of 264,833
observations and 10 variables.
Customer Segmentation By Revenue:
merged_data %>%
ggplot(aes(tier,revenue, fill = tier)) +
geom_boxplot() +
geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
theme_minimal() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Tiers",
y = "Revenue",
title = "Customer Segmentation by Revenue",
caption = "Data Analyst : JP"
)

• Outliers were identified and
handled appropriately to avoid skewing the distribution
before creating the boxplot.
• Enhanced Data Quality: Improved
data accuracy and consistency by addressing missing values, outliers,
and inconsistencies.
• Standardized Data Formats:
Ensured uniform data formats for further analysis and
modeling.
• Optimized Data Structure: Merged
and transformed data into a suitable format for
analysis.
• Leveraged Data Cleaning Techniques:
Employed data visualization techniques box plots, to
identify and address outliers in the
quantity and revenue variable.
Outliers were handled by capping.
Phase 4 & 5 : Analyze and Share
Descriptive and Exploratory Data Analysis
(EDA):
Univariate Analysis
For Numerical Variables:
merged_data %>%
select(where(is.numeric)) %>%
skim()
Data summary
| Name |
Piped data |
| Number of rows |
264833 |
| Number of columns |
6 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
6 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| card_id |
0 |
1 |
135548.90 |
80580.03 |
1000.0 |
70021.0 |
130357.0 |
203094.0 |
2373711.0 |
▇▁▁▁▁ |
| store_id |
0 |
1 |
135.08 |
76.78 |
1.0 |
70.0 |
130.0 |
203.0 |
272.0 |
▆▇▇▇▇ |
| txn_id |
0 |
1 |
135157.72 |
78133.05 |
1.0 |
67600.0 |
135137.0 |
202700.0 |
2415841.0 |
▇▁▁▁▁ |
| product_id |
0 |
1 |
56.58 |
32.83 |
1.0 |
28.0 |
56.0 |
85.0 |
114.0 |
▇▇▇▇▇ |
| quantity |
0 |
1 |
1.91 |
0.34 |
1.0 |
2.0 |
2.0 |
2.0 |
5.0 |
▁▇▁▁▁ |
| revenue |
0 |
1 |
7.30 |
2.53 |
1.5 |
5.4 |
7.4 |
9.2 |
29.5 |
▆▇▁▁▁ |
• Suitable Variables for Histogram and Density are revenue,
quantity :
Histogram and Density Plot of ‘revenue’:
merged_data %>%
ggplot(aes(revenue)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
theme_minimal() +
theme(
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Histogram and Density Distribution on Revenue",
caption = "Data Analyst : JP")

• The plot shows most transactions
have low to moderate revenue, peaking
around 5 units with a long tail toward higher values. This distribution
can guide revenue optimization and pricing
strategies.
Histogram and Density Plot of ‘quantity’:
merged_data %>%
ggplot(aes(quantity)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
theme_minimal() +
theme(
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Histogram and Density Distribution on Quantity",
caption = "Data Analyst : JP")

• The plot shows most transactions
involve small quantities, with a peak around 2 units
and long tails. This insight can aid in optimizing
inventory and pricing strategies.
Histogram and Density Plot of ‘store_id’:
merged_data %>%
ggplot(aes(store_id)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
geom_line(stat = "density", aes(y = after_stat(count)), color = "red") +
theme_minimal() +
theme(
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Histogram and Density Distribution on Store IDs",
caption = "Data Analyst : JP")

• Store ID distribution helps identify clusters or
patterns in sales activity across stores.
For Categorical Variables:
merged_data %>%
select(where(is.character)) %>%
skim()
Data summary
| Name |
Piped data |
| Number of rows |
264833 |
| Number of columns |
3 |
| _______________________ |
|
| Column type frequency: |
|
| character |
3 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
| life_stage |
0 |
1 |
8 |
22 |
0 |
7 |
0 |
| tier |
0 |
1 |
6 |
10 |
0 |
3 |
0 |
| product_name |
0 |
1 |
17 |
40 |
0 |
114 |
0 |
Tier Frequencies:
merged_data %>%
count(tier) %>%
rename(counts = n)
## tier counts
## 1 Budget 93157
## 2 Mainstream 101988
## 3 Premium 69688
Distribution of Member Tier:
merged_data %>%
count(tier) %>%
rename(counts = n) %>%
mutate(percentage = counts / sum(counts) * 100) %>%
ggplot(aes(x = "", y = counts, fill = tier)) +
geom_bar(stat = "identity", color = 'black', width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = paste0(round(percentage, 2), "%")), position = position_stack(vjust = 0.5)) +
theme_void() +
labs(title = "Distribution of Member Types",
fill = "Members Tier",
caption = "Data Analyst : JP")

• This pie chart shows the distribution of members
across budget tiers.
‘life_stage’ With ‘tier’ Frquencies:
merged_data %>%
count(life_stage, tier) %>%
rename(counts = n) %>%
mutate(percentage = counts / sum(counts) * 100)
## life_stage tier counts percentage
## 1 MIDAGE SINGLES/COUPLES Budget 5020 1.8955342
## 2 MIDAGE SINGLES/COUPLES Mainstream 11874 4.4835802
## 3 MIDAGE SINGLES/COUPLES Premium 8216 3.1023324
## 4 NEW FAMILIES Budget 3005 1.1346773
## 5 NEW FAMILIES Mainstream 2325 0.8779117
## 6 NEW FAMILIES Premium 1589 0.6000008
## 7 OLDER FAMILIES Budget 23160 8.7451337
## 8 OLDER FAMILIES Mainstream 14244 5.3784838
## 9 OLDER FAMILIES Premium 11190 4.2253042
## 10 OLDER SINGLES/COUPLES Budget 18407 6.9504178
## 11 OLDER SINGLES/COUPLES Mainstream 18318 6.9168117
## 12 OLDER SINGLES/COUPLES Premium 17753 6.7034697
## 13 RETIREES Budget 15201 5.7398436
## 14 RETIREES Mainstream 21466 8.1054853
## 15 RETIREES Premium 13096 4.9450031
## 16 YOUNG FAMILIES Budget 19122 7.2203993
## 17 YOUNG FAMILIES Mainstream 12907 4.8736373
## 18 YOUNG FAMILIES Premium 11563 4.3661477
## 19 YOUNG SINGLES/COUPLES Budget 9242 3.4897464
## 20 YOUNG SINGLES/COUPLES Mainstream 20854 7.8743963
## 21 YOUNG SINGLES/COUPLES Premium 6281 2.3716833
‘life_stage’ and Frquency Breakdown:
merged_data %>%
count(life_stage, tier) %>%
rename(counts = n) %>%
mutate(percentage = counts / sum(counts) * 100) %>%
ggplot(aes(life_stage, counts, fill = life_stage)) +
geom_bar(stat = "identity")+
theme_dark() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Life Stage",
y = "Frquency",
title = "Life Stage and Frquency Breakdown",
caption = "Data Analyst : JP")

• The bar chart reveals that Older Singles/Couples
and Retirees have the highest frequency, while New Families are the
least represented. This breakdown provides insights
into the dominant life stages for targeted initiatives and product
offerings.
Life Stage and Tier Frquency Breakdown:
merged_data %>%
count(life_stage, tier) %>%
ggplot(aes(x = tier, y = n, fill = life_stage)) +
geom_bar(stat = "identity", color = 'black', position = "dodge") +
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Tier",
y = "Frquency",
fill = "Life Stages",
title = "Life Stage and Tier Frquency Breakdown",
caption = "Data Analyst : JP")

• The bar chart highlights that Young Families and
Older Families dominate in frequency across different
tiers, particularly Budget and Premium, while Mid-Age Singles/Couples
have the lowest representation. This distribution can
inform targeted marketing and product
development strategies.
’Unique Customers, Tiers and Average Transactions
Breakdown:
# Calculate total unique customers and average transactions per customer
customer_summary <- merged_data %>%
group_by(card_id) %>%
summarise(total_transactions = n()) %>%
summarise(total_customers = n(),
avg_transactions_per_customer = mean(total_transactions))
# View the result
print(customer_summary)
## # A tibble: 1 × 2
## total_customers avg_transactions_per_customer
## <int> <dbl>
## 1 72636 3.65
’Unique Customers, Tiers and Average Revenue
Breakdown:
# Calculate total unique customers and average revenue per customer
customer_summary1 <- merged_data %>%
group_by(card_id) %>%
summarise(total_revenue = sum(revenue)) %>%
summarise(total_customers = n(),
avg_revenue_per_customer = mean(total_revenue))
# View the result
print(customer_summary1)
## # A tibble: 1 × 2
## total_customers avg_revenue_per_customer
## <int> <dbl>
## 1 72636 26.6
Total Revenue Distribution Across Life Stages and
Tiers::
revenue_summary <- merged_data %>%
group_by(tier, life_stage) %>%
summarise(
total_revenue = sum(revenue, na.rm = TRUE), # Total revenue
unique_customers = n_distinct(card_id) # Number of unique customers
)
## `summarise()` has grouped output by 'tier'. You can override using the
## `.groups` argument.
# Plot total revenue by tier, filled by life_stage (number of unique customers)
revenue_summary %>%
mutate(percentage = (total_revenue / sum(total_revenue)) * 100) %>%
ggplot(aes(life_stage, total_revenue, fill = tier )) +
geom_bar(stat = "identity", color = 'black', position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
theme_dark() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")
) +
labs(
x = "Life Stages",
y = "Total Revenue",
fill = "Tier",
title = "Total Revenue by Tier and Life Stage (Unique Customers)",
caption = "Data Analyst: JP"
)

• Increase Premium tier customer acquisition and retention,
especially in the Older Singles/Couples and Retirees segments.
Unique Customers Breakdown by Life Stage and
Tier:
merged_data %>%
group_by(card_id) %>%
summarise(life_stage = first(life_stage), tier = first(tier)) %>%
count(life_stage, tier) %>%
rename(counts = n) %>%
mutate(percentage = counts / sum(counts) * 100) %>%
ggplot(aes(x = life_stage, y = counts, fill = tier)) +
geom_bar(stat = "identity", color = 'black', position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
theme_dark() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
legend.title = element_text(face = "bold"),
legend.position = "right") +
labs(x = "Life Stage",
y = "Frequency of Unique Customers",
title = "Unique Customers Breakdown by Life Stage and Tier",
fill = "Tier", # Legend title
caption = "Data Analyst : JP")

Bivariate Analysis
For Numerical-Numerical
merged_data %>%
select(where(is.numeric)) %>%
skim_without_charts()
Data summary
| Name |
Piped data |
| Number of rows |
264833 |
| Number of columns |
6 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
6 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| card_id |
0 |
1 |
135548.90 |
80580.03 |
1000.0 |
70021.0 |
130357.0 |
203094.0 |
2373711.0 |
| store_id |
0 |
1 |
135.08 |
76.78 |
1.0 |
70.0 |
130.0 |
203.0 |
272.0 |
| txn_id |
0 |
1 |
135157.72 |
78133.05 |
1.0 |
67600.0 |
135137.0 |
202700.0 |
2415841.0 |
| product_id |
0 |
1 |
56.58 |
32.83 |
1.0 |
28.0 |
56.0 |
85.0 |
114.0 |
| quantity |
0 |
1 |
1.91 |
0.34 |
1.0 |
2.0 |
2.0 |
2.0 |
5.0 |
| revenue |
0 |
1 |
7.30 |
2.53 |
1.5 |
5.4 |
7.4 |
9.2 |
29.5 |
Correlation Between Numerical Variables:
merged_data %>%
select_if(is.numeric) %>%
cor() %>%
ggcorrplot(hc.order = TRUE, type = "lower", lab = TRUE) +
theme_dark() +
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank(),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Correlation between Numerical Variables",
caption = "Data Analyst : JP")

• The matrix highlights potential correlations
between store IDs, card IDs, and sales quantities,
warranting further analysis to uncover
underlying factors.
Correlation between ‘store_id’ And ‘txn_id’:
merged_data %>%
ggplot(aes(store_id, txn_id)) +
geom_point(color = "red", size = 2) +
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Correlation between Store IDs and Transaction IDs",
x = "Store IDs",
y = "Txn IDs",
caption = "Data Analyst : JP")

• Scatter plot showing the correlation between Store IDs and
Transaction IDs. Note: correlation doesn’t
imply causation; the relationship may be
coincidental or influenced by other
factors.
Correlation between ‘store_id’ And ‘card_id’:
merged_data %>%
ggplot(aes(store_id, card_id)) +
geom_point(color = "#f6522e", size = 2)+
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Correlation between Store IDs and Card IDs",
x = "Store IDs",
y = "Card IDs",
caption = "Data Analyst : JP")

• Scatter plot showing the correlation between Store IDs and
Card IDs. Note: correlation doesn’t imply
causation; the relationship may be coincidental or
influenced by other factors.
Correlation Between ‘quantity’ and ‘revenue’:
merged_data %>%
ggplot(aes(quantity, revenue)) +
geom_point(color = "#fb785b", size = 2) +
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(title = "Correlation between Quantity and Revenue",
x = "Quantity",
y = "Revenue",
caption = "Data Analyst : JP")

• This scatter plot visualizes the correlation
between Quantity and Revenue.
For Numerical-Categorical
Total Quantities sold Through ;store_id’:
merged_data %>%
group_by(store_id) %>%
summarise(total_quantity_sold = sum(quantity)) %>%
arrange(desc(total_quantity_sold))
## # A tibble: 272 × 2
## store_id total_quantity_sold
## <dbl> <dbl>
## 1 226 4001
## 2 88 3718
## 3 93 3639
## 4 165 3602
## 5 43 3519
## 6 237 3515
## 7 40 3499
## 8 230 3476
## 9 213 3470
## 10 58 3463
## # ℹ 262 more rows
Top 10 Stores Sold The Most Quantity:
merged_data %>%
group_by(store_id) %>%
summarise(total_quantity_sold = sum(quantity)) %>%
arrange(desc(total_quantity_sold)) %>%
slice_max(total_quantity_sold, n = 10) %>%
mutate(store_id = factor(store_id, levels = store_id)) %>%
ggplot(aes(store_id, total_quantity_sold, fill = store_id)) +
geom_bar(stat = "identity", color = 'black') +
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Store IDs",
y = "Total Quantities Sold",
title = "Top 10 Stores Sold The Most Quantity ",
caption = "Data Analyst : JP")

• The chart shows wide variation in performance
among the top 10 stores, with Store 226 leading with
significantly more quantities sold.
Lowest-Performing Stores:
merged_data %>%
group_by(store_id) %>%
summarise(total_quantity_sold = sum(quantity)) %>%
arrange(total_quantity_sold)
## # A tibble: 272 × 2
## store_id total_quantity_sold
## <dbl> <dbl>
## 1 11 2
## 2 76 2
## 3 92 2
## 4 206 2
## 5 211 2
## 6 252 2
## 7 31 4
## 8 193 4
## 9 85 5
## 10 117 47
## # ℹ 262 more rows
10 Lowest-Performing Stores:
merged_data %>%
group_by(store_id) %>%
summarise(total_quantity_sold = sum(quantity)) %>%
arrange(total_quantity_sold) %>%
slice_min(total_quantity_sold, n = 10) %>%
mutate(store_id = factor(store_id, levels = store_id)) %>%
ggplot(aes(store_id, total_quantity_sold, fill = store_id)) +
geom_bar(stat = "identity", color = 'black') +
theme_dark() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Store IDs",
y = "Total Quantities Sold",
title = "10 Lowest-Performing Stores",
caption = "Data Analyst : JP")

• This bar chart illustrates the 10 stores with the
lowest quantities sold. Store 117 stands
out as the highest-performing store among these 10, with
significantly more quantities sold than the others.
Extracted ‘brand’ Names:
merged_data <- merged_data %>%
mutate(brand = word(product_name, 1))
str(merged_data)
## 'data.frame': 264833 obs. of 12 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ tier : chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : num 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ month_year : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
## $ brand : chr "Natural" "Red" "Grain" "Natural" ...
Top 5 Best-Selling Products: Monthly Trends:
merged_data %>%
group_by(month_year, brand) %>%
summarise(total_quantity_sold = sum(quantity), .groups = 'drop') %>%
arrange(month_year, desc(total_quantity_sold)) %>%
group_by(month_year) %>%
slice_head(n = 5) %>%
ggplot(aes(x = reorder(brand, -total_quantity_sold), y = total_quantity_sold, fill = brand)) +
geom_bar(stat = "identity", color = 'black') +
facet_wrap(~ month_year) +
theme_dark() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Brand Name", y = "Total Quantity Sold", title = "Top 5 Products Sold Per Month over the Year",
caption = "Data Analyst : JP")

• Kettle leads in sales, followed by Doritos and
Pringles with slight fluctuations. Smiths and Thins have lower sales,
with overall minor variations across brands.
Time Series Analysis of Top 5 Product Sales:
merged_data %>%
group_by(month_year, brand) %>%
summarise(total_quantity_sold = sum(quantity), .groups = 'drop') %>%
arrange(month_year, desc(total_quantity_sold)) %>%
group_by(month_year) %>%
slice_head(n = 5) %>%
ggplot(aes(month_year, total_quantity_sold, color = brand, group = brand)) +
geom_line() +
geom_point(color = 'black', ) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Month-Year", y = "Total Quantity Sold", title = "Trend In Top 5 Products Sold Per Month over Year",
caption = "Data Analyst : JP")

• Line chart showing sales trends of top 5 snack brands (Doritos,
Kettle, Pringles, Smiths, Thins) from July 2018 to July 2019. Doritos
and Kettle lead, Thins lags, and total
sales fluctuate, suggesting seasonal patterns.
‘revenue’ Distribution Across ‘life_stage’:
merged_data %>%
ggplot(aes(life_stage, revenue, fill = life_stage)) +
geom_boxplot() +
geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Life Stage",
y = "Revenue",
title = "Revenue Distribution Across Life Stage",
caption = "Data Analyst : JP"
)

• Revenue spread varies across life stages,
providing useful data for targeted
marketing strategies.
Monthly Revenue Performance Over the Year:
merged_data %>%
ggplot(aes(month_year, revenue, fill = month_year)) +
geom_boxplot() +
geom_hline(yintercept = mean(merged_data$revenue), color = "purple", linetype = 2, linewidth = 1) +
stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 2)), size = 4, vjust = -0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
legend.position = "none",
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Month-Year",
y = "Revenue",
title = "Monthly Revenue Performance Over the Year",
caption = "Data Analyst : JP"
)

• The box plot shows monthly revenue distribution.
The median revenue varies across months, with wider
boxes indicating greater revenue variability.
Chi-Square Test
For Categorical-Categorical
Contingency Table For ‘life_stage’ And ‘tier’:
lt_table <- table(merged_data$tier, merged_data$life_stage)
print(lt_table)
##
## MIDAGE SINGLES/COUPLES NEW FAMILIES OLDER FAMILIES
## Budget 5020 3005 23160
## Mainstream 11874 2325 14244
## Premium 8216 1589 11190
##
## OLDER SINGLES/COUPLES RETIREES YOUNG FAMILIES
## Budget 18407 15201 19122
## Mainstream 18318 21466 12907
## Premium 17753 13096 11563
##
## YOUNG SINGLES/COUPLES
## Budget 9242
## Mainstream 20854
## Premium 6281
Chi-Square Test For ‘life_stage’ And ‘tier’:
print(chisq.test(lt_table))
##
## Pearson's Chi-squared test
##
## data: lt_table
## X-squared = 15226, df = 12, p-value < 2.2e-16
• Chi-Square Test indicates a strong relationship
between life stages and tier.
Heatmap For ‘life_stage’ And ‘tier’:
lt_table <- data.frame(lt_table)
lt_table %>%
ggplot(aes(Var1, Var2, fill = Freq)) +
geom_tile(color = 'black') +
scale_fill_gradient(low = "white", high = "darkblue") +
theme_minimal() +
theme(
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Tier",
y = "Life Stage",
fill = "Frequency",
title = "Customer Segmentation Heatmap",
caption = "Data Analyst : JP"
)

• The heatmap visualizes customer distribution
across life stages and tiers. Tile size and color intensity
indicate customer frequency.
Contingency Table For ‘life_stage’ And ‘brand’:
lb_table <- table(merged_data$life_stage, merged_data$brand)
print(lb_table)
##
## Burger CCs Cheetos Cheezels Cobs Dorito Doritos
## MIDAGE SINGLES/COUPLES 152 433 265 443 961 300 2361
## NEW FAMILIES 40 93 62 129 288 91 693
## OLDER FAMILIES 353 941 615 813 1624 533 4332
## OLDER SINGLES/COUPLES 292 851 580 966 2036 684 5236
## RETIREES 256 741 491 867 1884 619 4852
## YOUNG FAMILIES 293 898 550 771 1504 484 3950
## YOUNG SINGLES/COUPLES 178 594 364 614 1396 472 3538
##
## French Grain GrnWves Infuzions Infzns Kettle Natural
## MIDAGE SINGLES/COUPLES 117 613 119 1059 344 4055 550
## NEW FAMILIES 32 183 30 300 93 1171 132
## OLDER FAMILIES 283 1090 339 1926 570 6851 1254
## OLDER SINGLES/COUPLES 286 1311 292 2327 635 8847 1198
## RETIREES 228 1214 252 2107 612 8194 1040
## YOUNG FAMILIES 278 948 273 1775 440 6277 1107
## YOUNG SINGLES/COUPLES 194 913 163 1563 450 5893 769
##
## NCC Old Pringles Red RRD Smith Smiths Snbts
## MIDAGE SINGLES/COUPLES 144 915 2389 535 1079 267 2689 132
## NEW FAMILIES 26 245 698 137 251 78 681 31
## OLDER FAMILIES 317 1589 4244 1270 2625 642 5801 337
## OLDER SINGLES/COUPLES 268 1945 5307 1149 2175 557 5762 310
## RETIREES 246 1839 4951 1037 1943 518 5095 291
## YOUNG FAMILIES 260 1467 3829 1052 2344 563 5110 295
## YOUNG SINGLES/COUPLES 158 1324 3684 705 1477 338 3721 180
##
## Sunbites Thins Tostitos Twisties Tyrrells Woolworths
## MIDAGE SINGLES/COUPLES 117 1316 924 931 611 382
## NEW FAMILIES 30 378 277 233 187 90
## OLDER FAMILIES 285 2475 1546 1644 1093 939
## OLDER SINGLES/COUPLES 274 2969 2039 1949 1340 858
## RETIREES 244 2792 1850 1890 1259 741
## YOUNG FAMILIES 301 2186 1467 1412 997 851
## YOUNG SINGLES/COUPLES 181 1959 1368 1395 955 576
##
## WW
## MIDAGE SINGLES/COUPLES 907
## NEW FAMILIES 240
## OLDER FAMILIES 2263
## OLDER SINGLES/COUPLES 2035
## RETIREES 1710
## YOUNG FAMILIES 1910
## YOUNG SINGLES/COUPLES 1255
Chi-Square Test For ‘life_stage’ And ‘brand’:
print(chisq.test(lb_table))
##
## Pearson's Chi-squared test
##
## data: lb_table
## X-squared = 1574.3, df = 168, p-value < 2.2e-16
• Chi-Square Test indicates a strong relationship between
life stages and brands.
Heatmap For ‘life_stage’ And ‘brand’:
lb_table <- data.frame(lb_table)
lb_table %>%
ggplot(aes(Var2, Var1, fill = Freq)) +
geom_tile(color = 'black') +
scale_fill_gradient(low = "white", high = "darkblue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Brands",
y = "Life Stage",
fill = "Frequency",
title = "Brand Segmentation by Life Stage",
caption = "Data Analyst : JP"
)

• The heatmap visualizes brands distribution across
life stages. Tile size and color intensity indicate
brands frequency.
Contingency Table For ‘tier’ and ‘brand’:
tb_table <- table(merged_data$tier, merged_data$brand)
print(tb_table)
##
## Burger CCs Cheetos Cheezels Cobs Dorito Doritos French Grain
## Budget 579 1679 1051 1626 3274 1056 8762 539 2114
## Mainstream 548 1631 1111 1735 3889 1253 9939 507 2516
## Premium 437 1241 765 1242 2530 874 6261 372 1642
##
## GrnWves Infuzions Infzns Kettle Natural NCC Old Pringles Red
## Budget 542 3855 1067 14154 2246 539 3203 8620 2109
## Mainstream 521 4290 1260 16423 2162 495 3725 9903 2156
## Premium 405 2912 817 10711 1642 385 2396 6579 1620
##
## RRD Smith Smiths Snbts Sunbites Thins Tostitos Twisties Tyrrells
## Budget 4371 1118 10430 610 536 4931 3236 3229 2195
## Mainstream 4306 1055 10787 544 498 5436 3737 3785 2583
## Premium 3217 790 7642 422 398 3708 2498 2440 1664
##
## Woolworths WW
## Budget 1605 3881
## Mainstream 1607 3586
## Premium 1225 2853
Chi-Square Test For ‘tier’ and ‘brand’:
print(chisq.test(tb_table))
##
## Pearson's Chi-squared test
##
## data: tb_table
## X-squared = 348.61, df = 56, p-value < 2.2e-16
• The Pearson’s Chi-squared test yielded a strong
relationship between the variables.
Heatmap For ‘tier’ and ‘brand’:
tb_table <- data.frame(tb_table)
tb_table %>%
ggplot(aes(Var1, Var2, fill = Freq)) +
geom_tile(color = 'black') +
scale_fill_gradient(low = "white", high = "darkblue") +
theme_minimal() +
theme(
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(
x = "Tier",
y = "Brands",
fill = "Frequency",
title = "Brand Segmentation by Tier",
caption = "Data Analyst : JP"
)

• The heatmap visualizes tiers distribution across
brands. Tile size and color intensity indicate brands
frequency.
Forecasting
Projected Revenue for the Next 6 Months (180
Days):
zoo_data <- merged_data %>%
group_by(date) %>%
summarise(total_revenue = sum(revenue))
zoo_data <- zoo(zoo_data$total_revenue, order.by = as.Date(zoo_data$date))
fit <- auto.arima(zoo_data)
forecast(fit, h = 180) %>%
plot(xlab = "Days", ylab = "Total Revenue",
main = "Projected Revenue for the Next 6 Months (180 Days)")

• The plot shows projected revenue for the next 6
months, with the predicted trend line and confidence interval. It
suggests stable revenue with some fluctuations.
Assessed Model Fit:
checkresiduals(fit)

##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,0) with non-zero mean
## Q* = 6.5198, df = 8, p-value = 0.5892
##
## Model df: 2. Total lags used: 10
• The residuals are randomly distributed with no patterns. The
histogram is approximately normal, and the Ljung-Box test(Q* = 6.52, p =
0.5892) confirms no autocorrelation. The ARIMA(2,0,0)
model fits the data well.
Model Summary for Forecasting:
summary(fit)
## Series: zoo_data
## ARIMA(2,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 mean
## 0.4304 0.2417 5314.2705
## s.e. 0.0507 0.0506 41.6029
##
## sigma^2 = 69528: log likelihood = -2544.56
## AIC=5097.12 AICc=5097.23 BIC=5112.72
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.3734051 262.5937 203.2132 -0.24488 3.860994 0.8611807 -0.0114896
• The projected revenue for the next six months
shows a consistent trend, with a slight upward
slope.
• The confidence interval is relatively narrow,
suggesting a high degree of accuracy.”
Extracted Pack Size from Product Name:
merged_data <- merged_data %>%
mutate(pack_size = parse_number(product_name))
str(merged_data)
## 'data.frame': 264833 obs. of 13 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ tier : chr "Premium" "Mainstream" "Budget" "Budget" ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : num 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ month_year : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
## $ brand : chr "Natural" "Red" "Grain" "Natural" ...
## $ pack_size : num 175 150 210 175 160 165 110 150 330 170 ...
Total Revenue by Pack Size:
merged_data %>%
group_by(pack_size) %>%
summarise(total_revenue = sum(revenue)) %>%
arrange(desc(total_revenue)) %>%
ggplot(aes(reorder(factor(pack_size), total_revenue), y = total_revenue, fill = pack_size)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::comma) + # Remove scientific notation
theme_dark() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Pack Size",
y = "Total Revenue",
title = "Total Revenue by Pack Size (Ordered by Revenue)",
caption = "Data Analyst : JP")

• Pack sizes 175 and 150 yield the highest revenue,
while sizes 125 and 220 generate the lowest,
showing significant revenue variation across pack
sizes.
Machine Learning
Transformed Categorical Variables for Model
Fitting:
model_data <- merged_data %>%
mutate(pack_size = as.factor(pack_size),
store_id = as.factor(store_id),
brand = as.factor(brand),
month_year = as.factor(month_year),
product_id = as.factor(product_id),
life_stage = as.factor(life_stage),
tier = as.factor(tier))
str(model_data)
## 'data.frame': 264833 obs. of 13 variables:
## $ card_id : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ life_stage : Factor w/ 7 levels "MIDAGE SINGLES/COUPLES",..: 7 7 6 6 4 1 7 7 2 7 ...
## $ tier : Factor w/ 3 levels "Budget","Mainstream",..: 3 2 1 1 2 2 1 1 3 2 ...
## $ date : Date, format: "2018-10-17" "2018-09-16" ...
## $ store_id : Factor w/ 272 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ txn_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ product_id : Factor w/ 114 levels "1","2","3","4",..: 5 58 52 106 96 86 49 10 20 51 ...
## $ product_name: chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ quantity : num 2 1 1 1 1 1 1 1 1 2 ...
## $ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ month_year : Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
## $ brand : Factor w/ 29 levels "Burger","CCs",..: 14 18 9 14 29 3 11 19 7 7 ...
## $ pack_size : Factor w/ 21 levels "70","90","110",..: 11 7 15 11 8 9 3 7 20 10 ...
Data Preparation: Splitting:
model_data1 <- merged_data %>%
mutate(pack_size = as.factor(pack_size),
store_id = as.factor(store_id),
brand = as.factor(brand),
month_year = as.factor(month_year),
product_id = as.factor(product_id),
life_stage = as.factor(life_stage),
tier = as.factor(tier))
set.seed(123)
sample_data1 <- sample(nrow(model_data1), size = floor(0.75 * nrow(model_data1)))
train_data1 <- model_data1[sample_data1, ]
test_data1 <- model_data1[-sample_data1, ]
Linear Regression Model Training and Evaluation:
# Trained the linear regression model on the train_data:
lm_model <- lm(revenue ~ (brand + quantity) ^ 2 + pack_size, data = train_data1)
# Made predictions on the test_data:
predictions <- predict(lm_model, newdata = test_data1)
# Calculated validation metrics on the test_data:
actuals1 <- test_data1$revenue
residuals1 <- actuals1 - predictions
# Calculated R-squared and RMSE:
rsquared1 <- 1 - sum(residuals1^2) / sum((actuals1 - mean(actuals1))^2)
rmse1 <- sqrt(mean(residuals1^2))
# Print results:
print(paste("R-squared:", rsquared1))
## [1] "R-squared: 0.979241536717138"
print(paste("RMSE:", rmse1))
## [1] "RMSE: 0.363925484510656"
• The Linear Regression Model also achieved an
R-squared of 0.979, explaining 97.9% of the variance in
revenue, with a low RMSE of 0.365, demonstrating
accurate and reliable predictions.
Linear Regression Model Summary:
# Tidy coefficients
tidy_summary1 <- tidy(lm_model)
print(tidy_summary1)
## # A tibble: 78 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.28e-11 0.0503 2.55e-10 1.00
## 2 brandCCs 5.18e+ 0 0.0602 8.61e+ 1 0
## 3 brandCheetos 6.45e+ 0 0.0647 9.97e+ 1 0
## 4 brandCheezels 3.58e+ 0 0.0612 5.85e+ 1 0
## 5 brandCobs -2.66e+ 0 0.0576 -4.61e+ 1 0
## 6 brandDorito 2.96e- 1 0.0682 4.34e+ 0 0.0000141
## 7 brandDoritos 5.60e+ 0 0.0532 1.05e+ 2 0
## 8 brandFrench 5.18e+ 0 0.0797 6.50e+ 1 0
## 9 brandGrain 1.36e- 1 0.0590 2.31e+ 0 0.0209
## 10 brandGrnWves -1.63e-11 0.0752 -2.16e-10 1.00
## # ℹ 68 more rows
glance_summary1 <- glance(lm_model)
print(glance_summary1)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.980 0.980 0.361 136048. 0 70 -79588. 159320. 160055.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
anova(lm_model)
## Analysis of Variance Table
##
## Response: revenue
## Df Sum Sq Mean Sq F value Pr(>F)
## brand 28 648431 23158 177412.0 < 2.2e-16 ***
## quantity 1 333729 333729 2556652.3 < 2.2e-16 ***
## pack_size 13 240365 18490 141646.3 < 2.2e-16 ***
## brand:quantity 28 20598 736 5635.7 < 2.2e-16 ***
## Residuals 198553 25918 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
• These demonstrates the strong predictive power of
the selected features and model effectiveness in
uncovering revenue drivers.
Variable Importance in Linear Regression Model:
top_vars1 <- varImp(lm_model)
tidy_vars1 <- top_vars1 %>% head(20)
print(tidy_vars1)
## Overall
## brandCCs 8.609736e+01
## brandCheetos 9.971482e+01
## brandCheezels 5.847882e+01
## brandCobs 4.612246e+01
## brandDorito 4.342043e+00
## brandDoritos 1.053334e+02
## brandFrench 6.504406e+01
## brandGrain 2.310227e+00
## brandGrnWves 2.162047e-10
## brandInfuzions 3.792556e+01
## brandInfzns 3.905534e+01
## brandKettle 1.145380e+02
## brandNatural 8.937486e+01
## brandNCC 6.821387e+01
## brandOld 1.444717e+02
## brandPringles 5.680685e-01
## brandRed 1.102868e+02
## brandRRD 1.146858e+02
## brandSmith 9.967471e+01
## brandSmiths 7.650353e+01
• The variable importance table provides a
comprehensive view of how different categories (capped to 20)
influence revenue.
Residual Analysis:
plot(lm_model)




• The model fits reasonably well, with residuals
showing a largely random scatter. However, minor issues
with variance and normality suggest exploring
transformations or advanced models like Random Forest for
improved accuracy.
ANOVA and Post Hoc Analysis (Tukey Test):
model_data2 <- merged_data %>%
mutate(pack_size = as.factor(pack_size),
store_id = as.factor(store_id),
brand = as.factor(brand),
month_year = as.factor(month_year),
product_id = as.factor(product_id),
life_stage = as.factor(life_stage),
tier = as.factor(tier),
quantity = as.factor(quantity))
aov_model1 <- aov(revenue ~ brand + quantity + pack_size, data = model_data2)
turkey_summary <- tidy(TukeyHSD(aov_model1))
print(turkey_summary)
## # A tibble: 626 × 7
## term contrast null.value estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 brand CCs-Burger 0 -0.395 -0.448 -0.342 0
## 2 brand Cheetos-Burger 0 1.40 1.34 1.46 0
## 3 brand Cheezels-Burger 0 4.33 4.28 4.38 0
## 4 brand Cobs-Burger 0 2.91 2.86 2.96 0
## 5 brand Dorito-Burger 0 7.90 7.85 7.96 0
## 6 brand Doritos-Burger 0 3.71 3.66 3.75 0
## 7 brand French-Burger 0 1.22 1.16 1.29 0
## 8 brand Grain-Burger 0 2.50 2.45 2.55 0
## 9 brand GrnWves-Burger 0 1.47 1.40 1.53 0
## 10 brand Infuzions-Burger 0 2.53 2.48 2.58 0
## # ℹ 616 more rows
TukeyHSD Pairwise Comparisons:
tukey_results <- TukeyHSD(aov_model1)
plot(tukey_results, las = 1, col = "blue", cex.axis = 0.8)



• Pack Size: Not all pack sizes have the same
impact on the revenue.
• Quantity: The effect of quantity
on the revenue varies across different levels.
• Brand: Different brands exhibit
varying levels of influence on the revenue.
Data Splitting: Training and Testing Sets:
set.seed(123)
sample_data <- sample(nrow(model_data), size = floor(0.75 * nrow(model_data)))
train_data <- model_data[sample_data, ]
test_data <- model_data[-sample_data, ]
Random Forest Model Training and Performance
Evaluation:
# Trained Random Forest model:
rf_model <- randomForest(revenue ~ brand + pack_size + quantity ,
data = train_data, ntree = 100, importance = TRUE)
# Made predictions on the test_data:
rf_predictions <- predict(rf_model, newdata = test_data)
# Calculated validation metrics on the test_data:
actuals <- test_data$revenue
residuals <- actuals - rf_predictions
# Calculated R-squared and RMSE:
rsquared_rf <- 1 - sum(residuals^2) / sum((actuals - mean(actuals))^2)
rmse_rf <- sqrt(mean(residuals^2))
# Print results
cat("Random Forest R-squared:", rsquared_rf, "\n")
## Random Forest R-squared: 0.9790648
cat("Random Forest RMSE:", rmse_rf, "\n")
## Random Forest RMSE: 0.3654712
• The Random Forest model achieved an R-squared of
0.979, explaining 97.9% of the variance in revenue,
with a low RMSE of 0.365, demonstrating accurate and
reliable predictions.
Random Forest Model Summary:
print(rf_model)
##
## Call:
## randomForest(formula = revenue ~ brand + pack_size + quantity, data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.1505496
## % Var explained: 97.64
• The random forest model with 100 trees explains
97.64% of the variance in revenue, with a mean squared residual of
0.15.
Feature Importance Analysis:
importance(rf_model)
## %IncMSE IncNodePurity
## brand 45.34634 529829.2
## pack_size 30.25738 294077.5
## quantity 84.35584 322880.8
• Feature importance analysis identified quantity as
the most significant predictor (%IncMSE: 84.36),
followed by brand (45.35) and pack_size (30.26). These
findings underscore the critical impact of quantity on
revenue predictions
Variable Importance Analysis with Random Forest:
varImp(rf_model)
## Overall
## brand 45.34634
## pack_size 30.25738
## quantity 84.35584
• The variable importance table provides a
comprehensive view of how different categories
influence revenue.
Visualizing Feature Importance with Random
Forest:
varImpPlot(rf_model, main = "Random Forest Feature Importance")

• The plot suggests that that quantity is the most
important predictor, followed by brand, and then
pack_size.
Created a Data Frame for Actuals and
Predictions:
results <- data.frame(
Actuals = actuals,
Predictions = rf_predictions
)
Scatterplot for Actuals vs Predictions:
results %>%
ggplot(aes(Actuals, Predictions)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = "Actual Revenue",
y = "Predicted Revenue",
title = "Actual vs. Predicted Revenue",
caption = "Data Analyst : JP") +
annotate(
"text", x = min(actuals), y = max(rf_predictions),
label = paste("R-squared:", round(rsquared_rf, 2), "\nRMSE:", round(rmse_rf, 2)),
hjust = 0, vjust = 1, size = 4, color = "darkgreen"
)

Model Performance as Data Frame:
model_performance <- data.frame(
Model = c("Random Forest", "Linear Regression"),
R_squared = c(rsquared_rf, rsquared1),
RMSE = c(rmse_rf, rmse1))
Bar Chart for R-squared Comparison:
model_performance %>%
ggplot(aes(Model, R_squared, fill = Model)) +
geom_bar(stat = "identity", color = "black") +
theme_minimal() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
legend.position = "none") +
labs(x = "Model",
y = "R-squared",
title = "R-squared Comparison Between Models",
caption = "Data Analyst : JP")

Bar Chart for RMSE Comparison:
model_performance %>%
ggplot(aes(Model, RMSE, fill = Model)) +
geom_bar(stat = "identity", color = "black") +
theme_minimal() +
theme(axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
legend.position = "none") +
labs(x = "Model",
y = "RMSE",
title = "RMSE Comparison Between Models",
caption = "Data Analyst : JP")

Saving The Best Model:
saveRDS(rf_model, file = "best_rf_model.rds")
Later
we are tasked with evaluating the performance of a trial
conducted across three stores (Store 77, Store 86, and Store 88) to
assess whether the trial had a significant impact on sales and customer
behavior. Specifically, we aim to analyze whether the changes observed
in the trial stores can be attributed to factors like increased customer
traffic, higher transaction frequency, or other underlying
causes.
# Load data
qvi <- read.csv("QVI_data.csv")
# Check structure of data
str(qvi)
## 'data.frame': 264834 obs. of 12 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1003 1004 1005 1007 1007 1009 1010 ...
## $ DATE : chr "2018-10-17" "2018-09-16" "2019-03-07" "2019-03-08" ...
## $ STORE_NBR : int 1 1 1 1 1 1 1 1 1 1 ...
## $ TXN_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PROD_NBR : int 5 58 52 106 96 86 49 10 20 51 ...
## $ PROD_NAME : chr "Natural Chip Compny SeaSalt175g" "Red Rock Deli Chikn&Garlic Aioli 150g" "Grain Waves Sour Cream&Chives 210G" "Natural ChipCo Hony Soy Chckn175g" ...
## $ PROD_QTY : int 2 1 1 1 1 1 1 1 1 2 ...
## $ TOT_SALES : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
## $ PACK_SIZE : int 175 150 210 175 160 165 110 150 330 170 ...
## $ BRAND : chr "NATURAL" "RRD" "GRNWVES" "NATURAL" ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "YOUNG FAMILIES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Budget" ...
# Remove duplicates if any
qvi <- qvi %>%
distinct()
# Convert DATE to Date format if it's not already
qvi$DATE <- as.Date(qvi$DATE)
# Check for duplicates again and print them
qvi %>%
filter(duplicated(.)) %>%
print()
## [1] LYLTY_CARD_NBR DATE STORE_NBR TXN_ID
## [5] PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## [9] PACK_SIZE BRAND LIFESTAGE PREMIUM_CUSTOMER
## <0 rows> (or 0-length row.names)
# Define trial store IDs
trial_stores <- c(77, 86, 88)
# Define the trial period
trial_start <- as.Date("2018-07-01")
trial_end <- as.Date("2019-06-30")
# Step 1: Calculate total sales, transactions, and average transactions for trial stores
trial_sales <- qvi %>%
filter(STORE_NBR %in% trial_stores, as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end) %>%
group_by(STORE_NBR, DATE) %>%
summarise(
trial_sales = sum(TOT_SALES),
trial_txn_count = n(),
avg_txn_per_customer = trial_sales / trial_txn_count,
.groups = "drop"
)
# Step 2: Calculate total sales, transactions, and average transactions for control store candidates
control_candidates <- qvi %>%
filter(!STORE_NBR %in% trial_stores, as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end) %>%
group_by(STORE_NBR, DATE) %>%
summarise(
store_sales = sum(TOT_SALES),
txn_count = n(),
avg_txn_per_customer = store_sales / txn_count,
.groups = "drop"
)
# Step 3: Identify the best control store for each trial store using correlation
find_best_control <- function(trial_store_id) {
# Extract sales data for the selected trial store
trial_store_sales <- trial_sales %>%
filter(STORE_NBR == trial_store_id) %>%
select(DATE, trial_sales)
# Compare correlation with each control store
correlations <- control_candidates %>%
left_join(trial_store_sales, by = "DATE") %>%
group_by(STORE_NBR) %>%
summarise(
correlation = ifelse(sum(!is.na(store_sales) & !is.na(trial_sales)) > 1,
{
# Check if standard deviation is zero
if (sd(store_sales, na.rm = TRUE) == 0 | sd(trial_sales, na.rm = TRUE) == 0) {
NA # Set correlation to NA if standard deviation is zero
} else {
cor(store_sales, trial_sales, use = "complete.obs")
}
},
NA),
.groups = "drop"
) %>%
arrange(desc(correlation))
# Return the best matching control store, excluding any with NA correlation
return(correlations %>% filter(!is.na(correlation)) %>% slice(1))
}
# Find the best control stores for trial stores 77, 86, and 88
best_control_77 <- find_best_control(77)
best_control_86 <- find_best_control(86)
best_control_88 <- find_best_control(88)
# Display the best control stores for each trial store
best_control_77
## # A tibble: 1 × 2
## STORE_NBR correlation
## <int> <dbl>
## 1 193 0.374
best_control_86
## # A tibble: 1 × 2
## STORE_NBR correlation
## <int> <dbl>
## 1 85 1
best_control_88
## # A tibble: 1 × 2
## STORE_NBR correlation
## <int> <dbl>
## 1 204 0.191
# Step 4: Combine the trial store data with its best control store data
combined_data_77 <- qvi %>%
filter(STORE_NBR %in% c(77, best_control_77$STORE_NBR)) %>%
filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)
combined_data_86 <- qvi %>%
filter(STORE_NBR %in% c(86, best_control_86$STORE_NBR)) %>%
filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)
combined_data_88 <- qvi %>%
filter(STORE_NBR %in% c(88, best_control_88$STORE_NBR)) %>%
filter(as.Date(DATE) >= trial_start & as.Date(DATE) <= trial_end)
# Step 5: Perform the t-test for each trial store against its best control store
t_test_results <- function(trial_data) {
t_test_result <- t.test(TOT_SALES ~ STORE_NBR, data = trial_data)
return(t_test_result$p.value)
}
# Get t-test results for each trial store
p_value_77 <- t_test_results(combined_data_77)
p_value_86 <- t_test_results(combined_data_86)
p_value_88 <- t_test_results(combined_data_88)
# Display p-values for each trial store
p_value_77
## [1] 0.6068824
p_value_86
## [1] 0.3906579
p_value_88
## [1] 5.496461e-09
# Aggregate trial_sales and control_candidates by STORE_NBR and DATE
trial_sales_agg <- trial_sales %>%
group_by(STORE_NBR, DATE) %>%
summarise(
trial_sales = sum(trial_sales),
trial_txn_count = sum(trial_txn_count),
avg_txn_per_customer_trial = mean(avg_txn_per_customer),
.groups = "drop"
)
control_candidates_agg <- control_candidates %>%
group_by(STORE_NBR, DATE) %>%
summarise(
store_sales = sum(store_sales),
txn_count = sum(txn_count),
avg_txn_per_customer_control = mean(avg_txn_per_customer),
.groups = "drop"
)
# Step 6: Sales change calculation (including a correction for avg_txn_per_customer)
# Step 6: Sales change calculation (including a correction for avg_txn_per_customer)
sales_change <- trial_sales_agg %>%
left_join(control_candidates_agg, by = "DATE", suffix = c("_trial", "_control")) %>%
mutate(
change_in_customers = trial_txn_count - txn_count,
change_in_avg_txn_per_customer = avg_txn_per_customer_trial - avg_txn_per_customer_control,
.groups = "drop"
)
# Display sales change for validation
head(sales_change)
## # A tibble: 6 × 12
## STORE_NBR_trial DATE trial_sales trial_txn_count avg_txn_per_customer_…¹
## <int> <date> <dbl> <int> <dbl>
## 1 77 2018-07-01 15.6 2 7.8
## 2 77 2018-07-01 15.6 2 7.8
## 3 77 2018-07-01 15.6 2 7.8
## 4 77 2018-07-01 15.6 2 7.8
## 5 77 2018-07-01 15.6 2 7.8
## 6 77 2018-07-01 15.6 2 7.8
## # ℹ abbreviated name: ¹avg_txn_per_customer_trial
## # ℹ 7 more variables: STORE_NBR_control <int>, store_sales <dbl>,
## # txn_count <int>, avg_txn_per_customer_control <dbl>,
## # change_in_customers <int>, change_in_avg_txn_per_customer <dbl>,
## # .groups <chr>
# Step 7: Visualization of sales trends for trial stores
ggplot(trial_sales, aes(x = DATE, y = trial_sales, color = factor(STORE_NBR))) +
geom_line() +
labs(title = "Sales Trends for Trial Stores", x = "Date", y = "Total Sales",
color = "Store Number",
caption = "Data Analyst : JP")
