# Do not modify this chunk
library(tidyverse)
library(readxl) # package that will help you to load MS Excel data in R.
library(flextable)
library(ggplot2) # will help with advanced plotting if necessary.
In this assignment, we perform basic data exploration and
visualization on marketing data. This is to get insights on how
customers behave to what FreshDirect company offers.
FreshDirect is the leader in online grocery delivery. Their marketing
data captures customer information such as demographics, transaction
behavior, and ordering patterns to enable loyalty analysis,
segmentation, and predictive modeling.
Your task is to load the data using the package readxl
allowing you to load excel files. Identify the function from the package
with specific chosen parameters from it to get rid of some big issues
that may come with data importing. Then you also need to perform some
data cleaning for some computations.
Below is the data description.
Column Name | Description |
|---|---|
LOYALTY_SEGMENT | Shopper classification based on purchase frequency (e.g., Weekly, Bi-Weekly, Monthly) |
AGE | Customer's age |
INCOME | Household income (may be grouped into ranges) |
GENDER | Gender of the primary shopper |
ZIP_CODE | Residential ZIP code of customer |
DMA | Designated Market Area (media/advertising region) |
GEOGRAPHY | Broader geographic grouping |
ACQUIRED_DATE | Date the customer first registered or became active |
12 Mo. DELIVERY_FEE_PAID | Total delivery fees paid in the last 12 months |
12 Mo. DELIVERYPASS_USED | Number of times DeliveryPass subscription was used in 12 months |
12 Mo. DISCOUNT_AMOUNT | Total discounts applied in 12 months |
12 Mo. Orders | Total number of orders in the past 12 months |
12 Mo. ORDERS_W_PROMO | Number of orders that included a promo in 12 months |
12 Mo. Sales | Total sales generated by customer in 12 months |
24 Mo. DELIVERY_FEE_PAID | Total delivery fees paid in the last 24 months |
24 Mo. DELIVERYPASS_USED | Number of times DeliveryPass was used in 24 months |
24 Mo. DISCOUNT_AMOUNT | Total discounts applied in 24 months |
24 Mo. Orders | Total number of orders in the past 24 months |
24 Mo. Orders w. Promo | Number of orders with a promo in 24 months |
24 Mo. Sales | Total sales generated by customer in 24 months |
SUNDAY ORDERS 12 MO. | Number of orders placed on Sundays (12 months) |
MONDAY ORDERS 12 MO. | Number of orders placed on Mondays (12 months) |
TUESDAY ORDERS 12 MO. | Number of orders placed on Tuesdays (12 months) |
WEDNESDAY ORDERS 12 MO. | Number of orders placed on Wednesdays (12 months) |
THURSDAY ORDERS 12 MO. | Number of orders placed on Thursdays (12 months) |
FRIDAY ORDERS 12 MO. | Number of orders placed on Fridays (12 months) |
SATURDAY ORDERS 12 MO. | Number of orders placed on Saturdays (12 months) |
Use R to attempt each of the following questions. We recommend you to write your interpretation in your English.
# Import data in a variable named df
df <- read_excel("FD_data.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
#View(df)
#Our data has got a lot of missing value indicators that is NA strings. so we need to do some pre_cleaning when loading
# Import data in a variable named df
df <- read_excel("FD_data.xlsx",skip=1 ,na = c("", "NA", "N/A", "NULL", "#N/A", " - ","-"))
print(head(df))
## # A tibble: 6 × 27
## LOYALTY_SEGMENT AGE INCOME GENDER ZIP_CODE DMA GEOGRAPHY ACQUIRED_DATE
## <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 2. Bi-Weekly Shopp… 60 $150,… F 10465 NY-NJ Bronx 6/4/12 0:00
## 2 5. Infrequent Shop… 53 $20,0… F 10462 NY-NJ Bronx 7/1/12 0:00
## 3 3. Every Three Wee… NA <NA> <NA> 10462 NY-NJ Bronx 7/10/12 0:00
## 4 3. Every Three Wee… 29 $150,… F 11206 NY-NJ Brooklyn 1/7/12 0:00
## 5 3. Every Three Wee… NA <NA> <NA> 11231 NY-NJ Brooklyn 1/8/12 0:00
## 6 2. Bi-Weekly Shopp… 54 $30,0… F 11205 NY-NJ Brooklyn 3/27/06 0:00
## # ℹ 19 more variables: `12 Mo. DELIVERY_FEE_PAID` <chr>,
## # `12 Mo. DELIVERYPASS_USED` <dbl>, `12 Mo. DISCOUNT_AMOUNT` <chr>,
## # `12 Mo. Orders` <dbl>, `12 Mo. ORDERS_W_PROMO` <dbl>, `12 Mo. Sales` <chr>,
## # `24 Mo. DELIVERY_FEE_PAID` <chr>, `24 Mo. DELIVERYPASS_USED` <dbl>,
## # `24 Mo. DISCOUNT_AMOUNT` <chr>, `24 Mo. Orders` <dbl>,
## # `24 Mo. Orders w. Promo` <dbl>, `24 Mo. Sales` <chr>,
## # `SUNDAY ORDERS 12 MO.` <dbl>, `MONDAY ORDERS 12 MO.` <dbl>, …
#----
check_missing_values <- function(df, print_all = TRUE) {
#calculating missing counts
missing_counts <- colSums(is.na(df))
# finding column with highest missing count
column_highest_missing <- names(which.max(missing_counts))
max_missing <- max(missing_counts)
total_rows <- nrow(df)
# print summary
cat("=== MISSING VALUE ANALYSIS ===\n")
cat("Column with highest missing count:", column_highest_missing,
"with", max_missing, "missing values (",
round(max_missing/total_rows * 100, 2), "%)\n\n")
# geneerate data frame
missing_counts_df <- data.frame(
Column = names(missing_counts),
Missing_Count = missing_counts,
Percentage = round(missing_counts/total_rows * 100, 2)
) %>%
arrange(desc(Missing_Count))
#print all columns
if(print_all) {
cat("Missing values by column (sorted by highest missing):\n")
print(missing_counts_df)
cat("\nTotal missing values in dataset:", sum(missing_counts), "\n")
cat("Total complete cases:", sum(complete.cases(df)), "out of", total_rows, "\n")
}
# Return the data frame invisibly for further use
invisible(missing_counts_df)
}
check_missing_values(df)
## === MISSING VALUE ANALYSIS ===
## Column with highest missing count: GENDER with 659 missing values ( 21.97 %)
##
## Missing values by column (sorted by highest missing):
## Column Missing_Count Percentage
## GENDER GENDER 659 21.97
## THURSDAY ORDERS 12 MO. THURSDAY ORDERS 12 MO. 656 21.87
## SATURDAY ORDERS 12 MO. SATURDAY ORDERS 12 MO. 622 20.73
## WEDNESDAY ORDERS 12 MO. WEDNESDAY ORDERS 12 MO. 574 19.13
## FRIDAY ORDERS 12 MO. FRIDAY ORDERS 12 MO. 562 18.73
## TUESDAY ORDERS 12 MO. TUESDAY ORDERS 12 MO. 548 18.27
## AGE AGE 543 18.10
## INCOME INCOME 543 18.10
## MONDAY ORDERS 12 MO. MONDAY ORDERS 12 MO. 484 16.13
## SUNDAY ORDERS 12 MO. SUNDAY ORDERS 12 MO. 447 14.90
## 12 Mo. ORDERS_W_PROMO 12 Mo. ORDERS_W_PROMO 239 7.97
## 24 Mo. Orders w. Promo 24 Mo. Orders w. Promo 239 7.97
## LOYALTY_SEGMENT LOYALTY_SEGMENT 0 0.00
## ZIP_CODE ZIP_CODE 0 0.00
## DMA DMA 0 0.00
## GEOGRAPHY GEOGRAPHY 0 0.00
## ACQUIRED_DATE ACQUIRED_DATE 0 0.00
## 12 Mo. DELIVERY_FEE_PAID 12 Mo. DELIVERY_FEE_PAID 0 0.00
## 12 Mo. DELIVERYPASS_USED 12 Mo. DELIVERYPASS_USED 0 0.00
## 12 Mo. DISCOUNT_AMOUNT 12 Mo. DISCOUNT_AMOUNT 0 0.00
## 12 Mo. Orders 12 Mo. Orders 0 0.00
## 12 Mo. Sales 12 Mo. Sales 0 0.00
## 24 Mo. DELIVERY_FEE_PAID 24 Mo. DELIVERY_FEE_PAID 0 0.00
## 24 Mo. DELIVERYPASS_USED 24 Mo. DELIVERYPASS_USED 0 0.00
## 24 Mo. DISCOUNT_AMOUNT 24 Mo. DISCOUNT_AMOUNT 0 0.00
## 24 Mo. Orders 24 Mo. Orders 0 0.00
## 24 Mo. Sales 24 Mo. Sales 0 0.00
##
## Total missing values in dataset: 6116
## Total complete cases: 996 out of 3000
#----
#View(df)
#head(df)
# Check the dimension of df
print(cat("Dataset dimensions:",dim(df), "\n\n"))
## Dataset dimensions: 3000 27
##
## NULL
# Display the data structure of df
print(cat("Data structure:\n"))
## Data structure:
## NULL
str(df)
## tibble [3,000 × 27] (S3: tbl_df/tbl/data.frame)
## $ LOYALTY_SEGMENT : chr [1:3000] "2. Bi-Weekly Shoppers" "5. Infrequent Shoppers" "3. Every Three Week Shoppers" "3. Every Three Week Shoppers" ...
## $ AGE : num [1:3000] 60 53 NA 29 NA 54 40 68 33 NA ...
## $ INCOME : chr [1:3000] "$150,000" "$20,000" NA "$150,000" ...
## $ GENDER : chr [1:3000] "F" "F" NA "F" ...
## $ ZIP_CODE : num [1:3000] 10465 10462 10462 11206 11231 ...
## $ DMA : chr [1:3000] "NY-NJ" "NY-NJ" "NY-NJ" "NY-NJ" ...
## $ GEOGRAPHY : chr [1:3000] "Bronx" "Bronx" "Bronx" "Brooklyn" ...
## $ ACQUIRED_DATE : chr [1:3000] "6/4/12 0:00" "7/1/12 0:00" "7/10/12 0:00" "1/7/12 0:00" ...
## $ 12 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$5.99" "$0.00" ...
## $ 12 Mo. DELIVERYPASS_USED: num [1:3000] 51 0 12 13 23 0 0 0 0 0 ...
## $ 12 Mo. DISCOUNT_AMOUNT : chr [1:3000] "$98.65" "$0.00" "$139.81" "$20.96" ...
## $ 12 Mo. Orders : num [1:3000] 57 4 18 15 25 31 8 12 8 3 ...
## $ 12 Mo. ORDERS_W_PROMO : num [1:3000] 19 NA 16 7 4 2 5 1 8 NA ...
## $ 12 Mo. Sales : chr [1:3000] "$25,195" "$84" "$1,496" "$1,092" ...
## $ 24 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$17.97" "$17.97" ...
## $ 24 Mo. DELIVERYPASS_USED: num [1:3000] 103 0 26 13 46 0 0 0 0 0 ...
## $ 24 Mo. DISCOUNT_AMOUNT : chr [1:3000] "$128.16" "$0.00" "$257.65" "$50.92" ...
## $ 24 Mo. Orders : num [1:3000] 105 1 39 21 46 35 9 24 18 3 ...
## $ 24 Mo. Orders w. Promo : num [1:3000] 25 NA 33 14 8 9 6 3 15 NA ...
## $ 24 Mo. Sales : chr [1:3000] "$37,999" "$130" "$3,273" "$1,823" ...
## $ SUNDAY ORDERS 12 MO. : num [1:3000] NA 2 2 1 NA 4 1 6 5 NA ...
## $ MONDAY ORDERS 12 MO. : num [1:3000] 50 1 NA 1 3 3 1 1 NA NA ...
## $ TUESDAY ORDERS 12 MO. : num [1:3000] 2 NA 1 2 5 NA 3 NA 1 1 ...
## $ WEDNESDAY ORDERS 12 MO. : num [1:3000] NA NA 4 4 4 1 NA 1 NA NA ...
## $ THURSDAY ORDERS 12 MO. : num [1:3000] NA NA 4 1 3 3 1 NA NA 1 ...
## $ FRIDAY ORDERS 12 MO. : num [1:3000] 2 NA 5 3 4 1 1 NA 1 1 ...
## $ SATURDAY ORDERS 12 MO. : num [1:3000] NA 1 2 1 4 19 1 4 1 NA ...
cat("\n")
Point out any issue with the data (asterix serve as bullet points in markdown. Add * as much as possible.)
Multiple representations of missing values: #N/A, - (with spaces) for example the AGE and INCOME columns should benumbers but contain #N/A text, preventing them from being treated as numeric. Column names contained spaces, periods, and special characters Dollar amounts included $ symbol and commas (e.g., $25,195) that is they are stored as text and canot be used for mathematical operations One record showed ($1) instead of -1 Column names with time periods (12 Mo., 24 Mo.) created parsing challenges Activity (DOW) Dash for zero/missing SUNDAY ORDERS 12 MO.: - Treated as text, must be converted to 0 for analysis.
df$INCOME <- gsub("\\(\\$1\\)", "1", df$INCOME)
currency_columns <- c(
"12 Mo. DELIVERY_FEE_PAID", "12 Mo. DISCOUNT_AMOUNT", "12 Mo. Sales",
"24 Mo. DELIVERY_FEE_PAID", "24 Mo. DISCOUNT_AMOUNT", "24 Mo. Sales",
"INCOME"
)
clean_currency <- function(x) {
x <- gsub("\\$", "", x)
x <- gsub(",", "", x)
x <- gsub("^\\((.*)\\)$", "-\\1", x)
x <- trimws(x)
x <- ifelse(x %in% c("", "NA", "N/A", "NULL", "#N/A", "-", " - "), NA, x)
as.numeric(x)
}
df[currency_columns] <- lapply(df[currency_columns], clean_currency)
#View(df)
reason to change the format to is to convert our currency to numeric for calcualtions
# Function to replace NA with 0 for count columns
replace_na_counts <- function(df) {
count_patterns <- c(
"DELIVERYPASS_USED", "Orders", "ORDERS_W_PROMO",
"SUNDAY", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY"
)
# finding all columns that match count patterns
count_cols <- names(df)[grepl(paste(count_patterns, collapse = "|"), names(df))]
cat("Replacing NA with 0 in columns:", paste(count_cols, collapse = ", "), "\n")
# NA with 0
df <- df %>%
mutate(across(
any_of(count_cols),
~ ifelse(is.na(.), 0, .)
))
return(df)
}
df <- replace_na_counts(df)
## Replacing NA with 0 in columns: 12 Mo. DELIVERYPASS_USED, 12 Mo. Orders, 12 Mo. ORDERS_W_PROMO, 24 Mo. DELIVERYPASS_USED, 24 Mo. Orders, 24 Mo. Orders w. Promo, SUNDAY ORDERS 12 MO., MONDAY ORDERS 12 MO., TUESDAY ORDERS 12 MO., WEDNESDAY ORDERS 12 MO., THURSDAY ORDERS 12 MO., FRIDAY ORDERS 12 MO., SATURDAY ORDERS 12 MO.
#View(df)
The intuition for converting all NAs to 0 is that an NA in the order data simply indicates ZERO order was made.
# Combined Demographic Data Imputation Function
impute_demographics <- function(df) {
cat("=== HANDLING MISSING INCOME (Segment-Based Imputation) ===\n")
segment_medians <- df %>%
group_by(LOYALTY_SEGMENT, GEOGRAPHY) %>%
summarise(
SEGMENT_MEDIAN_INCOME = median(INCOME, na.rm = TRUE),
SEGMENT_COUNT = n(),
.groups = 'drop'
)
cat("Income medians by segment:\n")
print(segment_medians)
cat("\n")
# impute all demographic variables
df_imputed <- df %>%
# Join segment medians for income imputation
left_join(segment_medians, by = c("LOYALTY_SEGMENT", "GEOGRAPHY")) %>%
# Impute all demographics in one mutate call
mutate(
# INCOME: Segment-based imputation with fallback
INCOME = ifelse(is.na(INCOME), SEGMENT_MEDIAN_INCOME, INCOME),
INCOME = ifelse(is.na(INCOME), median(INCOME, na.rm = TRUE), INCOME),
# median imputation
AGE = ifelse(is.na(AGE), median(AGE, na.rm = TRUE), AGE),
GENDER = ifelse(is.na(GENDER), "Unknown", GENDER),
) %>%
select(-SEGMENT_MEDIAN_INCOME, -SEGMENT_COUNT)
return(df_imputed)
}
df <- impute_demographics(df)
## === HANDLING MISSING INCOME (Segment-Based Imputation) ===
## Income medians by segment:
## # A tibble: 43 × 4
## LOYALTY_SEGMENT GEOGRAPHY SEGMENT_MEDIAN_INCOME SEGMENT_COUNT
## <chr> <chr> <dbl> <int>
## 1 1. Weekly Shoppers Brooklyn 100000 116
## 2 1. Weekly Shoppers Jersey Shore 100000 9
## 3 1. Weekly Shoppers Manhattan 100000 452
## 4 1. Weekly Shoppers New Jersey 50000 11
## 5 1. Weekly Shoppers New Jersey (Hudson) 37500. 4
## 6 1. Weekly Shoppers Queens 125000 28
## 7 1. Weekly Shoppers Staten Island 150000 2
## 8 1. Weekly Shoppers Westchester/CT 150000 9
## 9 2. Bi-Weekly Shoppers Bronx 150000 1
## 10 2. Bi-Weekly Shoppers Brooklyn 75000 201
## # ℹ 33 more rows
# Verify the results
cat("=== FINAL VALIDATION ===\n")
## === FINAL VALIDATION ===
cat("Missing INCOME values:", sum(is.na(df$INCOME)), "\n")
## Missing INCOME values: 0
cat("Missing AGE values:", sum(is.na(df$AGE)), "\n")
## Missing AGE values: 0
cat("Missing GENDER values:", sum(is.na(df$GENDER)), "\n")
## Missing GENDER values: 0
cat("Any missing demographics?", any(is.na(df$INCOME) | is.na(df$AGE) | is.na(df$GENDER)), "\n")
## Any missing demographics? FALSE
# verify handling
missing_after <- sapply(df, function(x) sum(is.na(x)))
cat("Missing values after handling:", sum(missing_after), "\n")
## Missing values after handling: 0
The reason l used each approach for data cleaning: 1.For Day-of-Week Data (20% missing) Assumption: If order count is missing for a specific day, the customer didn’t order that day Then we set to 0 - this is behaviorally accurate For Demographics (18-22% missing) 1.AGE/INCOME: Numeric variables - imputing with median preserves distribution GENDER: Categorical variable - “Unknown” category maintains transparency For Promotion Data (8% missing) Assumption: Missing promo data means no promotions were used Set to 0 - reasonable for behavioral data And imputation for income is based on segments usually “LOYALTY_SEGMENT”, “GEOGRAPHY” hence this is more suitable than using mean value
*Now l dont have missing values in my dataset and thats awesome we move into analysis
#View(df)
# The definitive way to confirm - check for duplicate combinations of key identifiers
key_identifiers <- c("ZIP_CODE", "AGE", "GENDER", "ACQUIRED_DATE")
duplicate_check <- df %>%
group_by(across(all_of(key_identifiers))) %>%
filter(n() > 1) %>%
nrow()
c
## function (...) .Primitive("c")
if(duplicate_check == 0) {
cat("CONFIRMED: Each row represents a unique customer\n")
unique_customers <- nrow(df)
cat("Number of unique customers:", unique_customers, "\n")
} else {
# Show the duplicate cases
duplicates <- df %>%
group_by(across(all_of(key_identifiers))) %>%
filter(n() > 1) %>%
select(all_of(key_identifiers), LOYALTY_SEGMENT, `12 Mo. Orders`)
print(duplicates)
}
## # A tibble: 8 × 6
## # Groups: ZIP_CODE, AGE, GENDER, ACQUIRED_DATE [4]
## ZIP_CODE AGE GENDER ACQUIRED_DATE LOYALTY_SEGMENT `12 Mo. Orders`
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 10021 44 Unknown 2/28/12 0:00 2. Bi-Weekly Shoppers 25
## 2 10021 44 Unknown 2/28/12 0:00 4. Once a Month Shoppers 11
## 3 10021 44 Unknown 11/27/02 0:00 3. Every Three Week Shop… 22
## 4 10021 44 Unknown 11/27/02 0:00 4. Once a Month Shoppers 7
## 5 10029 44 Unknown 4/5/12 0:00 4. Once a Month Shoppers 13
## 6 10003 44 Unknown 4/7/12 0:00 1. Weekly Shoppers 101
## 7 10029 44 Unknown 4/5/12 0:00 4. Once a Month Shoppers 9
## 8 10003 44 Unknown 4/7/12 0:00 2. Bi-Weekly Shoppers 24
# Additional verification - check if behavioral data supports uniqueness
behavioral_uniqueness <- df %>%
summarise(
unique_order_counts = n_distinct(`12 Mo. Orders`),
unique_sales_values = n_distinct(`12 Mo. Sales`)
)
cat("\n=== BEHAVIORAL UNIQUENESS CHECK ===\n")
##
## === BEHAVIORAL UNIQUENESS CHECK ===
cat("Unique order counts:", behavioral_uniqueness$unique_order_counts, "/", nrow(df), "\n")
## Unique order counts: 128 / 3000
cat("Unique sales values:", behavioral_uniqueness$unique_sales_values, "/", nrow(df), "\n")
## Unique sales values: 2485 / 3000
if(behavioral_uniqueness$unique_order_counts == nrow(df)) {
cat("All customers have unique order counts\n")
}
Interpretation: We have multiple customers with the exact same,ZIP_CODE: 10021, 10029, 10003 AGE: 44; GENDER: Unknown; ACQUIRED_DATE: Same dates for multiple customers But they have different: LOYALTY_SEGMENT: Different classifications 12 Mo. Orders: Different order counts This Should not happen if Each Row = Unique Customer Dataset of 3000 records probably represents only about 2788 unique customers, with each customer appearing an average of 2 times with different loyalty segments and order counts.
#View(df)
df$AGE <- as.numeric(df$AGE)
age_summary <- summary(df$AGE)
cat("Age distribution:\n")
## Age distribution:
print(age_summary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 37.00 44.00 45.03 52.00 101.00
ggplot(df, aes(x = AGE)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7, na.rm = TRUE) +
labs(title = "Age Distribution of Customers", x = "Age", y = "Count") +
theme_minimal()
. Interpretation: The histogram shows the age
distribution of customers. Most customers fall between ages 25 and 75,
with a clear peak around 45 years old. There is also a smaller cluster
at very low ages (near 0–10), which may indicate outliers. Overall, the
distribution is right-skewed, meaning there are fewer older customers
compared to younger and middle-aged ones. The sharp spike at 45 shows
that they is quite a larger number of customers at 45.Our maximum age is
101 and minimum age is 0 and average age is 45 years
income_summary <- summary(df$INCOME)
income_standard_deviation <- sd(df$INCOME, na.rm = TRUE)
cat(" The Income distribution is given by:\n")
## The Income distribution is given by:
print(income_summary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 50000 100000 106220 150000 300000
cat("The Standard deviation is given by:", income_standard_deviation, "\n")
## The Standard deviation is given by: 82077.25
Interpretation: Min–Max: Income ranges from 1 to 300,000 — very wide spread.
Median vs Mean: Median (100,000) slightly below mean (106,220) → slight right skew.
Quartiles: 25% earn ≤50,000; 75% earn ≤150,000.
Std. Dev.: 82,077 indicates high variability in income.
df <- df %>%
mutate(GENDER = case_when(
GENDER == "F" ~ "Female",
GENDER == "M" ~ "Male",
TRUE ~ as.character(GENDER)
))
gender_counts <- table(df$GENDER)
cat("Gender distribution:\n")
## Gender distribution:
print(gender_counts)
##
## Female Male Unknown
## 1596 745 659
Interpretation: Female: 1,596 individuals – largest group signifies Females shop more than man
Male: 745 individuals – male group are less than half of ladies it shows that male shop less than man.
Unknown: 659 individuals – significant missing/unspecified gender data they is quite large number of people in the city who prefer to keep their gender private.
geography_counts <- df %>%
count(GEOGRAPHY) %>%
arrange(desc(n))
cat("Geography with highest customer count:", geography_counts$GEOGRAPHY[1], "with", geography_counts$n[1], "customers\n")
## Geography with highest customer count: Manhattan with 2054 customers
ggplot(df, aes(x = GEOGRAPHY, fill = GEOGRAPHY)) +
geom_bar() +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5) +
labs(title = "Customer Count by Geography") +
theme_minimal() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1)
)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation: Manhattan has the highest customer count of (2054) followed by Brooklyn(699) and lastly Nassau and Philadelphia Area both they have 1
loyalty_counts <- df %>%
count(LOYALTY_SEGMENT) %>%
mutate(percentage = n/sum(n) * 100)
# Bar chart
ggplot(loyalty_counts, aes(x = LOYALTY_SEGMENT, y = n, fill = LOYALTY_SEGMENT)) +
geom_bar(stat = "identity") +
labs(title = "Customers by Loyalty Segment", x = "Loyalty Segment", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
loyalty_counts
Interpretation: Bi-Weekly and Every Three Week Shoppers dominate in numbers and frequency, indicating the bulk of shopping activity happens on a 2–3 week cycle.
Weekly Shoppers also form a significant portion but slightly fewer than Bi-Weekly shoppers.
Once a Month and Infrequent Shoppers contribute less to overall activity, suggesting potential for targeted marketing to increase engagement.
90 Day New Shoppers and No Segment represent negligible activity, possibly new or unclassified customers.
avg_orders_per_customer <- mean(df$`12 Mo. Orders`, na.rm = TRUE)
cat("Average orders per customer:", avg_orders_per_customer, "\n")
## Average orders per customer: 30.68
Interpretation: On average, each customer placed about 31 orders in the past year, indicating relatively frequent engagement. This suggests a high level of customer activity and repeat purchases.
avg_sales_12mo <- mean(df$`12 Mo. Sales`, na.rm = TRUE)
cat("Average sales per customer (12 months):","$",round(avg_sales_12mo, 2), "\n")
## Average sales per customer (12 months): $ 3643.54
Interpretation: On average, each customer contributed $3,643.54 in sales over the past 12 months, indicating a strong revenue contribution per customer.
#Customers who used DeliveryPass at least once
deliverypass_users <- sum(df$`12 Mo. DELIVERYPASS_USED` > 0, na.rm = TRUE)
cat("Customers who used DeliveryPass at least once:", deliverypass_users, "\n")
## Customers who used DeliveryPass at least once: 1585
Interpretation: Out of the total customer base, 1,585 customers engaged with the DeliveryPass subscription at least once, indicating a moderate adoption rate.
df <- df %>%
mutate(
INCOME_GROUP = case_when(
INCOME < 30000 ~ "Lower Income",
INCOME >= 30000 & INCOME < 70000 ~ "Middle Income",
INCOME >= 70000 ~ "Higher Income",
TRUE ~ "Unknown"
),
INCOME_GROUP = factor(INCOME_GROUP,
levels = c("Lower Income", "Middle Income", "Higher Income", "Unknown"))
)
# Correlation analysis
cor_test <- cor.test(df$INCOME, df$`12 Mo. Orders`, use = "complete.obs")
cat("=== CORRELATION ANALYSIS ===\n")
## === CORRELATION ANALYSIS ===
cat("Correlation between income and orders:", round(cor_test$estimate, 3), "\n")
## Correlation between income and orders: 0.055
cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")
## P-value: 0.00237
if(cor_test$p.value < 0.05) {
if(cor_test$estimate > 0) {
cat("Significant positive correlation: Higher income customers tend to place more orders\n")
} else {
cat("Significant negative correlation: Higher income customers tend to place fewer orders\n")
}
} else {
cat("No significant correlation between income and order count\n")
}
## Significant positive correlation: Higher income customers tend to place more orders
cat("\n")
# Analysis by income_groups
income_group_analysis <- df %>%
group_by(INCOME_GROUP) %>%
summarise(
customers = n(),
avg_orders = mean(`12 Mo. Orders`, na.rm = TRUE),
median_orders = median(`12 Mo. Orders`, na.rm = TRUE),
min_orders = min(`12 Mo. Orders`, na.rm = TRUE),
max_orders = max(`12 Mo. Orders`, na.rm = TRUE),
sd_orders = sd(`12 Mo. Orders`, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(percentage = customers / sum(customers) * 100)
cat("=== INCOME GROUP ANALYSIS ===\n")
## === INCOME GROUP ANALYSIS ===
print(income_group_analysis)
## # A tibble: 3 × 8
## INCOME_GROUP customers avg_orders median_orders min_orders max_orders
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Lower Income 476 28.7 23 1 158
## 2 Middle Income 480 28.6 24 1 92
## 3 Higher Income 2044 31.6 27 1 172
## # ℹ 2 more variables: sd_orders <dbl>, percentage <dbl>
cat("\n")
# Statistical test between groups
if(require(rstatix)) {
anova_test <- aov(`12 Mo. Orders` ~ INCOME_GROUP, data = df)
cat("=== ANOVA TEST (Difference between income groups) ===\n")
print(summary(anova_test))
cat("\n")
}
## Loading required package: rstatix
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
## === ANOVA TEST (Difference between income groups) ===
## Df Sum Sq Mean Sq F value Pr(>F)
## INCOME_GROUP 2 5598 2799.0 5.67 0.00348 **
## Residuals 2997 1479467 493.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(df, aes(x = INCOME_GROUP, y = `12 Mo. Orders`, fill = INCOME_GROUP)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
labs(title = "Orders by Income Group",
subtitle = paste("Correlation:", round(cor_test$estimate, 3),
ifelse(cor_test$p.value < 0.05, "(Significant)", "(Not significant)")),
x = "Income Group",
y = "12-Month Orders",
caption = "Red diamond shows mean value") +
theme_minimal() +
theme(legend.position = "none")
ggplot(df, aes(x = INCOME, y = `12 Mo. Orders`)) +
geom_point(alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Income vs Orders Correlation",
subtitle = paste("r =", round(cor_test$estimate, 3),
ifelse(cor_test$p.value < 0.05, "(Significant)", "(Not significant)")),
x = "Income ($)",
y = "12-Month Orders") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
#Bar chart of average orders by income group
ggplot(income_group_analysis, aes(x = INCOME_GROUP, y = avg_orders, fill = INCOME_GROUP)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = avg_orders - sd_orders, ymax = avg_orders + sd_orders),
width = 0.2) +
geom_text(aes(label = round(avg_orders, 1)), vjust = -0.5, size = 4, fontface = "bold") +
labs(title = "Average Orders by Income Group",
x = "Income Group",
y = "Average 12-Month Orders") +
theme_minimal() +
theme(legend.position = "none")
# Additional insight: Orders per dollar of income
df <- df %>%
mutate(ORDERS_PER_INCOME = ifelse(INCOME > 0, `12 Mo. Orders` / INCOME, NA))
cat("=== ORDERS PER DOLLAR OF INCOME ===\n")
## === ORDERS PER DOLLAR OF INCOME ===
orders_per_income <- df %>%
group_by(INCOME_GROUP) %>%
summarise(
avg_orders_per_income = mean(ORDERS_PER_INCOME, na.rm = TRUE),
.groups = 'drop'
)
print(orders_per_income)
## # A tibble: 3 × 2
## INCOME_GROUP avg_orders_per_income
## <fct> <dbl>
## 1 Lower Income 23.7
## 2 Middle Income 0.000620
## 3 Higher Income 0.000265
cat("\n")
#interpretation
cat("=== FINAL INTERPRETATION ===\n")
## === FINAL INTERPRETATION ===
highest_group <- income_group_analysis %>%
filter(avg_orders == max(avg_orders))
lowest_group <- income_group_analysis %>%
filter(avg_orders == min(avg_orders))
cat("• Highest average orders:", highest_group$INCOME_GROUP, "group with",
round(highest_group$avg_orders, 1), "orders\n")
## • Highest average orders: 3 group with 31.6 orders
cat("• Lowest average orders:", lowest_group$INCOME_GROUP, "group with",
round(lowest_group$avg_orders, 1), "orders\n")
## • Lowest average orders: 2 group with 28.6 orders
cat("• Correlation strength:", case_when(
abs(cor_test$estimate) < 0.3 ~ "Weak",
abs(cor_test$estimate) < 0.7 ~ "Moderate",
TRUE ~ "Strong"
), "\n")
## • Correlation strength: Weak
cat("• Practical significance:", ifelse(abs(cor_test$estimate) > 0.2,
"May be meaningful for business",
"Likely not practically significant"), "\n")
## • Practical significance: Likely not practically significant
Interpretation: The regression line is almost flat (slope ≈ 0) here is a relationship: Higher income customers do place slightly more orders But it’s extremely weak: Income explains only 0.3% of the variation in orders (r² = 0.055² = 0.003) Based on our graphs we can also notice that higher income class places more order than middle income and lower income classes
sales_by_gender <- df %>%
group_by(GENDER) %>%
summarise(
avg_sales = mean(`12 Mo. Sales`, na.rm = TRUE),
n = n()
)
sales_by_gender
Interpretation: Females have the highest average sales (3771), slightly above Males (3681) and Unknown group shows noticeably lower average sales (3292). Interpretation: Gender appears linked to sales, with females spending a bit more on average, while customers with unknown gender spend the least.
LOYALTY_SEGMENTS differ in terms of twelve month
sales?sales_by_loyalty <- df %>%
group_by(LOYALTY_SEGMENT) %>%
summarise(
avg_sales = mean(`12 Mo. Sales`, na.rm = TRUE),
total_sales = sum(`12 Mo. Sales`, na.rm = TRUE),
n = n()
)
print(sales_by_loyalty)
## # A tibble: 7 × 4
## LOYALTY_SEGMENT avg_sales total_sales n
## <chr> <dbl> <dbl> <int>
## 1 1. Weekly Shoppers 6882. 4342664 631
## 2 2. Bi-Weekly Shoppers 4423. 3653599 826
## 3 3. Every Three Week Shoppers 2509. 1964765 783
## 4 4. Once a Month Shoppers 1536. 525143 342
## 5 5. Infrequent Shoppers 1070. 441893 413
## 6 7. 90 Day New Shoppers 978 978 1
## 7 No Segment 393. 1573 4
Interpretation: Weekly Shoppers generate the highest average (≈ 6882) and total sales, showing the strongest customer value.
Bi-Weekly Shoppers also contribute heavily with high totals, though lower per-customer averages compared to Weekly Shoppers
Every Three Week and Once a Month Shoppers show progressively lower average of 2509.278 and 1535.506 respectively total sales
Infrequent Shoppers and 90 Day New Shoppers contribute minimal sales.
No Segment customers show very low averages and totals, indicating little loyalty impact.
df$`12 Mo. ORDERS_W_PROMO` <- as.numeric(df$`12 Mo. ORDERS_W_PROMO`)
promo_by_age <- df %>%
filter(!is.na(AGE)) %>%
mutate(age_group = ifelse(AGE < 30, "Under 30", "30+")) %>%
group_by(age_group) %>%
summarise(
avg_promo_orders = mean(`12 Mo. ORDERS_W_PROMO`, na.rm = TRUE),
promo_ratio = mean(`12 Mo. ORDERS_W_PROMO`/`12 Mo. Orders`, na.rm = TRUE),
n = n()
)
promo_by_age
Interpretation: Under 30 customers have a slightly higher average promo orders (7.47 vs 7.24) and a noticeably higher promo ratio (30.6% vs 26.7%).
This suggests younger customers rely on promotions more than older ones, both in frequency and relative share of their purchases.
#View(df)
discount_by_income <- df %>%
group_by(INCOME_GROUP) %>%
summarise(avg_12_discount = mean(`12 Mo. DISCOUNT_AMOUNT`, na.rm = TRUE))
discount_by_income
discount_by_income <- df %>%
group_by(INCOME_GROUP) %>%
summarise(avg_24_discount = mean(`24 Mo. DISCOUNT_AMOUNT`, na.rm = TRUE))
discount_by_income
Interpretation: Lower Income customers use the highest discounts on average (40.1 and 50.1) followed by Middle Income customers use slightly less (37.1 and 63.9) and Higher Income customers use the least (33.0 and 54.3) Discount usage decreases as income increases lower-income customers are more responsive to discounts.
deliverypass_sales <- df %>%
mutate(has_deliverypass = `12 Mo. DELIVERYPASS_USED` > 0) %>%
group_by(has_deliverypass) %>%
summarise(
avg_sales = mean(`12 Mo. Sales`, na.rm = TRUE),
n = n()
)
deliverypass_sales
Interpretation: Customers with DeliveryPass have much higher average sales (4570) compared to those without (2606) so YES this suggests DeliveryPass usage is strongly associated with higher spending and greater customer value..
# Clean delivery fee data
df$`12 Mo. DELIVERY_FEE_PAID` <- gsub("\\$", "", df$`12 Mo. DELIVERY_FEE_PAID`)
df$`12 Mo. DELIVERY_FEE_PAID` <- as.numeric(df$`12 Mo. DELIVERY_FEE_PAID`)
fees_by_loyalty <- df %>%
group_by(LOYALTY_SEGMENT) %>%
summarise(
avg_delivery_fee = mean(`12 Mo. DELIVERY_FEE_PAID`, na.rm = TRUE),
n = n()
)
fees_by_loyalty
Interpretation: Weekly Shoppers pay the highest average delivery fees (45.4), followed by Bi-Weekly Shoppers (63.2), less frequent segments (Once a Month, Infrequent, No Segment) pay noticeably lower fees. Interpretation: Frequent shoppers do not pay less in delivery fees in fact, they pay more, likely because higher order frequency drives up cumulative delivery charges.
sales_percentage <- df %>%
group_by(LOYALTY_SEGMENT) %>%
summarise(
total_sales = sum(`12 Mo. Sales`, na.rm = TRUE),
percentage = total_sales / sum(df$`12 Mo. Sales`, na.rm = TRUE) * 100
)
sales_percentage
Interpretation: Weekly and Bi-Weekly Shoppers together generate over 73% of total sales, making them the core customer base every Three Week Shoppers also contribute significantly (≈18%).
Once a Month and Infrequent Shoppers contribute less than 10% combined, indicating lower engagement 90 Day New Shoppers and No Segment have a negligible contribution.
Sales are heavily concentrated among Weekly and Bi-Weekly shoppers, so loyalty programs and promotions should prioritize these groups, while targeted campaigns could encourage less frequent shoppers to increase their purchase frequency.
# Parse acquisition date
df <- df %>%
mutate(ACQUIRED_DATE = as.Date(ACQUIRED_DATE, format = "%m/%d/%y"),
customer_tenure = as.numeric(difftime(Sys.Date(), ACQUIRED_DATE, units = "days"))/365)
tenure_vs_orders <- cor.test(df$customer_tenure, df$`12 Mo. Orders`, use = "complete.obs")
tenure_vs_orders
##
## Pearson's product-moment correlation
##
## data: df$customer_tenure and df$`12 Mo. Orders`
## t = 7.3276, df = 2998, p-value = 2.997e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0973210 0.1676363
## sample estimates:
## cor
## 0.1326455
Interpretation: Correlation (r): 0.133, 95% CI: [0.097, 0.168],p-value: 2.99e-13 (highly significant)
There is a positive but weak correlation between customer tenure (earlier acquisition date) and number of orders in the past 12 months. Statistically, the relationship is significant (p < 0.001), but the effect size is small.
This suggests that older customers tend to place slightly more orders, but the strength of loyalty explained by tenure alone is limited.Other factors (such as promotions, shopping habits, or income) likely play a larger role in customer loyalty than tenure.
top_zip_codes <- df %>%
group_by(ZIP_CODE) %>%
summarise(
avg_spending = mean(`12 Mo. Sales`, na.rm = TRUE),
n_customers = n()
) %>%
filter(n_customers >= 1) %>%
arrange(desc(avg_spending)) %>%
head(10)
top_zip_codes
Interpretation: The highest per customer spending ZIPs (10465, 06807, 07003, 10463, 10044, 07304) each have only one customer, so their high averages may not be representative.
ZIP codes with more customers (10282, 10069, 11210) still show strong per-customer spending levels ($5,400–6,200), making them more reliable indicators of affluent clusters.
While some single-customer ZIPs appear as outliers, the areas with multiple customers (e.g., 10282 and 10069) likely represent genuine high-value customer bases
What is the day of the week with the highest average orders?
# Clean day-of-week data
data_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY ORDERS 12 MO.",
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRIDAY ORDERS 12 MO.",
"SATURDAY ORDERS 12 MO.")
for(col in data_cols){
df[[col]] <- gsub("- ", "0", df[[col]])
df[[col]] <- as.numeric(df[[col]])
}
dow_data <- df %>%
select(all_of(data_cols)) %>%
summarise_all(mean, na.rm = TRUE) %>%
pivot_longer(everything(), names_to = "Day", values_to="Avg_Orders")
highest_day <- dow_data[which.max(dow_data$Avg_Orders),]
highest_day
Interpretation: Sunday has got the highest average sales
weekend_orders <- mean(df$`SATURDAY ORDERS 12 MO.` + df$`SUNDAY ORDERS 12 MO.`, na.rm = TRUE)
weekday_orders <- mean(
df$`MONDAY ORDERS 12 MO.` + df$`TUESDAY ORDERS 12 MO.` + df$`WEDNESDAY ORDERS 12 MO.` +
df$`THURSDAY ORDERS 12 MO.` + df$`FRIDAY ORDERS 12 MO.`, na.rm = TRUE
) / 5
cat("Average weekend orders:", weekend_orders, "\n")
## Average weekend orders: 9.698667
cat("Average weekday orders:", weekday_orders, "\n")
## Average weekday orders: 4.009067
Interpretation: Yes the weekend ordering (Sat+Sun) higher than weekday ordering probably because people would because people have more free time on weekends
dow_by_loyalty <- df %>%
group_by(LOYALTY_SEGMENT) %>%
summarise(
Sunday = mean(df$`SUNDAY ORDERS 12 MO.`, na.rm = TRUE),
Monday = mean(df$`MONDAY ORDERS 12 MO.`, na.rm = TRUE),
Tuesday = mean(df$`TUESDAY ORDERS 12 MO.`, na.rm = TRUE),
Wednesday = mean(df$`WEDNESDAY ORDERS 12 MO.`, na.rm = TRUE),
Thursday = mean(df$`THURSDAY ORDERS 12 MO.`, na.rm = TRUE),
Friday = mean(df$`FRIDAY ORDERS 12 MO.`, na.rm = TRUE),
Saturday = mean(df$`SATURDAY ORDERS 12 MO.`, na.rm = TRUE)
)
# Convert to long format for plotting
dow_long <- dow_by_loyalty %>%
pivot_longer(
cols = Sunday:Saturday,
names_to = "Day",
values_to = "Avg_Orders"
)
# Ensure proper day order (Monday to Sunday)
dow_long$Day <- factor(dow_long$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Display the average orders per loyalty segment
cat("=== AVERAGE ORDERS PER LOYALTY SEGMENT ===\n")
## === AVERAGE ORDERS PER LOYALTY SEGMENT ===
print(dow_by_loyalty)
## # A tibble: 7 × 8
## LOYALTY_SEGMENT Sunday Monday Tuesday Wednesday Thursday Friday Saturday
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1. Weekly Shoppers 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 2 2. Bi-Weekly Shoppers 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 3 3. Every Three Week … 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 4 4. Once a Month Shop… 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 5 5. Infrequent Shoppe… 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 6 7. 90 Day New Shoppe… 5.72 5.13 4.13 3.53 3.16 4.10 3.98
## 7 No Segment 5.72 5.13 4.13 3.53 3.16 4.10 3.98
# Plot line chart
ggplot(dow_long, aes(x = Day, y = Avg_Orders, color = LOYALTY_SEGMENT, group = LOYALTY_SEGMENT)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(
title = "Average Orders by Day of Week per Loyalty Segment",
x = "Day of Week",
y = "Average Orders",
color = "Loyalty Segment"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom"
) +
scale_y_continuous(limits = c(0, NA)) # Start y-axis at 0
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Additional: Show overall average orders by loyalty segment
cat("\n=== OVERALL AVERAGE ORDERS BY LOYALTY SEGMENT ===\n")
##
## === OVERALL AVERAGE ORDERS BY LOYALTY SEGMENT ===
overall_avg <- df %>%
group_by(LOYALTY_SEGMENT) %>%
summarise(
Avg_Total_Orders = mean(`12 Mo. Orders`, na.rm = TRUE),
Total_Customers = n()
) %>%
arrange(desc(Avg_Total_Orders))
print(overall_avg)
## # A tibble: 7 × 3
## LOYALTY_SEGMENT Avg_Total_Orders Total_Customers
## <chr> <dbl> <int>
## 1 1. Weekly Shoppers 61.6 631
## 2 2. Bi-Weekly Shoppers 35.6 826
## 3 3. Every Three Week Shoppers 20.3 783
## 4 4. Once a Month Shoppers 12.6 342
## 5 No Segment 9.75 4
## 6 5. Infrequent Shoppers 8.59 413
## 7 7. 90 Day New Shoppers 6 1
Interpretation: Generally everyone in the Loyalty Segment prefers to order on Sunday more than any other day with exception of 90 Day New Shoppers who prefers Wednesdays Weekly Shoppers, Bi-Weekly Shoppers, Every Three Week Shoppers, once a Month Shoppers do not prefer to order mid-week especially on Thursday And Lastly 90 Day New Shoppers prefer to order on Wednesday, Friday
# Day-of-week columns
day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY ORDERS 12 MO.",
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRIDAY ORDERS 12 MO.",
"SATURDAY ORDERS 12 MO.")
# Calculate promo orders and total orders by day
day_comparison <- df %>%
summarise(across(all_of(day_cols), ~ sum(`12 Mo. ORDERS_W_PROMO` * (. / `12 Mo. Orders`), na.rm = TRUE))) %>%
pivot_longer(everything(), names_to = "Day", values_to = "Estimated_Promo_Orders") %>%
mutate(Day = gsub("_ORDERS_12_MO", "", Day), Day = gsub("_", " ", Day)) %>%
left_join(
df %>% summarise(across(all_of(day_cols), sum, na.rm = TRUE)) %>%
pivot_longer(everything(), names_to = "Day", values_to = "Total_Orders") %>%
mutate(Day = gsub("_ORDERS_12_MO", "", Day), Day = gsub("_", " ", Day)),
by = "Day"
) %>%
mutate(Promo_Rate = Estimated_Promo_Orders / Total_Orders * 100)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(all_of(day_cols), sum, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
# Day with highest promo rate
max_promo_day <- day_comparison[which.max(day_comparison$Promo_Rate), ]
cat("=== PROMO ORDER ANALYSIS BY DAY ===\n")
## === PROMO ORDER ANALYSIS BY DAY ===
print(day_comparison %>% arrange(desc(Promo_Rate)))
## # A tibble: 7 × 4
## Day Estimated_Promo_Orders Total_Orders Promo_Rate
## <chr> <dbl> <dbl> <dbl>
## 1 FRIDAY ORDERS 12 MO. 3387. 12298 27.5
## 2 THURSDAY ORDERS 12 MO. 2390. 9467 25.2
## 3 SATURDAY ORDERS 12 MO. 2937. 11943 24.6
## 4 WEDNESDAY ORDERS 12 MO. 2582. 10584 24.4
## 5 SUNDAY ORDERS 12 MO. 4101. 17153 23.9
## 6 TUESDAY ORDERS 12 MO. 2808. 12398 22.7
## 7 MONDAY ORDERS 12 MO. 2947. 15389 19.1
cat("\nDay with highest promo rate:", max_promo_day$Day,
"with", round(max_promo_day$Promo_Rate, 1), "% of orders using promotions\n")
##
## Day with highest promo rate: FRIDAY ORDERS 12 MO. with 27.5 % of orders using promotions
# ANOVA test
if(require(rstatix)) {
promo_long <- df %>%
mutate(Promo_Rate = `12 Mo. ORDERS_W_PROMO` / `12 Mo. Orders`) %>%
select(-`12 Mo. ORDERS_W_PROMO`, -`12 Mo. Orders`) %>%
pivot_longer(all_of(day_cols), names_to = "Day", values_to = "Day_Orders") %>%
mutate(Day = gsub("_ORDERS_12_MO", "", Day), Day = gsub("_", " ", Day))
anova_result <- anova_test(data = promo_long %>% filter(Day_Orders > 0),
formula = Promo_Rate ~ Day)
cat("\n=== STATISTICAL SIGNIFICANCE ===\n")
print(anova_result)
}
##
## === STATISTICAL SIGNIFICANCE ===
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Day 6 17100 6.046 2.48e-06 * 0.002
# Visualization
ggplot(day_comparison, aes(x = reorder(Day, Promo_Rate), y = Promo_Rate, fill = Day)) +
geom_col() +
geom_text(aes(label = paste0(round(Promo_Rate, 1), "%")), hjust = -0.1, size = 4, fontface = "bold") +
labs(title = "Promo Usage Rate by Day of Week", subtitle = "Percentage of orders using promotions",
x = "Day of Week", y = "Promo Usage Rate (%)") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none")
Interpretation: Friday has the highest promo usage: 27% of Friday orders use promotions, Monday has the lowest promo usage: Only 19.1% of Monday orders use promotions
Weekdays vs Weekend: Promo usage is generally higher later in the week Yes, promo orders cluster on specific days - particularly Fridays
day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY ORDERS 12 MO.",
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRIDAY ORDERS 12 MO.",
"SATURDAY ORDERS 12 MO.")
# Compute total sales and percentages
day_distribution <- df %>%
summarise(across(all_of(day_cols), sum, na.rm = TRUE)) %>%
pivot_longer(everything(), names_to = "Day", values_to = "Total_Sales") %>%
mutate(Day = gsub(" ORDERS 12 MO\\.", "", Day),
Percentage = Total_Sales / sum(Total_Sales) * 100)
# Coefficient of Variation (CV) to assess distribution
cv <- sd(day_distribution$Total_Sales) / mean(day_distribution$Total_Sales) * 100
skew_pattern <- if(cv < 30) "Evenly distributed" else "Skewed to a few days"
cat("=== 12-Month Sales Distribution Across Days ===\n")
## === 12-Month Sales Distribution Across Days ===
print(day_distribution %>% arrange(desc(Total_Sales)))
## # A tibble: 7 × 3
## Day Total_Sales Percentage
## <chr> <dbl> <dbl>
## 1 SUNDAY 17153 19.2
## 2 MONDAY 15389 17.2
## 3 TUESDAY 12398 13.9
## 4 FRIDAY 12298 13.8
## 5 SATURDAY 11943 13.4
## 6 WEDNESDAY 10584 11.9
## 7 THURSDAY 9467 10.6
cat("\nCoefficient of Variation (CV):", round(cv,1), "%\n")
##
## Coefficient of Variation (CV): 21 %
cat("Sales pattern across days:", skew_pattern, "\n")
## Sales pattern across days: Evenly distributed
# Visualization
ggplot(day_distribution, aes(x = reorder(Day, Total_Sales), y = Total_Sales, fill = Day)) +
geom_col() +
geom_text(aes(label = paste0(round(Percentage,1), "%")), hjust = -0.1, size = 4, fontface = "bold") +
labs(title = "12-Month Sales Distribution Across Days",
subtitle = paste("Pattern:", skew_pattern),
x = "Day of Week", y = "Total Sales") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none")
Interpretation: The promo usage rates range from 19.1% to 27%, this represents a narrow spread of only 7.9 percentage points and all days fall within a relatively tight range
# Sales by loyalty segment
ggplot(sales_by_loyalty, aes(x = LOYALTY_SEGMENT, y = avg_sales, fill = LOYALTY_SEGMENT)) +
geom_bar(stat = "identity") +
labs(title = "Average Sales by Loyalty Segment",
subtitle = "Weekly shoppers generate the highest revenue",
x = "Loyalty Segment", y = "Average Sales ($)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Day of week ordering pattern
dow_data$Day <- factor(dow_data$Day, levels = day_cols)
ggplot(dow_data, aes(x = Day, y = Avg_Orders, fill = Day)) +
geom_bar(stat = "identity") +
labs(title = "Average Orders by Day of Week",
subtitle = "Friday is the most popular shopping day",
x = "Day of Week", y = "Average Orders") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation: The analysis reveals a clear and predictable revenue hierarchy directly tied to shopping frequency, with weekly shoppers dominating as the highest-value segment at over $6,000 in average sales, followed by bi-weekly shoppers in the $4,000-$5,000 range. The pattern continues downward through every-three-week shoppers, monthly shoppers, and finally infrequent shoppers who generate the lowest revenue at under $2,000. This demonstrates that purchase frequency strongly correlates with customer value, as weekly shoppers are approximately three times more valuable than infrequent customers. The strategic imperative is clear: prioritize retention of high-frequency weekly shoppers, implement conversion programs to move bi-weekly shoppers to weekly purchasing patterns, and develop reactivation campaigns for infrequent customers to maximize overall revenue potential.
# T-test for DeliveryPass impact on sales
deliverypass_users_sales <- df %>%
filter(`12 Mo. DELIVERYPASS_USED` > 0) %>%
pull(`12 Mo. Sales`)
non_users_sales <- df %>%
filter(`12 Mo. DELIVERYPASS_USED` == 0) %>%
pull(`12 Mo. Sales`)
t_test_result <- t.test(deliverypass_users_sales, non_users_sales)
t_test_result
##
## Welch Two Sample t-test
##
## data: deliverypass_users_sales and non_users_sales
## t = 16.626, df = 2934.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1732.170 2195.345
## sample estimates:
## mean of x mean of y
## 4569.777 2606.020
Interpretation: The Welch Two Sample t-test results (t = 16.626, p < 2.2e-16) provide overwhelming evidence that DeliveryPass users and non-users have significantly different sales outcomes. The p-value (essentially zero) indicates there is virtually no chance this difference occurred randomly. The 95% confidence interval (1732.170 to 2195.345) confirms that DeliveryPass users spend between $1,732 and $2,195 more on average than non-users. DeliveryPass subscribers generate $4,570 in average sales, compared to $2,606 for non-subscribers—a 75% increase in revenue. This translates to a $1,963 average uplift per subscriber, demonstrating that the program directly correlates with higher customer spending.
Submit the .Rmd and the knitted PDF files using the
correct naming.