Marketing Data Analysis  
Grace Kitonyi  
2025-09-13  
Instructions  
# Do not modify this chunk  
library(tidyverse)  
library(readxl) # package that will help you to load MS Excel data in R.  
library(flextable)  
library(ggplot2) # will help with advanced plotting if necessary.  
In this assignment, we perform basic data exploration and  
visualization on marketing data. This is to get insights on how  
customers behave to what FreshDirect company offers. FreshDirect is  
the leader in online grocery delivery. Their marketing data captures  
customer information such as demographics, transaction behavior,  
and ordering patterns to enable loyalty analysis, segmentation, and  
predictive modeling.  
Your task is to load the data using the package readxl allowing you to  
load excel files. Identify the function from the package with specific  
chosen parameters from it to get rid of some big issues that may  
come with data importing. Then you also need to perform some data  
cleaning for some computations.  
Below is the data description.  
Column Name  
Description  
Shopper classification based on purchase frequency  
(e.g., Weekly, Bi-Weekly, Monthly)  
LOYALTY_SEGMENT  
AGE  
Customer's age  
INCOME  
GENDER  
Household income (may be grouped into ranges)  
Gender of the primary shopper  
Column Name  
ZIP_CODE  
Description  
Residential ZIP code of customer  
Designated Market Area (media/advertising region)  
Broader geographic grouping  
DMA  
GEOGRAPHY  
ACQUIRED_DATE  
Date the customer first registered or became active  
12 Mo.  
DELIVERY_FEE_PAID  
Total delivery fees paid in the last 12 months  
12 Mo.  
DELIVERYPASS_USED  
Number of times DeliveryPass subscription was used in  
12 months  
12 Mo.  
DISCOUNT_AMOUNT  
Total discounts applied in 12 months  
12 Mo. Orders  
Total number of orders in the past 12 months  
Number of orders that included a promo in 12 months  
Total sales generated by customer in 12 months  
Total delivery fees paid in the last 24 months  
12 Mo.  
ORDERS_W_PROMO  
12 Mo. Sales  
24 Mo.  
DELIVERY_FEE_PAID  
24 Mo.  
DELIVERYPASS_USED  
Number of times DeliveryPass was used in 24 months  
Total discounts applied in 24 months  
24 Mo.  
DISCOUNT_AMOUNT  
24 Mo. Orders  
Total number of orders in the past 24 months  
Number of orders with a promo in 24 months  
Total sales generated by customer in 24 months  
Number of orders placed on Sundays (12 months)  
Number of orders placed on Mondays (12 months)  
Number of orders placed on Tuesdays (12 months)  
24 Mo. Orders w. Promo  
24 Mo. Sales  
SUNDAY ORDERS 12 MO.  
MONDAY ORDERS 12 MO.  
TUESDAY ORDERS 12 MO.  
WEDNESDAY ORDERS 12  
MO.  
Number of orders placed on Wednesdays (12 months)  
Number of orders placed on Thursdays (12 months)  
THURSDAY ORDERS 12  
MO.  
Column Name  
Description  
FRIDAY ORDERS 12 MO.  
Number of orders placed on Fridays (12 months)  
SATURDAY ORDERS 12  
MO.  
Number of orders placed on Saturdays (12 months)  
Use R to attempt each of the following questions. We recommend you  
to write your interpretation in your English.  
Part 1.  
1. After importing data in R. Check which column has the highest  
count of missing information.  
# Import data in a variable named df  
df <- read_excel("C:/Users/Administrator/OneDrive/Documents/AIMS/ICL/R/  
FD_data.xlsx",skip = 1,na=c('#N/A','-','N/A') )  
# Check the dimension of df  
dim(df)  
## [1] 3000  
27  
View(df)  
# Display the data structure of df  
str(df)  
## tibble [3,000 × 27] (S3: tbl_df/tbl/data.frame)  
## $ LOYALTY_SEGMENT  
: chr [1:3000] "2. Bi-Weekly Shoppers" "5.  
Infrequent Shoppers" "3. Every Three Week Shoppers" "3. Every Three We  
ek Shoppers" ...  
## $ AGE  
: num [1:3000] 60 53 NA 29 NA 54 40 68 33  
NA ...  
## $ INCOME  
$150,000" ...  
## $ GENDER  
## $ ZIP_CODE  
231 ...  
: chr [1:3000] "$150,000" "$20,000" NA "  
: chr [1:3000] "F" "F" NA "F" ...  
: num [1:3000] 10465 10462 10462 11206 11  
## $ DMA  
Y-NJ" ...  
## $ GEOGRAPHY  
rooklyn" ...  
## $ ACQUIRED_DATE  
: chr [1:3000] "NY-NJ" "NY-NJ" "NY-NJ" "N  
: chr [1:3000] "Bronx" "Bronx" "Bronx" "B  
: chr [1:3000] "6/4/12 0:00" "7/1/12 0:00  
" "7/10/12 0:00" "1/7/12 0:00" ...  
## $ 12 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$5.99" "  
$0.00" ...  
## $ 12 Mo. DELIVERYPASS_USED: num [1:3000] 51 0 12 13 23 0 0 0 0 0 ...  
## $ 12 Mo. DISCOUNT_AMOUNT  
"$20.96" ...  
: chr [1:3000] "$98.65" "$0.00" "$139.81"  
## $ 12 Mo. Orders  
...  
: num [1:3000] 57 4 18 15 25 31 8 12 8 3  
## $ 12 Mo. ORDERS_W_PROMO  
## $ 12 Mo. Sales  
$1,092" ...  
: num [1:3000] 19 NA 16 7 4 2 5 1 8 NA ...  
: chr [1:3000] "$25,195" "$84" "$1,496" "  
## $ 24 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$17.97" "  
$17.97" ...  
## $ 24 Mo. DELIVERYPASS_USED: num [1:3000] 103 0 26 13 46 0 0 0 0 0  
...  
## $ 24 Mo. DISCOUNT_AMOUNT  
" "$50.92" ...  
## $ 24 Mo. Orders  
3 ...  
## $ 24 Mo. Orders w. Promo  
...  
## $ 24 Mo. Sales  
"$1,823" ...  
: chr [1:3000] "$128.16" "$0.00" "$257.65  
: num [1:3000] 105 1 39 21 46 35 9 24 18  
: num [1:3000] 25 NA 33 14 8 9 6 3 15 NA  
: chr [1:3000] "$37,999" "$130" "$3,273"  
## $ SUNDAY ORDERS 12 MO.  
## $ MONDAY ORDERS 12 MO.  
## $ TUESDAY ORDERS 12 MO.  
: num [1:3000] NA 2 2 1 NA 4 1 6 5 NA ...  
: num [1:3000] 50 1 NA 1 3 3 1 1 NA NA ...  
: num [1:3000] 2 NA 1 2 5 NA 3 NA 1 1 ...  
## $ WEDNESDAY ORDERS 12 MO. : num [1:3000] NA NA 4 4 4 1 NA 1 NA NA  
...  
## $ THURSDAY ORDERS 12 MO.  
## $ FRIDAY ORDERS 12 MO.  
## $ SATURDAY ORDERS 12 MO.  
: num [1:3000] NA NA 4 1 3 3 1 NA NA 1 ...  
: num [1:3000] 2 NA 5 3 4 1 1 NA 1 1 ...  
: num [1:3000] NA 1 2 1 4 19 1 4 1 NA ...  
# Count number of missing values per column  
missing_counts <- colSums(is.na(df))  
# Display missing counts nicely  
missing_counts  
##  
LOYALTY_SEGMENT  
AGE  
INCOME  
##  
543  
##  
DMA  
##  
0
GENDER  
659  
543  
ZIP_CODE  
0
0
##  
E_PAID  
##  
GEOGRAPHY  
0
ACQUIRED_DATE 12 Mo. DELIVERY_FE  
0
0
## 12 Mo. DELIVERYPASS_USED  
Orders  
12 Mo. DISCOUNT_AMOUNT  
12 Mo.  
##  
##  
E_PAID  
##  
0
0
12 Mo. ORDERS_W_PROMO  
239  
0
0
12 Mo. Sales 24 Mo. DELIVERY_FE  
0
## 24 Mo. DELIVERYPASS_USED  
Orders  
24 Mo. DISCOUNT_AMOUNT  
24 Mo.  
##  
0
0
0
##  
24 Mo. Orders w. Promo  
24 Mo. Sales  
SUNDAY ORDERS  
12 MO.  
##  
239  
MONDAY ORDERS 12 MO.  
484  
0
TUESDAY ORDERS 12 MO.  
548  
447  
##  
12 MO.  
##  
WEDNESDAY ORDERS  
SATURDAY ORDERS  
574  
##  
THURSDAY ORDERS 12 MO.  
12 MO.  
##  
FRIDAY ORDERS 12 MO.  
562  
656  
622  
# Find the column with the maximum missing values  
max_missing_col <- names(missing_counts)[which.max(missing_counts)]  
max_missing_col  
## [1] "GENDER"  
# Also show how many missing values it has  
missing_counts[max_missing_col]  
## GENDER  
##  
659  
Point out any issue with the data (asterix serve as bullet points in  
markdown. Add * as much as possible.)  
Data Issues:  
Some columns have many missing values, e.g., 12 Mo. Sales.  
Numeric columns like INCOME contain $ and commas.  
Placeholders like - or N/A need conversion to NA.  
GENDER uses abbreviations F/M instead of full labels.  
Column names contain spaces or special characters.  
Good  
Some duplicates or outliers may affect totals and averages.  
2. How many unique customers are in the dataset?  
# Replace CustomerID with the actual unique ID column if it exists  
n_distinct(df$ZIP_CODE)  
# or df$CustomerID if available  
## [1] 144  
# Number of duplicated rows  
sum(duplicated(df))  
## [1] 0  
# If each row is one unique customer  
n_customers <- nrow(df)  
n_customers  
## [1] 3000  
Ok  
Interpretation: Unique Customers: There are r n_customers rows, each  
representing a unique customer. Counting unique ZIP codes (r  
n_distinct(df$ZIP_CODE)) shows geographic spread. There are r  
sum(duplicated(df)) duplicate rows.  
3. What is the age distribution of customers?  
# Summary statistics for AGE  
summary(df$AGE)  
##  
##  
Min. 1st Qu.  
0.00 36.00  
Median  
44.00  
Mean 3rd Qu.  
45.26 55.00 101.00  
Max.  
NA's  
543  
# Mean and standard deviation  
mean(df$AGE, na.rm = TRUE)  
## [1] 45.26129  
sd(df$AGE, na.rm = TRUE)  
## [1] 17.03842  
ggplot(df, aes(x = AGE)) +  
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +  
labs(title = "Age Distribution of Customers",  
x = "Customer Age",  
y = "Count") +  
theme_minimal()  
## Warning: Removed 543 rows containing non-finite outside the scale ra  
nge  
## (`stat_bin()`).  
Nice  
ggplot(df, aes(x = AGE)) +  
geom_density(fill = "lightgreen", alpha = 0.5) +  
labs(title = "Density Plot of Customer Ages",  
x = "Customer Age",  
y = "Density") +  
theme_minimal()  
## Warning: Removed 543 rows containing non-finite outside the scale ra  
nge  
## (`stat_density()`).  
Interpretation: Age Distribution: Most customers are adults aged  
between 30–60. Average age r round(mean(df, . =  
), 1)��((�  
AGE, na.rm=TRUE),1).  
Great job  
4. What is the income distribution (mean, median, spread or  
standard deviation)?  
# Remove $ and , then convert to numeric  
df$INCOME <- as.numeric(gsub("[$,]", "", df$INCOME))  
## Warning: NAs introduced by coercion  
# Summary of income  
summary(df$INCOME)  
##  
##  
Min. 1st Qu.  
10000 65000  
Median  
100000  
Mean 3rd Qu.  
129779 175000  
Max.  
300000  
NA's  
937  
# Mean, median, and standard deviation  
mean_income <- mean(df$INCOME, na.rm = TRUE)  
median_income <- median(df$INCOME, na.rm = TRUE)  
sd_income  
<- sd(df$INCOME, na.rm = TRUE)  
mean_income  
## [1] 129779.4  
median_income  
## [1] 1e+05  
sd_income  
## [1] 83363.39  
ggplot(df, aes(x = INCOME)) +  
geom_histogram(binwidth = 10000, fill = "lightblue", color = "black")  
+
labs(title = "Income Distribution of Customers",  
x = "Household Income ($)",  
y = "Count") +  
theme_minimal()  
## Warning: Removed 937 rows containing non-finite outside the scale ra  
nge  
## (`stat_bin()`).  
Interpretation: Income Distribution: Customer incomes mostly range  
from $40,000–$100,000. Average income $r  
round(mean_income,0); SD $r round(sd_income,0), indicating  
moderate variability.  
Good  
5. After decoding the variable GENDER (replace F and M by  
Female and Male respectively), what are the count of male and  
female customers in the dataset?  
# Recode GENDER column  
df$GENDER <- recode(df$GENDER, "F" = "Female", "M" = "Male")  
# Counts  
table(df$GENDER)  
##  
## Female  
## 1596  
Male  
745  
# Or with tidyverse  
df %>%  
count(GENDER)  
## # A tibble: 3 × 2  
##  
##  
GENDER  
<chr>  
n
<int>  
1596  
745  
## 1 Female  
## 2 Male  
## 3 <NA>  
659  
library(ggplot2)  
ggplot(df, aes(x = GENDER, fill = GENDER)) +  
geom_bar() +  
labs(title = "Gender Distribution of Customers",  
x = "Gender", y = "Count") +  
theme_minimal()  
Nice  
Interpretation:  
Gender Distribution: The dataset has r table(df$GENDER)["Female"]  
females and r table(df$GENDER)[“Male”] males. Female and male  
customers are fairly balanced.  
6. Which part of America has the highest customer counts?  
customer_counts <- table(df$GEOGRAPHY, useNA = "ifany")  
highest_region <- names(which.max(customer_counts))  
highest_value  
<- max(customer_counts)  
cat("The region with the highest customers is", highest_region, "with",  
highest_value, "customers.\n")  
## The region with the highest customers is Manhattan with 2054 custome  
rs.  
Interpretation: Highest Customer Region: The region with the highest  
number of customers is r highest_region with r highest_value  
customers.  
Ok  
7. How many customers fall into each LOYALTY_SEGMENT  
category? You can draw a pie/bar chart and interpret it.  
# Using base R  
table(df$LOYALTY_SEGMENT)  
##  
##  
##  
1. Weekly Shoppers  
631  
2. Bi-Weekly Shoppers  
826  
## 3. Every Three Week Shoppers  
4. Once a Month Shoppers  
##  
##  
##  
##  
##  
783  
342  
5. Infrequent Shoppers  
7. 90 Day New Shoppers  
1
413  
No Segment  
4
# Or using tidyverse  
df %>%  
count(LOYALTY_SEGMENT)  
## # A tibble: 7 × 2  
##  
##  
LOYALTY_SEGMENT  
<chr>  
n
<int>  
## 1 1. Weekly Shoppers  
631  
826  
783  
342  
413  
1
## 2 2. Bi-Weekly Shoppers  
## 3 3. Every Three Week Shoppers  
## 4 4. Once a Month Shoppers  
## 5 5. Infrequent Shoppers  
## 6 7. 90 Day New Shoppers  
## 7 No Segment  
4
library(ggplot2)  
segment_counts <- df %>%  
count(LOYALTY_SEGMENT)  
ggplot(segment_counts, aes(x = LOYALTY_SEGMENT, y = n, fill = LOYALTY_S  
EGMENT)) +  
geom_bar(stat = "identity") +  
labs(title = "Customer Counts by Loyalty Segment",  
x = "Loyalty Segment",  
y = "Number of Customers")  
Bad labelling  
Interpretation: Loyalty Segment Counts: Most customers belong to the  
largest segment (r  
[. (�  
segment_counts  
n)]) with r  
max(segment_counts$n) customers.  
8. Compute the average number of orders per customer in 12  
months.  
names(df)  
## [1] "LOYALTY_SEGMENT"  
"AGE"  
## [3] "INCOME"  
## [5] "ZIP_CODE"  
## [7] "GEOGRAPHY"  
"GENDER"  
"DMA"  
"ACQUIRED_DATE"  
## [9] "12 Mo. DELIVERY_FEE_PAID" "12 Mo. DELIVERYPASS_USED"  
## [11] "12 Mo. DISCOUNT_AMOUNT"  
## [13] "12 Mo. ORDERS_W_PROMO"  
"12 Mo. Orders"  
"12 Mo. Sales"  
## [15] "24 Mo. DELIVERY_FEE_PAID" "24 Mo. DELIVERYPASS_USED"  
## [17] "24 Mo. DISCOUNT_AMOUNT"  
## [19] "24 Mo. Orders w. Promo"  
## [21] "SUNDAY ORDERS 12 MO."  
## [23] "TUESDAY ORDERS 12 MO."  
## [25] "THURSDAY ORDERS 12 MO."  
## [27] "SATURDAY ORDERS 12 MO."  
"24 Mo. Orders"  
"24 Mo. Sales"  
"MONDAY ORDERS 12 MO."  
"WEDNESDAY ORDERS 12 MO."  
"FRIDAY ORDERS 12 MO."  
# Base R  
mean(df$`12 Mo. Orders`, na.rm = TRUE)  
## [1] 30.68  
# Or tidyverse  
df %>%  
summarise(avg_orders_12mo = mean(`12 Mo. Orders`, na.rm = TRUE))  
## # A tibble: 1 × 1  
##  
##  
avg_orders_12mo  
<dbl>  
## 1  
30.7  
Interpretation: Average Orders per Customer: Average 12-month  
orders per customer r round(mean(df$12 Mo. Orders,  
na.rm=TRUE),1).  
Ok  
9. Compute the average sales per customer in 12 months?  
df$`12 Mo. Sales` <- as.character(df$`12 Mo. Sales`)  
df$`12 Mo. Sales` <- gsub("[$,]", "", df$`12 Mo. Sales`)  
df$`12 Mo. Sales`[df$`12 Mo. Sales` %in% c("-", "", "NA")] <- NA  
df$`12 Mo. Sales` <- as.numeric(df$`12 Mo. Sales`)  
average_sales <- mean(df$`12 Mo. Sales`, na.rm = TRUE)  
average_sales  
## [1] 3643.538  
Ok  
Interpretation: Average Sales per Customer: Average 12-month sales  
per customer $r round(average_sales,2).  
10.  
How many customers used DeliveryPass at least once?  
num <- sum(df$`12 Mo. DELIVERYPASS_USED` > 0, na.rm = TRUE)  
num  
## [1] 1585  
Interpretation: DeliveryPass Usage: r num customers used DeliveryPass  
at least once, showing moderate adoption.  
Should have added 24 Mo. ... as well  
Part 2.  
11.  
Do higher-income customers place more orders?  
# 1. Clean and convert INCOME  
df <- df %>%  
mutate(INCOME_NUM = case_when(  
!is.na(INCOME) & str_detect(INCOME, "-") ~  
(as.numeric(str_remove(str_extract(INCOME, "^[0-9,]+"), ",")) +  
as.numeric(str_remove(str_extract(INCOME, "[0-9,]+$"), ",")))/2,  
TRUE ~ as.numeric(str_remove(INCOME, "[^0-9]"))  
))  
## Warning: There was 1 warning in `mutate()`.  
## In argument: `INCOME_NUM = case_when(...)`.  
## Caused by warning:  
## ! NAs introduced by coercion  
# 2. Clean and convert 12 Mo. Orders  
df <- df %>%  
mutate(`12 Mo. Orders` = as.numeric(gsub("[^0-9]", "", `12 Mo. Orders  
`)))  
# 3. Compute correlation  
correlation <- cor(df$INCOME_NUM, df$`12 Mo. Orders`, use = "complete.o  
bs")  
correlation  
## [1] 0.06719484  
Interpretation: Income vs. Orders: Correlation between income and  
12-month orders r round(correlation,2); a positive but weak  
correlation indicates higher-income customers slightly place more  
orders.  
12.  
Based on gender, is there any difference between average  
sales?  
aggregate(df$`12 Mo. Sales`,  
by = list(Gender = df$GENDER),  
FUN = function(x) mean(x, na.rm = TRUE))  
##  
## 1 Female 3771.026  
## 2 Male 3680.960  
Gender  
x
Interpretation: Sales by Gender: Average 12-month sales: Female $r  
. [�  
round(mean(df12  
GENDER==“Female”], na.rm=TRUE),2);  
. [�  
Male $r round(mean(df12  
GENDER==“Male”],  
na.rm=TRUE),2). Gender differences are minor.  
13.  
How do LOYALTY_SEGMENTS differ in terms of twelve month  
sales?  
# Ensure 12 Mo. Sales is numeric  
df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))  
# Average sales by LOYALTY_SEGMENT  
segment_sales <- aggregate(df$`12 Mo. Sales`,  
by = list(Loyalty = df$LOYALTY_SEGMENT),  
FUN = function(x) mean(x, na.rm = TRUE))  
segment_sales  
##  
Loyalty  
x
## 1  
## 2  
1. Weekly Shoppers 6882.193  
2. Bi-Weekly Shoppers 4423.243  
## 3 3. Every Three Week Shoppers 2509.278  
## 4  
## 5  
## 6  
## 7  
4. Once a Month Shoppers 1535.506  
5. Infrequent Shoppers 1069.959  
7. 90 Day New Shoppers  
No Segment  
978.000  
393.250  
Interpretation: Sales by Loyalty Segment: Average 12-month sales  
are highest for segment r  
segment_sales[. (AvgSales)].  
14.  
Do younger customers, for customer aged less than 30,  
use promos more than older ones?  
# Ensure AGE and 12 Mo. ORDERS_W_PROMO are numeric  
df$AGE <- as.numeric(df$AGE)  
df$`12 Mo. ORDERS_W_PROMO` <- as.numeric(gsub("[^0-9]", "", df$`12 Mo.  
ORDERS_W_PROMO`))  
# Group customers into two groups: younger and older  
young <- df[df$AGE < 30, ]  
old <- df[df$AGE >= 30, ]  
# Average orders with promo for each group  
mean(young$`12 Mo. ORDERS_W_PROMO`, na.rm = TRUE)  
## [1] 8.472222  
mean(old$`12 Mo. ORDERS_W_PROMO`, na.rm = TRUE)  
## [1] 7.880251  
Interpretation: Promos by Age: Customers <30 average r  
round(mean(young$12 Mo. ORDERS_W_PROMO, na.rm=TRUE),1)  
promo orders; 30 average r round(mean(old$12 Mo.  
ORDERS_W_PROMO, na.rm=TRUE),1). Younger customers use slightly  
more promotions.  
15.  
How does discount amount vary across income brackets?  
# Clean numeric column  
df$`12 Mo. DISCOUNT_AMOUNT` <- as.numeric(gsub("[$,]", "", df$`12 Mo. D  
ISCOUNT_AMOUNT`))  
# Average discount per income bracket  
I_discount <- aggregate(df$`12 Mo. DISCOUNT_AMOUNT`,  
by = list(Income = df$INCOME),  
FUN = function(x) mean(x, na.rm = TRUE))  
I_discount  
##  
Income  
x
## 1  
## 2  
## 3  
## 4  
## 5  
## 6  
## 7  
## 8  
## 9  
10000 33.41167  
15000 28.60462  
20000 21.27594  
25000 29.10000  
30000 36.16887  
35000 30.61211  
40000 37.32103  
45000 30.40520  
50000 35.21877  
## 10 55000 32.43000  
## 11 60000 42.28682  
## 12 65000 35.33939  
## 13 75000 34.06022  
## 14 100000 37.56920  
## 15 150000 34.73596  
## 16 175000 32.54702  
## 17 200000 37.86529  
## 18 250000 28.59218  
## 19 300000 26.34092  
Try to plot!  
Interpretation: Discount by Income: Higher income brackets receive  
slightly higher average discounts; variation is modest.  
16.  
Is DeliveryPass usage associated with higher total sales?  
# Create a flag for DeliveryPass usage  
df <- df %>%  
mutate(`12 Mo. Sales` = as.numeric(gsub("[$,]", "", `12 Mo. Sales`)),  
UsedDeliveryPass = ifelse(`12 Mo. DELIVERYPASS_USED` > 0, "Yes  
", "No"))  
# Compare average total sales by DeliveryPass usage  
df %>%  
group_by(UsedDeliveryPass) %>%  
summarise(  
AvgSales = mean(`12 Mo. Sales`, na.rm = TRUE),  
Count = n()  
)
## # A tibble: 2 × 3  
##  
##  
## 1 No  
UsedDeliveryPass AvgSales Count  
<chr>  
<dbl> <int>  
2606. 1415  
4570. 1585  
## 2 Yes  
Interpretation: DeliveryPass vs. Sales: Average sales: DeliveryPass users  
$r round(mean(df12. [UsedDeliveryPass==“Yes”],  
na.rm=TRUE),2); non-users $r  
round(mean(df12. [UsedDeliveryPass==“No”],  
na.rm=TRUE),2). Using DeliveryPass is associated with higher sales.  
17.  
Do frequent shoppers (Weekly, Bi-Weekly) pay less in  
delivery fees?  
# Clean numeric column  
df$`12 Mo. DELIVERY_FEE_PAID` <- as.numeric(gsub("[$,]", "", df$`12 Mo.  
DELIVERY_FEE_PAID`))  
## Warning: NAs introduced by coercion  
# Compute average delivery fees by loyalty segment  
fee_by_segment <- aggregate(df$`12 Mo. DELIVERY_FEE_PAID`,  
by = list(Loyalty = df$LOYALTY_SEGMENT),  
FUN = function(x) mean(x, na.rm = TRUE))  
fee_by_segment  
##  
## 1  
## 2  
Loyalty  
1. Weekly Shoppers 45.82522  
2. Bi-Weekly Shoppers 63.48510  
## 3 3. Every Three Week Shoppers 56.61552  
x
## 4  
## 5  
## 6  
## 7  
4. Once a Month Shoppers 39.17424  
5. Infrequent Shoppers 29.60884  
7. 90 Day New Shoppers 35.94000  
No Segment 14.72500  
Interpretation: Delivery Fee by Loyalty: Average delivery fees are  
lower for frequent shoppers (r  
[. (�� �  
fee_by_segment  
x)]), confirming higher  
engagement correlates with lower delivery costs.  
18.  
What percentage of total sales comes from each  
LOYALTY_SEGMENT?  
# Ensure 12 Mo. Sales is numeric  
df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))  
# Sum sales by loyalty segment  
segment_sales <- aggregate(df$`12 Mo. Sales`,  
by = list(Loyalty = df$LOYALTY_SEGMENT),  
FUN = function(x) sum(x, na.rm = TRUE))  
# Compute percentage of total sales  
segment_sales$Percentage <- round(100 * segment_sales$x / sum(segment_s  
ales$x), 2)  
segment_sales  
##  
Loyalty  
x Percentage  
## 1  
## 2  
1. Weekly Shoppers 4342664  
2. Bi-Weekly Shoppers 3653599  
39.73  
33.43  
17.97  
4.80  
4.04  
0.01  
## 3 3. Every Three Week Shoppers 1964765  
## 4  
## 5  
## 6  
## 7  
4. Once a Month Shoppers  
5. Infrequent Shoppers  
7. 90 Day New Shoppers  
No Segment  
525143  
441893  
978  
1573  
0.01  
Interpretation: Sales Contribution by Segment: Percentage of total  
sales is highest for r  
[. (�  
segment_sales  
Percentage)] segment (r  
segment_sales[. (Percentage)]%).  
19.  
Are customers with earlier acquisition dates (older  
customers) more loyal in terms of orders?  
# Create customer age in days  
df$Customer_Age <- as.numeric(Sys.Date() - as.Date(df$ACQUIRED_DATE))  
# Ensure 12 Mo. Orders is numeric  
df$`12 Mo. Orders` <- as.numeric(gsub("[^0-9]", "", df$`12 Mo. Orders`))  
# Compute correlation between customer age and 12 Mo. Orders  
correlation <- cor(df$Customer_Age, df$`12 Mo. Orders`, use = "complete.  
obs")  
correlation  
## [1] -0.05260738  
Interpretation: Customer Age vs. Orders: Correlation between  
customer age and 12-month orders r round(correlation,2); older  
customers tend to place slightly more orders.  
20.  
Which ZIP codes have the highest per-customer spending?  
# Clean 12 Mo. Sales column  
df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))  
# Compute average sales per ZIP code  
zip_sales <- aggregate(df$`12 Mo. Sales`,  
by = list(ZIP = df$ZIP_CODE),  
FUN = function(x) mean(x, na.rm = TRUE))  
# Sort ZIP codes by descending average sales  
zip_sales <- zip_sales[order(-zip_sales$x), ]  
# View top 10 ZIP codes  
head(zip_sales, 10)  
##  
ZIP  
x
## 94 10465 25195.000  
## 1  
## 6  
6807 18894.000  
7003 15849.000  
## 93 10463  
## 80 10044  
9961.000  
8494.000  
## 20  
7304 6514.000  
## 87 10282  
## 82 10069  
## 98 10701  
## 114 11210  
6183.056  
5814.000  
5644.000  
5416.400  
Interpretation: Top ZIP Codes by Spending: Top ZIP code by average  
sales is r zip_sales$ZIP[1] with ��(x[1],2) per customer.  
Part 3.  
What is the day of the week with the highest average orders?  
Interpretation:  
This was not answered!!  
21.  
Is weekend ordering (Sat+Sun) higher than weekday  
ordering?  
# Find actual columns for orders by day  
weekend_cols <- grep("SATURDAY|SUNDAY.*ORDERS", names(df), value = TRUE)  
weekday_cols <- grep("MONDAY|TUESDAY|WEDNESDAY|THURSDAY|FRIDAY.*ORDERS",  
names(df), value = TRUE)  
# Average weekend orders per customer  
weekend_avg <- rowMeans(df[, weekend_cols], na.rm = TRUE)  
# Average weekday orders per customer  
weekday_avg <- rowMeans(df[, weekday_cols], na.rm = TRUE)  
mean_weekend <- mean(weekend_avg, na.rm = TRUE)  
mean_weekday <- mean(weekday_avg, na.rm = TRUE)  
mean_weekend  
## [1] 5.690553  
mean_weekday  
## [1] 4.638016  
if(mean_weekend > mean_weekday){  
"Weekend ordering is higher than weekday ordering."  
} else if(mean_weekend < mean_weekday){  
"Weekday ordering is higher than weekend ordering."  
} else {  
"Weekend and weekday ordering are about the same."  
}
## [1] "Weekend ordering is higher than weekday ordering."  
Interpretation: Weekend vs Weekday Ordering: Average weekend  
orders r round(mean_weekend,1); weekday r  
This is not an interpretation  
round(mean_weekday,1). Weekend ordering is r  
ifelse(mean_weekend>mean_weekday,“higher”,“lower”) than weekdays.  
22.  
Do different LOYALTY_SEGMENTS prefer different days of  
the week?  
# Use pattern matching to get all day order columns  
day_cols <- grep("ORDERS.*12.MO", names(df), value = TRUE)  
segment_day_pref <- df %>%  
group_by(LOYALTY_SEGMENT) %>%  
summarise(across(all_of(day_cols), ~ mean(.x, na.rm = TRUE)))  
segment_day_pref  
## # A tibble: 7 × 8  
##  
LOYALTY_SEGMENT  
`SUNDAY ORDERS 12 MO.` `MONDAY ORDERS  
12 MO.`  
##  
<chr>  
<dbl>  
<dbl>  
12.0  
7.56  
4.19  
3.12  
2.47  
NaN  
## 1 1. Weekly Shoppers  
10.6  
## 2 2. Bi-Weekly Shoppers  
6.76  
## 3 3. Every Three Week Shoppers  
3.87  
## 4 4. Once a Month Shoppers  
2.53  
## 5 5. Infrequent Shoppers  
2.27  
## 6 7. 90 Day New Shoppers  
NaN  
## 7 No Segment  
1.5  
2.75  
## # 5 more variables: `TUESDAY ORDERS 12 MO.` <dbl>,  
## #  
## #  
`WEDNESDAY ORDERS 12 MO.` <dbl>, `THURSDAY ORDERS 12 MO.` <dbl>,  
`FRIDAY ORDERS 12 MO.` <dbl>, `SATURDAY ORDERS 12 MO.` <dbl>  
This can be better represented. The below conclusion is not even clear with the  
way you have presented your results.  
Interpretation: Loyalty Segment Preferences by Day: Different loyalty  
segments order on different days; frequent shoppers peak on  
weekends, infrequent shoppers peak on weekdays.  
23.  
Do promo orders cluster on specific days (e.g., Fridays)?  
colnames(df)  
## [1] "LOYALTY_SEGMENT"  
## [3] "INCOME"  
## [5] "ZIP_CODE"  
## [7] "GEOGRAPHY"  
"AGE"  
"GENDER"  
"DMA"  
"ACQUIRED_DATE"  
## [9] "12 Mo. DELIVERY_FEE_PAID" "12 Mo. DELIVERYPASS_USED"  
## [11] "12 Mo. DISCOUNT_AMOUNT"  
## [13] "12 Mo. ORDERS_W_PROMO"  
"12 Mo. Orders"  
"12 Mo. Sales"  
## [15] "24 Mo. DELIVERY_FEE_PAID" "24 Mo. DELIVERYPASS_USED"  
## [17] "24 Mo. DISCOUNT_AMOUNT"  
## [19] "24 Mo. Orders w. Promo"  
## [21] "SUNDAY ORDERS 12 MO."  
## [23] "TUESDAY ORDERS 12 MO."  
## [25] "THURSDAY ORDERS 12 MO."  
## [27] "SATURDAY ORDERS 12 MO."  
## [29] "UsedDeliveryPass"  
"24 Mo. Orders"  
"24 Mo. Sales"  
"MONDAY ORDERS 12 MO."  
"WEDNESDAY ORDERS 12 MO."  
"FRIDAY ORDERS 12 MO."  
"INCOME_NUM"  
"Customer_Age"  
With the current dataset, you cannot analyze promo orders by day,  
because daily breakdown doesn’t exist.  
The only thing you can do is look at total 12-month promo orders  
per customer:  
Interpretation: Promo Orders by Day: Daily promo breakdown is  
unavailable; only total 12-month promo orders per customer can be  
analyzed.  
24.  
Are sales more evenly distributed across days or skewed to  
a few?  
# Daily order columns  
day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY  
ORDERS 12 MO.",  
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRI  
DAY ORDERS 12 MO.",  
"SATURDAY ORDERS 12 MO.")  
total_orders_day <- colSums(df[, day_cols], na.rm = TRUE)  
total_orders_day  
##  
MO.  
##  
SUNDAY ORDERS 12 MO.  
17153  
MONDAY ORDERS 12 MO.  
TUESDAY ORDERS 12  
12  
FRIDAY ORDERS 12  
12  
15389  
THURSDAY ORDERS 12 MO.  
9467  
398  
## WEDNESDAY ORDERS 12 MO.  
MO.  
##  
10584  
298  
## SATURDAY ORDERS 12 MO.  
## 11943  
prop_orders_day <- round(100 * total_orders_day / sum(total_orders_day),  
2)  
prop_orders_day  
##  
MO.  
##  
SUNDAY ORDERS 12 MO.  
19.22  
MONDAY ORDERS 12 MO.  
17.25  
TUESDAY ORDERS 12  
13.  
FRIDAY ORDERS 12  
13.  
89  
## WEDNESDAY ORDERS 12 MO.  
MO.  
##  
78  
THURSDAY ORDERS 12 MO.  
10.61  
11.86  
## SATURDAY ORDERS 12 MO.  
## 13.38  
From the total orders per day and the chart, we can see that [e.g.,  
Fridays and Saturdays have higher orders while Mondays and  
Tuesdays are lower], indicating that sales are skewed to a few days  
Sunday has the highest orders  
rather than evenly distributed:  
24.  
(Bonus) Draw at least 2 graphics and carefully interpret  
results.  
library(ggplot2)  
# Daily order columns  
day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY  
ORDERS 12 MO.",  
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRI  
DAY ORDERS 12 MO.",  
"SATURDAY ORDERS 12 MO.")  
# Total orders per day  
total_orders_day <- colSums(df[, day_cols], na.rm = TRUE)  
# Convert to data frame for plotting  
orders_df <- data.frame(  
Day = names(total_orders_day),  
TotalOrders = as.numeric(total_orders_day)  
)
# Plot  
ggplot(orders_df, aes(x = Day, y = TotalOrders, fill = TotalOrders)) +  
geom_bar(stat = "identity") +  
scale_fill_gradient(low = "lightblue", high = "steelblue") +  
labs(title = "Total Orders by Day of the Week",  
x = "Day", y = "Total Orders") +  
theme_minimal() +  
theme(axis.text.x = element_text(angle = 45, hjust = 1))  
# Ensure 12 Mo. Sales is numeric  
df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))  
# Compute average sales per loyalty segment  
segment_sales <- aggregate(df$`12 Mo. Sales`,  
by = list(Loyalty = df$LOYALTY_SEGMENT),  
FUN = function(x) mean(x, na.rm = TRUE))  
# Rename column for plotting  
colnames(segment_sales)[2] <- "AvgSales"  
# Plot  
ggplot(segment_sales, aes(x = reorder(Loyalty, -AvgSales), y = AvgSales,  
fill = AvgSales)) +  
geom_bar(stat = "identity") +  
scale_fill_gradient(low = "lightgreen", high = "darkgreen") +  
labs(title = "Average 12-Month Sales by Loyalty Segment",  
x = "Loyalty Segment", y = "Average Sales ($)") +  
theme_minimal() +  
theme(axis.text.x = element_text(angle = 45, hjust = 1))  
You have created two good bonus graphics. However, the  
interpretation of the 'Total Orders by Day of the Week' chart is  
inaccurate as you state Friday/Saturday have peaks, when  
Sunday is the actual peak.  
Interpretation: Bonus Graphics Interpretation:  
Total orders bar chart shows Friday/Saturday peaks.  
Average sales by loyalty segment show highest revenue from r  
segment_sales[. (AvgSales)].  
This is also not clear  
24.  
(Bonus) Perform at least one statistical test and interpret  
results.  
# Daily order columns  
weekend_cols <- c("SATURDAY ORDERS 12 MO.", "SUNDAY ORDERS 12 MO.")  
weekday_cols <- c("MONDAY ORDERS 12 MO.", "TUESDAY ORDERS 12 MO.",  
"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.",  
"FRIDAY ORDERS 12 MO.")  
# Per-customer average  
weekend_avg <- rowMeans(df[, weekend_cols], na.rm = TRUE)  
weekday_avg <- rowMeans(df[, weekday_cols], na.rm = TRUE)  
# Perform paired t-test  
t_test_result <- t.test(weekend_avg, weekday_avg, paired = TRUE)  
t_test_result  
##  
## Paired t-test  
##  
## data:  
weekend_avg and weekday_avg  
## t = 8.7351, df = 2783, p-value < 2.2e-16  
## alternative hypothesis: true mean difference is not equal to 0  
## 95 percent confidence interval:  
## 0.7975298 1.2592231  
## sample estimates:  
## mean difference  
##  
1.028376  
Interpretation: Bonus Statistical Test: Paired t-test for weekend vs  
weekday orders gives p-value r round(t_test_result$p.value,3);  
Your interpretations are not clear  
difference is statistically significant.  
Submission  
Submit the .Rmd and the knitted PDF files using the correct naming.