Marketing Data Analysis

Grace Kitonyi

2025-09-13

Instructions

# Do not modify this chunk

library(tidyverse)

library(readxl) # package that will help you to load MS Excel data in R.

library(flextable)

library(ggplot2) # will help with advanced plotting if necessary.

In this assignment, we perform basic data exploration and

visualization on marketing data. This is to get insights on how

customers behave to what FreshDirect company offers. FreshDirect is

the leader in online grocery delivery. Their marketing data captures

customer information such as demographics, transaction behavior,

and ordering patterns to enable loyalty analysis, segmentation, and

predictive modeling.

Your task is to load the data using the package readxl allowing you to

load excel files. Identify the function from the package with specific

chosen parameters from it to get rid of some big issues that may

come with data importing. Then you also need to perform some data

cleaning for some computations.

Below is the data description.

Column Name

Description

Shopper classification based on purchase frequency

(e.g., Weekly, Bi-Weekly, Monthly)

LOYALTY_SEGMENT

AGE

Customer's age

INCOME

GENDER

Household income (may be grouped into ranges)

Gender of the primary shopper

Column Name

ZIP_CODE

Description

Residential ZIP code of customer

Designated Market Area (media/advertising region)

Broader geographic grouping

DMA

GEOGRAPHY

ACQUIRED_DATE

Date the customer first registered or became active

12 Mo.

DELIVERY_FEE_PAID

Total delivery fees paid in the last 12 months

12 Mo.

DELIVERYPASS_USED

Number of times DeliveryPass subscription was used in

12 months

12 Mo.

DISCOUNT_AMOUNT

Total discounts applied in 12 months

12 Mo. Orders

Total number of orders in the past 12 months

Number of orders that included a promo in 12 months

Total sales generated by customer in 12 months

Total delivery fees paid in the last 24 months

12 Mo.

ORDERS_W_PROMO

12 Mo. Sales

24 Mo.

DELIVERY_FEE_PAID

24 Mo.

DELIVERYPASS_USED

Number of times DeliveryPass was used in 24 months

Total discounts applied in 24 months

24 Mo.

DISCOUNT_AMOUNT

24 Mo. Orders

Total number of orders in the past 24 months

Number of orders with a promo in 24 months

Total sales generated by customer in 24 months

Number of orders placed on Sundays (12 months)

Number of orders placed on Mondays (12 months)

Number of orders placed on Tuesdays (12 months)

24 Mo. Orders w. Promo

24 Mo. Sales

SUNDAY ORDERS 12 MO.

MONDAY ORDERS 12 MO.

TUESDAY ORDERS 12 MO.

WEDNESDAY ORDERS 12

MO.

Number of orders placed on Wednesdays (12 months)

Number of orders placed on Thursdays (12 months)

THURSDAY ORDERS 12

MO.

Column Name

Description

FRIDAY ORDERS 12 MO.

Number of orders placed on Fridays (12 months)

SATURDAY ORDERS 12

MO.

Number of orders placed on Saturdays (12 months)

Use R to attempt each of the following questions. We recommend you

to write your interpretation in your English.

Part 1.

1. After importing data in R. Check which column has the highest

count of missing information.

# Import data in a variable named df

df <- read_excel("C:/Users/Administrator/OneDrive/Documents/AIMS/ICL/R/

FD_data.xlsx",skip = 1,na=c('#N/A','-','N/A') )

# Check the dimension of df

dim(df)

## [1] 3000

27

View(df)

# Display the data structure of df

str(df)

## tibble [3,000 × 27] (S3: tbl_df/tbl/data.frame)

## $ LOYALTY_SEGMENT

: chr [1:3000] "2. Bi-Weekly Shoppers" "5.

Infrequent Shoppers" "3. Every Three Week Shoppers" "3. Every Three We

ek Shoppers" ...

## $ AGE

: num [1:3000] 60 53 NA 29 NA 54 40 68 33

NA ...

## $ INCOME

$150,000" ...

## $ GENDER

## $ ZIP_CODE

231 ...

: chr [1:3000] "$150,000" "$20,000" NA "

: chr [1:3000] "F" "F" NA "F" ...

: num [1:3000] 10465 10462 10462 11206 11

## $ DMA

Y-NJ" ...

## $ GEOGRAPHY

rooklyn" ...

## $ ACQUIRED_DATE

: chr [1:3000] "NY-NJ" "NY-NJ" "NY-NJ" "N

: chr [1:3000] "Bronx" "Bronx" "Bronx" "B

: chr [1:3000] "6/4/12 0:00" "7/1/12 0:00

" "7/10/12 0:00" "1/7/12 0:00" ...

## $ 12 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$5.99" "

$0.00" ...

## $ 12 Mo. DELIVERYPASS_USED: num [1:3000] 51 0 12 13 23 0 0 0 0 0 ...

## $ 12 Mo. DISCOUNT_AMOUNT

"$20.96" ...

: chr [1:3000] "$98.65" "$0.00" "$139.81"

## $ 12 Mo. Orders

...

: num [1:3000] 57 4 18 15 25 31 8 12 8 3

## $ 12 Mo. ORDERS_W_PROMO

## $ 12 Mo. Sales

$1,092" ...

: num [1:3000] 19 NA 16 7 4 2 5 1 8 NA ...

: chr [1:3000] "$25,195" "$84" "$1,496" "

## $ 24 Mo. DELIVERY_FEE_PAID: chr [1:3000] "$0.00" "$5.99" "$17.97" "

$17.97" ...

## $ 24 Mo. DELIVERYPASS_USED: num [1:3000] 103 0 26 13 46 0 0 0 0 0

...

## $ 24 Mo. DISCOUNT_AMOUNT

" "$50.92" ...

## $ 24 Mo. Orders

3 ...

## $ 24 Mo. Orders w. Promo

...

## $ 24 Mo. Sales

"$1,823" ...

: chr [1:3000] "$128.16" "$0.00" "$257.65

: num [1:3000] 105 1 39 21 46 35 9 24 18

: num [1:3000] 25 NA 33 14 8 9 6 3 15 NA

: chr [1:3000] "$37,999" "$130" "$3,273"

## $ SUNDAY ORDERS 12 MO.

## $ MONDAY ORDERS 12 MO.

## $ TUESDAY ORDERS 12 MO.

: num [1:3000] NA 2 2 1 NA 4 1 6 5 NA ...

: num [1:3000] 50 1 NA 1 3 3 1 1 NA NA ...

: num [1:3000] 2 NA 1 2 5 NA 3 NA 1 1 ...

## $ WEDNESDAY ORDERS 12 MO. : num [1:3000] NA NA 4 4 4 1 NA 1 NA NA

...

## $ THURSDAY ORDERS 12 MO.

## $ FRIDAY ORDERS 12 MO.

## $ SATURDAY ORDERS 12 MO.

: num [1:3000] NA NA 4 1 3 3 1 NA NA 1 ...

: num [1:3000] 2 NA 5 3 4 1 1 NA 1 1 ...

: num [1:3000] NA 1 2 1 4 19 1 4 1 NA ...

# Count number of missing values per column

missing_counts <- colSums(is.na(df))

# Display missing counts nicely

missing_counts

##

LOYALTY_SEGMENT

AGE

INCOME

##

543

##

DMA

##

0

GENDER

659

543

ZIP_CODE

0

##

E_PAID

##

GEOGRAPHY

0

ACQUIRED_DATE 12 Mo. DELIVERY_FE

0

## 12 Mo. DELIVERYPASS_USED

Orders

12 Mo. DISCOUNT_AMOUNT

12 Mo.

##

E_PAID

##

0

12 Mo. ORDERS_W_PROMO

239

0

12 Mo. Sales 24 Mo. DELIVERY_FE

0

## 24 Mo. DELIVERYPASS_USED

Orders

24 Mo. DISCOUNT_AMOUNT

24 Mo.

##

0

##

24 Mo. Orders w. Promo

24 Mo. Sales

SUNDAY ORDERS

12 MO.

##

239

MONDAY ORDERS 12 MO.

484

0

TUESDAY ORDERS 12 MO.

548

447

##

12 MO.

##

WEDNESDAY ORDERS

SATURDAY ORDERS

574

##

THURSDAY ORDERS 12 MO.

12 MO.

##

FRIDAY ORDERS 12 MO.

562

656

622

# Find the column with the maximum missing values

max_missing_col <- names(missing_counts)[which.max(missing_counts)]

max_missing_col

## [1] "GENDER"

# Also show how many missing values it has

missing_counts[max_missing_col]

## GENDER

##

659

Point out any issue with the data (asterix serve as bullet points in

markdown. Add * as much as possible.)

 Data Issues:

Some columns have many missing values, e.g., 12 Mo. Sales.

Numeric columns like INCOME contain $ and commas.

Placeholders like - or N/A need conversion to NA.

GENDER uses abbreviations F/M instead of full labels.

Column names contain spaces or special characters.

Good

Some duplicates or outliers may affect totals and averages.

2. How many unique customers are in the dataset?

# Replace CustomerID with the actual unique ID column if it exists

n_distinct(df$ZIP_CODE)

# or df$CustomerID if available

## [1] 144

# Number of duplicated rows

sum(duplicated(df))

## [1] 0

# If each row is one unique customer

n_customers <- nrow(df)

n_customers

## [1] 3000

Ok

Interpretation: Unique Customers: There are r n_customers rows, each

representing a unique customer. Counting unique ZIP codes (r

n_distinct(df$ZIP_CODE)) shows geographic spread. There are r

sum(duplicated(df)) duplicate rows.

3. What is the age distribution of customers?

# Summary statistics for AGE

summary(df$AGE)

##

Min. 1st Qu.

0.00 36.00

Median

44.00

Mean 3rd Qu.

45.26 55.00 101.00

Max.

NA's

543

# Mean and standard deviation

mean(df$AGE, na.rm = TRUE)

## [1] 45.26129

sd(df$AGE, na.rm = TRUE)

## [1] 17.03842

ggplot(df, aes(x = AGE)) +

geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +

labs(title = "Age Distribution of Customers",

x = "Customer Age",

y = "Count") +

theme_minimal()

## Warning: Removed 543 rows containing non-finite outside the scale ra

nge

## (`stat_bin()`).

Nice

ggplot(df, aes(x = AGE)) +

geom_density(fill = "lightgreen", alpha = 0.5) +

labs(title = "Density Plot of Customer Ages",

x = "Customer Age",

y = "Density") +

theme_minimal()

## Warning: Removed 543 rows containing non-finite outside the scale ra

nge

## (`stat_density()`).

Interpretation: Age Distribution: Most customers are adults aged

between 30–60. Average age ≈ r round(mean(df��, ��. �� =

��), 1)��ℎ�� ≈ ��(��(��

AGE, na.rm=TRUE),1).

Great job

4. What is the income distribution (mean, median, spread or

standard deviation)?

# Remove $ and , then convert to numeric

df$INCOME <- as.numeric(gsub("[$,]", "", df$INCOME))

## Warning: NAs introduced by coercion

# Summary of income

summary(df$INCOME)

##

Min. 1st Qu.

10000 65000

Median

100000

Mean 3rd Qu.

129779 175000

Max.

300000

NA's

937

# Mean, median, and standard deviation

mean_income <- mean(df$INCOME, na.rm = TRUE)

median_income <- median(df$INCOME, na.rm = TRUE)

sd_income

<- sd(df$INCOME, na.rm = TRUE)

mean_income

## [1] 129779.4

median_income

## [1] 1e+05

sd_income

## [1] 83363.39

ggplot(df, aes(x = INCOME)) +

geom_histogram(binwidth = 10000, fill = "lightblue", color = "black")

+

labs(title = "Income Distribution of Customers",

x = "Household Income ($)",

y = "Count") +

theme_minimal()

## Warning: Removed 937 rows containing non-finite outside the scale ra

nge

## (`stat_bin()`).

Interpretation: Income Distribution: Customer incomes mostly range

from $40,000–$100,000. Average income ≈ $r

round(mean_income,0); SD ≈ $r round(sd_income,0), indicating

moderate variability.

Good

5. After decoding the variable GENDER (replace F and M by

Female and Male respectively), what are the count of male and

female customers in the dataset?

# Recode GENDER column

df$GENDER <- recode(df$GENDER, "F" = "Female", "M" = "Male")

# Counts

table(df$GENDER)

##

## Female

## 1596

Male

745

# Or with tidyverse

df %>%

count(GENDER)

## # A tibble: 3 × 2

##

GENDER

<chr>

n

<int>

1596

745

## 1 Female

## 2 Male

## 3 <NA>

659

library(ggplot2)

ggplot(df, aes(x = GENDER, fill = GENDER)) +

geom_bar() +

labs(title = "Gender Distribution of Customers",

x = "Gender", y = "Count") +

theme_minimal()

Nice

Interpretation:

Gender Distribution: The dataset has r table(df$GENDER)["Female"]

females and r table(df$GENDER)[“Male”] males. Female and male

customers are fairly balanced.

6. Which part of America has the highest customer counts?

customer_counts <- table(df$GEOGRAPHY, useNA = "ifany")

highest_region <- names(which.max(customer_counts))

highest_value

<- max(customer_counts)

cat("The region with the highest customers is", highest_region, "with",

highest_value, "customers.\n")

## The region with the highest customers is Manhattan with 2054 custome

rs.

Interpretation: Highest Customer Region: The region with the highest

number of customers is r highest_region with r highest_value

customers.

Ok

7. How many customers fall into each LOYALTY_SEGMENT

category? You can draw a pie/bar chart and interpret it.

# Using base R

table(df$LOYALTY_SEGMENT)

##

1. Weekly Shoppers

631

2. Bi-Weekly Shoppers

826

## 3. Every Three Week Shoppers

4. Once a Month Shoppers

##

783

342

5. Infrequent Shoppers

7. 90 Day New Shoppers

1

413

No Segment

4

# Or using tidyverse

df %>%

count(LOYALTY_SEGMENT)

## # A tibble: 7 × 2

##

LOYALTY_SEGMENT

<chr>

n

<int>

## 1 1. Weekly Shoppers

631

826

783

342

413

1

## 2 2. Bi-Weekly Shoppers

## 3 3. Every Three Week Shoppers

## 4 4. Once a Month Shoppers

## 5 5. Infrequent Shoppers

## 6 7. 90 Day New Shoppers

## 7 No Segment

4

library(ggplot2)

segment_counts <- df %>%

count(LOYALTY_SEGMENT)

ggplot(segment_counts, aes(x = LOYALTY_SEGMENT, y = n, fill = LOYALTY_S

EGMENT)) +

geom_bar(stat = "identity") +

labs(title = "Customer Counts by Loyalty Segment",

x = "Loyalty Segment",

y = "Number of Customers")

Bad labelling

Interpretation: Loyalty Segment Counts: Most customers belong to the

largest segment (r

�� [�ℎ��ℎ. ��(��

segment_counts

n)]) with r

�

max(segment_counts$n) customers.

8. Compute the average number of orders per customer in 12

months.

names(df)

## [1] "LOYALTY_SEGMENT"

"AGE"

## [3] "INCOME"

## [5] "ZIP_CODE"

## [7] "GEOGRAPHY"

"GENDER"

"DMA"

"ACQUIRED_DATE"

## [9] "12 Mo. DELIVERY_FEE_PAID" "12 Mo. DELIVERYPASS_USED"

## [11] "12 Mo. DISCOUNT_AMOUNT"

## [13] "12 Mo. ORDERS_W_PROMO"

"12 Mo. Orders"

"12 Mo. Sales"

## [15] "24 Mo. DELIVERY_FEE_PAID" "24 Mo. DELIVERYPASS_USED"

## [17] "24 Mo. DISCOUNT_AMOUNT"

## [19] "24 Mo. Orders w. Promo"

## [21] "SUNDAY ORDERS 12 MO."

## [23] "TUESDAY ORDERS 12 MO."

## [25] "THURSDAY ORDERS 12 MO."

## [27] "SATURDAY ORDERS 12 MO."

"24 Mo. Orders"

"24 Mo. Sales"

"MONDAY ORDERS 12 MO."

"WEDNESDAY ORDERS 12 MO."

"FRIDAY ORDERS 12 MO."

# Base R

mean(df$`12 Mo. Orders`, na.rm = TRUE)

## [1] 30.68

# Or tidyverse

df %>%

summarise(avg_orders_12mo = mean(`12 Mo. Orders`, na.rm = TRUE))

## # A tibble: 1 × 1

##

avg_orders_12mo

<dbl>

## 1

30.7

Interpretation: Average Orders per Customer: Average 12-month

orders per customer ≈ r round(mean(df$12 Mo. Orders,

na.rm=TRUE),1).

Ok

9. Compute the average sales per customer in 12 months?

df$`12 Mo. Sales` <- as.character(df$`12 Mo. Sales`)

df$`12 Mo. Sales` <- gsub("[$,]", "", df$`12 Mo. Sales`)

df$`12 Mo. Sales`[df$`12 Mo. Sales` %in% c("-", "", "NA")] <- NA

df$`12 Mo. Sales` <- as.numeric(df$`12 Mo. Sales`)

average_sales <- mean(df$`12 Mo. Sales`, na.rm = TRUE)

average_sales

## [1] 3643.538

Ok

Interpretation: Average Sales per Customer: Average 12-month sales

per customer ≈ $r round(average_sales,2).

10.

How many customers used DeliveryPass at least once?

num <- sum(df$`12 Mo. DELIVERYPASS_USED` > 0, na.rm = TRUE)

num

## [1] 1585

Interpretation: DeliveryPass Usage: r num customers used DeliveryPass

at least once, showing moderate adoption.

Should have added 24 Mo. ... as well

Part 2.

11.

Do higher-income customers place more orders?

# 1. Clean and convert INCOME

df <- df %>%

mutate(INCOME_NUM = case_when(

!is.na(INCOME) & str_detect(INCOME, "-") ~

(as.numeric(str_remove(str_extract(INCOME, "^[0-9,]+"), ",")) +

as.numeric(str_remove(str_extract(INCOME, "[0-9,]+$"), ",")))/2,

TRUE ~ as.numeric(str_remove(INCOME, "[^0-9]"))

))

## Warning: There was 1 warning in `mutate()`.

## ℹ In argument: `INCOME_NUM = case_when(...)`.

## Caused by warning:

## ! NAs introduced by coercion

# 2. Clean and convert 12 Mo. Orders

df <- df %>%

mutate(`12 Mo. Orders` = as.numeric(gsub("[^0-9]", "", `12 Mo. Orders

`)))

# 3. Compute correlation

correlation <- cor(df$INCOME_NUM, df$`12 Mo. Orders`, use = "complete.o

bs")

correlation

## [1] 0.06719484

Interpretation: Income vs. Orders: Correlation between income and

12-month orders ≈ r round(correlation,2); a positive but weak

correlation indicates higher-income customers slightly place more

orders.

12.

Based on gender, is there any difference between average

sales?

aggregate(df$`12 Mo. Sales`,

by = list(Gender = df$GENDER),

FUN = function(x) mean(x, na.rm = TRUE))

##

## 1 Female 3771.026

## 2 Male 3680.960

Gender

x

Interpretation: Sales by Gender: Average 12-month sales: Female ≈ $r

��. ��[��

round(mean(df12

GENDER==“Female”], na.rm=TRUE),2);

��. ��[��

Male ≈ $r round(mean(df12

GENDER==“Male”],

na.rm=TRUE),2). Gender differences are minor.

13.

How do LOYALTY_SEGMENTS differ in terms of twelve month

sales?

# Ensure 12 Mo. Sales is numeric

df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))

# Average sales by LOYALTY_SEGMENT

segment_sales <- aggregate(df$`12 Mo. Sales`,

by = list(Loyalty = df$LOYALTY_SEGMENT),

FUN = function(x) mean(x, na.rm = TRUE))

segment_sales

##

Loyalty

x

## 1

## 2

1. Weekly Shoppers 6882.193

2. Bi-Weekly Shoppers 4423.243

## 3 3. Every Three Week Shoppers 2509.278

## 4

## 5

## 6

## 7

4. Once a Month Shoppers 1535.506

5. Infrequent Shoppers 1069.959

7. 90 Day New Shoppers

No Segment

978.000

393.250

Interpretation: Sales by Loyalty Segment: Average 12-month sales

are highest for segment r

segment_sales��[�ℎ��ℎ. ��(��_��AvgSales)].

14.

Do younger customers, for customer aged less than 30,

use promos more than older ones?

# Ensure AGE and 12 Mo. ORDERS_W_PROMO are numeric

df$AGE <- as.numeric(df$AGE)

df$`12 Mo. ORDERS_W_PROMO` <- as.numeric(gsub("[^0-9]", "", df$`12 Mo.

ORDERS_W_PROMO`))

# Group customers into two groups: younger and older

young <- df[df$AGE < 30, ]

old <- df[df$AGE >= 30, ]

# Average orders with promo for each group

mean(young$`12 Mo. ORDERS_W_PROMO`, na.rm = TRUE)

## [1] 8.472222

mean(old$`12 Mo. ORDERS_W_PROMO`, na.rm = TRUE)

## [1] 7.880251

Interpretation: Promos by Age: Customers <30 average r

round(mean(young$12 Mo. ORDERS_W_PROMO, na.rm=TRUE),1)

promo orders; ≥30 average r round(mean(old$12 Mo.

ORDERS_W_PROMO, na.rm=TRUE),1). Younger customers use slightly

more promotions.

15.

How does discount amount vary across income brackets?

# Clean numeric column

df$`12 Mo. DISCOUNT_AMOUNT` <- as.numeric(gsub("[$,]", "", df$`12 Mo. D

ISCOUNT_AMOUNT`))

# Average discount per income bracket

I_discount <- aggregate(df$`12 Mo. DISCOUNT_AMOUNT`,

by = list(Income = df$INCOME),

FUN = function(x) mean(x, na.rm = TRUE))

I_discount

##

Income

x

## 1

## 2

## 3

## 4

## 5

## 6

## 7

## 8

## 9

10000 33.41167

15000 28.60462

20000 21.27594

25000 29.10000

30000 36.16887

35000 30.61211

40000 37.32103

45000 30.40520

50000 35.21877

## 10 55000 32.43000

## 11 60000 42.28682

## 12 65000 35.33939

## 13 75000 34.06022

## 14 100000 37.56920

## 15 150000 34.73596

## 16 175000 32.54702

## 17 200000 37.86529

## 18 250000 28.59218

## 19 300000 26.34092

Try to plot!

Interpretation: Discount by Income: Higher income brackets receive

slightly higher average discounts; variation is modest.

16.

Is DeliveryPass usage associated with higher total sales?

# Create a flag for DeliveryPass usage

df <- df %>%

mutate(`12 Mo. Sales` = as.numeric(gsub("[$,]", "", `12 Mo. Sales`)),

UsedDeliveryPass = ifelse(`12 Mo. DELIVERYPASS_USED` > 0, "Yes

", "No"))

# Compare average total sales by DeliveryPass usage

df %>%

group_by(UsedDeliveryPass) %>%

summarise(

AvgSales = mean(`12 Mo. Sales`, na.rm = TRUE),

Count = n()

)

## # A tibble: 2 × 3

##

## 1 No

UsedDeliveryPass AvgSales Count

<chr>

2606. 1415

4570. 1585

## 2 Yes

Interpretation: DeliveryPass vs. Sales: Average sales: DeliveryPass users

≈ $r round(mean(df12��. ��[��UsedDeliveryPass==“Yes”],

na.rm=TRUE),2); non-users ≈ $r

round(mean(df12��. ��[��UsedDeliveryPass==“No”],

na.rm=TRUE),2). Using DeliveryPass is associated with higher sales.

17.

Do frequent shoppers (Weekly, Bi-Weekly) pay less in

delivery fees?

# Clean numeric column

df$`12 Mo. DELIVERY_FEE_PAID` <- as.numeric(gsub("[$,]", "", df$`12 Mo.

DELIVERY_FEE_PAID`))

## Warning: NAs introduced by coercion

# Compute average delivery fees by loyalty segment

fee_by_segment <- aggregate(df$`12 Mo. DELIVERY_FEE_PAID`,

by = list(Loyalty = df$LOYALTY_SEGMENT),

FUN = function(x) mean(x, na.rm = TRUE))

fee_by_segment

##

## 1

## 2

Loyalty

1. Weekly Shoppers 45.82522

2. Bi-Weekly Shoppers 63.48510

## 3 3. Every Three Week Shoppers 56.61552

x

## 4

## 5

## 6

## 7

4. Once a Month Shoppers 39.17424

5. Infrequent Shoppers 29.60884

7. 90 Day New Shoppers 35.94000

No Segment 14.72500

Interpretation: Delivery Fee by Loyalty: Average delivery fees are

lower for frequent shoppers (r

��[�ℎ��ℎ. ��(��

fee_by_segment

x)]), confirming higher

�

engagement correlates with lower delivery costs.

18.

What percentage of total sales comes from each

LOYALTY_SEGMENT?

# Ensure 12 Mo. Sales is numeric

df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))

# Sum sales by loyalty segment

segment_sales <- aggregate(df$`12 Mo. Sales`,

by = list(Loyalty = df$LOYALTY_SEGMENT),

FUN = function(x) sum(x, na.rm = TRUE))

# Compute percentage of total sales

segment_sales$Percentage <- round(100 * segment_sales$x / sum(segment_s

ales$x), 2)

segment_sales

##

Loyalty

x Percentage

## 1

## 2

1. Weekly Shoppers 4342664

2. Bi-Weekly Shoppers 3653599

39.73

33.43

17.97

4.80

4.04

0.01

## 3 3. Every Three Week Shoppers 1964765

## 4

## 5

## 6

## 7

4. Once a Month Shoppers

5. Infrequent Shoppers

7. 90 Day New Shoppers

No Segment

525143

441893

978

1573

0.01

Interpretation: Sales Contribution by Segment: Percentage of total

sales is highest for r

��[�ℎ��ℎ. ��(��

segment_sales

Percentage)] segment (r

�

segment_sales��[�ℎ��ℎ. ��(��_��Percentage)]%).

19.

Are customers with earlier acquisition dates (older

customers) more loyal in terms of orders?

# Create customer age in days

df$Customer_Age <- as.numeric(Sys.Date() - as.Date(df$ACQUIRED_DATE))

# Ensure 12 Mo. Orders is numeric

df$`12 Mo. Orders` <- as.numeric(gsub("[^0-9]", "", df$`12 Mo. Orders`))

# Compute correlation between customer age and 12 Mo. Orders

correlation <- cor(df$Customer_Age, df$`12 Mo. Orders`, use = "complete.

obs")

correlation

## [1] -0.05260738

Interpretation: Customer Age vs. Orders: Correlation between

customer age and 12-month orders ≈ r round(correlation,2); older

customers tend to place slightly more orders.

20.

Which ZIP codes have the highest per-customer spending?

# Clean 12 Mo. Sales column

df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))

# Compute average sales per ZIP code

zip_sales <- aggregate(df$`12 Mo. Sales`,

by = list(ZIP = df$ZIP_CODE),

FUN = function(x) mean(x, na.rm = TRUE))

# Sort ZIP codes by descending average sales

zip_sales <- zip_sales[order(-zip_sales$x), ]

# View top 10 ZIP codes

head(zip_sales, 10)

##

ZIP

x

## 94 10465 25195.000

## 1

## 6

6807 18894.000

7003 15849.000

## 93 10463

## 80 10044

9961.000

8494.000

## 20

7304 6514.000

## 87 10282

## 82 10069

## 98 10701

## 114 11210

6183.056

5814.000

5644.000

5416.400

Interpretation: Top ZIP Codes by Spending: Top ZIP code by average

sales is r zip_sales$ZIP[1] with ��(��_��x[1],2) per customer.

Part 3.

What is the day of the week with the highest average orders?

Interpretation:

This was not answered!!

21.

Is weekend ordering (Sat+Sun) higher than weekday

ordering?

# Find actual columns for orders by day

weekend_cols <- grep("SATURDAY|SUNDAY.*ORDERS", names(df), value = TRUE)

weekday_cols <- grep("MONDAY|TUESDAY|WEDNESDAY|THURSDAY|FRIDAY.*ORDERS",

names(df), value = TRUE)

# Average weekend orders per customer

weekend_avg <- rowMeans(df[, weekend_cols], na.rm = TRUE)

# Average weekday orders per customer

weekday_avg <- rowMeans(df[, weekday_cols], na.rm = TRUE)

mean_weekend <- mean(weekend_avg, na.rm = TRUE)

mean_weekday <- mean(weekday_avg, na.rm = TRUE)

mean_weekend

## [1] 5.690553

mean_weekday

## [1] 4.638016

if(mean_weekend > mean_weekday){

"Weekend ordering is higher than weekday ordering."

} else if(mean_weekend < mean_weekday){

"Weekday ordering is higher than weekend ordering."

} else {

"Weekend and weekday ordering are about the same."

}

## [1] "Weekend ordering is higher than weekday ordering."

Interpretation: Weekend vs Weekday Ordering: Average weekend

orders ≈ r round(mean_weekend,1); weekday ≈ r

This is not an interpretation

round(mean_weekday,1). Weekend ordering is r

ifelse(mean_weekend>mean_weekday,“higher”,“lower”) than weekdays.

22.

Do different LOYALTY_SEGMENTS prefer different days of

the week?

# Use pattern matching to get all day order columns

day_cols <- grep("ORDERS.*12.MO", names(df), value = TRUE)

segment_day_pref <- df %>%

group_by(LOYALTY_SEGMENT) %>%

summarise(across(all_of(day_cols), ~ mean(.x, na.rm = TRUE)))

segment_day_pref

## # A tibble: 7 × 8

##

LOYALTY_SEGMENT

`SUNDAY ORDERS 12 MO.` `MONDAY ORDERS

12 MO.`

##

<chr>

<dbl>

12.0

7.56

4.19

3.12

2.47

NaN

## 1 1. Weekly Shoppers

10.6

## 2 2. Bi-Weekly Shoppers

6.76

## 3 3. Every Three Week Shoppers

3.87

## 4 4. Once a Month Shoppers

2.53

## 5 5. Infrequent Shoppers

2.27

## 6 7. 90 Day New Shoppers

NaN

## 7 No Segment

1.5

2.75

## # ℹ 5 more variables: `TUESDAY ORDERS 12 MO.` <dbl>,

## #

`WEDNESDAY ORDERS 12 MO.` <dbl>, `THURSDAY ORDERS 12 MO.` <dbl>,

`FRIDAY ORDERS 12 MO.` <dbl>, `SATURDAY ORDERS 12 MO.` <dbl>

This can be better represented. The below conclusion is not even clear with the

way you have presented your results.

Interpretation: Loyalty Segment Preferences by Day: Different loyalty

segments order on different days; frequent shoppers peak on

weekends, infrequent shoppers peak on weekdays.

23.

Do promo orders cluster on specific days (e.g., Fridays)?

colnames(df)

## [1] "LOYALTY_SEGMENT"

## [3] "INCOME"

## [5] "ZIP_CODE"

## [7] "GEOGRAPHY"

"AGE"

"GENDER"

"DMA"

"ACQUIRED_DATE"

## [9] "12 Mo. DELIVERY_FEE_PAID" "12 Mo. DELIVERYPASS_USED"

## [11] "12 Mo. DISCOUNT_AMOUNT"

## [13] "12 Mo. ORDERS_W_PROMO"

"12 Mo. Orders"

"12 Mo. Sales"

## [15] "24 Mo. DELIVERY_FEE_PAID" "24 Mo. DELIVERYPASS_USED"

## [17] "24 Mo. DISCOUNT_AMOUNT"

## [19] "24 Mo. Orders w. Promo"

## [21] "SUNDAY ORDERS 12 MO."

## [23] "TUESDAY ORDERS 12 MO."

## [25] "THURSDAY ORDERS 12 MO."

## [27] "SATURDAY ORDERS 12 MO."

## [29] "UsedDeliveryPass"

"24 Mo. Orders"

"24 Mo. Sales"

"MONDAY ORDERS 12 MO."

"WEDNESDAY ORDERS 12 MO."

"FRIDAY ORDERS 12 MO."

"INCOME_NUM"

"Customer_Age"

With the current dataset, you cannot analyze promo orders by day,

because daily breakdown doesn’t exist.

The only thing you can do is look at total 12-month promo orders

per customer:

Interpretation: Promo Orders by Day: Daily promo breakdown is

unavailable; only total 12-month promo orders per customer can be

analyzed.

24.

Are sales more evenly distributed across days or skewed to

a few?

# Daily order columns

day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY

ORDERS 12 MO.",

"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRI

DAY ORDERS 12 MO.",

"SATURDAY ORDERS 12 MO.")

total_orders_day <- colSums(df[, day_cols], na.rm = TRUE)

total_orders_day

##

MO.

##

SUNDAY ORDERS 12 MO.

17153

MONDAY ORDERS 12 MO.

TUESDAY ORDERS 12

12

FRIDAY ORDERS 12

12

15389

THURSDAY ORDERS 12 MO.

9467

398

## WEDNESDAY ORDERS 12 MO.

MO.

##

10584

298

## SATURDAY ORDERS 12 MO.

## 11943

prop_orders_day <- round(100 * total_orders_day / sum(total_orders_day),

2)

prop_orders_day

##

MO.

##

SUNDAY ORDERS 12 MO.

19.22

MONDAY ORDERS 12 MO.

17.25

TUESDAY ORDERS 12

13.

FRIDAY ORDERS 12

13.

89

## WEDNESDAY ORDERS 12 MO.

MO.

##

78

THURSDAY ORDERS 12 MO.

10.61

11.86

## SATURDAY ORDERS 12 MO.

## 13.38

From the total orders per day and the chart, we can see that [e.g.,

Fridays and Saturdays have higher orders while Mondays and

Tuesdays are lower], indicating that sales are skewed to a few days

Sunday has the highest orders

rather than evenly distributed:

24.

(Bonus) Draw at least 2 graphics and carefully interpret

results.

library(ggplot2)

# Daily order columns

day_cols <- c("SUNDAY ORDERS 12 MO.", "MONDAY ORDERS 12 MO.", "TUESDAY

ORDERS 12 MO.",

"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.", "FRI

DAY ORDERS 12 MO.",

"SATURDAY ORDERS 12 MO.")

# Total orders per day

total_orders_day <- colSums(df[, day_cols], na.rm = TRUE)

# Convert to data frame for plotting

orders_df <- data.frame(

Day = names(total_orders_day),

TotalOrders = as.numeric(total_orders_day)

)

# Plot

ggplot(orders_df, aes(x = Day, y = TotalOrders, fill = TotalOrders)) +

geom_bar(stat = "identity") +

scale_fill_gradient(low = "lightblue", high = "steelblue") +

labs(title = "Total Orders by Day of the Week",

x = "Day", y = "Total Orders") +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Ensure 12 Mo. Sales is numeric

df$`12 Mo. Sales` <- as.numeric(gsub("[$,]", "", df$`12 Mo. Sales`))

# Compute average sales per loyalty segment

segment_sales <- aggregate(df$`12 Mo. Sales`,

by = list(Loyalty = df$LOYALTY_SEGMENT),

FUN = function(x) mean(x, na.rm = TRUE))

# Rename column for plotting

colnames(segment_sales)[2] <- "AvgSales"

# Plot

ggplot(segment_sales, aes(x = reorder(Loyalty, -AvgSales), y = AvgSales,

fill = AvgSales)) +

geom_bar(stat = "identity") +

scale_fill_gradient(low = "lightgreen", high = "darkgreen") +

labs(title = "Average 12-Month Sales by Loyalty Segment",

x = "Loyalty Segment", y = "Average Sales ($)") +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1))

You have created two good bonus graphics. However, the

interpretation of the 'Total Orders by Day of the Week' chart is

inaccurate as you state Friday/Saturday have peaks, when

Sunday is the actual peak.

Interpretation: Bonus Graphics Interpretation:

Total orders bar chart shows Friday/Saturday peaks.

Average sales by loyalty segment show highest revenue from r

segment_sales��[�ℎ��ℎ. ��(��_��AvgSales)].

This is also not clear

24.

(Bonus) Perform at least one statistical test and interpret

results.

# Daily order columns

weekend_cols <- c("SATURDAY ORDERS 12 MO.", "SUNDAY ORDERS 12 MO.")

weekday_cols <- c("MONDAY ORDERS 12 MO.", "TUESDAY ORDERS 12 MO.",

"WEDNESDAY ORDERS 12 MO.", "THURSDAY ORDERS 12 MO.",

"FRIDAY ORDERS 12 MO.")

# Per-customer average

weekend_avg <- rowMeans(df[, weekend_cols], na.rm = TRUE)

weekday_avg <- rowMeans(df[, weekday_cols], na.rm = TRUE)

# Perform paired t-test

t_test_result <- t.test(weekend_avg, weekday_avg, paired = TRUE)

t_test_result

##

## Paired t-test

##

## data:

weekend_avg and weekday_avg

## t = 8.7351, df = 2783, p-value < 2.2e-16

## alternative hypothesis: true mean difference is not equal to 0

## 95 percent confidence interval:

## 0.7975298 1.2592231

## sample estimates:

## mean difference

##

1.028376

Interpretation: Bonus Statistical Test: Paired t-test for weekend vs

weekday orders gives p-value ≈ r round(t_test_result$p.value,3);

Your interpretations are not clear

difference is statistically significant.

Submission

Submit the .Rmd and the knitted PDF files using the correct naming.