packages <- c("SmartEDA", "tidyr", "dplyr", "ggplot2", "randomForest", "tinytex", "scales", "plotly", "corrplot", "caret", "tidyverse", "lubridate")
# Install any packages that are not yet installed
installed_packages <- rownames(installed.packages())
for (pkg in packages) {
if (!(pkg %in% installed_packages)) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
##
## filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: Paket 'ggplot2' wurde unter R Version 4.4.1 erstellt
## Warning: Paket 'randomForest' wurde unter R Version 4.4.1 erstellt
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attache Paket: 'randomForest'
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## margin
## Das folgende Objekt ist maskiert 'package:dplyr':
##
## combine
## Warning: Paket 'plotly' wurde unter R Version 4.4.1 erstellt
##
## Attache Paket: 'plotly'
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## last_plot
## Das folgende Objekt ist maskiert 'package:stats':
##
## filter
## Das folgende Objekt ist maskiert 'package:graphics':
##
## layout
## corrplot 0.92 loaded
## Warning: Paket 'caret' wurde unter R Version 4.4.1 erstellt
## Lade nötiges Paket: lattice
## Warning: Paket 'readr' wurde unter R Version 4.4.1 erstellt
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ purrr::discard() masks scales::discard()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The following report details an in-depth analysis of our dataset, “email_data_mens,” which provides key insights into customer behavior in response to our email marketing campaigns. Our dataset consists of 21,307 unique customers, with detailed demographic and behavioral data across 15 columns. This report offers a comprehensive overview of the dataset structure, RFM model findings, and specific patterns identified in conversion rates and customer segment preferences. Based on these insights, recommendations are presented to enhance campaign effectiveness, increase conversions, and optimize targeted strategies.
At first we are going to load all of our required packages and introduce our Dataset “email_data_men”
load("Group3_email_data_men.rda")
summary(email_data_mens)
## recency history_segment history mens
## Min. : 1.000 1) $0 - $100 :7724 Min. : 29.99 Min. :0.0000
## 1st Qu.: 2.000 2) $100 - $200 :4691 1st Qu.: 63.58 1st Qu.:0.0000
## Median : 6.000 3) $200 - $350 :4090 Median : 157.22 Median :1.0000
## Mean : 5.774 4) $350 - $500 :2097 Mean : 242.84 Mean :0.5509
## 3rd Qu.: 9.000 5) $500 - $750 :1597 3rd Qu.: 325.21 3rd Qu.:1.0000
## Max. :12.000 6) $750 - $1,000: 644 Max. :3215.97 Max. :1.0000
## 7) $1,000 + : 464
## womens zip_code newbie channel
## Min. :0.0000 Rural :3243 Min. :0.0000 Multichannel:2577
## 1st Qu.:0.0000 Surburban:9501 1st Qu.:0.0000 Phone :9240
## Median :1.0000 Urban :8563 Median :1.0000 Web :9490
## Mean :0.5514 Mean :0.5015
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
## visit conversion spend
## Min. :0.0000 Min. :0.00000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 0.000
## Median :0.0000 Median :0.00000 Median : 0.000
## Mean :0.1828 Mean :0.01253 Mean : 1.423
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.: 0.000
## Max. :1.0000 Max. :1.00000 Max. :499.000
##
The dataset “email_data_mens” includes key customer information useful for Customer Base Analysis and for constructing an RFM (Recency, Frequency, Monetary) Model. This dataset comprises 21,307 customers, represented by various demographic and behavioral variables that allow us to segment and understand the customer base effectively.
For our RFM model we can use the “Recency” variable, “Newbie” will be used to measure Frequency of customer visits, while “History” serves as a proxy for Monetary value. Through this model and segmentation, we aim to uncover patterns in customer engagement, spending behaviors, and preferred channels.
What we can tell from the summary of our dataset: Low conversion rate: From the summary we can already tell, that we do have a low conversion rate, only 1,25% of emails result in conversions. This means, that there might be a significant opportunity. The average email conversion rate differs a lot from industry to industry. As per “https://www.klaviyo.com/uk/blog/email-marketing-benchmarks-open-click-and-conversion-rates” The average conversion rate across all industries is 1,71%, while some top performers have a conversion rate as high as 5,71%.
Customer Segmentation: The largest proportion of customers falls into the low spending segment (0-100$), with about 7724
Channel preference: Web and Phone are the primary channels, with multichannel being less utilized.
Let’s check if there are any missing or duplicated values:
missing_values <- sum(is.na(email_data_mens))
missing_values
## [1] 0
duplicates <- duplicated(email_data_mens)
any(duplicates)
## [1] TRUE
Luckily for us, our Dataset is already cleaned.
Let’s visualize some of those datapoints to find out what the relevant variables for our analysis will be.
# Create the 'gender' column based on 'mens' and 'womens' columns
email_data_mens$gender <- with(email_data_mens, ifelse(mens == 1 & womens == 1, "Other",
ifelse(mens == 1, "Men",
ifelse(womens == 1, "Women", "Unspecified"))))
# Create a summary table grouped by zip_code and gender
demographics_by_zip <- email_data_mens %>%
filter(gender %in% c("Men", "Women", "Other")) %>% # Exclude 'Unspecified' if present
group_by(zip_code, gender) %>%
summarise(Count = n(), .groups = "drop") # Explicitly drop grouping after summarising
# Pivot the data to a wider format
demographics_by_zip_wide <- demographics_by_zip %>%
pivot_wider(names_from = gender, values_from = Count, values_fill = 0) %>%
mutate(Total = Men + Women + Other)
# Calculate total for all zip codes
totals_row <- demographics_by_zip_wide %>%
summarise(
zip_code = "Total", # Label this row as "Total"
Men = sum(Men, na.rm = TRUE),
Women = sum(Women, na.rm = TRUE),
Other = sum(Other, na.rm = TRUE),
Total = sum(Total, na.rm = TRUE)
)
# Bind the totals row to the original data
demographics_by_zip_wide_with_totals <- bind_rows(demographics_by_zip_wide, totals_row)
# View the updated table with totals row
print(demographics_by_zip_wide_with_totals)
## # A tibble: 4 × 5
## zip_code Men Other Women Total
## <chr> <int> <int> <int> <int>
## 1 Rural 1445 323 1475 3243
## 2 Surburban 4301 975 4225 9501
## 3 Urban 3812 883 3868 8563
## 4 Total 9558 2181 9568 21307
This table gives us a detailled analysis of our costumer demographics and where they live.
ggplot(demographics_by_zip, aes(x = gender, y = Count, fill = gender)) +
geom_bar(stat = "identity") +
facet_wrap(~ zip_code) + # Create a separate plot for each zip code area
labs(
title = "Demographic Distribution by Zip Code",
x = "Gender",
y = "Number of Customers",
fill = "Gender"
) +
theme_minimal() +
theme(
strip.text = element_text(size = 14, face = "bold"), # Style facet labels
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14),
legend.position = "none"
) +
scale_fill_manual(values = c("Men" = "#1f77b4",
"Women" = "#ff7f0e",
"Other" = "#2ca02c"))
The bar chart further visualizes the distribution of our customer demographics.
unique(email_data_mens$history_segment)
## [1] 5) $500 - $750 2) $100 - $200 1) $0 - $100 3) $200 - $350
## [5] 6) $750 - $1,000 4) $350 - $500 7) $1,000 +
## 7 Levels: 1) $0 - $100 2) $100 - $200 3) $200 - $350 ... 7) $1,000 +
ggplot(email_data_mens, aes(x = history_segment)) +
geom_bar(fill = "steelblue") +
geom_text(stat = "count", aes(label = after_stat(count)),
vjust = -0.5, size = 3.5) + # Add count labels above bars
labs(
title = "Spending Distribution of Customers",
x = "Spending Bracket",
y = "Number of Customers"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
) +
ylim(0, max(table(email_data_mens$history_segment)) * 1.1)
We can see, the lowest spending bracket is clearly the one with the most customers and the bigger the spending bracket, the fewer customers there are in it.
# Create a summary table for newbie vs returning customers
newbie_distribution <- email_data_mens %>%
group_by(newbie) %>%
summarise(Count = n()) %>%
mutate(Category = ifelse(newbie == 1, "New Customers", "Returning Customers"))
# Calculate total customers
total_customers <- sum(newbie_distribution$Count)
# Add percentage to the summary table
newbie_distribution <- newbie_distribution %>%
mutate(Percentage = round((Count / total_customers) * 100, 1),
Label = paste0(Count, " (", Percentage, "%)"))
newbie_distribution
## # A tibble: 2 × 5
## newbie Count Category Percentage Label
## <int> <int> <chr> <dbl> <chr>
## 1 0 10621 Returning Customers 49.8 10621 (49.8%)
## 2 1 10686 New Customers 50.2 10686 (50.2%)
ggplot(newbie_distribution, aes(x = Category, y = Count, fill = Category)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Label), vjust = -0.5, size = 5) +
labs(
title = "New vs. Returning Customers",
x = "Customer Type",
y = "Number of Customers"
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14)
) +
ylim(0, max(newbie_distribution$Count) * 1.1)
The number of returning and new customers is more or less equal, as shown on the bar chart and the datatable.
# Convert 'conversion' to a factor with descriptive labels
email_data_mens <- email_data_mens %>%
mutate(conversion_status = factor(conversion, levels = c(0, 1),
labels = c("Not Converted", "Converted")))
# Summarize counts per channel and conversion status
channel_conversion_summary <- email_data_mens %>%
group_by(channel, conversion_status) %>%
summarise(Count = n(), .groups = "drop")
# Calculate total customers per channel to compute conversion rates
channel_totals <- email_data_mens %>%
group_by(channel) %>%
summarise(Total = n(), .groups = "drop")
# Create conversion labels
conversion_labels <- channel_totals %>%
left_join(
email_data_mens %>%
group_by(channel) %>%
summarise(Converted = sum(conversion == 1), .groups = "drop"),
by = "channel"
) %>%
mutate(Conversion_Rate = round((Converted / Total) * 100, 2),
Label = paste0("Conversion Rate: ", Conversion_Rate, "%"))
conversion_labels
## # A tibble: 3 × 5
## channel Total Converted Conversion_Rate Label
## <fct> <int> <int> <dbl> <chr>
## 1 Multichannel 2577 44 1.71 Conversion Rate: 1.71%
## 2 Phone 9240 100 1.08 Conversion Rate: 1.08%
## 3 Web 9490 123 1.3 Conversion Rate: 1.3%
Conversion rates can be influenced by the total number of customers in each channel. Channels with a smaller customer base might show a higher conversion rate due to smaller, more targeted audiences. Conversely, larger bases might have lower conversion rates because it’s harder to achieve high engagement across a broad audience.
You could argue that the conversion rate for “Multichannel” is the highest. However, it is also the channel with the least amount of customers and therefore the one with the least amount of conversions.
#mapping the customer journey out,to see where we lose our customers
transition_rates <- email_data_mens %>%
group_by(channel) %>%
summarise(
total_customers = n(),
visit_rate = sum(visit) / n(),
visit_conversion_rate = sum(conversion) / sum(visit[visit == 1])
)
print(transition_rates)
## # A tibble: 3 × 4
## channel total_customers visit_rate visit_conversion_rate
## <fct> <int> <dbl> <dbl>
## 1 Multichannel 2577 0.211 0.0807
## 2 Phone 9240 0.163 0.0665
## 3 Web 9490 0.194 0.0667
The visit rate shows us the proportion of customers who visited the platform through each channel as well es the visit conversion rate, which tells us how many people convert after visiting. Again Multichannel performs the best and Phone the worst.
Now that we identified and evaluated our important variables, we can implement our statistical Models.
Let us find out which variables could actually be responsible for leading to conversions:
# Select numerical variables including the target
num_vars <- email_data_mens %>% select(recency, history, conversion)
# Compute correlation matrix
cor_matrix <- cor(num_vars, use = "complete.obs")
# Plot the correlation matrix
corrplot(cor_matrix, method = "color", addCoef.col = "black",
tl.col = "black", number.cex = 0.7, title = "Correlation Matrix",
mar = c(0,0,1,0))
The correlation Matrix shows us, that there is a weak negative correlation between recency and history. Other than that, there is no correlation at all between recency and history to conversion.
gender_conversion <- table(email_data_mens$gender, email_data_mens$conversion)
chisq.test(gender_conversion)
##
## Pearson's Chi-squared test
##
## data: gender_conversion
## X-squared = 27.358, df = 2, p-value = 1.146e-06
There is a statistically significant association between gender and conversion. This suggests that conversion rates are influenced by gender, and the distribution of conversion differs based on gender categories.
# Channel vs Conversion
channel_conversion <- table(email_data_mens$channel, email_data_mens$conversion)
chisq.test(channel_conversion)
##
## Pearson's Chi-squared test
##
## data: channel_conversion
## X-squared = 6.6199, df = 2, p-value = 0.03652
There is a moderate association between channel and conversion, with a p-value just below 0.05. This means that the marketing channel used may influence conversion rates, but the strength of the association is relatively weak
# Zip Code vs Conversion
zip_conversion <- table(email_data_mens$zip_code, email_data_mens$conversion)
chisq.test(zip_conversion)
##
## Pearson's Chi-squared test
##
## data: zip_conversion
## X-squared = 0.34122, df = 2, p-value = 0.8432
There is no statistically significant association between zip code and conversion. This means that geographic location (represented by zip code) does not affect conversion rates
#stepwise logistic regression
set.seed(123)
# Create training and testing sets
train_index <- createDataPartition(email_data_mens$conversion, p = 0.7, list = FALSE)
train_data <- email_data_mens[train_index, ]
test_data <- email_data_mens[-train_index, ]
# Fit the logistic regression model with selected variables
logistic_model <- glm(conversion ~ gender + channel + history + recency,
data = train_data, family = binomial)
# Summary of the model
summary(logistic_model)
##
## Call:
## glm(formula = conversion ~ gender + channel + history + recency,
## family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1810217 0.2814755 -14.854 < 2e-16 ***
## genderOther 0.6184147 0.2260939 2.735 0.00623 **
## genderWomen -0.0107015 0.1706018 -0.063 0.94998
## channelPhone -0.3783237 0.2326204 -1.626 0.10387
## channelWeb -0.1777727 0.2261561 -0.786 0.43183
## history 0.0003352 0.0002813 1.192 0.23332
## recency -0.0375055 0.0230230 -1.629 0.10330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1894.9 on 14914 degrees of freedom
## Residual deviance: 1869.3 on 14908 degrees of freedom
## AIC: 1883.3
##
## Number of Fisher Scoring iterations: 7
# Predict on test data
test_data$pred_prob <- predict(logistic_model, newdata = test_data, type = "response")
test_data$pred_class <- ifelse(test_data$pred_prob > 0.3, 1, 0)
# Confusion Matrix
confusionMatrix(factor(test_data$pred_class), factor(test_data$conversion))
## Warning in confusionMatrix.default(factor(test_data$pred_class),
## factor(test_data$conversion)): Levels are not in the same order for reference
## and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6299 93
## 1 0 0
##
## Accuracy : 0.9855
## 95% CI : (0.9822, 0.9882)
## No Information Rate : 0.9855
## P-Value [Acc > NIR] : 0.5275
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9855
## Neg Pred Value : NaN
## Prevalence : 0.9855
## Detection Rate : 0.9855
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
From the logistic regression model we can again conclude, that the gender does significantly effect the conversion. In particular, The “Other” gender category significantly increases the log-odds of conversion compared to “Men,” suggesting that people in the “Other” category are more likely to convert.
Due to the fact, that non other variables have a significant impact on the model, the prediction does not predict any conversions.
Let’s try a rfm model Objective: Adjusting the Recency metric to measure the time since the last purchase in months, where:
1 month represents the most recent purchase. 12 months represents the least recent purchase within the past year.
Problem: Exact shopping frequency is unknown. Available Data: A newbie column where: 1 = Newbie (frequency = 1) 0 = Returning customer (frequency > 1) Approach: Assign frequency = 1 for newbies and frequency = 2 for returning customers.
email_data_mens <- email_data_mens %>%
mutate(customer_id = row_number())
# Define analysis_date as today's date
analysis_date <- Sys.Date()
email_data_mens <- email_data_mens %>%
mutate(
Frequency = ifelse(newbie == 1, 1, 2) # Assign 1 for newbies, 2 for returning customers
)
rfm_table <- email_data_mens %>%
mutate(
Recency = ifelse(recency > 12, 12, recency),
Frequency = ifelse(newbie == 1, 1, 2), # Default frequency assignment
Monetary = history
) %>%
select(customer_id, Recency, Frequency, Monetary)
rfm_table <- rfm_table %>%
mutate(
# Normalize Recency: Lower Recency is better
Recency_Normalized = (12 - Recency) / 11, # Assuming Recency is capped at 12
# Normalize Frequency: Already between 1-2
Frequency_Normalized = (Frequency - 1) / 1, # Scaled to 0-1
# Normalize Monetary
Monetary_Normalized = (Monetary - min(Monetary)) / (max(Monetary) - min(Monetary))
) %>%
mutate(
# Assign weights to each metric
R_weighted = Recency_Normalized * 0.5,
F_weighted = Frequency_Normalized * 0.2,
M_weighted = Monetary_Normalized * 0.5,
# Calculate Total RFM Score
RFM_Score = R_weighted + F_weighted + M_weighted
)
ggplot(rfm_table, aes(x = R_weighted, y = F_weighted, color = M_weighted)) +
geom_jitter(width = 0.2, height = 0.2, alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "RFM Segmentation",
x = "Recency Score",
y = "Frequency Score",
color = "Monetary Score"
) +
theme_minimal()
# Prepare data for clustering
rfm_clusters <- rfm_table %>%
select(Recency, Frequency, Monetary) %>%
scale() # Standardize the data
# Determine optimal number of clusters using Elbow Method
set.seed(123)
wss <- sapply(1:10, function(k){
kmeans(rfm_clusters, centers = k, nstart = 10)$tot.withinss
})
# Plot Elbow Method
plot(1:10, wss, type = "b", pch = 19,
xlab = "Number of Clusters",
ylab = "Total Within-Clusters Sum of Squares",
main = "Elbow Method for Determining Optimal Clusters")
Based on the Elbow Method we chose k=5 for optimal clustering.
optimal_k = 5
# Perform K-Means Clustering
set.seed(123)
kmeans_model <- kmeans(rfm_clusters, centers = optimal_k, nstart = 25)
# Assign cluster labels to the RFM table
rfm_table$Cluster <- as.factor(kmeans_model$cluster)
The RFM Model consists of various types of customers, which we can now assign based on our clusters. According to the types we will build our following business strategy.
To achieve this, we first have to calculate the average values for all our variables in the model. The resulting table will be the base for our following analysis.
# Analyze RFM Averages per Cluster
cluster_summary <- rfm_table %>%
mutate(cluster = kmeans_model$cluster) %>%
group_by(cluster) %>%
summarise(
Avg_Recency = mean(Recency),
Avg_Frequency = mean(Frequency),
Avg_Monetary = mean(Monetary),
Avg_RFM_Score = mean(RFM_Score),
Count = n()
)
print(cluster_summary)
## # A tibble: 5 × 6
## cluster Avg_Recency Avg_Frequency Avg_Monetary Avg_RFM_Score Count
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 9.44 1 149. 0.135 4061
## 2 2 3.19 1 144. 0.418 4078
## 3 3 9.03 2 158. 0.355 5290
## 4 4 3.40 1 796. 0.511 2547
## 5 5 2.86 2 210. 0.644 5331
This leads to the following risk categories:
Cluster 1: “At-Risk Customers”. These customers have engaged less recently, do not purchase frequently, and have moderate spending. They may be starting to drift away and could benefit from re-engagement efforts to prevent churn.
Cluster 2: “Recent but Low-Frequency Spenders”. This group has recently engaged but spends infrequently and at a moderate level. They could be potential for upselling with targeted offers to increase engagement frequency.
Cluster 3: “Occasional Loyalists”. These customers engage relatively frequently (Frequency = 2) but with lower recent activity (Recency = 9.03). Their moderate spending and frequency indicate that they value the brand but may need periodic nudges to stay engaged.
Cluster 4: “High-Value New Customers”. These customers spend significantly more but engage less frequently. Their recent engagement indicates potential for high lifetime value, suggesting they may respond well to loyalty-building strategies.
Cluster 5: “Loyal Champions”. This group is the most engaged with high frequency, recent activity, and above-average spending. They are prime candidates for loyalty programs, VIP offers, and brand advocacy initiatives.
# assigne the clusters their label names
rfm_table_named <- rfm_table %>%
mutate(cluster = kmeans_model$cluster) %>% # Add cluster labels from the K-means model
mutate(Customer_Segment = case_when(
cluster == 1 ~ "At-Risk Customers",
cluster == 2 ~ "Recent but Low-Frequency Spenders",
cluster == 3 ~ "Occasional Loyalists",
cluster == 4 ~ "High-Value New Customers",
cluster == 5 ~ "Loyal Champions"
))
# Summarize the segments for visualisation
segment_summary <- rfm_table_named %>%
group_by(Customer_Segment) %>%
summarise(
Avg_Recency = mean(Recency),
Avg_Frequency = mean(Frequency),
Avg_Monetary = mean(Monetary),
Count = n()
)
# Plotting the number of customers in each segment
ggplot(segment_summary, aes(x = reorder(Customer_Segment, -Count), y = Count, fill = Customer_Segment)) +
geom_bar(stat = "identity") +
labs(title = "Customer Segment Distribution",
x = "Customer Segment",
y = "Number of Customers") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
Now we visualize our RFM Model using Plotly for a 3D visualisation.
# creating a plotly graph for 3d visualisation
plot_ly(rfm_table_named, x = ~Recency, y = ~Frequency, z = ~Monetary,
color = ~Customer_Segment, colors = "Set2",
type = 'scatter3d', mode = 'markers') %>%
layout(title = "3D RFM Segmentation",
scene = list(xaxis = list(title = "Recency"),
yaxis = list(title = "Frequency"),
zaxis = list(title = "Monetary")))
Now that we do have 5 different customer segments according to our RFM model, let’s try to analyze key demographic variables such as gender, zip_code, and history_segment within each cluster
current_wd <- getwd()
Sys.setenv(TMPDIR = current_wd)
Sys.setenv(TEMP = current_wd)
Sys.setenv(TMP = current_wd)
Sys.setenv(TMPDIR = "C:/Users/atapp/OneDrive/WISO Studium/SBWL/Data Science/Applications of Data Science/project 1")
library(knitr)
# Assuming 'rfm_table' contains 'customer_id' and 'Cluster'
# Merge 'rfm_table' with 'email_data_mens'
# Ensure 'customer_id' is unique in both datasets
email_data_mens_unique <- email_data_mens %>%
distinct(customer_id, .keep_all = TRUE)
# Merge the datasets
email_data_with_clusters <- email_data_mens_unique %>%
left_join(rfm_table %>% select(customer_id, Cluster), by = "customer_id")
# Demographic Summary by Cluster
demographics_summary <- email_data_with_clusters %>%
group_by(Cluster) %>%
summarise(
Count = n(),
Men = sum(gender == "Men"),
Women = sum(gender == "Women"),
Other = sum(gender == "Other"),
Rural = sum(zip_code == "Rural"),
Urban = sum(zip_code == "Urban"),
Suburban = sum(zip_code == "Suburban"),
Newbies = sum(newbie == 1),
Returning_Customers = sum(newbie == 0)
) %>%
mutate(
Men_Percent = Men / Count * 100,
Women_Percent = Women / Count * 100,
Other_Percent = Other / Count * 100,
Rural_Percent = Rural / Count * 100,
Urban_Percent = Urban / Count * 100,
Suburban_Percent = Suburban / Count * 100,
Newbies_Percent = Newbies / Count * 100,
Returning_Percent = Returning_Customers / Count * 100
) %>%
select(
Cluster, Count,
Men_Percent, Women_Percent, Other_Percent,
Rural_Percent, Urban_Percent, Suburban_Percent,
Newbies_Percent, Returning_Percent
)
# Display the demographic summary
#kable(demographics_summary, caption = "Demographic Breakdown by Cluster")
print(demographics_summary)
## # A tibble: 5 × 10
## Cluster Count Men_Percent Women_Percent Other_Percent Rural_Percent
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 4061 46.2 48.1 5.71 14.4
## 2 2 4078 48.0 46.8 5.20 14.6
## 3 3 5290 46.7 46.7 6.60 16.1
## 4 4 2547 33.8 32.7 33.5 14.9
## 5 5 5331 44.9 45.1 10.0 15.6
## # ℹ 4 more variables: Urban_Percent <dbl>, Suburban_Percent <dbl>,
## # Newbies_Percent <dbl>, Returning_Percent <dbl>
# Gender Distribution Across Clusters
ggplot(email_data_with_clusters, aes(x = Cluster, fill = gender)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Gender Distribution Across Clusters",
x = "Cluster",
y = "Proportion",
fill = "Gender"
) +
theme_minimal()
# Zip Code Distribution Across Clusters
ggplot(email_data_with_clusters, aes(x = Cluster, fill = zip_code)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Zip Code Distribution Across Clusters",
x = "Cluster",
y = "Proportion",
fill = "Zip Code"
) +
theme_minimal()
Let’s assess engagement metrics per customer
# Channel Distribution Across Clusters with Normalization
channel_summary_normalized <- email_data_with_clusters %>%
group_by(Cluster) %>%
mutate(Total_Cluster_Count = n()) %>%
group_by(Cluster, channel) %>%
summarise(Count = n(), Total_Cluster_Count = first(Total_Cluster_Count)) %>%
mutate(Proportion = Count / Total_Cluster_Count * 100)
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.
# Display the normalized channel summary
kable(channel_summary_normalized, caption = "Normalized Channel Preferences by Cluster")
Cluster | channel | Count | Total_Cluster_Count | Proportion |
---|---|---|---|---|
1 | Multichannel | 274 | 4061 | 6.747107 |
1 | Phone | 1834 | 4061 | 45.161290 |
1 | Web | 1953 | 4061 | 48.091603 |
2 | Multichannel | 245 | 4078 | 6.007847 |
2 | Phone | 1962 | 4078 | 48.111820 |
2 | Web | 1871 | 4078 | 45.880333 |
3 | Multichannel | 394 | 5290 | 7.448015 |
3 | Phone | 2412 | 5290 | 45.595463 |
3 | Web | 2484 | 5290 | 46.956522 |
4 | Multichannel | 1000 | 2547 | 39.261877 |
4 | Phone | 767 | 2547 | 30.113859 |
4 | Web | 780 | 2547 | 30.624264 |
5 | Multichannel | 664 | 5331 | 12.455449 |
5 | Phone | 2265 | 5331 | 42.487338 |
5 | Web | 2402 | 5331 | 45.057212 |
# Stacked Bar Chart for Normalized Channel Preferences
ggplot(channel_summary_normalized, aes(x = Cluster, y = Proportion, fill = channel)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Pastel1") +
labs(
title = "Normalized Channel Preferences by Cluster",
x = "Cluster",
y = "Percentage",
fill = "Channel"
) +
theme_minimal()
# Proportion of Visits per Cluster
visits_summary <- email_data_with_clusters %>%
group_by(Cluster) %>%
summarise(
Visit_Rate = mean(visit, na.rm = TRUE) * 100, # Proportion of customers who visited
Total_Customers = n() # Total number of customers in each cluster
)
# Display the visits summary
kable(visits_summary, caption = "Visit Rate by Cluster")
Cluster | Visit_Rate | Total_Customers |
---|---|---|
1 | 10.56390 | 4061 |
2 | 15.79205 | 4078 |
3 | 18.69565 | 5290 |
4 | 22.53632 | 2547 |
5 | 23.59782 | 5331 |
# Bar Plot for Visit Rates per Cluster
ggplot(visits_summary, aes(x = Cluster, y = Visit_Rate, fill = Cluster)) +
geom_bar(stat = "identity") +
labs(
title = "Visit Rate Across Clusters",
x = "Cluster",
y = "Visit Rate (%)"
) +
theme_minimal() +
theme(legend.position = "none")
# Conversion Rate by Cluster
conversion_summary <- email_data_with_clusters %>%
group_by(Cluster) %>%
summarise(
Total_Customers = n(),
Converted = sum(conversion_status == "Converted"),
Conversion_Rate = (Converted / Total_Customers) * 100
)
# Display the conversion summary
kable(conversion_summary, caption = "Conversion Rates by Cluster")
Cluster | Total_Customers | Converted | Conversion_Rate |
---|---|---|---|
1 | 4061 | 36 | 0.8864812 |
2 | 4078 | 34 | 0.8337420 |
3 | 5290 | 57 | 1.0775047 |
4 | 2547 | 49 | 1.9238320 |
5 | 5331 | 91 | 1.7069968 |
# Bar Chart for Conversion Rates
ggplot(conversion_summary, aes(x = Cluster, y = Conversion_Rate, fill = Cluster)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Conversion Rates by Cluster",
x = "Cluster",
y = "Conversion Rate (%)"
) +
theme_minimal() +
theme(legend.position = "none")
# History Segment Distribution Across Clusters
history_summary <- email_data_with_clusters %>%
group_by(Cluster, history_segment) %>%
summarise(Count = n()) %>%
mutate(Proportion = Count / sum(Count) * 100)
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.
# Display the history summary
kable(history_summary, caption = "History Segment Distribution by Cluster")
Cluster | history_segment | Count | Proportion |
---|---|---|---|
1 | 1) $0 - $100 | 2034 | 50.086186 |
1 | 2) $100 - $200 | 1011 | 24.895346 |
1 | 3) $200 - $350 | 620 | 15.267176 |
1 | 4) $350 - $500 | 176 | 4.333908 |
1 | 5) $500 - $750 | 220 | 5.417385 |
2 | 1) $0 - $100 | 1815 | 44.507111 |
2 | 2) $100 - $200 | 1195 | 29.303580 |
2 | 3) $200 - $350 | 796 | 19.519372 |
2 | 4) $350 - $500 | 272 | 6.669936 |
3 | 1) $0 - $100 | 2314 | 43.742911 |
3 | 2) $100 - $200 | 1301 | 24.593573 |
3 | 3) $200 - $350 | 1150 | 21.739130 |
3 | 4) $350 - $500 | 525 | 9.924386 |
4 | 4) $350 - $500 | 62 | 2.434236 |
4 | 5) $500 - $750 | 1377 | 54.063604 |
4 | 6) $750 - $1,000 | 644 | 25.284649 |
4 | 7) $1,000 + | 464 | 18.217511 |
5 | 1) $0 - $100 | 1561 | 29.281561 |
5 | 2) $100 - $200 | 1184 | 22.209717 |
5 | 3) $200 - $350 | 1524 | 28.587507 |
5 | 4) $350 - $500 | 1062 | 19.921215 |
# Stacked Bar Chart for History Segments
ggplot(history_summary, aes(x = Cluster, y = Proportion, fill = history_segment)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set3") +
labs(
title = "History Segment Distribution by Cluster",
x = "Cluster",
y = "Percentage",
fill = "History Segment"
) +
theme_minimal()
transition_rates <- email_data_mens %>%
group_by(channel) %>%
summarise(
total_customers = n(),
visit_rate = sum(visit) / n(),
conversion_rate = sum(conversion) / sum(visit[visit == 1])
)
print(transition_rates)
## # A tibble: 3 × 4
## channel total_customers visit_rate conversion_rate
## <fct> <int> <dbl> <dbl>
## 1 Multichannel 2577 0.211 0.0807
## 2 Phone 9240 0.163 0.0665
## 3 Web 9490 0.194 0.0667