Packages

packages <- c("SmartEDA", "tidyr", "dplyr", "ggplot2", "randomForest", "tinytex", "scales", "plotly", "corrplot", "caret", "tidyverse", "lubridate")

# Install any packages that are not yet installed
installed_packages <- rownames(installed.packages())
for (pkg in packages) {
  if (!(pkg %in% installed_packages)) {
    install.packages(pkg)
  }
  library(pkg, character.only = TRUE)
}

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attache Paket: 'dplyr'

## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: Paket 'ggplot2' wurde unter R Version 4.4.1 erstellt

## Warning: Paket 'randomForest' wurde unter R Version 4.4.1 erstellt

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attache Paket: 'randomForest'

## Das folgende Objekt ist maskiert 'package:ggplot2':
## 
##     margin

## Das folgende Objekt ist maskiert 'package:dplyr':
## 
##     combine

## Warning: Paket 'plotly' wurde unter R Version 4.4.1 erstellt

## 
## Attache Paket: 'plotly'

## Das folgende Objekt ist maskiert 'package:ggplot2':
## 
##     last_plot

## Das folgende Objekt ist maskiert 'package:stats':
## 
##     filter

## Das folgende Objekt ist maskiert 'package:graphics':
## 
##     layout

## corrplot 0.92 loaded

## Warning: Paket 'caret' wurde unter R Version 4.4.1 erstellt

## Lade nötiges Paket: lattice

## Warning: Paket 'readr' wurde unter R Version 4.4.1 erstellt

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor()     masks scales::col_factor()
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ purrr::discard()        masks scales::discard()
## ✖ plotly::filter()        masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Business Case

The following report details an in-depth analysis of our dataset, “email_data_mens,” which provides key insights into customer behavior in response to our email marketing campaigns. Our dataset consists of 21,307 unique customers, with detailed demographic and behavioral data across 15 columns. This report offers a comprehensive overview of the dataset structure, RFM model findings, and specific patterns identified in conversion rates and customer segment preferences. Based on these insights, recommendations are presented to enhance campaign effectiveness, increase conversions, and optimize targeted strategies.

Dataset

At first we are going to load all of our required packages and introduce our Dataset “email_data_men”

load("Group3_email_data_men.rda")
summary(email_data_mens)

##     recency               history_segment    history             mens       
##  Min.   : 1.000   1) $0 - $100    :7724   Min.   :  29.99   Min.   :0.0000  
##  1st Qu.: 2.000   2) $100 - $200  :4691   1st Qu.:  63.58   1st Qu.:0.0000  
##  Median : 6.000   3) $200 - $350  :4090   Median : 157.22   Median :1.0000  
##  Mean   : 5.774   4) $350 - $500  :2097   Mean   : 242.84   Mean   :0.5509  
##  3rd Qu.: 9.000   5) $500 - $750  :1597   3rd Qu.: 325.21   3rd Qu.:1.0000  
##  Max.   :12.000   6) $750 - $1,000: 644   Max.   :3215.97   Max.   :1.0000  
##                   7) $1,000 +     : 464                                     
##      womens            zip_code        newbie               channel    
##  Min.   :0.0000   Rural    :3243   Min.   :0.0000   Multichannel:2577  
##  1st Qu.:0.0000   Surburban:9501   1st Qu.:0.0000   Phone       :9240  
##  Median :1.0000   Urban    :8563   Median :1.0000   Web         :9490  
##  Mean   :0.5514                    Mean   :0.5015                      
##  3rd Qu.:1.0000                    3rd Qu.:1.0000                      
##  Max.   :1.0000                    Max.   :1.0000                      
##                                                                        
##      visit          conversion          spend        
##  Min.   :0.0000   Min.   :0.00000   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:  0.000  
##  Median :0.0000   Median :0.00000   Median :  0.000  
##  Mean   :0.1828   Mean   :0.01253   Mean   :  1.423  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:  0.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :499.000  
##

The dataset “email_data_mens” includes key customer information useful for Customer Base Analysis and for constructing an RFM (Recency, Frequency, Monetary) Model. This dataset comprises 21,307 customers, represented by various demographic and behavioral variables that allow us to segment and understand the customer base effectively.

For our RFM model we can use the “Recency” variable, “Newbie” will be used to measure Frequency of customer visits, while “History” serves as a proxy for Monetary value. Through this model and segmentation, we aim to uncover patterns in customer engagement, spending behaviors, and preferred channels.

What we can tell from the summary of our dataset: Low conversion rate: From the summary we can already tell, that we do have a low conversion rate, only 1,25% of emails result in conversions. This means, that there might be a significant opportunity. The average email conversion rate differs a lot from industry to industry. As per “https://www.klaviyo.com/uk/blog/email-marketing-benchmarks-open-click-and-conversion-rates” The average conversion rate across all industries is 1,71%, while some top performers have a conversion rate as high as 5,71%.

Customer Segmentation: The largest proportion of customers falls into the low spending segment (0-100$), with about 7724

Channel preference: Web and Phone are the primary channels, with multichannel being less utilized.

Let’s check if there are any missing or duplicated values:

missing_values <- sum(is.na(email_data_mens))
missing_values

## [1] 0

duplicates <- duplicated(email_data_mens)
any(duplicates)

## [1] TRUE

Luckily for us, our Dataset is already cleaned.

First Analysis

Let’s visualize some of those datapoints to find out what the relevant variables for our analysis will be.

Demographics

# Create the 'gender' column based on 'mens' and 'womens' columns
email_data_mens$gender <- with(email_data_mens, ifelse(mens == 1 & womens == 1, "Other",
                                   ifelse(mens == 1, "Men",
                                          ifelse(womens == 1, "Women", "Unspecified"))))

# Create a summary table grouped by zip_code and gender
demographics_by_zip <- email_data_mens %>%
  filter(gender %in% c("Men", "Women", "Other")) %>%  # Exclude 'Unspecified' if present
  group_by(zip_code, gender) %>%
  summarise(Count = n(), .groups = "drop")  # Explicitly drop grouping after summarising

# Pivot the data to a wider format
demographics_by_zip_wide <- demographics_by_zip %>%
  pivot_wider(names_from = gender, values_from = Count, values_fill = 0) %>%
  mutate(Total = Men + Women + Other)

# Calculate total for all zip codes
totals_row <- demographics_by_zip_wide %>%
  summarise(
    zip_code = "Total",  # Label this row as "Total"
    Men = sum(Men, na.rm = TRUE),
    Women = sum(Women, na.rm = TRUE),
    Other = sum(Other, na.rm = TRUE),
    Total = sum(Total, na.rm = TRUE)
  )

# Bind the totals row to the original data
demographics_by_zip_wide_with_totals <- bind_rows(demographics_by_zip_wide, totals_row)

# View the updated table with totals row
print(demographics_by_zip_wide_with_totals)

## # A tibble: 4 × 5
##   zip_code    Men Other Women Total
##   <chr>     <int> <int> <int> <int>
## 1 Rural      1445   323  1475  3243
## 2 Surburban  4301   975  4225  9501
## 3 Urban      3812   883  3868  8563
## 4 Total      9558  2181  9568 21307

This table gives us a detailled analysis of our costumer demographics and where they live.

ggplot(demographics_by_zip, aes(x = gender, y = Count, fill = gender)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ zip_code) +  # Create a separate plot for each zip code area
  labs(
    title = "Demographic Distribution by Zip Code",
    x = "Gender",
    y = "Number of Customers",
    fill = "Gender"
  ) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold"),  # Style facet labels
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    legend.position = "none"
  ) +
  scale_fill_manual(values = c("Men" = "#1f77b4", 
                               "Women" = "#ff7f0e", 
                               "Other" = "#2ca02c"))

The bar chart further visualizes the distribution of our customer demographics.

Monetary history

unique(email_data_mens$history_segment)

## [1] 5) $500 - $750   2) $100 - $200   1) $0 - $100     3) $200 - $350  
## [5] 6) $750 - $1,000 4) $350 - $500   7) $1,000 +     
## 7 Levels: 1) $0 - $100 2) $100 - $200 3) $200 - $350 ... 7) $1,000 +

ggplot(email_data_mens, aes(x = history_segment)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = "count", aes(label = after_stat(count)), 
            vjust = -0.5, size = 3.5) +  # Add count labels above bars
  labs(
    title = "Spending Distribution of Customers",
    x = "Spending Bracket",
    y = "Number of Customers"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
  ) +
  ylim(0, max(table(email_data_mens$history_segment)) * 1.1)

We can see, the lowest spending bracket is clearly the one with the most customers and the bigger the spending bracket, the fewer customers there are in it.

Frequency

# Create a summary table for newbie vs returning customers
newbie_distribution <- email_data_mens %>%
  group_by(newbie) %>%
  summarise(Count = n()) %>%
  mutate(Category = ifelse(newbie == 1, "New Customers", "Returning Customers"))

# Calculate total customers
total_customers <- sum(newbie_distribution$Count)

# Add percentage to the summary table
newbie_distribution <- newbie_distribution %>%
  mutate(Percentage = round((Count / total_customers) * 100, 1),
         Label = paste0(Count, " (", Percentage, "%)"))

newbie_distribution

## # A tibble: 2 × 5
##   newbie Count Category            Percentage Label        
##    <int> <int> <chr>                    <dbl> <chr>        
## 1      0 10621 Returning Customers       49.8 10621 (49.8%)
## 2      1 10686 New Customers             50.2 10686 (50.2%)

ggplot(newbie_distribution, aes(x = Category, y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Label), vjust = -0.5, size = 5) +
  labs(
    title = "New vs. Returning Customers",
    x = "Customer Type",
    y = "Number of Customers"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14)
  ) +
  ylim(0, max(newbie_distribution$Count) * 1.1)

The number of returning and new customers is more or less equal, as shown on the bar chart and the datatable.

Conversion

# Convert 'conversion' to a factor with descriptive labels
email_data_mens <- email_data_mens %>%
  mutate(conversion_status = factor(conversion, levels = c(0, 1),
                                    labels = c("Not Converted", "Converted")))

# Summarize counts per channel and conversion status
channel_conversion_summary <- email_data_mens %>%
  group_by(channel, conversion_status) %>%
  summarise(Count = n(), .groups = "drop")

# Calculate total customers per channel to compute conversion rates
channel_totals <- email_data_mens %>%
  group_by(channel) %>%
  summarise(Total = n(), .groups = "drop")

# Create conversion labels
conversion_labels <- channel_totals %>%
  left_join(
    email_data_mens %>%
      group_by(channel) %>%
      summarise(Converted = sum(conversion == 1), .groups = "drop"),
    by = "channel"
  ) %>%
  mutate(Conversion_Rate = round((Converted / Total) * 100, 2),
         Label = paste0("Conversion Rate: ", Conversion_Rate, "%"))

conversion_labels

## # A tibble: 3 × 5
##   channel      Total Converted Conversion_Rate Label                 
##   <fct>        <int>     <int>           <dbl> <chr>                 
## 1 Multichannel  2577        44            1.71 Conversion Rate: 1.71%
## 2 Phone         9240       100            1.08 Conversion Rate: 1.08%
## 3 Web           9490       123            1.3  Conversion Rate: 1.3%

Conversion rates can be influenced by the total number of customers in each channel. Channels with a smaller customer base might show a higher conversion rate due to smaller, more targeted audiences. Conversely, larger bases might have lower conversion rates because it’s harder to achieve high engagement across a broad audience.

You could argue that the conversion rate for “Multichannel” is the highest. However, it is also the channel with the least amount of customers and therefore the one with the least amount of conversions.

#mapping the customer journey out,to see where we lose our customers
transition_rates <- email_data_mens %>%
  group_by(channel) %>%
  summarise(
    total_customers = n(),
    visit_rate = sum(visit) / n(),
    visit_conversion_rate = sum(conversion) / sum(visit[visit == 1])
  )

print(transition_rates)

## # A tibble: 3 × 4
##   channel      total_customers visit_rate visit_conversion_rate
##   <fct>                  <int>      <dbl>                 <dbl>
## 1 Multichannel            2577      0.211                0.0807
## 2 Phone                   9240      0.163                0.0665
## 3 Web                     9490      0.194                0.0667

The visit rate shows us the proportion of customers who visited the platform through each channel as well es the visit conversion rate, which tells us how many people convert after visiting. Again Multichannel performs the best and Phone the worst.

Models

Now that we identified and evaluated our important variables, we can implement our statistical Models.

Correlation Matrix

Let us find out which variables could actually be responsible for leading to conversions:

# Select numerical variables including the target
num_vars <- email_data_mens %>% select(recency, history, conversion)

# Compute correlation matrix
cor_matrix <- cor(num_vars, use = "complete.obs")

# Plot the correlation matrix
corrplot(cor_matrix, method = "color", addCoef.col = "black",
         tl.col = "black", number.cex = 0.7, title = "Correlation Matrix",
         mar = c(0,0,1,0))

The correlation Matrix shows us, that there is a weak negative correlation between recency and history. Other than that, there is no correlation at all between recency and history to conversion.

Chi-squared tests

gender_conversion <- table(email_data_mens$gender, email_data_mens$conversion)
chisq.test(gender_conversion)

## 
##  Pearson's Chi-squared test
## 
## data:  gender_conversion
## X-squared = 27.358, df = 2, p-value = 1.146e-06

There is a statistically significant association between gender and conversion. This suggests that conversion rates are influenced by gender, and the distribution of conversion differs based on gender categories.

# Channel vs Conversion
channel_conversion <- table(email_data_mens$channel, email_data_mens$conversion)
chisq.test(channel_conversion)

## 
##  Pearson's Chi-squared test
## 
## data:  channel_conversion
## X-squared = 6.6199, df = 2, p-value = 0.03652

There is a moderate association between channel and conversion, with a p-value just below 0.05. This means that the marketing channel used may influence conversion rates, but the strength of the association is relatively weak

# Zip Code vs Conversion
zip_conversion <- table(email_data_mens$zip_code, email_data_mens$conversion)
chisq.test(zip_conversion)

## 
##  Pearson's Chi-squared test
## 
## data:  zip_conversion
## X-squared = 0.34122, df = 2, p-value = 0.8432

There is no statistically significant association between zip code and conversion. This means that geographic location (represented by zip code) does not affect conversion rates

Logistic regression

#stepwise logistic regression
set.seed(123)
# Create training and testing sets
train_index <- createDataPartition(email_data_mens$conversion, p = 0.7, list = FALSE)
train_data <- email_data_mens[train_index, ]
test_data <- email_data_mens[-train_index, ]

# Fit the logistic regression model with selected variables
logistic_model <- glm(conversion ~ gender + channel + history + recency,
                      data = train_data, family = binomial)

# Summary of the model
summary(logistic_model)

## 
## Call:
## glm(formula = conversion ~ gender + channel + history + recency, 
##     family = binomial, data = train_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.1810217  0.2814755 -14.854  < 2e-16 ***
## genderOther   0.6184147  0.2260939   2.735  0.00623 ** 
## genderWomen  -0.0107015  0.1706018  -0.063  0.94998    
## channelPhone -0.3783237  0.2326204  -1.626  0.10387    
## channelWeb   -0.1777727  0.2261561  -0.786  0.43183    
## history       0.0003352  0.0002813   1.192  0.23332    
## recency      -0.0375055  0.0230230  -1.629  0.10330    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1894.9  on 14914  degrees of freedom
## Residual deviance: 1869.3  on 14908  degrees of freedom
## AIC: 1883.3
## 
## Number of Fisher Scoring iterations: 7

# Predict on test data
test_data$pred_prob <- predict(logistic_model, newdata = test_data, type = "response")
test_data$pred_class <- ifelse(test_data$pred_prob > 0.3, 1, 0)

# Confusion Matrix
confusionMatrix(factor(test_data$pred_class), factor(test_data$conversion))

## Warning in confusionMatrix.default(factor(test_data$pred_class),
## factor(test_data$conversion)): Levels are not in the same order for reference
## and data. Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6299   93
##          1    0    0
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9822, 0.9882)
##     No Information Rate : 0.9855          
##     P-Value [Acc > NIR] : 0.5275          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9855          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9855          
##          Detection Rate : 0.9855          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

From the logistic regression model we can again conclude, that the gender does significantly effect the conversion. In particular, The “Other” gender category significantly increases the log-odds of conversion compared to “Men,” suggesting that people in the “Other” category are more likely to convert.

Due to the fact, that non other variables have a significant impact on the model, the prediction does not predict any conversions.

RFM Model

Let’s try a rfm model Objective: Adjusting the Recency metric to measure the time since the last purchase in months, where:

1 month represents the most recent purchase. 12 months represents the least recent purchase within the past year.

Problem: Exact shopping frequency is unknown. Available Data: A newbie column where: 1 = Newbie (frequency = 1) 0 = Returning customer (frequency > 1) Approach: Assign frequency = 1 for newbies and frequency = 2 for returning customers.

email_data_mens <- email_data_mens %>%
  mutate(customer_id = row_number())
# Define analysis_date as today's date
analysis_date <- Sys.Date()


email_data_mens <- email_data_mens %>%
  mutate(
    Frequency = ifelse(newbie == 1, 1, 2)  # Assign 1 for newbies, 2 for returning customers
  )

rfm_table <- email_data_mens %>%
  mutate(
    Recency = ifelse(recency > 12, 12, recency),  
    Frequency = ifelse(newbie == 1, 1, 2),        # Default frequency assignment
    Monetary = history
  ) %>%
  select(customer_id, Recency, Frequency, Monetary)

rfm_table <- rfm_table %>%
  mutate(
    # Normalize Recency: Lower Recency is better
    Recency_Normalized = (12 - Recency) / 11,  # Assuming Recency is capped at 12

    # Normalize Frequency: Already between 1-2
    Frequency_Normalized = (Frequency - 1) / 1,  # Scaled to 0-1

    # Normalize Monetary
    Monetary_Normalized = (Monetary - min(Monetary)) / (max(Monetary) - min(Monetary))
  ) %>%
  mutate(
    # Assign weights to each metric
    R_weighted = Recency_Normalized * 0.5,
    F_weighted = Frequency_Normalized * 0.2,
    M_weighted = Monetary_Normalized * 0.5,

    # Calculate Total RFM Score
    RFM_Score = R_weighted + F_weighted + M_weighted
  )
ggplot(rfm_table, aes(x = R_weighted, y = F_weighted, color = M_weighted)) +
  geom_jitter(width = 0.2, height = 0.2, alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "RFM Segmentation",
    x = "Recency Score",
    y = "Frequency Score",
    color = "Monetary Score"
  ) +
  theme_minimal()

K-Means Clustering

# Prepare data for clustering
rfm_clusters <- rfm_table %>%
  select(Recency, Frequency, Monetary) %>%
  scale()  # Standardize the data

# Determine optimal number of clusters using Elbow Method
set.seed(123)
wss <- sapply(1:10, function(k){
  kmeans(rfm_clusters, centers = k, nstart = 10)$tot.withinss
})
# Plot Elbow Method
plot(1:10, wss, type = "b", pch = 19,
     xlab = "Number of Clusters",
     ylab = "Total Within-Clusters Sum of Squares",
     main = "Elbow Method for Determining Optimal Clusters")

Based on the Elbow Method we chose k=5 for optimal clustering.

optimal_k = 5
# Perform K-Means Clustering
set.seed(123)
kmeans_model <- kmeans(rfm_clusters, centers = optimal_k, nstart = 25)

# Assign cluster labels to the RFM table
rfm_table$Cluster <- as.factor(kmeans_model$cluster)

Analyzing the Clusters

The RFM Model consists of various types of customers, which we can now assign based on our clusters. According to the types we will build our following business strategy.

To achieve this, we first have to calculate the average values for all our variables in the model. The resulting table will be the base for our following analysis.

# Analyze RFM Averages per Cluster

cluster_summary <- rfm_table %>%
  mutate(cluster = kmeans_model$cluster) %>%
  group_by(cluster) %>%
  summarise(
    Avg_Recency = mean(Recency),
    Avg_Frequency = mean(Frequency),
    Avg_Monetary = mean(Monetary),
    Avg_RFM_Score = mean(RFM_Score),
    Count = n()
  )
print(cluster_summary)

## # A tibble: 5 × 6
##   cluster Avg_Recency Avg_Frequency Avg_Monetary Avg_RFM_Score Count
##     <int>       <dbl>         <dbl>        <dbl>         <dbl> <int>
## 1       1        9.44             1         149.         0.135  4061
## 2       2        3.19             1         144.         0.418  4078
## 3       3        9.03             2         158.         0.355  5290
## 4       4        3.40             1         796.         0.511  2547
## 5       5        2.86             2         210.         0.644  5331

This leads to the following risk categories:

Cluster 1: “At-Risk Customers”. These customers have engaged less recently, do not purchase frequently, and have moderate spending. They may be starting to drift away and could benefit from re-engagement efforts to prevent churn.
Cluster 2: “Recent but Low-Frequency Spenders”. This group has recently engaged but spends infrequently and at a moderate level. They could be potential for upselling with targeted offers to increase engagement frequency.
Cluster 3: “Occasional Loyalists”. These customers engage relatively frequently (Frequency = 2) but with lower recent activity (Recency = 9.03). Their moderate spending and frequency indicate that they value the brand but may need periodic nudges to stay engaged.
Cluster 4: “High-Value New Customers”. These customers spend significantly more but engage less frequently. Their recent engagement indicates potential for high lifetime value, suggesting they may respond well to loyalty-building strategies.
Cluster 5: “Loyal Champions”. This group is the most engaged with high frequency, recent activity, and above-average spending. They are prime candidates for loyalty programs, VIP offers, and brand advocacy initiatives.

# assigne the clusters their label names
rfm_table_named <- rfm_table %>%
  mutate(cluster = kmeans_model$cluster) %>%  # Add cluster labels from the K-means model
  mutate(Customer_Segment = case_when(
    cluster == 1 ~ "At-Risk Customers",
    cluster == 2 ~ "Recent but Low-Frequency Spenders",
    cluster == 3 ~ "Occasional Loyalists",
    cluster == 4 ~ "High-Value New Customers",
    cluster == 5 ~ "Loyal Champions"
  ))

# Summarize the segments for visualisation
segment_summary <- rfm_table_named %>%
  group_by(Customer_Segment) %>%
  summarise(
    Avg_Recency = mean(Recency),
    Avg_Frequency = mean(Frequency),
    Avg_Monetary = mean(Monetary),
    Count = n()
  )

# Plotting the number of customers in each segment
ggplot(segment_summary, aes(x = reorder(Customer_Segment, -Count), y = Count, fill = Customer_Segment)) +
  geom_bar(stat = "identity") +
  labs(title = "Customer Segment Distribution",
       x = "Customer Segment",
       y = "Number of Customers") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

Now we visualize our RFM Model using Plotly for a 3D visualisation.

# creating a plotly graph for 3d visualisation
plot_ly(rfm_table_named, x = ~Recency, y = ~Frequency, z = ~Monetary, 
             color = ~Customer_Segment, colors = "Set2",
             type = 'scatter3d', mode = 'markers') %>%
  layout(title = "3D RFM Segmentation",
         scene = list(xaxis = list(title = "Recency"),
                      yaxis = list(title = "Frequency"),
                      zaxis = list(title = "Monetary")))

Analisation of demographic values

Now that we do have 5 different customer segments according to our RFM model, let’s try to analyze key demographic variables such as gender, zip_code, and history_segment within each cluster

current_wd <- getwd()

Sys.setenv(TMPDIR = current_wd)
Sys.setenv(TEMP = current_wd)
Sys.setenv(TMP = current_wd)
Sys.setenv(TMPDIR = "C:/Users/atapp/OneDrive/WISO Studium/SBWL/Data Science/Applications of Data Science/project 1")
library(knitr)
# Assuming 'rfm_table' contains 'customer_id' and 'Cluster'
# Merge 'rfm_table' with 'email_data_mens'

# Ensure 'customer_id' is unique in both datasets
email_data_mens_unique <- email_data_mens %>%
  distinct(customer_id, .keep_all = TRUE)

# Merge the datasets
email_data_with_clusters <- email_data_mens_unique %>%
  left_join(rfm_table %>% select(customer_id, Cluster), by = "customer_id")

# Demographic Summary by Cluster
demographics_summary <- email_data_with_clusters %>%
  group_by(Cluster) %>%
  summarise(
    Count = n(),
    Men = sum(gender == "Men"),
    Women = sum(gender == "Women"),
    Other = sum(gender == "Other"),
    Rural = sum(zip_code == "Rural"),
    Urban = sum(zip_code == "Urban"),
    Suburban = sum(zip_code == "Suburban"),
    Newbies = sum(newbie == 1),
    Returning_Customers = sum(newbie == 0)
   
  ) %>%
  mutate(
    Men_Percent = Men / Count * 100,
    Women_Percent = Women / Count * 100,
    Other_Percent = Other / Count * 100,
    Rural_Percent = Rural / Count * 100,
    Urban_Percent = Urban / Count * 100,
    Suburban_Percent = Suburban / Count * 100,
    Newbies_Percent = Newbies / Count * 100,
    Returning_Percent = Returning_Customers / Count * 100
  ) %>%
  select(
    Cluster, Count,
    Men_Percent, Women_Percent, Other_Percent,
    Rural_Percent, Urban_Percent, Suburban_Percent,
    Newbies_Percent, Returning_Percent
  )

# Display the demographic summary
#kable(demographics_summary, caption = "Demographic Breakdown by Cluster")
print(demographics_summary)

## # A tibble: 5 × 10
##   Cluster Count Men_Percent Women_Percent Other_Percent Rural_Percent
##   <fct>   <int>       <dbl>         <dbl>         <dbl>         <dbl>
## 1 1        4061        46.2          48.1          5.71          14.4
## 2 2        4078        48.0          46.8          5.20          14.6
## 3 3        5290        46.7          46.7          6.60          16.1
## 4 4        2547        33.8          32.7         33.5           14.9
## 5 5        5331        44.9          45.1         10.0           15.6
## # ℹ 4 more variables: Urban_Percent <dbl>, Suburban_Percent <dbl>,
## #   Newbies_Percent <dbl>, Returning_Percent <dbl>

# Gender Distribution Across Clusters
ggplot(email_data_with_clusters, aes(x = Cluster, fill = gender)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Gender Distribution Across Clusters",
    x = "Cluster",
    y = "Proportion",
    fill = "Gender"
  ) +
  theme_minimal()

# Zip Code Distribution Across Clusters
ggplot(email_data_with_clusters, aes(x = Cluster, fill = zip_code)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Zip Code Distribution Across Clusters",
    x = "Cluster",
    y = "Proportion",
    fill = "Zip Code"
  ) +
  theme_minimal()

Let’s assess engagement metrics per customer

# Channel Distribution Across Clusters with Normalization
channel_summary_normalized <- email_data_with_clusters %>%
  group_by(Cluster) %>%
  mutate(Total_Cluster_Count = n()) %>%
  group_by(Cluster, channel) %>%
  summarise(Count = n(), Total_Cluster_Count = first(Total_Cluster_Count)) %>%
  mutate(Proportion = Count / Total_Cluster_Count * 100)

## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.

# Display the normalized channel summary
kable(channel_summary_normalized, caption = "Normalized Channel Preferences by Cluster")

Normalized Channel Preferences by Cluster
Cluster	channel	Count	Total_Cluster_Count	Proportion
1	Multichannel	274	4061	6.747107
1	Phone	1834	4061	45.161290
1	Web	1953	4061	48.091603
2	Multichannel	245	4078	6.007847
2	Phone	1962	4078	48.111820
2	Web	1871	4078	45.880333
3	Multichannel	394	5290	7.448015
3	Phone	2412	5290	45.595463
3	Web	2484	5290	46.956522
4	Multichannel	1000	2547	39.261877
4	Phone	767	2547	30.113859
4	Web	780	2547	30.624264
5	Multichannel	664	5331	12.455449
5	Phone	2265	5331	42.487338
5	Web	2402	5331	45.057212

# Stacked Bar Chart for Normalized Channel Preferences
ggplot(channel_summary_normalized, aes(x = Cluster, y = Proportion, fill = channel)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    title = "Normalized Channel Preferences by Cluster",
    x = "Cluster",
    y = "Percentage",
    fill = "Channel"
  ) +
  theme_minimal()

# Proportion of Visits per Cluster
visits_summary <- email_data_with_clusters %>%
  group_by(Cluster) %>%
  summarise(
    Visit_Rate = mean(visit, na.rm = TRUE) * 100,  # Proportion of customers who visited
    Total_Customers = n()  # Total number of customers in each cluster
  )

# Display the visits summary
kable(visits_summary, caption = "Visit Rate by Cluster")

Visit Rate by Cluster
Cluster	Visit_Rate	Total_Customers
1	10.56390	4061
2	15.79205	4078
3	18.69565	5290
4	22.53632	2547
5	23.59782	5331

# Bar Plot for Visit Rates per Cluster
ggplot(visits_summary, aes(x = Cluster, y = Visit_Rate, fill = Cluster)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Visit Rate Across Clusters",
    x = "Cluster",
    y = "Visit Rate (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Conversion Rate by Cluster
conversion_summary <- email_data_with_clusters %>%
  group_by(Cluster) %>%
  summarise(
    Total_Customers = n(),
    Converted = sum(conversion_status == "Converted"),
    Conversion_Rate = (Converted / Total_Customers) * 100
  )

# Display the conversion summary
kable(conversion_summary, caption = "Conversion Rates by Cluster")

Conversion Rates by Cluster
Cluster	Total_Customers	Converted	Conversion_Rate
1	4061	36	0.8864812
2	4078	34	0.8337420
3	5290	57	1.0775047
4	2547	49	1.9238320
5	5331	91	1.7069968

# Bar Chart for Conversion Rates
ggplot(conversion_summary, aes(x = Cluster, y = Conversion_Rate, fill = Cluster)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Conversion Rates by Cluster",
    x = "Cluster",
    y = "Conversion Rate (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# History Segment Distribution Across Clusters
history_summary <- email_data_with_clusters %>%
  group_by(Cluster, history_segment) %>%
  summarise(Count = n()) %>%
  mutate(Proportion = Count / sum(Count) * 100)

## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.

# Display the history summary
kable(history_summary, caption = "History Segment Distribution by Cluster")

History Segment Distribution by Cluster
Cluster	history_segment	Count	Proportion
1	1) $0 - $100	2034	50.086186
1	2) $100 - $200	1011	24.895346
1	3) $200 - $350	620	15.267176
1	4) $350 - $500	176	4.333908
1	5) $500 - $750	220	5.417385
2	1) $0 - $100	1815	44.507111
2	2) $100 - $200	1195	29.303580
2	3) $200 - $350	796	19.519372
2	4) $350 - $500	272	6.669936
3	1) $0 - $100	2314	43.742911
3	2) $100 - $200	1301	24.593573
3	3) $200 - $350	1150	21.739130
3	4) $350 - $500	525	9.924386
4	4) $350 - $500	62	2.434236
4	5) $500 - $750	1377	54.063604
4	6) $750 - $1,000	644	25.284649
4	7) $1,000 +	464	18.217511
5	1) $0 - $100	1561	29.281561
5	2) $100 - $200	1184	22.209717
5	3) $200 - $350	1524	28.587507
5	4) $350 - $500	1062	19.921215

# Stacked Bar Chart for History Segments
ggplot(history_summary, aes(x = Cluster, y = Proportion, fill = history_segment)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "History Segment Distribution by Cluster",
    x = "Cluster",
    y = "Percentage",
    fill = "History Segment"
  ) +
  theme_minimal()

transition_rates <- email_data_mens %>%
  group_by(channel) %>%
  summarise(
    total_customers = n(),
    visit_rate = sum(visit) / n(),
    conversion_rate = sum(conversion) / sum(visit[visit == 1])
  )

print(transition_rates)

## # A tibble: 3 × 4
##   channel      total_customers visit_rate conversion_rate
##   <fct>                  <int>      <dbl>           <dbl>
## 1 Multichannel            2577      0.211          0.0807
## 2 Phone                   9240      0.163          0.0665
## 3 Web                     9490      0.194          0.0667

Customer Base Analysis

Alex Tappeiner, Nicole Dangl

2024-10-25