Code
library(readxl)
library(tidyverse)
library(dplyr)
library(rfm)This project delivers a comprehensive Customer Value Segmentation analysis for our online education platform, utilizing transaction data spanning 9,935 orders across a 12-month period (January 1, 2019 – December 31, 2019). By implementing a Recency, Frequency, and Monetary (RFM) framework, the analysis moves past aggregated metrics to map true student and institutional behavior. The objective is to transition from uniform marketing to high-precision, automated B2C workflows and high-touch B2B account management.
High-Volume Core vs. High-Value Institutional Skew: The platform’s overall distribution is heavily right-skewed. While the median individual order value sits at $636.72, a significant volume of transaction outliers stretches up to $9,854.88. This creates a dual-layer ecosystem: a highly dense, transaction-heavy consumer (B2C) base and a high-ticket enterprise (B2B) client layer.
Frequency-Driven Lifetime Value: Diagnostic heatmaps and facet matrices reveal that customer lifetime value is overwhelmingly dictated by purchase frequency. Due to relatively uniform individual transaction limits, customers rarely achieve a high Monetary tier through a single purchase; instead, they “earn” their way into high-value cohorts through sequential curriculum progression or multi-seat corporate expansions.
Excellent Platform Momentum: Recency distribution is heavily weighted toward the end of the fiscal year, with nearly 80% of all transaction activity concentrated within a fresh 0-to-60-day window. This reflects robust platform adoption and an exceptionally engaged user base closing out Q4.
The model successfully grouped the active database into distinct behavioral segments, revealing critical revenue-protection priorities:
| Segment | Median Monetary Value | Strategic Classification | Core Action Plan |
| At Risk | $29,711 | High-Value Enterprise Leak | Immediate Sales Intervention: Proactive outreach by account managers with tailored training bundles or contract renewals before permanent churn occurs. |
| Champions | $27,910 | Hyper-Active Core Advocates | Retention & VIP Access: Enroll in early-access program reviews, beta testing for new curriculums, and enterprise-wide licensing renewals. |
| Can’t Lose Them | $24,631 | Lapsed Institutional Clients | Win-Back Campaigns: Dedicated executive-level consulting outreach to identify platform adoption blockers or competitive friction points. |
| Potential Loyalist | $17,331 | High-Potential Mid-Tier Base | Milestone Up-selling: Target with mid-curriculum promotions to push them horizontally into high-frequency habits. |
| New / Promising | $8,013 – $13,811 | Emerging Pipeline | Automated Cultivation: Trigger behavioral email tracks recommending the next logical stage of their learning certification path. |
library(readxl)
library(tidyverse)
library(dplyr)
library(rfm)df<- read_excel("data - Online Edu.xlsx",
sheet = "Orders")
df <- df %>%
mutate(Date = as.Date(Date),
User_id = as.factor(User_id)) %>%
filter(!is.na(User_id), !is.na(Date), !is.na(Order_value))# Summary of the data
summary(df) Order_id Date User_id Order_value
Length:9935 Min. :2019-01-01 14470 : 38 Min. : 13.44
Class :character 1st Qu.:2019-05-15 11545 : 33 1st Qu.: 249.54
Mode :character Median :2019-08-09 18295 : 33 Median : 636.72
Mean :2019-07-27 18910 : 33 Mean :1340.96
3rd Qu.:2019-10-22 21355 : 33 3rd Qu.:1641.22
Max. :2019-12-31 14395 : 32 Max. :9854.88
(Other):9733
The summary shows that there are total 9,935 orders with minimum value of $13.44 (an introduction course maybe) maximum value of $9,854.88 (might be a corporate purchase ) and median value of $636.72.
# Order value distribution
ggplot(df, aes(y = Order_value)) +
geom_histogram() +
labs(title = "Order Value Plot",
subtitle = "Order value distribution") +
theme_minimal()The chart shows that it is a classic long tail business, with value for majority of the orders fall into the range between $13.44 and $500 (lower tier). Mid-tier range from $500 to $2,500 remains relatively thick, shows that a substantial number of students are willing to invest that amount. Top-tier range from $2,500 to nearly $10,000. The business demonstrates a classic 80/20 rule. The top-tier users should be closed protected as they contribute the lion’s share of the company revenue. Actions need to take immediately if they fall into “At risk” or “About to sleep” category
ggplot(df, aes(y = Order_value)) +
geom_boxplot() +
labs(title = "Order Value Plot",
subtitle = "Order value distribution") +
theme_minimal()The chart reinforces the finding that the overwhelming majority of transactions are lower-to-mid tier retail purchases. However, the upper whisker stops around $3,700 shows that any users paying over that amount are likely corporate users and should be analysed separately.
max_date <- max(df$Date, na.rm = TRUE)
rfm_result <- rfm_table_order(
data = df,
customer_id = User_id,
revenue = Order_value,
order_date = Date,
analysis_date = max_date
)
head(rfm_result$rfm)# A tibble: 6 × 9
customer_id recency_days transaction_count amount recency_score
<fct> <dbl> <int> <dbl> <int>
1 10015 36 16 9607. 3
2 10015 36 16 9607. 3
3 10015 36 16 9607. 3
4 10015 36 16 9607. 3
5 10015 36 16 9607. 3
6 10015 36 16 9607. 3
# ℹ 4 more variables: frequency_score <int>, monetary_score <int>,
# rfm_score <dbl>, Order_id <chr>
# Plot heatmap of the rfm results
rfm_plot_heatmap(rfm_result)The chart shows that the monetary value is almost entirely driven by Frequency rather than recency, where customer lifetime value (CLV) is a direct function of repeat purchases. The bottom right square is the one to action upon as they were high frequency and high-spending customers but have gone cold completely. Marketing efforts should also be put to convert Frequency Score 2 customers to Frequency Score 3 and higher.
# Facet plot
rfm_plot_bar_chart(rfm_result)The chart shows that each individual transaction is relatively uniform. high monetary bracket are largely driven by repeat purchase, which indicate that the marketing campaign should be highlight upskilling to encourage repeat purchase behavior
# View the distributions via Histograms
rfm_plot_histogram(rfm_result)The chart shows that the users are highly active and fresh as majority of users have made purchase in the last 60 days (7,800 out of 9,935), which indicates the Q4 marketing campaign is highly successful.
# Define standard RFM segment boundaries
segment_names <- c(
"Champions", "Potential Loyalist", "Loyal Customers", "Promising",
"New Customers", "Can't Lose Them", "At Risk", "Need Attention", "About To Sleep", "Lost"
)
recency_lower <- c(4, 3, 2, 3, 4, 1, 1, 1, 2, 1)
recency_upper <- c(5, 5, 4, 4, 5, 2, 2, 3, 3, 1)
frequency_lower <- c(4, 3, 2, 1, 1, 3, 2, 3, 1, 1)
frequency_upper <- c(5, 5, 4, 3, 3, 4, 5, 5, 3, 5)
monetary_lower <- c(4, 2, 2, 3, 1, 4, 4, 3, 1, 1)
monetary_upper <- c(5, 5, 4, 5, 5, 5, 5, 5, 4, 5)
# Map your customers to the segments
segmented_data <- rfm_segment(
rfm_result,
segment_names,
recency_lower,
recency_upper,
frequency_lower,
frequency_upper,
monetary_lower,
monetary_upper
)rfm_plot_segment_scatter(segmented_data, "monetary", "recency")The “Champion” users are the green dots at the bottom of the graph, who have made purchase recently with high price tag. Those could be targeted using advanced certification paths or enterprise-wide renewals.
“Cannot lose” and “At risk” represented by the gold and orage dots are middle-right (monetary value between $20,000 - $40,000 and recency between 50 - 200 days) and are immediate marketingtarget to bring them back with curriculum updates, custom training or corporate group discount.
“The High-density retail engine” represented by blue, pink and teal dots crowded in the bottom-left corder are in the pipeline for future champions. Cross selling “level 2” or advanced courses when they about finish the courses purchased
rfm_plot_segment_scatter(segmented_data, "frequency", "recency")Focus should be given to those fall into Frequency 15 - 25 and Recency teween 100 and 200, those who were heavy users in the first half of the year but stopped buying in the last 3 - 6 months. A direct account management intervention is highly desired.
# 1. Summarize the median monetary value per segment (just like the package does)
plot_data <- segmented_data %>%
group_by(segment) %>%
summarize(median_monetary = median(amount, na.rm = TRUE)) %>%
filter(!is.na(segment)) # Remove any unsegmented data
# 2. Plot it with unique colors per bar
ggplot(plot_data, aes(x = segment, y = median_monetary, fill = segment)) +
geom_col(width = 0.8) + # Creates the bars
geom_text(aes(label = round(median_monetary)), vjust = -0.5, size = 3.5) + # Adds the numbers on top
scale_fill_brewer(palette = "Set3") + # Applies a distinct color palette to every bar
labs(
title = "Median Monetary Value by Segment",
x = "Segment",
y = "Median Monetary Value"
) +
theme_minimal() + # Keeps a clean background
theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), # Rotates x-axis text
plot.title = element_text(hjust = 0.5, face = "bold", size = 14), # Centers title
legend.position = "none" # Hides redundant legend
)The segment of “At risk” and “Can’t lose them” should be prioritized by reaching out with personalized curriculum renewals, platform updates or corporate packages, as their monetary value is too high to lose.
Efforts should be make to push “Potential Loyalist” and “Promising” segment to “Champion” segment by tagetted upselling campigns, advanced certification bundles
The monetary value of “Lost” segment is high, losing one of them means a $10,000 lose. Internal assessment should be conducted on Customer Acquisition Cost (CAC), and marketing campaign should be drafted to reduce lost rate
# Calculate user counts
user_counts <- segmented_data %>%
group_by(segment) %>%
summarize(total_users = n_distinct(customer_id)) %>%
filter(!is.na(segment)) %>%
arrange(desc(total_users))
# Lock in the factor levels
user_counts$segment <- factor(user_counts$segment, levels = user_counts$segment)
# Generate the chart
ggplot(user_counts, aes(x = segment, y = total_users, fill = segment)) +
geom_col(width = 0.8) +
geom_text(aes(label = scales::comma(total_users)),
vjust = -0.5,
size = 3.5,
fontface = "bold") +
scale_fill_viridis_d(option = "plasma") +
labs(
title = "Distribution of Unique Users Across RFM Segment",
x = "Customer Segment",
y = "Number of Unique Users"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
legend.position = "none"
)| Segment | User Count | Median Value | Total Risk/Opportunity | Operational Action | Channel |
| At Risk | 40 | $29,711 | ~$1.18M | Immediate Revenue Rescue | High-Touch (Sales/AM Call) |
| Can’t Lose Them | 41 | $24,631 | ~$1.01M | Long-Term Win-Back | High-Touch (Executive Outreach) |
| Potential Loyalist | 179 | $17,331 | ~$3.10M | Milestone Cross-Selling | Programmatic/Email Tracks |
| Champions | 124 | $27,910 | ~$3.46M | VIP Advocacy & Renewals | Exclusive Perks / Early Access |
The business looks good in the way that the number in “Champions” and “Potential Loyalist” are higher than any other segment.
The combined number of users in “Can’t lose them” and “At Risk” are only 81, manual contact could be arranged to offer tailor made programs.
The number of users in “Lost” segment (91) and those in “About to Sleep” (80) could be targeted by automated email to try to wake them up