RFM Analysis of Online Education Platform

Author

Tony Wang

EXECUTIVE SUMMARY

Project Overview

This project delivers a comprehensive Customer Value Segmentation analysis for our online education platform, utilizing transaction data spanning 9,935 orders across a 12-month period (January 1, 2019 – December 31, 2019). By implementing a Recency, Frequency, and Monetary (RFM) framework, the analysis moves past aggregated metrics to map true student and institutional behavior. The objective is to transition from uniform marketing to high-precision, automated B2C workflows and high-touch B2B account management.

Key Data Insights & Structural Diagnostics

  • High-Volume Core vs. High-Value Institutional Skew: The platform’s overall distribution is heavily right-skewed. While the median individual order value sits at $636.72, a significant volume of transaction outliers stretches up to $9,854.88. This creates a dual-layer ecosystem: a highly dense, transaction-heavy consumer (B2C) base and a high-ticket enterprise (B2B) client layer.

  • Frequency-Driven Lifetime Value: Diagnostic heatmaps and facet matrices reveal that customer lifetime value is overwhelmingly dictated by purchase frequency. Due to relatively uniform individual transaction limits, customers rarely achieve a high Monetary tier through a single purchase; instead, they “earn” their way into high-value cohorts through sequential curriculum progression or multi-seat corporate expansions.

  • Excellent Platform Momentum: Recency distribution is heavily weighted toward the end of the fiscal year, with nearly 80% of all transaction activity concentrated within a fresh 0-to-60-day window. This reflects robust platform adoption and an exceptionally engaged user base closing out Q4.

Strategic Segment Breakdown

The model successfully grouped the active database into distinct behavioral segments, revealing critical revenue-protection priorities:

Segment Median Monetary Value Strategic Classification Core Action Plan
At Risk $29,711 High-Value Enterprise Leak Immediate Sales Intervention: Proactive outreach by account managers with tailored training bundles or contract renewals before permanent churn occurs.
Champions $27,910 Hyper-Active Core Advocates Retention & VIP Access: Enroll in early-access program reviews, beta testing for new curriculums, and enterprise-wide licensing renewals.
Can’t Lose Them $24,631 Lapsed Institutional Clients Win-Back Campaigns: Dedicated executive-level consulting outreach to identify platform adoption blockers or competitive friction points.
Potential Loyalist $17,331 High-Potential Mid-Tier Base Milestone Up-selling: Target with mid-curriculum promotions to push them horizontally into high-frequency habits.
New / Promising $8,013 – $13,811 Emerging Pipeline Automated Cultivation: Trigger behavioral email tracks recommending the next logical stage of their learning certification path.

Load Libraries

Code
library(readxl)
library(tidyverse)
library(dplyr)
library(rfm)

Load Data

Code
df<- read_excel("data - Online Edu.xlsx", 
    sheet = "Orders")

df <- df %>%
  mutate(Date = as.Date(Date),
         User_id = as.factor(User_id)) %>%
  filter(!is.na(User_id), !is.na(Date), !is.na(Order_value))

Summary of the data

Code
# Summary of the data
summary(df)
   Order_id              Date               User_id      Order_value     
 Length:9935        Min.   :2019-01-01   14470  :  38   Min.   :  13.44  
 Class :character   1st Qu.:2019-05-15   11545  :  33   1st Qu.: 249.54  
 Mode  :character   Median :2019-08-09   18295  :  33   Median : 636.72  
                    Mean   :2019-07-27   18910  :  33   Mean   :1340.96  
                    3rd Qu.:2019-10-22   21355  :  33   3rd Qu.:1641.22  
                    Max.   :2019-12-31   14395  :  32   Max.   :9854.88  
                                         (Other):9733                    

The summary shows that there are total 9,935 orders with minimum value of $13.44 (an introduction course maybe) maximum value of $9,854.88 (might be a corporate purchase ) and median value of $636.72.

Order Value Distribution

Code
# Order value distribution
ggplot(df, aes(y = Order_value)) +
  geom_histogram() +
  labs(title = "Order Value Plot",
       subtitle = "Order value distribution") +
  theme_minimal()

The chart shows that it is a classic long tail business, with value for majority of the orders fall into the range between $13.44 and $500 (lower tier). Mid-tier range from $500 to $2,500 remains relatively thick, shows that a substantial number of students are willing to invest that amount. Top-tier range from $2,500 to nearly $10,000. The business demonstrates a classic 80/20 rule. The top-tier users should be closed protected as they contribute the lion’s share of the company revenue. Actions need to take immediately if they fall into “At risk” or “About to sleep” category

Order Value in Boxplot

Code
ggplot(df, aes(y = Order_value)) +
  geom_boxplot() +
    labs(title = "Order Value Plot",
       subtitle = "Order value distribution") +
  theme_minimal()

The chart reinforces the finding that the overwhelming majority of transactions are lower-to-mid tier retail purchases. However, the upper whisker stops around $3,700 shows that any users paying over that amount are likely corporate users and should be analysed separately.

Build RFM Table

Code
max_date <- max(df$Date, na.rm = TRUE)

rfm_result <- rfm_table_order(
  data = df,
  customer_id   = User_id,
  revenue       = Order_value,
  order_date    = Date,
  analysis_date = max_date
)

head(rfm_result$rfm)
# A tibble: 6 × 9
  customer_id recency_days transaction_count amount recency_score
  <fct>              <dbl>             <int>  <dbl>         <int>
1 10015                 36                16  9607.             3
2 10015                 36                16  9607.             3
3 10015                 36                16  9607.             3
4 10015                 36                16  9607.             3
5 10015                 36                16  9607.             3
6 10015                 36                16  9607.             3
# ℹ 4 more variables: frequency_score <int>, monetary_score <int>,
#   rfm_score <dbl>, Order_id <chr>

Heatmap of rfm result

Code
# Plot heatmap of the rfm results
rfm_plot_heatmap(rfm_result)

The chart shows that the monetary value is almost entirely driven by Frequency rather than recency, where customer lifetime value (CLV) is a direct function of repeat purchases. The bottom right square is the one to action upon as they were high frequency and high-spending customers but have gone cold completely. Marketing efforts should also be put to convert Frequency Score 2 customers to Frequency Score 3 and higher.

Bar Chart of rfm result

Code
# Facet plot
rfm_plot_bar_chart(rfm_result)

The chart shows that each individual transaction is relatively uniform. high monetary bracket are largely driven by repeat purchase, which indicate that the marketing campaign should be highlight upskilling to encourage repeat purchase behavior

Recency Table

Code
# View the distributions via Histograms
rfm_plot_histogram(rfm_result)

The chart shows that the users are highly active and fresh as majority of users have made purchase in the last 60 days (7,800 out of 9,935), which indicates the Q4 marketing campaign is highly successful.

Build Segment Table

Code
# Define standard RFM segment boundaries
segment_names <- c(
  "Champions", "Potential Loyalist", "Loyal Customers", "Promising",
  "New Customers", "Can't Lose Them", "At Risk", "Need Attention", "About To Sleep", "Lost"
)

recency_lower   <- c(4, 3, 2, 3, 4, 1, 1, 1, 2, 1)
recency_upper   <- c(5, 5, 4, 4, 5, 2, 2, 3, 3, 1)
frequency_lower <- c(4, 3, 2, 1, 1, 3, 2, 3, 1, 1)
frequency_upper <- c(5, 5, 4, 3, 3, 4, 5, 5, 3, 5)
monetary_lower  <- c(4, 2, 2, 3, 1, 4, 4, 3, 1, 1)
monetary_upper  <- c(5, 5, 4, 5, 5, 5, 5, 5, 4, 5)

# Map your customers to the segments
segmented_data <- rfm_segment(
  rfm_result, 
  segment_names, 
  recency_lower, 
  recency_upper,
  frequency_lower, 
  frequency_upper, 
  monetary_lower, 
  monetary_upper
)

Scatter Plot of Monetary and Recency

Code
rfm_plot_segment_scatter(segmented_data, "monetary", "recency")

The “Champion” users are the green dots at the bottom of the graph, who have made purchase recently with high price tag. Those could be targeted using advanced certification paths or enterprise-wide renewals.

“Cannot lose” and “At risk” represented by the gold and orage dots are middle-right (monetary value between $20,000 - $40,000 and recency between 50 - 200 days) and are immediate marketingtarget to bring them back with curriculum updates, custom training or corporate group discount.

“The High-density retail engine” represented by blue, pink and teal dots crowded in the bottom-left corder are in the pipeline for future champions. Cross selling “level 2” or advanced courses when they about finish the courses purchased

Scatter Plot of Frequency and Recency

Code
rfm_plot_segment_scatter(segmented_data, "frequency", "recency")

Focus should be given to those fall into Frequency 15 - 25 and Recency teween 100 and 200, those who were heavy users in the first half of the year but stopped buying in the last 3 - 6 months. A direct account management intervention is highly desired.

Median Monetary Value by Segment

Code
# 1. Summarize the median monetary value per segment (just like the package does)
plot_data <- segmented_data %>%
  group_by(segment) %>%
  summarize(median_monetary = median(amount, na.rm = TRUE)) %>%
  filter(!is.na(segment)) # Remove any unsegmented data

# 2. Plot it with unique colors per bar
ggplot(plot_data, aes(x = segment, y = median_monetary, fill = segment)) +
  geom_col(width = 0.8) +  # Creates the bars
  geom_text(aes(label = round(median_monetary)), vjust = -0.5, size = 3.5) + # Adds the numbers on top
  scale_fill_brewer(palette = "Set3") + # Applies a distinct color palette to every bar
  labs(
    title = "Median Monetary Value by Segment",
    x = "Segment",
    y = "Median Monetary Value"
  ) +
  theme_minimal() + # Keeps a clean background
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), # Rotates x-axis text
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14), # Centers title
    legend.position = "none" # Hides redundant legend
  )

The segment of “At risk” and “Can’t lose them” should be prioritized by reaching out with personalized curriculum renewals, platform updates or corporate packages, as their monetary value is too high to lose.

Efforts should be make to push “Potential Loyalist” and “Promising” segment to “Champion” segment by tagetted upselling campigns, advanced certification bundles

The monetary value of “Lost” segment is high, losing one of them means a $10,000 lose. Internal assessment should be conducted on Customer Acquisition Cost (CAC), and marketing campaign should be drafted to reduce lost rate

Segment User Counts

Code
# Calculate user counts
user_counts <- segmented_data %>%
  group_by(segment) %>%
  summarize(total_users = n_distinct(customer_id)) %>%
  filter(!is.na(segment)) %>%
  arrange(desc(total_users))

# Lock in the factor levels
user_counts$segment <- factor(user_counts$segment, levels = user_counts$segment)

# Generate the chart
ggplot(user_counts, aes(x = segment, y = total_users, fill = segment)) +
  geom_col(width = 0.8) +
  geom_text(aes(label = scales::comma(total_users)), 
            vjust = -0.5, 
            size = 3.5, 
            fontface = "bold") +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Distribution of Unique Users Across RFM Segment",
    x = "Customer Segment",
    y = "Number of Unique Users"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    legend.position = "none"
  )

Resource Allocation Matrix

Segment User Count Median Value Total Risk/Opportunity Operational Action Channel
At Risk 40 $29,711 ~$1.18M Immediate Revenue Rescue High-Touch (Sales/AM Call)
Can’t Lose Them 41 $24,631 ~$1.01M Long-Term Win-Back High-Touch (Executive Outreach)
Potential Loyalist 179 $17,331 ~$3.10M Milestone Cross-Selling Programmatic/Email Tracks
Champions 124 $27,910 ~$3.46M VIP Advocacy & Renewals Exclusive Perks / Early Access

The business looks good in the way that the number in “Champions” and “Potential Loyalist” are higher than any other segment.

The combined number of users in “Can’t lose them” and “At Risk” are only 81, manual contact could be arranged to offer tailor made programs.

The number of users in “Lost” segment (91) and those in “About to Sleep” (80) could be targeted by automated email to try to wake them up