In this project, I analyze Learnova’s prospect data to identify the key factors that influence whether a user enrolls in an online course. Using exploratory data analysis and visualization, I examine demographic, behavioral, and marketing-related variables such as age, occupation status, engagement time, and ad exposure. I then develop and evaluate classification models—including logistic regression, random forest, and boosted tree models—using metrics like precision, recall, F1-score, and ROC-AUC. The goal is to provide actionable insights into the drivers of enrollment and recommend strategies that help Learnova better target and convert prospective learners.
The digital learning sector is rapidly growing, projected to hit $370B by 2026 with ~8.5% CAGR. Learnova is a startup targeting students and professionals with advanced tech courses. The challenge: not all leads convert. Outreach (calls, emails, ads) costs time and money — so prioritizing high-probability leads is essential.
How do we identify which prospects are most likely to enroll, so we can prioritize outreach and improve efficiency?
Dataset covers demographics like age and occupation, engagement behaviors such as profile completion, site visits, and time spent, as well as acquisition channels like ads, forums, and referrals. Our target variable is whether someone actually enrolled or not (enrollment_status). Before any further analysis, we have an initial consensus of 4 variable groups that could influence someone’s decision to enroll: - Demographics: Age, Occupation - Behavior: Profile Status, Visits, Engagement Time, Pages per Session - Channel: Initial Contact, Recent Engagement - Advertising Source: Ads, Forums, Referrals
Loading necessary libraries, reading in csv file
library(readr)
library(tidyverse)
library(lubridate)
library(scales)
library(broom)
df <- read_csv("C:/Users/jaych/Downloads/Learnova_Leads (1).csv")num_cols <- df %>%
select(where(is.numeric)) %>%
names()
num_summary <- df %>%
select(all_of(num_cols)) %>%
summary()
num_summary## user_age site_visits engagement_time avg_pages_per_session
## Min. :18.0 Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:36.0 1st Qu.: 2.000 1st Qu.: 148.8 1st Qu.: 2.078
## Median :51.0 Median : 3.000 Median : 376.0 Median : 2.792
## Mean :46.2 Mean : 3.567 Mean : 724.0 Mean : 3.026
## 3rd Qu.:57.0 3rd Qu.: 5.000 3rd Qu.:1336.8 3rd Qu.: 3.756
## Max. :63.0 Max. :30.000 Max. :2537.0 Max. :18.434
## enrollment_status
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2986
## 3rd Qu.:1.0000
## Max. :1.0000
Enrollment Rate by Key Categorical Variables
rate_by <- function(data, col) {
data %>%
group_by({{ col }}) %>%
summarise(
n = n(),
enroll_rate = mean(enrollment_status == 1)
) %>%
arrange(desc(enroll_rate))
}
by_occupation <- rate_by(df, occupation_status)
by_profile <- rate_by(df, profile_status)
by_initial <- rate_by(df, initial_contact)
by_recent <- rate_by(df, recent_engagement)
by_occupation; by_profile; by_initial; by_recentplot_rate <- function(tbl, x, xlab) {
ggplot(tbl, aes(x = {{ x }}, y = enroll_rate)) +
geom_col() +
geom_text(aes(label = percent(enroll_rate, accuracy = 0.1)), vjust = -0.4) +
scale_y_continuous(labels = percent) +
labs(title = paste("Enrollment Rate by", xlab), x = xlab, y = "Enrollment Rate") +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
}
plot_rate(by_occupation, occupation_status, "Occupation Status")Enrollment Rate by Advertisement Source
source_cols <- c("newspaper_ad","magazine_ad","online_ad","edu_forums","word_of_mouth")
source_rates <- map_dfr(source_cols, function(col) {
df %>%
mutate(yes_no = .data[[col]]) %>%
group_by(yes_no) %>%
summarise(
n = n(),
enroll_rate = mean(enrollment_status == 1)
) %>%
mutate(ad_source = col)
}) %>%
relocate(ad_source)
source_ratesEnrollment Rate by Numeric Variables (Engagement)
iqr_bin <- function(x) {
qs <- quantile(x, probs = c(.25, .75), na.rm = TRUE)
cut(x,
breaks = c(-Inf, qs[1], qs[2], Inf),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE, right = TRUE, ordered_result = TRUE)
}
df <- df %>%
mutate(
site_visits_bin = iqr_bin(site_visits),
engagement_time_bin = iqr_bin(engagement_time),
avg_pages_per_session_bin = iqr_bin(avg_pages_per_session),
user_age_bin = iqr_bin(user_age)
)
rate_by_name <- function(data, var) {
data %>%
group_by(.data[[var]]) %>%
summarise(
n = n(),
enroll_rate = mean(enrollment_status == 1),
.groups = "drop"
) %>%
arrange(desc(enroll_rate))
}
by_user_age <- rate_by_name(df, "user_age_bin")
by_site_visits <- rate_by_name(df, "site_visits_bin")
by_engage_time <- rate_by_name(df, "engagement_time_bin")
by_avg_pages <- rate_by_name(df, "avg_pages_per_session_bin")
by_user_age; by_site_visits; by_engage_time; by_avg_pagesdf_num <- df %>% select(all_of(num_cols))
# Quick correlation heatmap (removes columns with 0 variance)
df_num_nzv <- df_num %>% select(where(~ sd(.x, na.rm = TRUE) > 0))
GGally::ggcorr(df_num_nzv, label = TRUE, label_alpha = TRUE, hjust = 0.8, layout.exp = 2) +
ggtitle("Correlation Matrix (Numeric Features)")Profile Completion, Engagement Depth, and Source Quality remains driving factors of learner enrollment.
1. Prioritize Hot Leads - High profile completion - High site engagement - Referral-based leads
2. Deploy Predictive Scoring - “Probability of Enrolling” score - Ranked leads and efficient outreach
3. Outreach Sequencing - Website –> Email –> Phone - Occupation-based marketing, tailored campaigns
4. Referral Programs - Discounts for friends
5. Pilot Test - 2-Week Plan - Utilize predictive score, guide outreach, comparing enrollment lift, iterate on key features