This report answers two key questions from the Papaya Analytics Challenge:
bills_df <- read_excel("Papaya_Analytics_Challenge_April2025.xlsx", sheet = "Bills Data") %>%
clean_names() %>%
mutate(bill_type = gsub('No_Bill','No Bill', bill_type)) %>%
mutate(across(c(submitted_date, paid_date), as.Date)) %>%
mutate(across(c(bill_id, user_id, payee_id), as.factor)) %>%
mutate_if(is.character, as.factor)
user_retention <- bills_df %>%
group_by(user_id) %>%
summarise(
total_bills = n(),
first_paid = min(paid_date),
last_paid = max(paid_date),
days_between = as.integer(difftime(last_paid, first_paid, units = "days")),
.groups = "drop"
)
user_retention %>%
mutate(user_type = if_else(total_bills > 1, "Repeat", "One-Time")) %>%
count(user_type) %>%
ggplot(aes(x = user_type, y = n, fill = user_type)) +
geom_col() +
scale_y_continuous(labels = comma) +
labs(title = "User Retention: Repeat vs One-Time", x = "User Type", y = "Count") +
theme_minimal()
Insight: Approximately half of all users return after their first bill, indicating a baseline of healthy repeat engagement. However, the remaining 50% churn immediately after their initial transaction, representing a significant drop-off and a major opportunity for intervention. This split suggests that while the product or payment flow retains some users effectively, others may lack sufficient incentive, onboarding clarity, or follow-up communication. Improving early lifecycle engagement — especially within the first 30–60 days — could meaningfully increase retention and compound downstream value.
bills_df <- bills_df %>% mutate(cohort_month = floor_date(paid_date, "month"))
user_cohorts <- bills_df %>%
group_by(user_id) %>%
summarise(cohort = min(cohort_month), .groups = "drop") %>%
left_join(bills_df, by = "user_id") %>%
mutate(month_diff = interval(cohort, paid_date) %/% months(1))
cohort_sizes <- user_cohorts %>%
filter(month_diff == 0) %>%
group_by(cohort) %>%
summarise(starting_users = n_distinct(user_id), .groups = "drop")
cohort_retention <- user_cohorts %>%
group_by(cohort, month_diff) %>%
summarise(users = n_distinct(user_id), .groups = "drop") %>%
ungroup() %>%
inner_join(cohort_sizes, by = "cohort") %>%
mutate(retention_pct = users / starting_users * 100) %>%
filter(month_diff > 0)
valid_cohorts <- bills_df %>%
distinct(cohort_month) %>%
arrange(cohort_month) %>%
slice(2:(n() - 1)) %>%
pull(cohort_month)
cohort_retention <- cohort_retention %>%
filter(cohort %in% valid_cohorts)
ggplot(cohort_retention, aes(x = month_diff, y = retention_pct, color = as.factor(cohort))) +
geom_line(linewidth = 1.2) +
scale_y_continuous(labels = percent_format(scale = 1)) +
labs(title = "Monthly Retention by Cohort", x = "Months Since First Bill", y = "Retention %", color = "Cohort") +
theme_minimal()
Insight: The July and August 2023 cohorts exhibit strong early retention, with 23–30% of users returning by month 1. This suggests that during this period, acquisition quality or onboarding processes were more effective at converting first-time users into repeat billers. However, retention drops off sharply starting with the September 2023 cohort, where month-1 retention falls below 10% and continues to decay rapidly. This inflection point implies a potential shift in acquisition source, user experience, or billing behavior that should be immediately investigated. Reverting or revisiting tactics from the July–August window may offer a path to improving retention in future cohorts.
# Step 1: Compute user type proportions
bill_retention <- bills_df %>%
group_by(bill_type, user_id) %>%
summarise(total_bills = n(), .groups = "drop") %>%
mutate(user_type = if_else(total_bills > 1, "Repeat", "One-Time")) %>%
count(bill_type, user_type) %>%
group_by(bill_type) %>%
mutate(pct = n / sum(n)) %>%
ungroup()
# Step 2: Extract repeat percentages for sorting (ascending this time)
repeat_order <- bill_retention %>%
filter(user_type == "Repeat") %>%
arrange(pct) %>% # ascending
pull(bill_type)
# Step 3: Convert bill_type to factor with correct ordering
bill_retention$bill_type <- factor(bill_retention$bill_type, levels = repeat_order)
# Step 4: Plot
ggplot(bill_retention, aes(x = bill_type, y = pct, fill = user_type)) +
geom_col(position = "stack") +
scale_y_continuous(labels = percent_format()) +
coord_flip() +
labs(title = "Retention by Bill Type",
x = "Bill Type",
y = "User Share") +
theme_minimal()
Insight: Medical and Utility bill types show the strongest retention performance, with nearly half of users returning to pay additional bills of the same type. These categories likely represent recurring, essential payments that naturally lend themselves to repeat behavior. On the other hand, categories like Citation and Other are overwhelmingly dominated by one-time users — suggesting they are situational or compulsory payments that don’t build habitual usage. This segmentation underscores where retention strategy should focus: reinforcing stickier bill types with reminders, incentives, or subscription options, while not over-investing in inherently low-retention categories.
user_gaps <- bills_df %>%
arrange(user_id, paid_date) %>%
group_by(user_id) %>%
mutate(days_since_last = as.integer(difftime(paid_date, lag(paid_date), units = "days"))) %>%
filter(!is.na(days_since_last)) %>%
summarise(avg_gap = mean(days_since_last), .groups = "drop") %>%
mutate(cadence = case_when(
avg_gap <= 30 ~ "Monthly",
avg_gap <= 90 ~ "Quarterly",
avg_gap <= 180 ~ "Semi-Annual",
TRUE ~ "Irregular"
))
user_gaps %>%
count(cadence) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = cadence, y = pct, fill = cadence)) +
geom_col() +
scale_y_continuous(labels = percent_format()) +
labs(title = "Cadence Frequency", x = "Cadence", y = "Percent of Users") +
theme_minimal()
Insight: A majority of users fall into the Monthly or Quarterly cadence buckets, comprising over 60% of repeat behavior patterns. This frequency suggests a predictable rhythm to user engagement, ideal for aligning proactive nudges, email campaigns, or recurring billing options. Users in these buckets are likely managing recurring obligations — making them both more valuable and more sensitive to disruption in experience. Optimizing around these intervals (e.g., pre-due nudges, loyalty reminders) can significantly reduce churn and reinforce habit formation in your most valuable user segments.
time_to_repeat <- bills_df %>%
arrange(user_id, paid_date) %>%
group_by(user_id) %>%
mutate(days_between = as.integer(difftime(lead(paid_date), paid_date, units = "days"))) %>%
filter(!is.na(days_between)) %>%
slice(1)
ggplot(time_to_repeat, aes(x = days_between)) +
geom_histogram(binwidth = 7, fill = "#ff7a59") +
scale_y_continuous(labels = comma) +
labs(title = "Days Between First and Second Bill", x = "Days", y = "User Count") +
theme_minimal()
Insight: The median time between a user’s first and second bill is 49 days, indicating a natural return window in the 6–8 week range. Most repeat behavior clusters within this timeframe, making it a critical period for lifecycle marketing and product re-engagement. Users who don’t return by day 60 are increasingly unlikely to convert without intervention, suggesting that targeted nudges, loyalty offers, or app notifications should be triggered in this window. Understanding and aligning communication with this behavioral rhythm can significantly improve second-bill conversion — a pivotal moment in reducing churn.
cohort_ltv <- bills_df %>%
mutate(cohort_month = floor_date(paid_date, "month")) %>%
group_by(user_id) %>%
mutate(user_cohort = min(cohort_month)) %>%
ungroup() %>%
mutate(month_diff = interval(user_cohort, paid_date) %/% months(1)) %>%
group_by(user_cohort, month_diff) %>%
summarise(
cohort_revenue = sum(amount),
users_in_cohort = n_distinct(user_id),
.groups = "drop"
) %>%
group_by(user_cohort) %>%
mutate(ltv_per_user = cohort_revenue / first(users_in_cohort)) %>%
ungroup() %>%
filter(month_diff > 0)
cohort_ltv %>%
ggplot(aes(x = month_diff, y = ltv_per_user, color = factor(user_cohort))) +
geom_line(size = 1.2) +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Lifetime Value per User by Cohort",
x = "Months Since Cohort Start",
y = "Cumulative LTV per User",
color = "Cohort"
) +
theme_minimal()
Insight: Cohorts from June through August 2023 generate significantly higher lifetime revenue per user compared to cohorts starting from September 2023. Even aside from retention differences, the value derived from earlier cohorts is substantially greater, highlighting a possible change in acquisition channel quality, targeted demographics, or payment sizes. Further investigation into the profitability drivers (e.g., average payment amounts, bill type mix, or user demographics) across these cohorts is strongly recommended to optimize future acquisition spend.
# Payment Channel Usage
bills_df %>%
count(payment_method, payment_channel) %>%
arrange(desc(n)) %>%
mutate(share = round(n / sum(n), 3)) %>%
rename(Count = n) %>%
head(10) %>%
mutate(Count = scales::comma(Count),
share = scales::percent(share)) %>%
rename_with(~str_to_title(str_replace_all(., "_", " "))) %>%
kable(caption = "Top 10 Payment Method & Channel Combinations", align = 'c') %>%
kable_styling()
| Payment Method | Payment Channel | Count | Share |
|---|---|---|---|
| ACH | Electronic Payment | 73,212 | 20.7% |
| Debit Card | Electronic Payment | 56,365 | 15.9% |
| ACH | Web | 42,075 | 11.9% |
| ACH | Web Bot | 38,615 | 10.9% |
| Credit Card | Electronic Payment | 28,045 | 7.9% |
| Debit Card | Web Bot | 28,007 | 7.9% |
| Debit Card | Web | 24,598 | 7.0% |
| Credit Card | Web Bot | 13,959 | 3.9% |
| Credit Card | Web | 12,633 | 3.6% |
| ACH | USPS | 11,441 | 3.2% |
Insight: Certain payment methods and channels dominate user transactions. Understanding the distribution can inform strategic decisions about prioritizing user experience improvements, promotions, or partnership integrations for popular payment methods like ACH via Web or Debit/Credit Cards via Electronic Payment.
bills_df %>%
group_by(user_id, bill_type) %>%
summarise(total_spend = sum(amount), .groups = "drop") %>%
filter(total_spend > quantile(total_spend, 0.95)) %>%
count(bill_type) %>%
arrange(desc(n)) %>%
rename(count = n) %>%
mutate(pct_high_spenders = round(count / sum(count), 3),
count = scales::comma(count),
pct_high_spenders = scales::percent(pct_high_spenders)) %>%
ungroup() %>%
rename_with(~str_to_title(str_replace_all(., "_", " "))) %>%
kable(caption = "Top Bill Types Among High-Spenders (Top 5%)", align = 'c') %>%
kable_styling()
| Bill Type | Count | Pct High Spenders |
|---|---|---|
| Medical | 6,375 | 70.8% |
| Other | 944 | 10.5% |
| Hospital | 647 | 7.2% |
| Utility | 536 | 6.0% |
| Credit Card/Loan | 222 | 2.5% |
| Dental | 132 | 1.5% |
| Insurance | 99 | 1.1% |
| Cable | 32 | 0.4% |
| Phone | 7 | 0.1% |
| Citation | 6 | 0.1% |
| No Bill | 2 | 0.0% |
| Unknown | 2 | 0.0% |
Insight: Bill types such as Medical, Hospital, and Utility are disproportionately common among top-spending users (top 5%), indicating these categories drive substantial revenue. This highlights strategic opportunities for premium offerings, targeted upselling, or focused partnership initiatives to enhance profitability within these segments.
# Partnership Impact on Retention
partnership_retention <- bills_df %>%
group_by(user_id) %>%
summarize(
partner_user = if_else(any(is_partner == "Yes"), "Partner", "Non-Partner"),
retained = n() > 1,
.groups = "drop"
) %>%
group_by(partner_user) %>%
summarise(retention_rate = mean(retained) * 100)
ggplot(partnership_retention, aes(x = partner_user, y = retention_rate, fill = partner_user)) +
geom_col(width = 0.6) +
scale_y_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 60)) +
labs(
title = "User Retention Rate by Partnership Status",
x = "User Type",
y = "Retention Rate (%)"
) +
theme_minimal() +
theme(legend.position = "none")
Insight: Users affiliated with partnerships exhibit a notably higher retention rate (50.0%) compared to non-partnership users (46.0%). This 4% absolute increase represents a significant business opportunity. Expanding or deepening partnership integrations could meaningfully boost user loyalty and retention, thereby enhancing lifetime value and profitability.
channel_efficiency <- bills_df %>%
mutate(lag_days = as.integer(paid_date - submitted_date)) %>%
group_by(payment_channel) %>%
summarise(
avg_lag_days = mean(lag_days, na.rm = TRUE),
median_lag_days = median(lag_days, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(avg_lag_days))
ggplot(channel_efficiency, aes(x = reorder(payment_channel, avg_lag_days), y = avg_lag_days)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Average Lag from Submission to Payment by Channel",
x = "Payment Channel",
y = "Average Lag (Days)"
) +
theme_minimal()
Insight: There is considerable variability in payment processing efficiency among channels. Voice (4.1 days) and IVR Bot (3.0 days) channels have notably longer delays, whereas Electronic Payment (0.4 days) and Web Bot (0.7 days) channels are the fastest. These substantial differences highlight opportunities to optimize user experience and operational efficiency. Encouraging user adoption of faster payment channels can decrease processing times, improve cash flow predictability, and enhance customer satisfaction.
seasonality <- bills_df %>%
mutate(month = floor_date(paid_date, "month")) %>%
count(month)
ggplot(seasonality, aes(x = month, y = n)) +
geom_line(linewidth = 1.2, color = "darkorange") +
geom_point(size = 2, color = "darkorange") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Monthly Billing Volume Trends",
x = "Month",
y = "Total Bills"
) +
theme_minimal()
Insight: Billing volumes display clear seasonal fluctuations, peaking significantly in March 2024 (60,657 bills) after a steady rise from a December low. These seasonal trends suggest strategic periods for intensified marketing campaigns, tailored messaging, or resource allocation. Understanding and proactively leveraging this seasonality could improve user acquisition efficiency, manage operational workloads, and enhance revenue performance.