1 Overview

This report answers two key questions from the Papaya Analytics Challenge:

  1. What can you tell us about our retention using this dataset?
  2. Are there other interesting insights you can provide about our business from this dataset?

1.1 Key Findings:

  • 49.9% of users return after their first bill. (Goal 1)
  • Time to Repeat: Median is 49 days — target return window is 6–8 weeks. (Goal 1)
  • Monthly & Quarterly Cadence: Over 60% of users follow these patterns. (Goal 1)
  • Strongest repeat by bill type: Medical and Utility. (Goal 1)
  • High value overdue users exist and are targetable. (Goal 2)
  • Lifetime Value per user Significant decline since Sep 2023 indicates possible acquisition shifts. (Goal 2)
  • Dominant Payment Methods: ACH-Web & Card-Electronic Payment present key optimization opportunities. (Goal 2)
  • High-Spender Categories: Top 5% predominantly Medical, Hospital, & Utility—ideal for premium strategies. (Goal 2)
  • Partnership Advantage: Partners retain 4% more users, highlighting strategic partnership potential. (Goal 2)
  • Channel Efficiency: Payment lag varies significantly (Electronic Payment: <1 day; Voice: ~4 days). (Goal 2)
  • Seasonal Trends: Billing peaks in March, optimal for targeted marketing; more data needed to confirm seasonality. (Goal 2)

2 Repeat vs One-Time Users

bills_df <- read_excel("Papaya_Analytics_Challenge_April2025.xlsx", sheet = "Bills Data") %>%
  clean_names() %>%
  mutate(bill_type = gsub('No_Bill','No Bill', bill_type)) %>%
  mutate(across(c(submitted_date, paid_date), as.Date)) %>%
  mutate(across(c(bill_id, user_id, payee_id), as.factor)) %>%
  mutate_if(is.character, as.factor)

user_retention <- bills_df %>%
  group_by(user_id) %>%
  summarise(
    total_bills = n(),
    first_paid = min(paid_date),
    last_paid = max(paid_date),
    days_between = as.integer(difftime(last_paid, first_paid, units = "days")),
    .groups = "drop"
  )

user_retention %>%
  mutate(user_type = if_else(total_bills > 1, "Repeat", "One-Time")) %>%
  count(user_type) %>%
  ggplot(aes(x = user_type, y = n, fill = user_type)) +
  geom_col() +
  scale_y_continuous(labels = comma) +
  labs(title = "User Retention: Repeat vs One-Time", x = "User Type", y = "Count") +
  theme_minimal()

Insight: Approximately half of all users return after their first bill, indicating a baseline of healthy repeat engagement. However, the remaining 50% churn immediately after their initial transaction, representing a significant drop-off and a major opportunity for intervention. This split suggests that while the product or payment flow retains some users effectively, others may lack sufficient incentive, onboarding clarity, or follow-up communication. Improving early lifecycle engagement — especially within the first 30–60 days — could meaningfully increase retention and compound downstream value.


3 Monthly Cohort Retention

bills_df <- bills_df %>% mutate(cohort_month = floor_date(paid_date, "month"))

user_cohorts <- bills_df %>%
  group_by(user_id) %>%
  summarise(cohort = min(cohort_month), .groups = "drop") %>%
  left_join(bills_df, by = "user_id") %>%
  mutate(month_diff = interval(cohort, paid_date) %/% months(1))

cohort_sizes <- user_cohorts %>%
  filter(month_diff == 0) %>%
  group_by(cohort) %>%
  summarise(starting_users = n_distinct(user_id), .groups = "drop")

cohort_retention <- user_cohorts %>%
  group_by(cohort, month_diff) %>%
  summarise(users = n_distinct(user_id), .groups = "drop") %>%
  ungroup() %>%
  inner_join(cohort_sizes, by = "cohort") %>%
  mutate(retention_pct = users / starting_users * 100) %>%
  filter(month_diff > 0)

valid_cohorts <- bills_df %>%
  distinct(cohort_month) %>%
  arrange(cohort_month) %>%
  slice(2:(n() - 1)) %>%
  pull(cohort_month)

cohort_retention <- cohort_retention %>%
  filter(cohort %in% valid_cohorts)

ggplot(cohort_retention, aes(x = month_diff, y = retention_pct, color = as.factor(cohort))) +
  geom_line(linewidth = 1.2) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  labs(title = "Monthly Retention by Cohort", x = "Months Since First Bill", y = "Retention %", color = "Cohort") +
  theme_minimal()

Insight: The July and August 2023 cohorts exhibit strong early retention, with 23–30% of users returning by month 1. This suggests that during this period, acquisition quality or onboarding processes were more effective at converting first-time users into repeat billers. However, retention drops off sharply starting with the September 2023 cohort, where month-1 retention falls below 10% and continues to decay rapidly. This inflection point implies a potential shift in acquisition source, user experience, or billing behavior that should be immediately investigated. Reverting or revisiting tactics from the July–August window may offer a path to improving retention in future cohorts.


4 Retention by Bill Type

# Step 1: Compute user type proportions
bill_retention <- bills_df %>%
  group_by(bill_type, user_id) %>%
  summarise(total_bills = n(), .groups = "drop") %>%
  mutate(user_type = if_else(total_bills > 1, "Repeat", "One-Time")) %>%
  count(bill_type, user_type) %>%
  group_by(bill_type) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

# Step 2: Extract repeat percentages for sorting (ascending this time)
repeat_order <- bill_retention %>%
  filter(user_type == "Repeat") %>%
  arrange(pct) %>%  # ascending
  pull(bill_type)

# Step 3: Convert bill_type to factor with correct ordering
bill_retention$bill_type <- factor(bill_retention$bill_type, levels = repeat_order)

# Step 4: Plot
ggplot(bill_retention, aes(x = bill_type, y = pct, fill = user_type)) +
  geom_col(position = "stack") +
  scale_y_continuous(labels = percent_format()) +
  coord_flip() +
  labs(title = "Retention by Bill Type",
       x = "Bill Type",
       y = "User Share") +
  theme_minimal()

Insight: Medical and Utility bill types show the strongest retention performance, with nearly half of users returning to pay additional bills of the same type. These categories likely represent recurring, essential payments that naturally lend themselves to repeat behavior. On the other hand, categories like Citation and Other are overwhelmingly dominated by one-time users — suggesting they are situational or compulsory payments that don’t build habitual usage. This segmentation underscores where retention strategy should focus: reinforcing stickier bill types with reminders, incentives, or subscription options, while not over-investing in inherently low-retention categories.


5 Behavioral Frequency

user_gaps <- bills_df %>%
  arrange(user_id, paid_date) %>%
  group_by(user_id) %>%
  mutate(days_since_last = as.integer(difftime(paid_date, lag(paid_date), units = "days"))) %>%
  filter(!is.na(days_since_last)) %>%
  summarise(avg_gap = mean(days_since_last), .groups = "drop") %>%
  mutate(cadence = case_when(
    avg_gap <= 30 ~ "Monthly",
    avg_gap <= 90 ~ "Quarterly",
    avg_gap <= 180 ~ "Semi-Annual",
    TRUE ~ "Irregular"
  ))

user_gaps %>%
  count(cadence) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = cadence, y = pct, fill = cadence)) +
  geom_col() +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Cadence Frequency", x = "Cadence", y = "Percent of Users") +
  theme_minimal()

Insight: A majority of users fall into the Monthly or Quarterly cadence buckets, comprising over 60% of repeat behavior patterns. This frequency suggests a predictable rhythm to user engagement, ideal for aligning proactive nudges, email campaigns, or recurring billing options. Users in these buckets are likely managing recurring obligations — making them both more valuable and more sensitive to disruption in experience. Optimizing around these intervals (e.g., pre-due nudges, loyalty reminders) can significantly reduce churn and reinforce habit formation in your most valuable user segments.


6 Time to Repeat

time_to_repeat <- bills_df %>%
  arrange(user_id, paid_date) %>%
  group_by(user_id) %>%
  mutate(days_between = as.integer(difftime(lead(paid_date), paid_date, units = "days"))) %>%
  filter(!is.na(days_between)) %>%
  slice(1)

ggplot(time_to_repeat, aes(x = days_between)) +
  geom_histogram(binwidth = 7, fill = "#ff7a59") +
  scale_y_continuous(labels = comma) +
  labs(title = "Days Between First and Second Bill", x = "Days", y = "User Count") +
  theme_minimal()

Insight: The median time between a user’s first and second bill is 49 days, indicating a natural return window in the 6–8 week range. Most repeat behavior clusters within this timeframe, making it a critical period for lifecycle marketing and product re-engagement. Users who don’t return by day 60 are increasingly unlikely to convert without intervention, suggesting that targeted nudges, loyalty offers, or app notifications should be triggered in this window. Understanding and aligning communication with this behavioral rhythm can significantly improve second-bill conversion — a pivotal moment in reducing churn.

7 Cohort LTV

cohort_ltv <- bills_df %>%
  mutate(cohort_month = floor_date(paid_date, "month")) %>%
  group_by(user_id) %>%
  mutate(user_cohort = min(cohort_month)) %>%
  ungroup() %>%
  mutate(month_diff = interval(user_cohort, paid_date) %/% months(1)) %>%
  group_by(user_cohort, month_diff) %>%
  summarise(
    cohort_revenue = sum(amount),
    users_in_cohort = n_distinct(user_id),
    .groups = "drop"
  ) %>%
  group_by(user_cohort) %>%
  mutate(ltv_per_user = cohort_revenue / first(users_in_cohort)) %>%
  ungroup() %>%
  filter(month_diff > 0)


cohort_ltv %>%
  ggplot(aes(x = month_diff, y = ltv_per_user, color = factor(user_cohort))) +
  geom_line(size = 1.2) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Lifetime Value per User by Cohort",
    x = "Months Since Cohort Start",
    y = "Cumulative LTV per User",
    color = "Cohort"
  ) +
  theme_minimal()

Insight: Cohorts from June through August 2023 generate significantly higher lifetime revenue per user compared to cohorts starting from September 2023. Even aside from retention differences, the value derived from earlier cohorts is substantially greater, highlighting a possible change in acquisition channel quality, targeted demographics, or payment sizes. Further investigation into the profitability drivers (e.g., average payment amounts, bill type mix, or user demographics) across these cohorts is strongly recommended to optimize future acquisition spend.

8 Payment Method & Channel Analysis

# Payment Channel Usage
bills_df %>%
  count(payment_method, payment_channel) %>%
  arrange(desc(n)) %>%
  mutate(share = round(n / sum(n), 3)) %>%
  rename(Count = n) %>%
  head(10) %>%
  mutate(Count = scales::comma(Count),
         share = scales::percent(share)) %>% 
  rename_with(~str_to_title(str_replace_all(., "_", " "))) %>%
  kable(caption = "Top 10 Payment Method & Channel Combinations", align = 'c') %>%
  kable_styling()
Top 10 Payment Method & Channel Combinations
Payment Method Payment Channel Count Share
ACH Electronic Payment 73,212 20.7%
Debit Card Electronic Payment 56,365 15.9%
ACH Web 42,075 11.9%
ACH Web Bot 38,615 10.9%
Credit Card Electronic Payment 28,045 7.9%
Debit Card Web Bot 28,007 7.9%
Debit Card Web 24,598 7.0%
Credit Card Web Bot 13,959 3.9%
Credit Card Web 12,633 3.6%
ACH USPS 11,441 3.2%

Insight: Certain payment methods and channels dominate user transactions. Understanding the distribution can inform strategic decisions about prioritizing user experience improvements, promotions, or partnership integrations for popular payment methods like ACH via Web or Debit/Credit Cards via Electronic Payment.

9 High-Spender Segmentation

bills_df %>%
  group_by(user_id, bill_type) %>%
  summarise(total_spend = sum(amount), .groups = "drop") %>%
  filter(total_spend > quantile(total_spend, 0.95)) %>%
  count(bill_type) %>%
  arrange(desc(n)) %>%
  rename(count = n) %>%
  mutate(pct_high_spenders = round(count / sum(count), 3),
         count = scales::comma(count),
         pct_high_spenders = scales::percent(pct_high_spenders)) %>%
  ungroup() %>%
  rename_with(~str_to_title(str_replace_all(., "_", " "))) %>%
  kable(caption = "Top Bill Types Among High-Spenders (Top 5%)", align = 'c') %>%
  kable_styling()
Top Bill Types Among High-Spenders (Top 5%)
Bill Type Count Pct High Spenders
Medical 6,375 70.8%
Other 944 10.5%
Hospital 647 7.2%
Utility 536 6.0%
Credit Card/Loan 222 2.5%
Dental 132 1.5%
Insurance 99 1.1%
Cable 32 0.4%
Phone 7 0.1%
Citation 6 0.1%
No Bill 2 0.0%
Unknown 2 0.0%

Insight: Bill types such as Medical, Hospital, and Utility are disproportionately common among top-spending users (top 5%), indicating these categories drive substantial revenue. This highlights strategic opportunities for premium offerings, targeted upselling, or focused partnership initiatives to enhance profitability within these segments.

10 Partnership Retention

# Partnership Impact on Retention
partnership_retention <- bills_df %>%
  group_by(user_id) %>%
  summarize(
    partner_user = if_else(any(is_partner == "Yes"), "Partner", "Non-Partner"),
    retained = n() > 1,
    .groups = "drop"
  ) %>%
  group_by(partner_user) %>%
  summarise(retention_rate = mean(retained) * 100)

ggplot(partnership_retention, aes(x = partner_user, y = retention_rate, fill = partner_user)) +
  geom_col(width = 0.6) +
  scale_y_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 60)) +
  labs(
    title = "User Retention Rate by Partnership Status",
    x = "User Type",
    y = "Retention Rate (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Insight: Users affiliated with partnerships exhibit a notably higher retention rate (50.0%) compared to non-partnership users (46.0%). This 4% absolute increase represents a significant business opportunity. Expanding or deepening partnership integrations could meaningfully boost user loyalty and retention, thereby enhancing lifetime value and profitability.

11 Payment Channel Efficiency

channel_efficiency <- bills_df %>%
  mutate(lag_days = as.integer(paid_date - submitted_date)) %>%
  group_by(payment_channel) %>%
  summarise(
    avg_lag_days = mean(lag_days, na.rm = TRUE),
    median_lag_days = median(lag_days, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_lag_days))

ggplot(channel_efficiency, aes(x = reorder(payment_channel, avg_lag_days), y = avg_lag_days)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Average Lag from Submission to Payment by Channel",
    x = "Payment Channel",
    y = "Average Lag (Days)"
  ) +
  theme_minimal()

Insight: There is considerable variability in payment processing efficiency among channels. Voice (4.1 days) and IVR Bot (3.0 days) channels have notably longer delays, whereas Electronic Payment (0.4 days) and Web Bot (0.7 days) channels are the fastest. These substantial differences highlight opportunities to optimize user experience and operational efficiency. Encouraging user adoption of faster payment channels can decrease processing times, improve cash flow predictability, and enhance customer satisfaction.