Wolt segmentation

Installing packages

Loading data

EDA

My goal is to gradually develop an understanding of my data. I’ll approach it by doing cycles of asking questions, trying to answer them and generating a new questions along the way.
The key here is to find good questions to ask, which I know I dont know so far. So it will be a creative process of searching and utilizing a help of a frameworks, so I can be sure to not get lost in the process.

My usual framework is to cover at least a basic questions and investigate some anomalies i’m noticing along the way.

Step 1 - Checking data health

At this stage, I have 2 general questions:

  1. What this data is about?
  2. Is the data good enough to be used?

1. Variables

Get a first glimpse of the data, to form a basic understanding of variables.
Lets group variables by categories:

  • Demography:
    USER_ID
    REGISTRATION_COUNTRY
  • Interface:
    PREFERRED_DEVICE
    IOS_PURCHASES
    WEB_PURCHASES
    ANDROID_PURCHASES
  • Purchases:
    PURCHASE_COUNT
  • Delivery type:
    PURCHASE_COUNT_DELIVERY
    PURCHASE_COUNT_TAKEAWAY
    AVERAGE_DELIVERY_DISTANCE_KMS
  • Purchase habits:
    MOST_COMMON_WEEKDAY_TO_PURCHASE
    MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE
    BREAKFAST_PURCHASES
    LUNCH_PURCHASES
    EVENING_PURCHASES
    DINNER_PURCHASES
  • Venue type:
    DISTINCT_PURCHASE_VENUE_COUNT
    PREFERRED_RESTAURANT_TYPES
    PURCHASE_COUNT_BY_STORE_TYPE
  • Retention:
    REGISTRATION_DATE
    FIRST_PURCHASE_DAY
    LAST_PURCHASE_DAY
    USER_HAS_VALID_PAYMENT_METHOD
  • Frequency:
    AVG_DAYS_BETWEEN_PURCHASES
    MEDIAN_DAYS_BETWEEN_PURCHASES
  • Revenue:
    TOTAL_PURCHASES_EUR
    MIN_PURCHASE_VALUE_EUR
    AVG_PURCHASE_VALUE_EUR
    MAX_PURCHASE_VALUE_EUR

Summary:
A lot of purchase-related data, already aggregated (this is good). Numeric, cat and dates.
Almost no data about non-purchasers.
But for paying users - its enough for segmentation.
Some abomination non-Tidy structured store types variable.

Basic understanding - done

2. Data health

What is healthy data? Here is my framework:

  • Import errors - wrong delimiters, missing colnames, footnotes or appendix, wrong encoding
  • Data types - wrong date formats, char instead of numeric, unneeded lists
  • Dataset quality - inconsistency in Tidy structure, missing data, unusual NAs or 0s, bad naming
  • Complex variables - high cardinality, parameters for split, “other” category

Quality seems ok, but I want to do some fixes:

  • done - Move id to be 1st col
  • skip - Lowercase var names
  • skip - Deal with PREFERRED_RESTAURANT_TYPES and PURCHASE_COUNT_BY_STORE_TYPE
  • done - NA to 0s sub (with some exceptions, like MIN_PURCHASE_VALUE_EUR)
  • done - Convert log to cha USER_HAS_VALID_PAYMENT_METHOD
  • done - Convert num to cha USER_HAS_VALID_PAYMENT_METHOD, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE, MOST_COMMON_WEEKDAY_TO_PURCHASE
  • done - Rename NA to “unknown” USER_HAS_VALID_PAYMENT_METHOD, REGISTRATION_COUNTRY, PREFERRED_DEVICE, USER_HAS_VALID_PAYMENT_METHOD, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE, MOST_COMMON_WEEKDAY_TO_PURCHASE
  • done - Drop LATE_NIGHT_PURCHASES - it has only NAs and 0s
  • done - Add DAYING_USER for easy marking

PURCHASE_COUNT 0s transferred to NAs for other Vars. Why? To calculate some other metrics?
For these vars, it would be reasonable to fill the missing values with 0.
There is a huge % of 0s, so almost all dist plots will be impossible to look at.

There is still missing data but its not essential and can be handled later.

Health is fine, so - done

Conclusion

Now i’ve got the feeling of the data, and performed some cleaning, I can to the analysis step.

Step 2 - Univariate analysis

My question here is - what type of variation occurs within my variables?

  1. What are the typical values?
  2. Is there any visible clusters?
  3. Is there outliers?

To answer those, I’ll use descriptive statistics to get central tendency, spread and distribution shape.
Distribution for num, Frequencies for cat.
Also data visualisation is usefull because its easy to miss something in a table full of values.

1. Typical values

For every var, I’ll quickly go with this questions:

  • Which values are the most common? Why?
  • Which are rare? Why? Does that match my expectations?
  • Any unusual patterns? Possible explanations?

For numerical variables:
Building a table that combines descriptive stats and barplots.
* At the 1st try of the table - there was too much 0s, because most vars is about paying users.
* So lets exclude them to make the picture more clean.

All the distributions unequally stacked towards lowest/infrequent numbers and have long tails.
This indicates that a large number of users make a small number of purchases and a small number of users - large number of purchases.
Which is what I’m expecting from ecomm data. So, I’m assuming it is not a data errors, but the intended nature of the app.

For categorical variables:

I’m using a Skimr to get an idea of the var levels.

Building a frequency tables using funModeling.
This function works well inside a R notebook, but I need some workaround for Rendered html.
I’ll try to fix this if i have more time later.

Now getting plots of the same vars

Country Device Card Time Weekday

Unusual or interesting stuff:

  • Number of purchases varies heavily
  • There is many users (25%) with more than 2 months between purchases.
  • I can use days btw purchases, to get a c2 metric (% of users who bought 2+ times).
  • In avg purchase value, there is a cool spike in the beginning of a distribution. Discounts?
  • No major seasoning according to MOST_COMMON_… vars
  • 95%+ users are from FIN-DNC-GRC markets
  • iOS% slightly > Android. What if I exclude paid users? Almost the same.
  • Only 20% of PAYING_USER has PREFERRED_RESTAURANT_TYPES. How does it calculates? Can user pick it?
  • Why USER_HAS_VALID_PAYMANT_METHOD has 40% “false” in the PAYING_USER group? Out-of-date cards?

Understanding basic patterns - done

2. Visual clusters

Any interesting subgroups?

  • What are the similarities within subgroups?
  • Differences?
  • Can I describe them somehow?

There is a good candidates for a Whales / Dolphins.
Clusters based on a meal type or a time of the day.
Or by platforms preferences and delivery type.

Understanding clusters - done

3. Outliers

Is there outliers?

  • Anything at min/max?
  • Is it unusual, or just missing or mistakes?
  • Try to exclude them and redo analysis?

I didn’t noticed anything broken, just some signs of power-users or strange patterns:

  • Users with too many different venues. 70 distinct places? Wow, I want to talk with the guy!
  • Same with 320 purchases, almost daily, considering its a 1-year timeline. No so many such users tho

Understanding outliers - done

Conclusion

This is a pretty good dataset, considering the goal of finding a segments for reactivation.
It has good conversion window, representative cohort size, a fair share of paying users, good numeric features and kinda ok categorical features.

I’ve analysed the variance to understand what are my data consist of and how can I split it to segments.
Im most interested in numeric vars that can be a base for RFM analysis.

Lets also look for aggregated metrics across all data to get the top level view of the product.

Very impressive conversion and conversion to the second order.

Step 3 - Multivariate analysis

What is my goal here? In a formal way its this: what type of covariation occurs between my variables?
Covariation is the tendency for variables to vary together in a related way. Visualization is a good way to spot it.
So, I’m looking for some visual indicators of a relationships or dependencies in data.

For different data types, there is different approaches:

  1. Comparing Num vs Num, using correlations heatmap and scatterplots
  2. Comparing Cat vs Num, using boxplots/violins
  3. Comparing Cat vs Cat, using contigency tables

0. Basic feature engineering

To save time, I’ll do only basic things:

  • Cardinality
    Done - Combine REGISTRATION_COUNTRY out of top3 to “Other”
    Halfdone - Deal with JSON-like strings - PREFERRED_RESTAURANT_TYPES, PURCHASE_COUNT_BY_STORE_TYPE - split and generalise to a categories
  • Chunking
    Skip - FIRST_PURCHASE_DAY, LAST_PURCHASE_DAY - split by 3 chunks
    Skip - REGISTRATION_DATE - weeks

1. Correlations

Best way to get the ideas about the relationships in numeric vars - is visually heat-mapping a corr matrix.

Findings:

  • Restaurant purchases are strongly correlated with TOTAL_PURCHASES_EUR
    Need to convert this var to a 2 levels cat: Restaurant vs Other venue types
  • AVERAGE_DELIVERY_DISTANCE_KMS not correlated with literally anything
    Is this an error in data? Or product is too good at long distance delivery? What about Groceries?
  • MEDIAN_DAYS_BETWEEN_PURCHASES just moderately correlated with TOTAL_PURCHASES_EUR and purchases
    Cheap-Frequent vs Expensive-Infrequent might be a good division
  • AVG_PURCHASE_VALUE_EUR highly correlated vs MIN and MAX
    Users ar sticking to their behaviour? I want to create a Consistency var to look for high-value segment

Usually its a good idea to double-check findings by building scatterplots, but i’m skipping it.

Main takeaway: RFM analysis is looking more and more lucrative.
There is a lot of unobvious correlations, I can enrich data with new cetegorial variables.

2. Boxplots

Similarities between countries

Greece has twice lower average check (AVP) as long as other metrics

Between hour of the day

Mornings are bad? Not much venues are opened? People are sleeping?

By store type

Total purchases

Store

Not sure about the correctness of the plot, so skipping it for now.

3. Cat vs cat analysis

Skip - I’ll build pair plots and summary tables in a segmentation section

Conclusion

Correlation section gave me ideas for new variables, cat section was a little-bit disappointing.
So I want to convert num vars to cat by chunking them. Most of the nonobvious relationships in data lies there.

Step 4 - Data structures analysis

Big question here - should I go on with segmentation using Product approach or Data approach.

  • Product - RFM analysis: enriching data with new variables then segmenting it
  • Data - ML-grade Feature Engineering, Clustering, PCA, Factor analysis and dimension reduction

Conclusion

I’ll go with RFM analysis since I’m already seing signs it can be meaningful. Data approach is risky, due to clusters interpretation.

Segmentation

Approach justification

Lets sync with the task: marketing team wants recommendations on how to reactivate different kinds of users. Based on segments.

Why do they want to use segments? Because there is no “average user” - averages might be misleading.
Different groups of users might react in opposite ways for the same activities.
So I need to find interesting groups who are differs between each other and who has some similar problems we can target.

To do that, I need to choose a segmentation base and some cat groups to compare.

I’ll go with RFM analysis:

Regency - app should hook users and not let them go away (not in toxic way ofc)
Frequency - app should become a familiar JTBD-solving tool
Monetary - app must give users value and they need to be comfortable to pay for it

Why?

  • We have all the metrics in a dataset
  • Those metrics are a good predictors of user behavior
  • Its simple - i dont have much time
  • Its actionable

I’m aiming to find a group of most valuable users and the one who is at risk (good target for re-engagement).

Plan

  • Done - New features
  • Done - Setting a segmentation base
  • Done - Spliting by groups
  • Done - Estimating the size and revenue share
  • Done - Marketing activities plan
  • Done - Focusing on something and checking cat slices
  • Conclusions

Adding features and filtering data

Still have NAs, its a problem, but i try to ignore it.

Segmentation base

More chunking and adding essential RFM vars.

Commented some checking functions.

Segments

Lets build and name segments

Segments descriprion

Lets build a heatmap and check sizes.

The numbers ontop of a heatmap - is the size of a segment (%, of all paying users).

At the rows we have 5 groups

  • Whales - They are doing Frequent (weekly) purchases with a big checks (more than 20 EUR)
    This is our best users with ~25% of a total revenue
  • Dolphins - Frequent, but small check
  • Celebrators - Rare, but big check
    Biggest group
  • Casuals - Rare and Cheap
  • Onetimers - No second purchases
    Also big

At the columns - 3 groups by activity

  • Inactive - They bought something and left within 60 days after registration
  • Loyal - Last purchase was less than a month ago since the data cut
  • Dropping - Users who are inbetween

Metrics for scoring

Lets use a couple of metrics to analyse segments.

I will use this table to prioritise ideas.
We have many possible marketing activities, and I want to estimate, which of them are the most lucrative.

Delta coefficient says about how much revenue you will get from 1% users in this segment.
If we convert users from popular segments with low Delta to some other segment with larger Delta - its a success.
So goal of the marketing team is to move users from one segment to better ones.
Or prevent users to naturally flow the other way.

Most interesting segments are:

  • Inactive and Dropping Ontimers -> Loyal Onetimers
    They tried once and left.
    Try to reach them and give them a way to engage with an app more.
  • Dropping Celebrators -> Loyal Celebrators
    They are buying rarely, but with a big check.
    Help them with incentives to use an app and form a habit.
  • Loyal Celebrators -> Loyal Whales
    Same, but they are not leaving.
    Give them more reasons to purchase regularly.
  • Dropping Whales -> Loyal Whales
    Best users. Seem we are a really good solution for their JTBD.
    Give them premium support and exclusive deals.

Ideas for marketing team

1. Outreach

  • Personalized emails - reminding of the app’s benefits, new features, or special offers
    Target: Inactive Onetimers - 22% of users to convert them to Loyal Casuals (x5 Delta)
    10% (very generous!) click rate -> 25% purchase (same) -> 0.55% Inactive Onetimers will became Loyal Casuals
    Before: 0.55% * 20% Delta
    After: 0.55% * 100% Delta = 0.55% * 80% = + 0.44% Revenue
    Decision: meh
  • Push notifications - timed and attractive offers
    Target: Inactive Onetimers
  • Marketing Campaigns - campaigns on social media

2. Offering Incentives

  • Special Offers - offers, discounts, loyalty points, bundles
    Target: Dropping Celebrators, Inactive Onetimers
  • Partnerships - with popular venues for exclusive deals
    Target: Dropping Whales
  • New features - for returning users: profile customization, delivery priority, loyalty program, better recommendations
    Target: Dropping Celebrators - give them a way to develop a habit of using a product
  • Bonuses - for using specific features
    Target: Inactive Onetimers - make them onboard
  • Helping users with onboarding
  • Upsells
    Target: Loyal Whales

3. Improving app

  • Price experiments - some groups are more sensitive to overall price or commissions changes
    Target: Dropping Celebrators, Loyal Celebrators - try making them by more for the same budget
  • Usability - fixing ux problems, better onboarding
    Target: Inactive Whales
  • New Features - recurring orders, timed deliveries, subscriptions, meal packs, lifestyle guidance
    Target: Loyal Celebrators - Give them a way to purchase more, try different venues or venue types.
  • Premium support - solving problems and giving advices
    Target: Dropping and Loyal Whales
  • Referral programs -

4. Learn why users leaving / staying

  • Data analysis - find possible causes of decreasing activity
  • Interviews - enriching data, ie: is it a solo buyer or a group
  • New segments - preferred food category, average spending
  • Feedback from current activities - what works?

Other slices

Im dropping Recency grouping and just checking other cat variable I’ve made.

Findings:

  • DAYS_TO_FIRST_PURCHASE
    For Celebrators, if 1st purchase was later than 1week in app - Revenue dropped 50%.
    Make them a targeted push-notifications?
  • VOLATILITY_OF_DAYS_BETWEEN_PURCHASES
    For Whales and Celebrators, the power of stable habit is huuuuge. Almost x7 for Whales and x3 for Occasionals.
    Whe need to help users to make this habit. Offer them a healthy lifestyle guidance?
  • VOLATILITY_OF_PURCHASE_VALUE
    Same. Might be some data-related biases.
    Fixed budget, buy more for the same money? Try discounts?
  • MOST_COMMON_WEEKDAY_TO_PURCHASE, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE
    There is minor differences, I cant came up with somesing here
  • REGISTRATION_COUNTRY
  • PREFERRED_DEVICE
  • AVERAGE_DELIVERY_DISTANCE_KMS
    Dolphins are x2 lower in long distances. Why? Is it a grocery?
    Offer scheduling?
  • DISTINCT_PURCHASE_VENUE_COUNT
    Everyone
    More different venues, more revenue. Not sure that its a causal relationship here. Strong correlation tho.
  • PREFERRED_VENUE_TYPE
    Whales. Oh, wow, Groceries and Retail Stores are important. Solo Restaurants axe x6 worse.
    Promote users to try a different style of shopping.

Its a good idea to separately focus on Dropping users only and go through the slices again.

Conclusion

im tired

need a mood enhancing pic