Wolt segmentation

Installing packages

Loading data

EDA

My goal is to gradually develop an understanding of my data. I’ll approach it by doing cycles of asking questions, trying to answer them and generating a new questions along the way.
The key here is to find good questions to ask, which I know I dont know so far. So it will be a creative process of searching and utilizing a help of a frameworks, so I can be sure to not get lost in the process.

My usual framework is to cover at least a basic questions and investigate some anomalies i’m noticing along the way.

Step 1 - Checking data health

At this stage, I have 2 general questions:

What this data is about?
Is the data good enough to be used?

1. Variables

Get a first glimpse of the data, to form a basic understanding of variables.
Lets group variables by categories:

Demography:
USER_ID
REGISTRATION_COUNTRY
Interface:
PREFERRED_DEVICE
IOS_PURCHASES
WEB_PURCHASES
ANDROID_PURCHASES
Purchases:
PURCHASE_COUNT
Delivery type:
PURCHASE_COUNT_DELIVERY
PURCHASE_COUNT_TAKEAWAY
AVERAGE_DELIVERY_DISTANCE_KMS
Purchase habits:
MOST_COMMON_WEEKDAY_TO_PURCHASE
MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE
BREAKFAST_PURCHASES
LUNCH_PURCHASES
EVENING_PURCHASES
DINNER_PURCHASES
Venue type:
DISTINCT_PURCHASE_VENUE_COUNT
PREFERRED_RESTAURANT_TYPES
PURCHASE_COUNT_BY_STORE_TYPE
Retention:
REGISTRATION_DATE
FIRST_PURCHASE_DAY
LAST_PURCHASE_DAY
USER_HAS_VALID_PAYMENT_METHOD
Frequency:
AVG_DAYS_BETWEEN_PURCHASES
MEDIAN_DAYS_BETWEEN_PURCHASES
Revenue:
TOTAL_PURCHASES_EUR
MIN_PURCHASE_VALUE_EUR
AVG_PURCHASE_VALUE_EUR
MAX_PURCHASE_VALUE_EUR

Summary:
A lot of purchase-related data, already aggregated (this is good). Numeric, cat and dates.
Almost no data about non-purchasers.
But for paying users - its enough for segmentation.
Some abomination non-Tidy structured store types variable.

Basic understanding - done

2. Data health

What is healthy data? Here is my framework:

Import errors - wrong delimiters, missing colnames, footnotes or appendix, wrong encoding
Data types - wrong date formats, char instead of numeric, unneeded lists
Dataset quality - inconsistency in Tidy structure, missing data, unusual NAs or 0s, bad naming
Complex variables - high cardinality, parameters for split, “other” category

Quality seems ok, but I want to do some fixes:

done - Move id to be 1st col
skip - Lowercase var names
skip - Deal with PREFERRED_RESTAURANT_TYPES and PURCHASE_COUNT_BY_STORE_TYPE
done - NA to 0s sub (with some exceptions, like MIN_PURCHASE_VALUE_EUR)
done - Convert log to cha USER_HAS_VALID_PAYMENT_METHOD
done - Convert num to cha USER_HAS_VALID_PAYMENT_METHOD, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE, MOST_COMMON_WEEKDAY_TO_PURCHASE
done - Rename NA to “unknown” USER_HAS_VALID_PAYMENT_METHOD, REGISTRATION_COUNTRY, PREFERRED_DEVICE, USER_HAS_VALID_PAYMENT_METHOD, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE, MOST_COMMON_WEEKDAY_TO_PURCHASE
done - Drop LATE_NIGHT_PURCHASES - it has only NAs and 0s
done - Add DAYING_USER for easy marking

PURCHASE_COUNT 0s transferred to NAs for other Vars. Why? To calculate some other metrics?
For these vars, it would be reasonable to fill the missing values with 0.
There is a huge % of 0s, so almost all dist plots will be impossible to look at.

There is still missing data but its not essential and can be handled later.

Health is fine, so - done

Conclusion

Now i’ve got the feeling of the data, and performed some cleaning, I can to the analysis step.

Step 2 - Univariate analysis

My question here is - what type of variation occurs within my variables?

What are the typical values?
Is there any visible clusters?
Is there outliers?

To answer those, I’ll use descriptive statistics to get central tendency, spread and distribution shape.
Distribution for num, Frequencies for cat.
Also data visualisation is usefull because its easy to miss something in a table full of values.

1. Typical values

For every var, I’ll quickly go with this questions:

Which values are the most common? Why?
Which are rare? Why? Does that match my expectations?
Any unusual patterns? Possible explanations?

For numerical variables:
Building a table that combines descriptive stats and barplots.
* At the 1st try of the table - there was too much 0s, because most vars is about paying users.
* So lets exclude them to make the picture more clean.

All the distributions unequally stacked towards lowest/infrequent numbers and have long tails.
This indicates that a large number of users make a small number of purchases and a small number of users - large number of purchases.
Which is what I’m expecting from ecomm data. So, I’m assuming it is not a data errors, but the intended nature of the app.

For categorical variables:

I’m using a Skimr to get an idea of the var levels.

Building a frequency tables using funModeling.
This function works well inside a R notebook, but I need some workaround for Rendered html.
I’ll try to fix this if i have more time later.

Now getting plots of the same vars

Country Device Card Time Weekday

Unusual or interesting stuff:

Number of purchases varies heavily
There is many users (25%) with more than 2 months between purchases.
I can use days btw purchases, to get a c2 metric (% of users who bought 2+ times).
In avg purchase value, there is a cool spike in the beginning of a distribution. Discounts?
No major seasoning according to MOST_COMMON_… vars
95%+ users are from FIN-DNC-GRC markets
iOS% slightly > Android. What if I exclude paid users? Almost the same.
Only 20% of PAYING_USER has PREFERRED_RESTAURANT_TYPES. How does it calculates? Can user pick it?
Why USER_HAS_VALID_PAYMANT_METHOD has 40% “false” in the PAYING_USER group? Out-of-date cards?

Understanding basic patterns - done

2. Visual clusters

Any interesting subgroups?

What are the similarities within subgroups?
Differences?
Can I describe them somehow?

There is a good candidates for a Whales / Dolphins.
Clusters based on a meal type or a time of the day.
Or by platforms preferences and delivery type.

Understanding clusters - done

3. Outliers

Is there outliers?

Anything at min/max?
Is it unusual, or just missing or mistakes?
Try to exclude them and redo analysis?

I didn’t noticed anything broken, just some signs of power-users or strange patterns:

Users with too many different venues. 70 distinct places? Wow, I want to talk with the guy!
Same with 320 purchases, almost daily, considering its a 1-year timeline. No so many such users tho

Understanding outliers - done

Conclusion

This is a pretty good dataset, considering the goal of finding a segments for reactivation.
It has good conversion window, representative cohort size, a fair share of paying users, good numeric features and kinda ok categorical features.

I’ve analysed the variance to understand what are my data consist of and how can I split it to segments.
Im most interested in numeric vars that can be a base for RFM analysis.

Lets also look for aggregated metrics across all data to get the top level view of the product.

Very impressive conversion and conversion to the second order.

Step 3 - Multivariate analysis

What is my goal here? In a formal way its this: what type of covariation occurs between my variables?
Covariation is the tendency for variables to vary together in a related way. Visualization is a good way to spot it.
So, I’m looking for some visual indicators of a relationships or dependencies in data.

For different data types, there is different approaches:

Comparing Num vs Num, using correlations heatmap and scatterplots
Comparing Cat vs Num, using boxplots/violins
Comparing Cat vs Cat, using contigency tables

0. Basic feature engineering

To save time, I’ll do only basic things:

Cardinality
Done - Combine REGISTRATION_COUNTRY out of top3 to “Other”
Halfdone - Deal with JSON-like strings - PREFERRED_RESTAURANT_TYPES, PURCHASE_COUNT_BY_STORE_TYPE - split and generalise to a categories
Chunking
Skip - FIRST_PURCHASE_DAY, LAST_PURCHASE_DAY - split by 3 chunks
Skip - REGISTRATION_DATE - weeks

1. Correlations

Best way to get the ideas about the relationships in numeric vars - is visually heat-mapping a corr matrix.

Findings:

Restaurant purchases are strongly correlated with TOTAL_PURCHASES_EUR
Need to convert this var to a 2 levels cat: Restaurant vs Other venue types
AVERAGE_DELIVERY_DISTANCE_KMS not correlated with literally anything
Is this an error in data? Or product is too good at long distance delivery? What about Groceries?
MEDIAN_DAYS_BETWEEN_PURCHASES just moderately correlated with TOTAL_PURCHASES_EUR and purchases
Cheap-Frequent vs Expensive-Infrequent might be a good division
AVG_PURCHASE_VALUE_EUR highly correlated vs MIN and MAX
Users ar sticking to their behaviour? I want to create a Consistency var to look for high-value segment

Usually its a good idea to double-check findings by building scatterplots, but i’m skipping it.

Main takeaway: RFM analysis is looking more and more lucrative.
There is a lot of unobvious correlations, I can enrich data with new cetegorial variables.

2. Boxplots

Similarities between countries

Greece has twice lower average check (AVP) as long as other metrics

Between hour of the day

Mornings are bad? Not much venues are opened? People are sleeping?

By store type

Total purchases

Not sure about the correctness of the plot, so skipping it for now.

3. Cat vs cat analysis

Skip - I’ll build pair plots and summary tables in a segmentation section

Conclusion

Correlation section gave me ideas for new variables, cat section was a little-bit disappointing.
So I want to convert num vars to cat by chunking them. Most of the nonobvious relationships in data lies there.

Step 4 - Data structures analysis

Big question here - should I go on with segmentation using Product approach or Data approach.

Product - RFM analysis: enriching data with new variables then segmenting it
Data - ML-grade Feature Engineering, Clustering, PCA, Factor analysis and dimension reduction

Conclusion

I’ll go with RFM analysis since I’m already seing signs it can be meaningful. Data approach is risky, due to clusters interpretation.

Segmentation

Approach justification

Lets sync with the task: marketing team wants recommendations on how to reactivate different kinds of users. Based on segments.

Why do they want to use segments? Because there is no “average user” - averages might be misleading.
Different groups of users might react in opposite ways for the same activities.
So I need to find interesting groups who are differs between each other and who has some similar problems we can target.

To do that, I need to choose a segmentation base and some cat groups to compare.

I’ll go with RFM analysis:

Regency - app should hook users and not let them go away (not in toxic way ofc)
Frequency - app should become a familiar JTBD-solving tool
Monetary - app must give users value and they need to be comfortable to pay for it

Why?

We have all the metrics in a dataset
Those metrics are a good predictors of user behavior
Its simple - i dont have much time
Its actionable

I’m aiming to find a group of most valuable users and the one who is at risk (good target for re-engagement).

Plan

Done - New features
Done - Setting a segmentation base
Done - Spliting by groups
Done - Estimating the size and revenue share
Done - Marketing activities plan
Done - Focusing on something and checking cat slices
Conclusions

Adding features and filtering data

Still have NAs, its a problem, but i try to ignore it.

Segmentation base

More chunking and adding essential RFM vars.

Commented some checking functions.

Segments

Lets build and name segments

Segments descriprion

Lets build a heatmap and check sizes.

The numbers ontop of a heatmap - is the size of a segment (%, of all paying users).

At the rows we have 5 groups

Whales - They are doing Frequent (weekly) purchases with a big checks (more than 20 EUR)
This is our best users with ~25% of a total revenue
Dolphins - Frequent, but small check
Celebrators - Rare, but big check
Biggest group
Casuals - Rare and Cheap
Onetimers - No second purchases
Also big

At the columns - 3 groups by activity

Inactive - They bought something and left within 60 days after registration
Loyal - Last purchase was less than a month ago since the data cut
Dropping - Users who are inbetween

Metrics for scoring

Lets use a couple of metrics to analyse segments.

I will use this table to prioritise ideas.
We have many possible marketing activities, and I want to estimate, which of them are the most lucrative.

Delta coefficient says about how much revenue you will get from 1% users in this segment.
If we convert users from popular segments with low Delta to some other segment with larger Delta - its a success.
So goal of the marketing team is to move users from one segment to better ones.
Or prevent users to naturally flow the other way.

Most interesting segments are:

Inactive and Dropping Ontimers -> Loyal Onetimers
They tried once and left.
Try to reach them and give them a way to engage with an app more.
Dropping Celebrators -> Loyal Celebrators
They are buying rarely, but with a big check.
Help them with incentives to use an app and form a habit.
Loyal Celebrators -> Loyal Whales
Same, but they are not leaving.
Give them more reasons to purchase regularly.
Dropping Whales -> Loyal Whales
Best users. Seem we are a really good solution for their JTBD.
Give them premium support and exclusive deals.

Ideas for marketing team

1. Outreach

Personalized emails - reminding of the app’s benefits, new features, or special offers
Target: Inactive Onetimers - 22% of users to convert them to Loyal Casuals (x5 Delta)
10% (very generous!) click rate -> 25% purchase (same) -> 0.55% Inactive Onetimers will became Loyal Casuals
Before: 0.55% * 20% Delta
After: 0.55% * 100% Delta = 0.55% * 80% = + 0.44% Revenue
Decision: meh
Push notifications - timed and attractive offers
Target: Inactive Onetimers
Marketing Campaigns - campaigns on social media

2. Offering Incentives

Special Offers - offers, discounts, loyalty points, bundles
Target: Dropping Celebrators, Inactive Onetimers
Partnerships - with popular venues for exclusive deals
Target: Dropping Whales
New features - for returning users: profile customization, delivery priority, loyalty program, better recommendations
Target: Dropping Celebrators - give them a way to develop a habit of using a product
Bonuses - for using specific features
Target: Inactive Onetimers - make them onboard
Helping users with onboarding
Upsells
Target: Loyal Whales

3. Improving app

Price experiments - some groups are more sensitive to overall price or commissions changes
Target: Dropping Celebrators, Loyal Celebrators - try making them by more for the same budget
Usability - fixing ux problems, better onboarding
Target: Inactive Whales
New Features - recurring orders, timed deliveries, subscriptions, meal packs, lifestyle guidance
Target: Loyal Celebrators - Give them a way to purchase more, try different venues or venue types.
Premium support - solving problems and giving advices
Target: Dropping and Loyal Whales
Referral programs -

4. Learn why users leaving / staying

Data analysis - find possible causes of decreasing activity
Interviews - enriching data, ie: is it a solo buyer or a group
New segments - preferred food category, average spending
Feedback from current activities - what works?

Other slices

Im dropping Recency grouping and just checking other cat variable I’ve made.

Findings:

DAYS_TO_FIRST_PURCHASE
For Celebrators, if 1st purchase was later than 1week in app - Revenue dropped 50%.
Make them a targeted push-notifications?
VOLATILITY_OF_DAYS_BETWEEN_PURCHASES
For Whales and Celebrators, the power of stable habit is huuuuge. Almost x7 for Whales and x3 for Occasionals.
Whe need to help users to make this habit. Offer them a healthy lifestyle guidance?
VOLATILITY_OF_PURCHASE_VALUE
Same. Might be some data-related biases.
Fixed budget, buy more for the same money? Try discounts?
MOST_COMMON_WEEKDAY_TO_PURCHASE, MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE
There is minor differences, I cant came up with somesing here
REGISTRATION_COUNTRY
PREFERRED_DEVICE
AVERAGE_DELIVERY_DISTANCE_KMS
Dolphins are x2 lower in long distances. Why? Is it a grocery?
Offer scheduling?
DISTINCT_PURCHASE_VENUE_COUNT
Everyone
More different venues, more revenue. Not sure that its a causal relationship here. Strong correlation tho.
PREFERRED_VENUE_TYPE
Whales. Oh, wow, Groceries and Retail Stores are important. Solo Restaurants axe x6 worse.
Promote users to try a different style of shopping.

Its a good idea to separately focus on Dropping users only and go through the slices again.

Conclusion

im tired

need a mood enhancing pic