With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the coronavirus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.
In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.
The data represents all online acquisitions in February 2018 and their subcripion status 3 mohnths later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix.
I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their product take up in my final meeting.
Although fictitius in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis. You can find the dataset on my GitHub profile, along with the scripts I used to carry out the analysis.
In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.
As this is raw data, it needed some cleansing and rearranging. Let’s start with loading up the data and make some adjustments to the variable names.
data_raw <-
readxl::read_xlsx(
path = "../00_data/Subscription Data.xlsx",
sheet = 'Data',
trim_ws = TRUE,
col_names = TRUE,
guess_max = 2000
) %>%
# all headers to lower case
set_names(names(.) %>% str_to_lower) %>%
# shortening some names
rename_at(vars(contains("cancellation")),
funs(str_replace_all(., "cancellation", "canc"))) %>%
# swapping space with underscore in some names
rename_at(vars(contains(" ")),
funs(str_replace_all(., "[ ]", "_")))A first glance at the data structure and it all looks in good order: all variables are in the format I would expect them to be.
## Observations: 4,853
## Variables: 16
## $ account_id <chr> "ID0026621", "ID0033642", "ID0036592", "ID00...
## $ created_date <dttm> 2008-11-20, 2008-11-20, 2008-11-20, 2008-11...
## $ country <chr> "United Kingdom", "United Kingdom", "United ...
## $ status <chr> "Cancelled", "Cancelled", "Active", "Cancell...
## $ product_group <chr> "Premium", "Standard", "Premium", "Standard"...
## $ payment_frequency <chr> "Monthly", "Monthly", "Annual", "Monthly", "...
## $ campaign_code <chr> "55372", "57472", "38072", "56472", "56572",...
## $ start_date <dttm> 2018-02-01, 2018-02-12, 2018-02-05, 2018-02...
## $ end_date <dttm> 2019-02-01, 2019-02-12, 2019-02-05, 2019-02...
## $ canc_date <dttm> 2018-02-01, 2018-03-12, NA, 2018-02-12, NA,...
## $ canc_reason <chr> "Amendment", "Lack of time", NA, "Failed Dir...
## $ monthly_price <dbl> 15.00, 6.99, 16.67, 0.00, 3.75, 6.99, 11.67,...
## $ contract_monthly_price <dbl> 15.00, 6.99, 16.67, 6.99, 3.75, 6.99, 11.67,...
## $ trial_length <chr> NA, NA, NA, "1M", NA, "1M", NA, "1M", "1M", ...
## $ trial_monthly_price <dbl> NA, NA, NA, 0, NA, 0, NA, 0, 0, NA, NA, NA, ...
## $ payment_method <chr> "Credit Card", "Credit Card", "Direct Debit"...
| Name | Piped data |
| Number of rows | 4853 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 3 |
| POSIXct | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| account_id | 0 | 1.00 | 9 | 9 | 0 | 4824 | 0 |
| country | 7 | 1.00 | 4 | 20 | 0 | 77 | 0 |
| status | 0 | 1.00 | 6 | 20 | 0 | 4 | 0 |
| product_group | 0 | 1.00 | 7 | 8 | 0 | 2 | 0 |
| payment_frequency | 0 | 1.00 | 6 | 10 | 0 | 3 | 0 |
| campaign_code | 0 | 1.00 | 5 | 7 | 0 | 73 | 0 |
| canc_reason | 3547 | 0.27 | 5 | 27 | 0 | 20 | 0 |
| trial_length | 473 | 0.90 | 2 | 2 | 0 | 4 | 0 |
| payment_method | 0 | 1.00 | 6 | 12 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| monthly_price | 0 | 1.0 | 7.34 | 3.79 | 0 | 6.99 | 6.99 | 6.99 | 26.00 | ▁▇▁▁▁ |
| contract_monthly_price | 0 | 1.0 | 8.20 | 3.21 | 0 | 6.99 | 6.99 | 6.99 | 26.00 | ▁▇▁▁▁ |
| trial_monthly_price | 474 | 0.9 | 0.00 | 0.10 | 0 | 0.00 | 0.00 | 0.00 | 6.94 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| created_date | 0 | 1.00 | 2008-11-20 | 2018-02-28 | 2017-12-20 | 971 |
| start_date | 0 | 1.00 | 2018-02-01 | 2018-02-28 | 2018-02-14 | 28 |
| end_date | 0 | 1.00 | 2018-03-01 | 2019-02-28 | 2019-02-13 | 47 |
| canc_date | 3549 | 0.27 | 2017-10-31 | 2019-02-21 | 2018-04-01 | 156 |
There are 29 (58/2) duplicate account_id
data_raw %>%
# selecting custoomer IDs that appear more than once
group_by(account_id) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
janitor::adorn_totals()It looks like the large majority of duplicate customer ID are linked to an Amendment on their account.
data_raw %>%
group_by(account_id) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
# appending all data back in
left_join(data_raw) %>%
# select some columns for a closer look
select(account_id, canc_reason) %>%
arrange(canc_reason)Given the very small number of duplicates compared to the total, I’m simply removing the duplicates.
Overwhelming majority of subscriptions are UK based
data_raw %>%
group_by(country) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
ungroup() %>%
mutate(country = country %>% as_factor()) %>%
filter(n > 7) %>%
ggplot(aes(x = country, y = n)) +
geom_col(fill = "#E69F00", colour = "red") +
theme_minimal() +
labs(title = 'Number of subscriptions by acquisition country',
caption = '',
x = 'Country of Residence',
y = 'Number of Subscribers') +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8))Majority of accounts are either Active or Cancelled
I’m rolling Lapsed and Pending Cancellation into Cancelled
All looks OK here
Under payment frequency there are only a handful of Fixed Term
# no NAs
data_raw %>%
group_by(payment_frequency) %>%
count() %>%
ungroup() %>%
janitor::adorn_totals()I’m dropping Fixed Term
Top three campaigns account for over 72% of total acquisitions in February 2018
# no NAs
data_raw %>%
group_by(campaign_code) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
ungroup() %>%
mutate(campaign_code = campaign_code %>% as_factor()) %>%
filter(n > 3) %>%
ggplot(aes(x = campaign_code, y = n)) +
geom_col(fill = "steelblue", colour = "blue") +
theme_minimal() +
labs(title = 'Number of subscriptions by acquisition campaign',
caption = '',
x = 'Campaign Code',
y = 'Number of Subscribers') +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8))Dropping unknown campaign code
trial_monthly_price price should be zero for all
data_raw %>%
select(contains('trial'), monthly_price) %>%
# filter(!(trial_end_date)) %>%
arrange(desc(trial_length)) %>%
head()Amending trial_monthly_price to zero and the 6.94 monthly_price to 6.99
there’s too much noise in the cancellation reasons, which needs simplifying to get a clear read
data_raw %>%
group_by(status, canc_reason) %>%
count() %>%
ungroup() %>%
arrange(status, n) %>%
janitor::adorn_totals()Grouping up some of the reason for cancelling and dealing with the NAs
data_clean <-
data_clean %>%
mutate(
canc_reason =
case_when(
status == 'Active' & canc_reason == 'Unknown' ~ '-',
# setting all NAs to zero as they're easier to deal with
is.na(canc_reason) ~ '-',
canc_reason == 'App performance' |
canc_reason == 'No compatible devices' |
canc_reason == 'Look and feel' |
canc_reason == 'Functionality' |
canc_reason == 'Download' ~ 'UX Related',
canc_reason == 'Failed Credit Card Payment' |
canc_reason == 'Failed PayPal Payment' |
canc_reason == 'Failed Direct Debit Payment' ~ 'Failed Payment',
canc_reason == 'Political' |
canc_reason == 'Editorial' |
canc_reason == 'Lack of content' ~ 'Editorial',
canc_reason == 'Competitor' |
canc_reason == 'Apple news' ~ 'Competitor',
canc_reason == 'Product switch' |
canc_reason == 'Amendment' |
canc_reason == 'Duplicate subscription' ~ 'Other',
canc_reason == 'Not known' |
canc_reason == 'Unknown' ~ '-',
TRUE ~ canc_reason)
) Majority of subscriptions go for 1M (one month) trial. I’ve triangulated dates with trial length and they all check out. NAs represent subscribers already enrolled so no trial for them
data_clean %>%
filter(country == 'United Kingdom') %>%
group_by(
monthly_price,
contract_monthly_price
) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
janitor::adorn_totals()Where monthly_price is zero they’ve all cancelled. I believe it coincides with old customers who cancelled on Feb-18
data_clean %>%
filter(trial_length == '1M') %>% # 4233 are 1M subscriptions or 90% of total
group_by(
product_group,
trial_monthly_price, # all zero
monthly_price,
contract_monthly_price) %>%
count() %>%
ungroup() %>%
janitor::adorn_totals()A handful of Unknown are to be found in payment_method
Dropping unknown
A final look at the cleansed data: I’m happy with everything here!
| Name | Piped data |
| Number of rows | 4749 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 3 |
| POSIXct | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| account_id | 0 | 1.00 | 9 | 9 | 0 | 4749 | 0 |
| country | 1 | 1.00 | 4 | 20 | 0 | 77 | 0 |
| status | 0 | 1.00 | 6 | 9 | 0 | 2 | 0 |
| product_group | 0 | 1.00 | 7 | 8 | 0 | 2 | 0 |
| payment_frequency | 0 | 1.00 | 6 | 7 | 0 | 2 | 0 |
| campaign_code | 0 | 1.00 | 5 | 5 | 0 | 70 | 0 |
| canc_reason | 0 | 1.00 | 1 | 14 | 0 | 8 | 0 |
| trial_length | 413 | 0.91 | 2 | 2 | 0 | 4 | 0 |
| payment_method | 0 | 1.00 | 6 | 12 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| monthly_price | 0 | 1.00 | 7.34 | 3.69 | 0 | 6.99 | 6.99 | 6.99 | 26 | ▁▇▁▁▁ |
| contract_monthly_price | 0 | 1.00 | 8.17 | 3.12 | 0 | 6.99 | 6.99 | 6.99 | 26 | ▁▇▁▁▁ |
| trial_monthly_price | 413 | 0.91 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0.00 | 0 | ▁▁▇▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| created_date | 0 | 1.00 | 2008-11-20 | 2018-02-28 | 2017-12-25 | 958 |
| start_date | 0 | 1.00 | 2018-02-01 | 2018-02-28 | 2018-02-13 | 28 |
| end_date | 0 | 1.00 | 2018-03-04 | 2019-02-28 | 2019-02-13 | 29 |
| canc_date | 3511 | 0.26 | 2017-10-31 | 2019-02-19 | 2018-04-04 | 153 |
The full R code and all relevant files can be found on my GitHub profile @ K Medoid Clustering
For the article that inspired my foray into K-medois clustering see this excellent TDS post by Thomas Filaire: Clustering on mixed type data
For a tidy and fully-featured approach to counting things, see the tabyls Function Vignette
Table 1 – Variable Definitions
| Attribute | Description |
|---|---|
| Account ID | Unique account ID |
| Created Date | Date of original account creation |
| Country | Country of account holder |
| Status | Current status - active/inactive |
| Product Group | Product type |
| Payment Frequency | Most subscriptions are a 1 year contract term - payable annually or monthly |
| Campaign Code | Unique identifier for campaign attributed to acquisition |
| Start Date | Start date of the trial |
| End Date | Scheduled end of term |
| Cancellation Date | Date of instruction to cancel |
| Cancellation Reason | Reason given for cancellation |
| Monthly Price | Current monthly price of subscription |
| Contract Monthly Price | Price after promo period |
| Trial Length | Length of trial |
| Trial Monthly Price | Price during trial |
| Payment Method | Payment method |