With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the coronavirus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.

In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.

The data

The data represents all online acquisitions in February 2018 and their subcripion status 3 mohnths later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix.

I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their product take up in my final meeting.

Although fictitius in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis. You can find the dataset on my GitHub profile, along with the scripts I used to carry out the analysis.

In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.

Data loading, inspection and manupulation

As this is raw data, it needed some cleansing and rearranging. Let’s start with loading up the data and make some adjustments to the variable names.

A first glance at the data structure and it all looks in good order: all variables are in the format I would expect them to be.

## Observations: 4,853
## Variables: 16
## $ account_id             <chr> "ID0026621", "ID0033642", "ID0036592", "ID00...
## $ created_date           <dttm> 2008-11-20, 2008-11-20, 2008-11-20, 2008-11...
## $ country                <chr> "United Kingdom", "United Kingdom", "United ...
## $ status                 <chr> "Cancelled", "Cancelled", "Active", "Cancell...
## $ product_group          <chr> "Premium", "Standard", "Premium", "Standard"...
## $ payment_frequency      <chr> "Monthly", "Monthly", "Annual", "Monthly", "...
## $ campaign_code          <chr> "55372", "57472", "38072", "56472", "56572",...
## $ start_date             <dttm> 2018-02-01, 2018-02-12, 2018-02-05, 2018-02...
## $ end_date               <dttm> 2019-02-01, 2019-02-12, 2019-02-05, 2019-02...
## $ canc_date              <dttm> 2018-02-01, 2018-03-12, NA, 2018-02-12, NA,...
## $ canc_reason            <chr> "Amendment", "Lack of time", NA, "Failed Dir...
## $ monthly_price          <dbl> 15.00, 6.99, 16.67, 0.00, 3.75, 6.99, 11.67,...
## $ contract_monthly_price <dbl> 15.00, 6.99, 16.67, 6.99, 3.75, 6.99, 11.67,...
## $ trial_length           <chr> NA, NA, NA, "1M", NA, "1M", NA, "1M", "1M", ...
## $ trial_monthly_price    <dbl> NA, NA, NA, 0, NA, 0, NA, 0, 0, NA, NA, NA, ...
## $ payment_method         <chr> "Credit Card", "Credit Card", "Direct Debit"...
Data summary
Name Piped data
Number of rows 4853
Number of columns 16
_______________________
Column type frequency:
character 9
numeric 3
POSIXct 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
account_id 0 1.00 9 9 0 4824 0
country 7 1.00 4 20 0 77 0
status 0 1.00 6 20 0 4 0
product_group 0 1.00 7 8 0 2 0
payment_frequency 0 1.00 6 10 0 3 0
campaign_code 0 1.00 5 7 0 73 0
canc_reason 3547 0.27 5 27 0 20 0
trial_length 473 0.90 2 2 0 4 0
payment_method 0 1.00 6 12 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
monthly_price 0 1.0 7.34 3.79 0 6.99 6.99 6.99 26.00 ▁▇▁▁▁
contract_monthly_price 0 1.0 8.20 3.21 0 6.99 6.99 6.99 26.00 ▁▇▁▁▁
trial_monthly_price 474 0.9 0.00 0.10 0 0.00 0.00 0.00 6.94 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
created_date 0 1.00 2008-11-20 2018-02-28 2017-12-20 971
start_date 0 1.00 2018-02-01 2018-02-28 2018-02-14 28
end_date 0 1.00 2018-03-01 2019-02-28 2019-02-13 47
canc_date 3549 0.27 2017-10-31 2019-02-21 2018-04-01 156

Individual variables exploration

account_id duplicates

There are 29 (58/2) duplicate account_id

It looks like the large majority of duplicate customer ID are linked to an Amendment on their account.

Given the very small number of duplicates compared to the total, I’m simply removing the duplicates.

status

Majority of accounts are either Active or Cancelled

I’m rolling Lapsed and Pending Cancellation into Cancelled

Product group

All looks OK here

Payment Frequency

Under payment frequency there are only a handful of Fixed Term

I’m dropping Fixed Term

Cancellation date / reason

there’s too much noise in the cancellation reasons, which needs simplifying to get a clear read

Grouping up some of the reason for cancelling and dealing with the NAs

trial length

Majority of subscriptions go for 1M (one month) trial. I’ve triangulated dates with trial length and they all check out. NAs represent subscribers already enrolled so no trial for them

price

Where monthly_price is zero they’ve all cancelled. I believe it coincides with old customers who cancelled on Feb-18

payment method

A handful of Unknown are to be found in payment_method

Dropping unknown

A final look at the cleansed data: I’m happy with everything here!

Data summary
Name Piped data
Number of rows 4749
Number of columns 16
_______________________
Column type frequency:
character 9
numeric 3
POSIXct 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
account_id 0 1.00 9 9 0 4749 0
country 1 1.00 4 20 0 77 0
status 0 1.00 6 9 0 2 0
product_group 0 1.00 7 8 0 2 0
payment_frequency 0 1.00 6 7 0 2 0
campaign_code 0 1.00 5 5 0 70 0
canc_reason 0 1.00 1 14 0 8 0
trial_length 413 0.91 2 2 0 4 0
payment_method 0 1.00 6 12 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
monthly_price 0 1.00 7.34 3.69 0 6.99 6.99 6.99 26 ▁▇▁▁▁
contract_monthly_price 0 1.00 8.17 3.12 0 6.99 6.99 6.99 26 ▁▇▁▁▁
trial_monthly_price 413 0.91 0.00 0.00 0 0.00 0.00 0.00 0 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
created_date 0 1.00 2008-11-20 2018-02-28 2017-12-25 958
start_date 0 1.00 2018-02-01 2018-02-28 2018-02-13 28
end_date 0 1.00 2018-03-04 2019-02-28 2019-02-13 29
canc_date 3511 0.26 2017-10-31 2019-02-19 2018-04-04 153

Saving data_clean

Code Repository

The full R code and all relevant files can be found on my GitHub profile @ K Medoid Clustering

References

Appendix

Table 1 – Variable Definitions

Attribute Description
Account ID Unique account ID
Created Date Date of original account creation
Country Country of account holder
Status Current status - active/inactive
Product Group Product type
Payment Frequency Most subscriptions are a 1 year contract term - payable annually or monthly
Campaign Code Unique identifier for campaign attributed to acquisition
Start Date Start date of the trial
End Date Scheduled end of term
Cancellation Date Date of instruction to cancel
Cancellation Reason Reason given for cancellation
Monthly Price Current monthly price of subscription
Contract Monthly Price Price after promo period
Trial Length Length of trial
Trial Monthly Price Price during trial
Payment Method Payment method