Segmenting with Mixed Type Data

library(tidyverse)
library(readxl) 
library(skimr)
library(knitr)
library(janitor)

With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the coronavirus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.

In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.

The data

The data represents all online acquisitions in February 2018 and their subcripion status 3 mohnths later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix.

I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their product take up in my final meeting.

Although fictitius in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis. You can find the dataset on my GitHub profile, along with the scripts I used to carry out the analysis.

In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.

Data loading, inspection and manupulation

As this is raw data, it needed some cleansing and rearranging. Let’s start with loading up the data and make some adjustments to the variable names.

data_raw <-
   readxl::read_xlsx(
      path = "../00_data/Subscription Data.xlsx",
      sheet = 'Data',
      trim_ws = TRUE,
      col_names = TRUE,
      guess_max = 2000
   ) %>% 
   # all headers to lower case
   set_names(names(.) %>% str_to_lower) %>% 
   # shortening some names
   rename_at(vars(contains("cancellation")),
             funs(str_replace_all(., "cancellation", "canc"))) %>% 
   # swapping space with underscore in some names
   rename_at(vars(contains(" ")),
             funs(str_replace_all(., "[ ]", "_")))

A first glance at the data structure and it all looks in good order: all variables are in the format I would expect them to be.

data_raw %>% glimpse()

## Observations: 4,853
## Variables: 16
## $ account_id             <chr> "ID0026621", "ID0033642", "ID0036592", "ID00...
## $ created_date           <dttm> 2008-11-20, 2008-11-20, 2008-11-20, 2008-11...
## $ country                <chr> "United Kingdom", "United Kingdom", "United ...
## $ status                 <chr> "Cancelled", "Cancelled", "Active", "Cancell...
## $ product_group          <chr> "Premium", "Standard", "Premium", "Standard"...
## $ payment_frequency      <chr> "Monthly", "Monthly", "Annual", "Monthly", "...
## $ campaign_code          <chr> "55372", "57472", "38072", "56472", "56572",...
## $ start_date             <dttm> 2018-02-01, 2018-02-12, 2018-02-05, 2018-02...
## $ end_date               <dttm> 2019-02-01, 2019-02-12, 2019-02-05, 2019-02...
## $ canc_date              <dttm> 2018-02-01, 2018-03-12, NA, 2018-02-12, NA,...
## $ canc_reason            <chr> "Amendment", "Lack of time", NA, "Failed Dir...
## $ monthly_price          <dbl> 15.00, 6.99, 16.67, 0.00, 3.75, 6.99, 11.67,...
## $ contract_monthly_price <dbl> 15.00, 6.99, 16.67, 6.99, 3.75, 6.99, 11.67,...
## $ trial_length           <chr> NA, NA, NA, "1M", NA, "1M", NA, "1M", "1M", ...
## $ trial_monthly_price    <dbl> NA, NA, NA, 0, NA, 0, NA, 0, 0, NA, NA, NA, ...
## $ payment_method         <chr> "Credit Card", "Credit Card", "Direct Debit"...

data_raw %>% skim()

Data summary
Name	Piped data
Number of rows	4853
Number of columns	16
_______________________
Column type frequency:
character	9
numeric	3
POSIXct	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
account_id	0	1.00	9	9	4824
country	7	1.00	4	20	77
status	0	1.00	6	20	4
product_group	0	1.00	7	8	2
payment_frequency	0	1.00	6	10	3
campaign_code	0	1.00	5	7	73
canc_reason	3547	0.27	5	27	20
trial_length	473	0.90	2	2	4
payment_method	0	1.00	6	12	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p25	p50	p75	p100	hist
monthly_price	0	1.0	7.34	3.79	6.99	6.99	6.99	26.00	▁▇▁▁▁
contract_monthly_price	0	1.0	8.20	3.21	6.99	6.99	6.99	26.00	▁▇▁▁▁
trial_monthly_price	474	0.9	0.00	0.10	0.00	0.00	0.00	6.94	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
created_date	0	1.00	2008-11-20	2018-02-28	2017-12-20	971
start_date	0	1.00	2018-02-01	2018-02-28	2018-02-14	28
end_date	0	1.00	2018-03-01	2019-02-28	2019-02-13	47
canc_date	3549	0.27	2017-10-31	2019-02-21	2018-04-01	156

Individual variables exploration

account_id duplicates

There are 29 (58/2) duplicate account_id

data_raw %>% 
   # selecting custoomer IDs that appear more than once
   group_by(account_id) %>% 
   count() %>% 
   filter(n > 1) %>% 
   ungroup() %>% 
   janitor::adorn_totals()

It looks like the large majority of duplicate customer ID are linked to an Amendment on their account.

data_raw %>% 
   group_by(account_id) %>% 
   count() %>% 
   filter(n > 1) %>% 
   ungroup() %>% 
   # appending all data back in
   left_join(data_raw) %>% 
   # select some columns for a closer look
   select(account_id, canc_reason) %>% 
   arrange(canc_reason)

Given the very small number of duplicates compared to the total, I’m simply removing the duplicates.

data_clean <- 
   data_raw %>% 
   group_by(account_id) %>% 
   count() %>% 
   filter(n == 1) %>% 
   ungroup() %>% 
   # appending all data back in
   left_join(data_raw) %>% 
   select(-n)

country

Overwhelming majority of subscriptions are UK based

data_raw %>% 
   group_by(country) %>% 
   count() %>% 
   ungroup() %>% 
   arrange(desc(n)) %>% 
   ungroup() %>% 
   mutate(country = country %>% as_factor()) %>% 
   filter(n > 7) %>%
   
   ggplot(aes(x = country, y = n)) +
   geom_col(fill = "#E69F00", colour = "red") +
   theme_minimal() +
   labs(title     = 'Number of subscriptions by acquisition country',
         caption  = '',
         x        = 'Country of Residence',
         y        = 'Number of Subscribers') +
   theme(plot.title = element_text(hjust = 0.5),
         axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

status

Majority of accounts are either Active or Cancelled

data_raw %>% 
   group_by(status) %>% 
   count() %>% 
   ungroup()

I’m rolling Lapsed and Pending Cancellation into Cancelled

data_clean <- 
   data_clean %>%
   mutate(status = 
             case_when(status == 'Lapsed' ~ 'Cancelled',
                       status == 'Pending Cancellation' ~ 'Cancelled',
                       TRUE ~ status)
   )

Product group

All looks OK here

# no NAs
data_raw %>% 
   group_by(product_group) %>% 
   count() %>% 
   ungroup() %>% 
   janitor::adorn_totals()

Payment Frequency

Under payment frequency there are only a handful of Fixed Term

# no NAs
data_raw %>% 
   group_by(payment_frequency) %>% 
   count() %>% 
   ungroup() %>% 
   janitor::adorn_totals()

I’m dropping Fixed Term

data_clean <- 
   data_clean %>% 
   filter(payment_frequency != 'Fixed Term')

Campaign

Top three campaigns account for over 72% of total acquisitions in February 2018

# no NAs
data_raw %>% 
   group_by(campaign_code) %>% 
   count() %>% 
   ungroup() %>% 
   arrange(desc(n)) %>% 
   ungroup() %>% 
   mutate(campaign_code = campaign_code %>% as_factor()) %>% 
   filter(n > 3) %>% 
   
   ggplot(aes(x = campaign_code, y = n)) +
   geom_col(fill = "steelblue", colour = "blue") +
   theme_minimal() +
   labs(title     = 'Number of subscriptions by acquisition campaign',
         caption  = '',
         x        = 'Campaign Code',
         y        = 'Number of Subscribers') +
   theme(plot.title = element_text(hjust = 0.5),
         axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Dropping unknown campaign code

data_clean <- 
   data_clean %>% 
   filter(campaign_code != 'Unknown')

Trial end date

trial_monthly_price price should be zero for all

data_raw %>% 
   select(contains('trial'), monthly_price) %>% 
   # filter(!(trial_end_date)) %>% 
   arrange(desc(trial_length)) %>% 
   head()

Amending trial_monthly_price to zero and the 6.94 monthly_price to 6.99

data_clean <- 
   data_clean %>% 
   mutate(trial_monthly_price = 
             case_when(trial_monthly_price == 6.94 ~ 0,
                       TRUE ~ trial_monthly_price),
          monthly_price = 
             case_when(monthly_price == 6.94 ~ 6.99,
                       TRUE ~ monthly_price)
          )

Cancellation date / reason

there’s too much noise in the cancellation reasons, which needs simplifying to get a clear read

data_raw %>% 
   group_by(status, canc_reason) %>% 
   count() %>% 
   ungroup() %>% 
   arrange(status, n) %>% 
   janitor::adorn_totals()

Grouping up some of the reason for cancelling and dealing with the NAs

data_clean <- 
   data_clean %>% 
      mutate(
         canc_reason = 
          case_when(
            status == 'Active' & canc_reason == 'Unknown' ~ '-',
            
            # setting all NAs to zero as they're easier to deal with
            is.na(canc_reason)  ~ '-',
            
            canc_reason == 'App performance' | 
            canc_reason == 'No compatible devices' |
            canc_reason == 'Look and feel'   | 
            canc_reason == 'Functionality' |
            canc_reason == 'Download'                      ~ 'UX Related',
            
            canc_reason == 'Failed Credit Card Payment' | 
            canc_reason == 'Failed PayPal Payment' |
            canc_reason == 'Failed Direct Debit Payment'   ~ 'Failed Payment',
            
            canc_reason == 'Political' |
            canc_reason == 'Editorial' |
            canc_reason == 'Lack of content'               ~ 'Editorial',
            
            canc_reason == 'Competitor' |
            canc_reason == 'Apple news'                    ~ 'Competitor',
            
            canc_reason == 'Product switch' |
            canc_reason == 'Amendment' |                  
            canc_reason == 'Duplicate subscription'        ~ 'Other',
            
            canc_reason == 'Not known' | 
            canc_reason == 'Unknown'                       ~ '-',
            
            TRUE                                           ~ canc_reason)
      )

trial length

Majority of subscriptions go for 1M (one month) trial. I’ve triangulated dates with trial length and they all check out. NAs represent subscribers already enrolled so no trial for them

data_raw %>% 
   group_by(trial_length) %>% 
   count() %>% 
   ungroup() %>% 
   janitor::adorn_totals()

price

data_clean %>% 
   filter(country == 'United Kingdom') %>% 
   group_by( 
      monthly_price,
      contract_monthly_price
      ) %>% 
   count() %>% 
   ungroup() %>% 
   arrange(desc(n)) %>% 
   janitor::adorn_totals()

Where monthly_price is zero they’ve all cancelled. I believe it coincides with old customers who cancelled on Feb-18

data_clean %>%
   filter(trial_length == '1M') %>% # 4233 are 1M subscriptions or 90% of total
   group_by( 
            product_group,
            trial_monthly_price, # all zero
            monthly_price, 
            contract_monthly_price) %>% 
   count() %>% 
   ungroup() %>% 
   janitor::adorn_totals()

payment method

A handful of Unknown are to be found in payment_method

data_raw %>% 
   group_by(payment_method) %>% 
   count() %>% 
   ungroup() %>% 
   adorn_totals()

Dropping unknown

data_clean <- 
   data_clean %>% 
   filter(payment_method != 'Unknown')

A final look at the cleansed data: I’m happy with everything here!

data_clean %>% 
   skim()

Data summary
Name	Piped data
Number of rows	4749
Number of columns	16
_______________________
Column type frequency:
character	9
numeric	3
POSIXct	4
________________________
Group variables	None