Cohort Analysis with R

Introduction

In this post I perform a behavioral analytics technique called Cohort Analysis to get an idea of monthly retention and churn rates for cohorts based on acquisition date. This technique provides a way of understanding customer trends, which aids an organization to better target its audience, and make better business decisions.

Before the Cohort Analysis can be executed, I must perform a good deal of data wrangling. The data wrangling sections will be labeled “preprocessing”. Feel free to skip those sections if you wish.

For this analysis, I’ll be utilizing an anonymized transactional data set called ‘Online Retail Data Set’ from UCI’s Machine Learning Repository. The data set was donated on 2015-11-06. You can access data here: http://archive.ics.uci.edu/ml/datasets/online+retail

A little bit about the data set from UCI itself:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Preprocessing: Features

The data set has been loaded in the R session from my working directory and cleverly named “online.retail”. Below we get a look at the initial 30 of 541,909 observations in 10 observation chunks, and the 8 variables.

# We're only going to show the head of this data because it
# is too large for RMarkdown to process visually.

DT::datatable(head(online.retail,30),
              rownames = FALSE,
              options = list(
                pageLength = 10,
                lengthMenu = c(10, 20, 30)),
                caption = htmltools::tags$caption(
                  style = 'caption-side:bottom; text-align: right;',
                  'Source:', htmltools::em('UCI Machine Learning Repository')))

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Preprocessing: Data Types

At this point it’s best to check the variables’ data types and see if any data cleaning is necessary.

As shown below, the CustomerID variable is an integer, which is fine, but the InvoiceDate variable is a factor. This isn’t fine. Later I’ll convert this factored object into a date object.

str(online.retail)

## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/10/2011 10:04",..: 6839 6839 6839 6839 6839 6839 6839 6840 6840 6841 ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...

Preprocessing: Duplicates

Checking for duplicate observations and removing them.

cat("The number of non-duplicate observations within the data set is",
    nrow(unique(online.retail)), "out of", "\n",
    nrow(online.retail),
    "indicating that there are",
    nrow(online.retail)-nrow(unique(online.retail)),
    "duplicates within the data set.","\n",
    "online.retail2 is our new, duplicate observation free data frame.")

## The number of non-duplicate observations within the data set is 536641 out of 
##  541909 indicating that there are 5268 duplicates within the data set. 
##  online.retail2 is our new, duplicate observation free data frame.

# Storing the index position of the
# observations with duplicated values
dupes <- which(duplicated(online.retail))

# Subsetting out the duplicated values
# using their index positions
online.retail2 <- online.retail[-dupes,]

rm(online.retail,dupes)

Preprocessing: NAs

Checking for observations with NA and/or missing values and removing them.

cat("There are",
    nrow(online.retail2[!complete.cases(online.retail2),]),
    "out of",
    nrow(online.retail2),
    "total observations that feature at least one NA value.","\n",
    "The CustomerId column alone features all of the",
    nrow(online.retail2[!complete.cases(online.retail2$CustomerID),]),
    "of the total NA values. These", "\n", "observations will be excluded from the new data frame, online.retail3,
    giving us a new total of",
    nrow(online.retail2)-nrow(online.retail2[!complete.cases(online.retail2),]),
    "observations to work with.")

## There are 135037 out of 536641 total observations that feature at least one NA value. 
##  The CustomerId column alone features all of the 135037 of the total NA values. These 
##  observations will be excluded from the new data frame, online.retail3,
##     giving us a new total of 401604 observations to work with.

online.retail3 <- online.retail2[complete.cases(online.retail2),]

rm(online.retail2)

Preprocessing: Dates

Converting the variable InvoiceDate into a date object, and dumping the time element from said date object.

# Converting the InvoiceDate column into a date object
online.retail3$InvoiceDate <- as.Date(online.retail3$InvoiceDate, format = "%m/%d/%Y")

# Setting up a column for the year data to make it easier to focus on 2011
online.retail3$Year <- as.numeric(format(online.retail3$InvoiceDate, '%Y'))

# Seperating the 2010 observations from the 2011
# observations to focus strictly on the 2011 cohorts
cohorts2011 <- online.retail3[online.retail3$Year==2011,]

# Dumping the unneeded variables
cohorts2011 <- cohorts2011[,c("CustomerID","InvoiceDate","Year")]


str(cohorts2011)

## 'data.frame':    375250 obs. of  3 variables:
##  $ CustomerID : int  13313 13313 13313 13313 13313 13313 13313 13313 13313 13313 ...
##  $ InvoiceDate: Date, format: "2011-01-04" "2011-01-04" ...
##  $ Year       : num  2011 2011 2011 2011 2011 ...

Creating the Cohorts

Here is where I

Get the initial InvoiceDate for each customer. This is the customer’s join date
Inner join both customer purchase frequencies and join dates to the Cohort2011 data frame
Label each customer via the month of their join date. This label will act as the customer’s cohort label.
Lastly, I produce an audit table to confirm that the join dates are in fact the customers’ initial invoice date. Enabling filtering for the table is pivotal for this.

Info for those new to analytics

For those asking themselves, “What’s a cohort?”: A cohort is a group of people who have a common characteristic during a period of time. In this case, its the date a customer became a customer. Cohorts can also be segmented via marketing channel.

In this example, our cohorts will be grouped by a monthly interval. The interval can be weekly, monthly, quarterly, annually, biannually, etc. It all depends on the nature of the business cycle.

SaaS (software as a service) brands like Salesforce charge on an annual basis, so an annual Cohort would be best for them. Netflix on the other hand operates with monthly contracts, therefore deeming a monthly Cohort optimal. One’s domain knowledge must be sharper for non-subscription-based services. Typically, weekly, monthly, or quarterly intervals would do fine in those scenarios.

# Getting the purchase frequencies per CustomerID
#order.frequency <- as.data.frame(table(cohorts2011$CustomerID)) #FREQ NOT NEEDED?
#colnames(order.frequency)[1] <- "Cust.Id" #FREQ NOT NEEDED?

# Getting the first transaction dates for each customer
join.date <- aggregate(InvoiceDate~CustomerID,cohorts2011,min, na.rm = TRUE)

# Changing the name of the column InvoiceDate to Join_Date
# since this is the first transaction date for each customer
colnames(join.date)[2] <- "Join_Date"

# Merge the Join date data to the cohort2011 data frame
cohorts2011 <- merge(cohorts2011, join.date, by.x = "CustomerID",by.y = "CustomerID", all.x = TRUE)

# Merge the frequency data to the cohort2011 data frame
#cohorts2011 <- merge(cohorts2011, order.frequency, by.x = "CustomerID",by.y = "Cust.Id", all.x = TRUE) #FREQ NOT NEEDED?

# Creating the groups/Cohorts based on the join date month
cohorts2011$Cohort <- as.numeric(format(cohorts2011$Join_Date, "%m"))

rm(join.date)


# We're only going to show the head of this data because it
# is too large for RMarkdown to process visually without freezing.

# This visualization is not necessary. We're using this
# interactive table to confirm that the join data is correct. 

DT::datatable(head(cohorts2011,500),
              filter = 'top',
              rownames = FALSE,
              options = list(
                pageLength = 10,
                pageLength = c(10,20,30,40,50)))

Cohort Age

Here I calculate the number of days a customer has been a customer with the company. Based on this information, I calculate the amount of months said customer has been active. Active as in making purchases.

# Calculating the difference in days between the invoice date column by join date column
# There is no option for month, but getting the month from the days is simple division
cohorts2011$Age_by_Day <- as.numeric(difftime(cohorts2011$InvoiceDate,cohorts2011$Join_Date,units = c("days")))

# Dividing the days by 30 to get the number of months
cohorts2011$Age_by_Month <- floor(cohorts2011$Age_by_Day/30)

# Dumping the day element from the join date column
cohorts2011$Join_Date <- format(cohorts2011$Join_Date, "%Y-%m")

# Now we remove the day element from the InvoiceDate data since
# this Cohort Analysis is based on monthly activity.
cohorts2011$InvoiceDate <- format(cohorts2011$InvoiceDate, "%Y-%m")
# Its important that we remove the day data so that we can remove
# extra observations in the months where customers have multiple
# observations in a single month due to having multiple orders in
# a single month. We'll use the function duplicated for this later


# We relabel the cohort column data to something more intuitive for the sake
# of the report consumers, then factor them since these are sequential
groups <- c("Jan Cohorts",
             "Feb Cohorts",
             "Mar Cohorts",
             "Apr Cohorts",
             "May Cohorts",
             "Jun Cohorts",
             "Jul Cohorts",
             "Aug Cohorts",
             "Sep Cohorts",
             "Oct Cohorts",
             "Nov Cohorts",
             "Dec Cohorts")

for(i in 1:12){
  cohorts2011[cohorts2011$Cohort==i,"Cohort"] <- groups[i]
}
rm(i,groups)

cohorts2011$Cohort <- factor(cohorts2011$Cohort,ordered = T,levels =c("Jan Cohorts",
                                                                      "Feb Cohorts",
                                                                      "Mar Cohorts",
                                                                      "Apr Cohorts",
                                                                      "May Cohorts",
                                                                      "Jun Cohorts",
                                                                      "Jul Cohorts",
                                                                      "Aug Cohorts",
                                                                      "Sep Cohorts",
                                                                      "Oct Cohorts",
                                                                      "Nov Cohorts",
                                                                      "Dec Cohorts"))


# We're only going to show the head of this data because it
# is too large for RMarkdown to process visually without freezing

DT::datatable(head(cohorts2011,500),
              filter = 'top',
              rownames = FALSE,
              options = list(
                pageLength = 10,
                pageLength = c(10,20,30,40,50)))

MAU Mixpanel

Reminder

All cohorts are of the year 2011
Cohort membership is based on the month of customer acquisition

Each of the twelve rows represent a cohort, and its columns, 1 through 11, represent the months post acquisition. Column 0 shows the count of customer acquisition per month. The intersections of row and column feature the count of active cohort members.

# By excluding both columns Age_by_Day and Age_by_Month
# we're able to remove the extra monthly observations to
# avoid counting unique customer IDs multiple times in 
# any single month.

# The day and month Age variables keep us from removing
# duplicates which is why we need to exclude them both
dupes <- which(duplicated(cohorts2011[,c(-5,-6)]))

# Removing the duplicate observations
cohorts2011 <- cohorts2011[-dupes,]

# Dropping to the dupes vector
# for memory efficiency
rm(dupes)


# Creating rows for each cohort group
# Creating columns for each value in the Age_by_Month column;0-11
# The default aggregation setup for dcast is, fun.aggregate = length
cohorts.wide <- reshape2::dcast(cohorts2011,Cohort~Age_by_Month,
                                value.var="CustomerID",
                                fun.aggregate = length)

## Warning in split_indices(.group, .n): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored

# Cloning the output for retention and churn mixpanels
# to be used later
cw.retention <- cohorts.wide
cw.churn <- cohorts.wide

# Creating 19 breaks and 20 rgb color values ranging from blue to white
breaks <- quantile(cohorts.wide[,3:13], probs = seq(.05, .95, .05), na.rm = TRUE)
colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
                 function(x){ rgb(x,x,155, maxColorValue = 155) } )


# The Retention Mixpanel with counts
DT::datatable(cohorts.wide,
              class = 'cell-border stripe',
             rownames = FALSE,
             options = list(
               ordering=F,
               dom = 't',
               pageLength = 12) ) %>%
             formatStyle("0",
                         backgroundColor = 'lightgrey',
                         fontWeight = 'bold') %>%
  formatStyle(names(cohorts.wide[c(-1,-2)]),fontWeight = 'bold',color = 'white', backgroundColor = styleInterval(breaks,colors))

Retention Rate Mixpanel

The retention rate mixpanel below features the same data from the previously shown MAU mixpanel. The difference here is the intersections between rows and columns now feature the retention rates for each cohort group. Column 0 gives us the count of acquired customers for each month.

We can see that the January cohort features significantly higher retention rates than the other cohorts. By looking upwards, diagonally, from each acquisition month, we’re able to keep track of any specific month other than January. With this, we can see that the month of October features a higher rate of retention than other months for most of the cohorts.

The horizontal panel below the retention rate mixpanel features the average retention rate for each monthly age of all cohorts. The average count of acquired customers per month is 402.5. Generally, the average retention rate increases as the cohort age increases, but mostly hovers between 22% and 26%.

# Calculating the percentages. month number/join month number
# DT will handle the *100 and % formating.
# The sequence needs to be reversed because if the first
# column is worked on, everything else will be divided by 1.
# Instead of formatting column 0 to show 100% for each row, it seems
# more useful to leave this as the original count, showing how
# many new customers were acquired in its respective month. This
# is why the for loop ends right before column 0.
for (i in rev(3:ncol(cw.retention))){
  cw.retention[,i] <- round(cw.retention[,i]/cw.retention[,2],4)
}
rm(i)

# Cloning the retention mixpanel
retention.avgs <- cw.retention

# When calculating the column averages, 0 won't get ignored,
# which is a problem. Converting these 0 to NAs solves this issue.
retention.avgs[retention.avgs == 0.0000] <- NA
avgs.ret <- round(apply(retention.avgs[,-1],2,mean, na.rm=TRUE),4)

# We use the zero because this is a numerical vector
# Changing it after the merge can't happen due to the
# factoring of the Cohort labels
avgs.ret <- c(0,avgs.ret)

# Adding the averages row to the retention mixpanel
cw.retention <- rbind(cw.retention,avgs.ret)

## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated

# Creating 19 breaks and 20 rgb color values ranging from blue to white
breaks <- quantile(cw.retention[,3:13], probs = seq(.05, .95, .05), na.rm = TRUE)
colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
                 function(x){ rgb(x,x,155, maxColorValue = 155) } )


# The retention rate mixpanel
DT::datatable(cw.retention,
              class = 'cell-border stripe',
             rownames = FALSE,
             options = list(
               ordering=F,
               dom = 't',
               pageLength = 13) ) %>%
             formatStyle("0",
                         backgroundColor = 'lightgrey',
                         fontWeight = 'bold') %>%
  formatPercentage(c(3:13),2) %>% # We don't want column 0 in %
  formatStyle("1", fontWeight = 'bold') %>%
  formatStyle(names(cw.retention[c(-1,-2)]),color = 'white',fontWeight = 'bold', backgroundColor = styleInterval(breaks,colors))

Churn Rate Mixpanel

The churn rate mixpanel below features the same data from the previously shown MAU Mixpanel, two sections up. The difference here is the intersections between rows and columns now feature the churn rates of cohorts. Column 0 gives us the count of acquired customers for each month.

As stated in the retention rate mixpanel section, the January cohort features higher than normal retention rates. This looks to be a seasonal effect of the holidays. With this mixpanel, we can see a clear pattern just by looking upwards, diagonally from the December cohort’s acquisition month. As you can see, the month of December has a churn rate that ranges mostly in the high 90% for each cohort. The months of October and November feature the lowest churn rates, further confirming my hypothesis of the seasonal effect towards the holidays, which causes lower than normal churn rates for the closest months post December.

The horizontal panel below the churn rate mixpanel features the average churn rate for each monthly age of all cohorts. The average count of acquired customers per month is 402.5. Generally, the average churn rate decreases as the age increases, but mostly hovers between 77% and 73%.

# Calculating the percentages. month number/join month number
# DT will handle the *100 and % formating.
# The sequence needs to be reversed because if the first
# column is worked on, everything else will be divided by 1.
# Instead of formatting column 0 to show 100% for each row, it seems
# more useful to leave this as the original count, showing how
# many new customers were acquired in its respective month. This
# is why the for loop ends right before column 0.
for (i in rev(3:ncol(cw.churn))){
  
  #Calculates the retention rate
  cw.churn[,i] <- round(cw.churn[,i]/cw.churn[,2],4)
  
  # Turns the retention rate into the churn rate. The ifelse
  # part is to avoid doing any calculations to the zeros.
  cw.churn[,i] <- ifelse(cw.churn[,i] !=0, 1.0-cw.churn[,i], 0+cw.churn[,i])
}
rm(i)

# Cloning the churn mixpanel
churn.avgs <- cw.churn

# When calculating the column averages, 0 gets included in the calculations,
# This is a problem. Converting these 0 to NAs solves this issue thanks to na.rm.
churn.avgs[churn.avgs == 0.0000] <- NA
avgs.chu <- round(apply(churn.avgs[,-1],2,mean, na.rm=TRUE),4)

# We use the zero because this is a numerical vector
# Changing it after the merge can't happen due to the
# factoring of the Cohort labels
avgs.chu <- c(0,avgs.chu)

# Adding the averages row to the retention mixpanel
cw.churn <- rbind(cw.churn,avgs.chu)

## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated

# Creating 19 breaks and 20 rgb color values ranging from red to white
breaks2 <- quantile(cw.churn[,3:13], probs = seq(.05, .95, .05), na.rm = TRUE)
colors2 <- sapply(round(seq(255, 40, length.out = length(breaks2) + 1), 0),
                 function(x){ rgb(255,x,x, maxColorValue = 255) } )


# The churn rate mixpanel
DT::datatable(cw.churn,
              class = 'cell-border stripe',
             rownames = FALSE,
             options = list(
               ordering=F,
               dom = 't',
               pageLength = 13) ) %>%
             formatStyle("0",
                         backgroundColor = 'lightgrey',
                         fontWeight = 'bold') %>%
  formatPercentage(c(3:13),2) %>% # We don't want column 0 in %
  formatStyle("1", fontWeight = 'bold') %>%
  formatStyle(names(cw.churn[c(-1,-2)]),color = 'white',fontWeight = 'bold', backgroundColor = styleInterval(breaks2,colors2))