Introduction

Business analytics, big data, and data science, are very hot topics today, and for good reasons. Companies are sitting on a treasure trove of data, but usually lack the skills and people to analyze and exploit the data efficiently. Those companies who develop the skills, and hire the right people to analyze and exploit the data, will have a clear competitive advantage.

It’s especially true in one demand, marketing. About 90% of the data collected by companies today are related to customer actions in marketing activities, including, what pages customers visit, what products they buy, in what quantities, and at what price. Which banners customers see or which emails they open, and how effective these actions have been to influence their behavior.

The domain of marketing analytics is absolutely huge, and may cover fancy topics, such as: text mining, social network analysis, sentiment analysis, real time bidding, online campaign optimization, and so on. But at the heart of marketing, lie a few basic questions, that often remain unanswered:

  • One: Who are my customers?
  • Two: Which customer should I target, and spend most of my marketing budget on? And:
  • Three: What’s the future value of my customers, so I can concentrate on those who will be worth the most to the company in the future?

That’s exactly what this series of episodes are ll about:

  • Segmentation: Understanding your customers.
  • Scoring Models: Targeting the right ones. And:
  • Customer Lifetime Value: Anticipating their future value.

These are the foundations of Markeing Analytics, and throughout this series articles, we will be using R statistical language to answer those questions. The series is arranged as follows:

Part One: Statistical Segmentation

In this part, we will start by some exploratory data analysis. Then talk about one kind of segmentation, which is the hierarchical segmentation: what is it? How it is done? What are the segmentation variables? How to choose the right number of segments? And how to do all this in R?

Part Two: Managerial Segmentation

In part two, we’ll continue talking about segmentation. But this time from a managerial point of view, not a statistical one. That is, to divide your customer database into meaningful, easy to manage, segments that are closely relevant to the managerial goal you’re trying to achieve.

Part Three: Scoring Models

After that, in part three, we will dive into the wonderful world of predictive modelling and build scoring models. That is, building models to predict which customers will make purchases over the next 12 months, and for how much money? We will do that using a combination of Linear and Logistic Regression Models.

Part Four: Customer Lifetime Value

In the the fourth and last part, we will use markov chains models to compute customer lifetime value, which is a metric of a customer’s value to the organization over the entire history of the relationship.

Now, roll up your sleeves and let’s start part one:

Prerequisites

The following R packages are required for the analysis:

library(ggplot2)      # for plotting
library(dplyr)        # for data manipulation and transformation
library(tidyr)        # for for applying tidy data principles
library(readr)        # for data import
library(lubridate)    # for date and time manipulation
library(plotly)       # for interactive visualization on the web 
library(ggdendro)     # for plotting the dendrogram
library(ggthemes)     # for adding themes to ggplot 
library(DT)           # for html tables printing 
library(tibble)       # for better data frame interaction
library(RColorBrewer) # Creates nice looking color palettes

Data Preparation

We will be working with a dataset that identifies customers and theirs purchases dates and amounts. It is stored in the text file “purchases.txt”.

Our first step is to read this dataset into R and explore it. So, let’s get started:

# Load text file into local variable called 'data'
data = tbl_df(read.delim(file = 'purchases.txt', header = FALSE, sep = '\t', dec = '.'))

# Look at the first few rows of our data
head(data, 30) %>% datatable(options(align = "rrr"), style = "bootstrap")
# let's have a look at its structure
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51243 obs. of  3 variables:
##  $ V1: int  760 860 1200 1420 1940 1960 2620 3050 3120 3260 ...
##  $ V2: num  25 50 100 50 70 40 30 50 150 45 ...
##  $ V3: Factor w/ 1879 levels "2005-01-02","2005-01-04",..: 668 1099 93 622 1160 1319 121 251 173 798 ...

The first column of our data is the customer id, the second is the purchase amount, and the third is the purchase date. We notice that the columns are not named accordingly, and we also notice that the date column is not recognized as a date data structure, rather, a factor. So, we need to fix these two issues. Also, we want to add a new column for the year of purchase. The following code will accomplish these 3 tasks:

# Add column names and parse the last column as a date data structure
# Then extract year of purchase and add it as a new column
colnames(data) = c('customer_id', 'purchase_amount', 'date_of_purchase')
data$date_of_purchase <- ymd(data$date_of_purchase)
data$year_of_purchase <- year(data$date_of_purchase)

# Display the data set after transformation
head(data, 30) %>% datatable(options(align = "rrr"), style = "bootstrap")
# let's again look at the structure
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51243 obs. of  4 variables:
##  $ customer_id     : int  760 860 1200 1420 1940 1960 2620 3050 3120 3260 ...
##  $ purchase_amount : num  25 50 100 50 70 40 30 50 150 45 ...
##  $ date_of_purchase: Date, format: "2009-11-06" "2012-09-28" ...
##  $ year_of_purchase: num  2009 2012 2005 2009 2013 ...

So, we solved the 3 issues. Now, let’s have a look at a statistical summary of our data:

summary(data$purchase_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   30.00   62.34   60.00 4500.00

Notice that three quarters of the purchases are below 60, while the maximum purchase is $4500. So, most customers are making moderated payments, while only very few of them make big purchases.

3 Questions..3 Dimensions

Before building any models, we need to explore our data. The way you explore your data is to ask questions. The more questions you ask and try to answer them, the more you understand your data. So, we’ll be asking 3 questions across three time dimensions:

  • Dimension One .. The Year: Basically, we want to know what’s happening over the years? To be specific, we’ll try to answer three quesitons:
    1. The Total number of Purchases Each Year
    2. The Average amount of Purchases Each Year
    3. The Total Purchase Amount Each Year
  • Dimension Two .. The Month: Here, we will ask the same three questions, but instead of asking them over the years, we want to see what’s happening across the months of the year:
    1. Total Purchase Amount per Month
    2. Average Purchase Amount per Month
    3. Total number of Purchases per Month
  • Dimension Three .. The Month-Day: Again, we’ll ask the same three questions, but instead of just the months, we want to explore our data across the joint distribution of Months and Days:
    1. Total Purchase Amount Across Months and Days
    2. Average Purchase Amount Across Months and Days
    3. Total number of Purchases Across Months and Days

Throughout our answers to these questions, we’ll be using almost exclusively two packages: dplyr, and ggplot2. The following pattern is going to be repeated:

  • Use group_by to group your data by dimension. So, when we’re working with years, we’ll use group_by(Year), and when working with months, group_by(Month). Finally, when working with joint dimension, we’ll use group_by(Month, Day).

  • Use summarise for the required compuations, e.g. summarise(sum(purchase_amount)) for total amount calculations, summarise(mean(purchase_amount)) for Avg. amount calculations, and finally summarise(n()) for total number of purchases (frequency) calculations.

  • After that, we pass the result of the above operations to ggplot to produce the required visualizations. We will use only two types of plots, but with some variations. These are: geom_bar for barplots, and goem_tile for heatmaps.

If you’re a beginner in R, or need a refresher, feel free to check these links:

ggplot2 tutorial from the creator of ggplot2, Hadley Wickam.

dplyr tutorial from data School.

Now, you’re ready to continue:

The Year Dimension

Let’s get started by answering our three questions in the first dimension:

Total No. of Purchases per Year

Let’s answer our first question. How many purchases have been made per year? The following code will accomplish that:

# Number of purchases per year

no_purch_year <- data %>%
  # group data by year_of_purchase (after converting it to factor)
  group_by(Year = factor(year_of_purchase)) %>%
  # calculate no. of purchases per year
  summarize(Frequency = n())

# have a look at the new data frame created
no_purch_year %>% datatable(options(align = "rr"), style = "bootstrap")

Now, let’s plot the table we just produced:

no_purch_year_p <- ggplot(no_purch_year, aes(x = Year, 
                                             y = Frequency)) +
  geom_bar(stat = "identity", fill = "aquamarine4") +
  labs(x = "Year of Purchase", y = "Total No. of Purchases",
       title = "Total No. of Purchases per Year")
ggplotly(no_purch_year_p)


As you can see, the number of purchases per year goes up very quickly at the beginning, and then it keeps growing, but at a much slower pace.

Average Purchase amount per Year

Now, for the second question, we want to know how much on average is spent each year. Let’s do this in R

# Average purchase amount per year
mean_purch_year <- data %>%
  # group data by the year
  group_by(Year = factor(year_of_purchase)) %>%
  # calculate the average of purchases per each year
  summarize(AvgAmount = mean(purchase_amount))

# have a look at the new generated table
mean_purch_year %>% datatable(options(align = "rr"), style = "bootstrap")

Also, we want to see a graphical representation for the last table:

mean_purch_year_p <- ggplot(mean_purch_year, aes(x = Year, 
                                                 y = AvgAmount)) +
  geom_bar(stat = "identity", fill = "slateblue4") +
  labs(x = "Year of Purchase", y = "Avg. Purchase Amount",
       title = "Avg. Purchase Amount per Year")
ggplotly(mean_purch_year_p)


We notice that on average the purchase amount was pretty stable at first and then started to increase starting from 2010, although at a very slow pace.

Total Purchase Amount per Year

Here we’re interested in the total purchase amount per year. Again, we will use a similar code, with a slight change:

# Total purchase amounts per year

tot_purch_year <- data %>%
  # group data by year
  group_by(Year = factor(year_of_purchase)) %>%
  # calcualte the total (sum) amount of purchases per year
  summarize(TotAmount = sum(purchase_amount))

# look at your new table
tot_purch_year %>% datatable(options(align = "rr"), style = "bootstrap")

Again, plotting:

# first, let's divide TotAmont by 1000, because we want the results expressed in 1000s
tot_purch_year$TotAmount <- tot_purch_year$TotAmount/1000
tot_purch_year_p <- ggplot(tot_purch_year, aes(x = Year, y = TotAmount)) +
  geom_bar(stat = "identity", fill = "orange3") +
  labs(x = "Year of Purchase", y = "Total Purchase Amount", 
       title = "Total Purchase Amount Per Year ($1,000)")
ggplotly(tot_purch_year_p)


Here we can see a nice positive trend. Total revenues increase as years progress in a nearly linear fashion.

Look carefully at the last 2 plots. i.e: Total amount of purchase per year, and average amount of purchase per year. Do you realize an interesting pattern? Think carefully.

Here’s what I realize: while the total revenue (purchase amount) increase almost linearly, the average revenue doesn’t do the same. It remained almost constant for some time then started to increase, but at a much slower pace than the total revenue. Why is that?

If you think carefully, this must be due to a few customers who make the highest amounts of purchases. Although they’re very fiew, but their gross contribution to total revenues is quite significant, and this is what caused this linear increase in revenues over the years.

On the other hand, most customers make only moderate purchases. So, when you take the average, these samll amounts drive down the average revenue. This is obvious from the summary statistics we ran before. Let’s run it again:

summary(data$purchase_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   30.00   62.34   60.00 4500.00

See how three fourths of our customers make only payments less than 60. While the remaining customers, although few in numbers, make the highest payments. Notice the mean value and how it’s pulled down to only $62.34, while the max amount of purchasing is a staggering $4,500.

Having answered our questions in the first dimension, let’s move on to the next one:

The Month Dimension

Total Purchase Amount Per Month:

Now, instead of looking at the revenues per year, let’s see the distribution of total revenues across the months:

# let's do it this time in one go:

# starting from our data:
tot_rev_month_p <- data %>% 
  # add a new variable (Month) by extracting the month componenet from date_of_purchase
  mutate(Month = month(date_of_purchase, label = T)) %>% 
  # for each month:
  group_by(Month) %>% 
  # calculate the total purchase amount (divide by 1000 to express the results in thousands)
  summarise(TotalAmount = sum(purchase_amount)/1000) %>% 
  # pass the data frame resulting from above to ggoplot,
  # then create a bar plot
  ggplot(aes(x = Month, y = TotalAmount)) +
  geom_bar(stat = "identity", fill = "deepskyblue4") +
  # adding labels
  labs(y = "Total Purchase Amount", title = "Total Purchase Amount Across Months ($1,000)") +
  # remove the x-axis label
  theme(axis.title.x = element_blank())

ggplotly(tot_rev_month_p)


Well, we all expected that December would be the highest month in terms of revenue generation, due to Christmas. But there’s also an interesting pattern appears: It seems that there’s an increase of revenues in the middle of the year, around May and two months before and after. Also the couple of months before Dec. While Jan., Feb., and Aug. are the least.

You might consider taking these patterns into account when planning your marketing campaigns, by targeting the months where customers have tendency to buy more. Or maybe it’s actually due to some markeing campaigns and promotions that customers buy more during theses months (I mean other than December).

Let’s answer the second question:

Avg. Purchase Amount Per Month:

Now, we will repeat the same steps as before, but instead of looking at the total purchase amount, we’ll look at the average across the months:

# let's do it this time in one go:

# starting from our data:
avg_rev_month_p <- data %>% 
  # add a new variable (Month) by extracting the month componenet from date
  mutate(Month = month(date_of_purchase, label = T)) %>% 
  # for each month:
  group_by(Month) %>% 
  # calculate the avg. purchase amount 
  summarise(AvgAmount = mean(purchase_amount)) %>% 
  # pass the data frame resulting from above to ggoplot,
  # then create a bar plot
  ggplot(aes(x = Month, y = AvgAmount)) +
  geom_bar(stat = "identity", fill = "maroon4", alpha = 0.8) +
  # adding labels
  labs(y = "Avg. Purchase Amount", title = "Avg. Purchase Amount Across Months") +
  # remove the x-axis label
  theme(axis.title.x = element_blank())

ggplotly(avg_rev_month_p)


Hmmm, that’s interesting! Now May is the highest one. So, it seems that in December, a very large number of people are making purchases, but at a relatively small amount per purchase. On the other hand, a relatively small number of people are making purchases in May, but when they do, they make larger purchases on average.

What? You don’t believe me? okay let’s prove that by answering the third question in the Month dimension:

Total Number of Purchases per Month:

Let’s now repeat the same plot but on the Y-axis we plot the total number of purchases (the frequency):

# let's do it this time in one go:

# starting from our data:
count_rev_month_p <- data %>% 
  # add a new variable (month_of_purchase) by extracting the month componenet from date
  mutate(Month = month(date_of_purchase, label = T)) %>% 
  # for each month:
  group_by(Month) %>% 
  # calculate the avg. purchase amount 
  summarise(Frequency = n()) %>% 
  # pass the data frame resulting from above to ggoplot,
  # then create a bar plot
  ggplot(aes(x = Month, y = Frequency)) +
  geom_bar(stat = "identity", fill = "red4") +
  # adding labels
  labs(y = "Frequency", title = "Frequency of Purchases Across Months") +
  # remove the x-axis label
  theme(axis.title.x = element_blank())

ggplotly(count_rev_month_p)


Ha! You believe me now? It turns out that May is even lower than many months in terms of transaction frequency. This must mean that in May, people make significanly larger purchases on average, which is something worth more investigation.

Now, let’s move on to the third and most exciting dimension:

The Month-Day Dimension

Total Purchase Amount Across Months and Days:

Let’s take our exploration a step further and see the distribution of total revenues across months of the year and days of the week. So, how can we do this?

Well, there are many solutions to this question, but I will demonstrate only 3 of them:

Solution 1: Facetted Bar Plots

Let’s repeat our plot of total purchase amount across months, but with one difference: we will make multiple plots, each of which represents the Total purchase Amount across months but for a specific day of the week, e.g. Monday, Tuesday, … etc. To accomplish that, we need to use the facet_wrap function with our ggplot object; this is the only difference, and everything else remains almost identical. Let’s see how can we do this:

# starting from our data:
tot_rev_month_day_p <- data %>% 
  # add 2 new variables 
  # 1- Month: by extracting the month componenet from date
  # 2- Day: by extracting the day componenet from date
  mutate(Month = month(date_of_purchase, label = T),
         Day = wday(date_of_purchase, label = T)) %>% 
  # for each (month, day) pair
  group_by(Month, Day) %>% 
  # calculate the total purchase amount (divide by 1000 to display results in thousands)
  summarise(TotalAmount = sum(purchase_amount)/1000) %>% 
  # pass the data frame resulting from above to ggoplot,
  # then create a bar plot
  ggplot(aes(x = Month, y = TotalAmount)) +
  geom_bar(stat = "identity", aes(fill = Day)) +
  # use different colors
  scale_fill_brewer(palette = "Dark2") +
  # adding labels
  labs(title = "Total Purchase Amount Across Months ($1000)") +
  # facet across Day (create a plot for each day of the week)
  facet_wrap(~Day, nrow = 4) +
  # remove the x-axis label
  theme(axis.title = element_blank())

ggplotly(tot_rev_month_day_p)

That’s interesting! To my surprize it seems that Sunday and Monday are the least in terms of revenue genration. Wednesday looks pretty lucrative though. Why is that? is it just customers’ tendency? or is it because the company used to make some promotions on this day? We need more investigation to answer this question; It cannot be answered by only looking at the data. The remaining days: Thursday, Friday, and Saturday seem close to each other.

Although this plot looks nice, it seems that making comparisons is not easy. For example, which one of Friday, Thursday, and Saturday is higher? To make comparisons easier, we will have a look at the second approach:

Solution 2: Stacked Bar Plots

Instead of making a plot for each day like before, we’ll stack the days on top of each other. This is done by using the argument position inside the geom_bar function and assigning to it the value "fill".

# starting from our data:
tot_rev_month_day_p_stacked <- data %>% 
  # add 2 new variables 
  # 1- Month: by extracting the month componenet from date
  # 2- Day: by extracting the day componenet from date
  mutate(Month = month(date_of_purchase, label = T),
         Day = wday(date_of_purchase, label = T)) %>% 
  # for each (month, day) pair
  group_by(Month, Day) %>% 
  # calculate the total purchase amount 
  summarise(TotalAmount = sum(purchase_amount)/1000) %>% 
  # pass the data frame resulting from above to ggoplot,
  # then create a bar plot
  ggplot(aes(x = Month, y = TotalAmount)) +
  # make a stacked bar plot
  geom_bar(stat = "identity", aes(fill = Day, label = TotalAmount), position = "fill") +
  # use different colors
  scale_fill_brewer(palette = "Dark2") +
  # adding labels
  labs(title = "Total Purchase Amount Across Months ($1000)") +
  # remove the x-axis label
  theme(axis.title = element_blank())

ggplotly(tot_rev_month_day_p_stacked)

I think The Answer to our question is more obvious now. Friday seems to generate much more revenue than Thursday or Saturday. Let’s try one final approach, to make things even better:

Solution 3: The Heatmap

This is my favorite one! We will make a heat map, in which, months are plotted on the X-axis, and days of the week on th Y-axis. Each (month, day) pair is represented by a square, while the color of the square indicates the total purchase amount generated in this day and month across the years of our database; the darker the color, the higher the total purchase amount. Let’s see how this is done in R:

# start from our data
heat_p <- data %>% 
  # add 2 new variables 
  # 1- Month: by extracting the month componenet from date
  # 2- Day: by extracting the day componenet from date
  mutate(Month = month(date_of_purchase, label = T),
         Day = wday(date_of_purchase, label = T)) %>% 
  # for each (month, day) pair
  group_by(Month, Day) %>% 
  # calculate the total puchase amount
  summarise(TotalAmount = sum(purchase_amount)/1000) %>% 
  ggplot(aes(x = Month, y = Day)) +
  # add the tiles (the squares) for the heatmap, make its fill color propotionate to TotalAmount
  geom_tile(aes(fill = TotalAmount)) +
  # create the color gradient for the tiles
  scale_fill_gradient(name = "Tot. Amount", low = "white", high = "maroon4") +
  labs(title = "Total Purchase Amount Across Months and Days ($1000)",
       caption = "The Heatmap is a very good visualization for Interaction between 3 variables")

ggplotly(heat_p)


We can see that making comparisons here is even easier. It’s obvious as we noticed before that Wednesday and Friday are the highest in terms of revenue generation.

Avg. Purchase Amount Across Months and Days:

Let’s repeat the same map, but this time for the average purchasing amount, not the total:

heat_p_avg <- data %>% 
  # add 2 new variables 
  # 1- Month: by extracting the month componenet from date
  # 2- Day: by extracting the day componenet from date
  mutate(Month = month(date_of_purchase, label = T),
         Day = wday(date_of_purchase, label = T)) %>% 
  group_by(Month, Day) %>% 
  summarise(AvgAmount = mean(purchase_amount)) %>% 
  ggplot(aes(x = Month, y = Day)) +
  # add the tiles for the heatmap
  geom_tile(aes(fill = AvgAmount)) +
  scale_fill_gradient(name = "Avg. Amount", low = "white", high = "red") +
  labs(title = "Avg. Purchase Amount Across Months and Days")

ggplotly(heat_p_avg)


Oh, this wasn’t expected at all!! Over ten years, it’s Monday in these specific 3 months, Mars, May, and June, that by far is the highest in terms of average purchase amount. Hover over the map to see the big difference between these 3 bright red squares and the other ones.

Why is that? This is a question worth investigation. Does it have to do with the type of business the company is running? Or is it due to some very effective promotions? Or is it just that customers happened to have some inclination to buy at significantly larger amounts on this day across these 3 months? which deoesn’t sound reasonable at all! Again, the data can’t explain this on its own, although it revealed these interesting patterns and questions for us.

Now, for our last visualization:

Total No. of Purchases Across Months and Days:

Again, Let’s repeat the same map, but for total number of purchases (frequency):

heat_p_freq <- data %>% 
  # add 2 new variables 
  # 1- Month: by extracting the month componenet from date
  # 2- Day: by extracting the day componenet from date
  mutate(Month = month(date_of_purchase, label = T),
         Day = wday(date_of_purchase, label = T)) %>% 
  group_by(Month, Day) %>% 
  summarise(Frequency = n()) %>% 
  ggplot(aes(x = Month, y = Day)) +
  # add the tiles for the heatmap
  geom_tile(aes(fill = Frequency)) +
  scale_fill_gradient(name = "Frequency", low = "white", high = "blue4") +
  labs(title = "Frequency of Purchases Across Months and Days")

ggplotly(heat_p_freq)


This looks similar to the first map, the map of the total purchase amount, which makes perfect sense; the more transactions your customers make, the more total revenues they generate.

All right then, we have asked quite a number of questions and managed to answer them using visualization. Try to dig deeper and come up with new questions and try to answer them. Think of new dimensions that might be of interest to you. For example, you can investigate the Year-Month dimension, or the Year-Day, and answer the same 3 questions across these 2 new dimensions. Or, even better, come up with your won interesting questions.

Now, let’s move on to the main topic of this article:

Segmentation

Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.

In other words segmentation transforms your huge customer data base into something that is clear and usable. Think about it, you can’t treat all your customers the same way, offer them the same product, charge the same price, or communicate in the same benefits, because they have differences in needs, wants, habits, and profiles. And those differences allow you to customize your offerings, adapt your customer messages and optimize your marketing campaigns.

Now, the question is: what makes a good and effective segmentaion? Many textbooks mention a long list of criteria to define what is a good segmentation. Such criteria may include:

  • homogeneity within segments: Customers within each segment should be similar enough.

  • Heterogeneity across segments: Customers who are different should fall in different groups.

  • Responsiveness: In the sense that each segment should react differently to different marketing mix.

  • Measurability: The size of the market segment can be measured.

And, of course this list is correct. But to put it simply: A good segmentation is one that’s relevant from a statistical and managerial perpectives. But, what does this actually mean?

Throughout the rest of this episode, I will be talking about segmentation from a statistical point of view. Specifically, I will talk about Hierarchical Segmentation; What is it? How it works? And how can we implement it in R? And The next episode will be devoted for managerial segmentation.

Hierarchical segmentation

To illustrate how hierarchical segmentation works, let’s take a very simple graphical example. Let’s assume that you have only ten customers in your database and that these ten customers are only described by two factors, or what we call segmentation variables:

  • How often they make a purchase in one of your stores (per year)
  • For how much they buy every time they shop.

Graphically, these ten customers can be represented by the plot below:



You can see that the market neatly separates into three segments. Each segment represents a group of customers that is distinct from the other groups and all customers within each group are quite similar.

Segment C contains customers who buy very often, and for a lot of money at each purchase occasion. Segment B groups together customers who make regular purchases, but for lower amounts. And segment A, groups together customers who shop much less frequently, but when they do, spend a lot. From a managerial point of view, these three segments, make a lot of sense:

  • Segment C is strategic and may generate a vast portion of the firm’s profit

  • Segment B constitutes the bulk of the purchases, because they buy frequently, but may not be as profitable as one might think. And:

  • Segment A is a perfect target for marketing actions that might encourage more frequent purchases, such as special store events or seasonal coupons.

But how can a statistical software, find the three segments automatically? Well, there are many methods, but the one we’ll explore is called hierarchical clustering. The internal process the software goes through goes as follows:

First, you consider that each and every customer is its own segment. So, there are as many segments as there are customers, in this case 10.

Then you ask the question: “which two customers could I group together so that I would lose the least information?”" That is, they would be so similar that if I treated them as absolutely identical, it would make no difference. If you looked at the last figure, you’d easily find out that the closest pair of customers are 5 and 6. So, we should start by grouping them together.

The trick is simply to continue the process. By grouping these two very similar customers together, we went from ten segments to nine. Then we keep going until we finally have only three segments.

The figure below illustrates this process:


Nothing prevents us from continuing the process however. We could keep grouping these customers into only two segments, but of course we’d lose a lot of information.

By grouping together those customers in the top part of the chart, we’ll simply have one big segment of large spenders, but without the ability to distinguish frequent buyers from occasional ones, and the segmentation would be much less useful for managers.


So, as you can see, a good segmentation is all about striking a good balance between preserving information; that is: not to produce too few segments that you lose significant information about your customers, and achieving simplicity; that is: not to produce too many segments that the management process becomes so complex. In short, the goal is trying to find the sweet spot between treating each customer individually, as if they were all their own segments, and treating all customers the same way as if there were only one big segment to which everybody belonged.

But where do you draw the line? Where to stop and say: this is the most suitable number of segments that achieve this goal? The next section is devoted for answering this exact question:

How Many Segments?

To begin with, there’s no magic way by which you precisly know how many segments are appropriate. First and foremost it depends on how managerially relevant your segmentation is.

If it makes sense, from a managerial perspective, to simplify and reduce the number of segments to make your solution more useable, then by all means do it. And if it’s reasonable to expand your segmentation by one segment or two, because distinguishing customers more precisely makes managerial sense, then again, you have your answer.

But if you are unsure or you don’t know where to start, there is a tool that can be used to guide your decision. This tool is the dendrogram. A dendrogram is a graphical representation of the hierarchical clustering process, that’s why we talk about hierarchical segmentation. All customers end up being grouped together, but there is hierarchy, a priority, and the dendogram illustrates this nicely. The figure below shows the dendrogram for our simple customers data set:

At the bottom, you have all the customers you are segmenting. And the tree shows how quickly and in which order these coustomers are grouped together into segments. At the end of the process, at the very top, all customers fall into the same segment.

But the dendrogram also shows you how much information you lose by grouping segments together. If by grouping two segments together you’re losing a lot of information, then you will see a big jump in the dendrogram. That’s when you’re beginning to group together customers who are too different from one another. And if you’d like to avoid losing that much information you need to stop the process right before.

In our example there is a sudden jump between three and two segments. Meaning that by going to only two segments you are losing a lot of information. And you should stop the segmentation process before, at three. There is no magic bullet here, but it’s a very useful tool when you don’t know where to start.

Another visualization for the dendrogram that shows the clusters very well is the following:

Another Visualization of the Dendrogram

Another Visualization of the Dendrogram


H.Clustering in Action

Do you want to see how hierarchical segmentation works live in front of you? that would nice, wouldn’t it? Well, check out this beautiful shiny application: hclustering in action

created by Raffael Vogler

Left-Click on the grey panel in the lower left corner to add or remove a point. Points can be dragged. Add at least 3 points for the app to start working. Keep playing with the app by adding some points close to each other, and other points away from each other. See how the application performs hierarachical clustering for you in the background and draws the dendrogram.

Segmentation Variables

All right; so we now talked about how customers are grouped together based on their similarities, and how to decide the suitable number of segments. But, wait I here you ask: If you group customers based on their similarity; similarity in what exactly? Well, that’s a very good question my friend, and here’s the answer:

It depends on the managerial question you’re asking. If you’d like to understand how people use your website, for instance, maybe you should study similarities in terms of pages visited, number of clicks, and duration of visits. If you’re trying to optimize product recommendations or trying to personalize catalogues, maybe you should study similarities in terms of products purchased in the past. If you are trying to optimize marketing campaigns, you should definitely group customers based on their similarities in terms of profit and responsiveness to past marketing campaigns.

The specific indicators on which you compare customers are called segmentation variables. The software doesn’t care what kind of data you are analyzing. If you feed the segmentation algorithm with data that is managerially irrelevant, it will find similarities that are useless. So whatever segmentation study you want to conduct, the first question should always be, what managerial goal do I want to achieve? And based on the answer, then, and only then, you should ask yourself what segmentation variables are relevant to achieve that managerial goal. Don’t select segmentation variables simply because they are easily available to you.

For our purposes, we will focus on 3 widely known segmentation variables. These are:

Recency, Frequency, Monetary Value

In many marketing studies, three specific marketing indicators often turn out to be invaluable. They’re called Recency, Frequency, and Monetary Value and they’ve been shown to be some of the best predictors of future purchases and customer profitability.

Recency indicates when a customer made his last purchase. Usually the smaller the recency, that is the more recent the last purchase, the more likely the next purchase will happen soon. On the other hand, if a customer has not made any purchase for a long period of time, he may have lost interest or switched to competition, which is bad news for future business.

Frequency refers to the number of purchases made in the past. The more purchases have been made in the past, the more likely additional purchases will occur in the future.

Finally, monetary value refers to the amount of money spent on average at each purchase occasion. Obviously, the more a costumer spends on average, the more valuable he is.

Marketing segmentation that use recency, frequency, and monetary value as segmentation variables is often referred to by the acronym RFM segmentations. But these key marketing variables are usually not readily available in the data you want to segment. More often than not, you simply have access to transactional database that contains a list of past purchases made by each customer, and you need to compute recency, frequency, and monetary value yourself.

In the next section, we’ll show how to compute these indicators in R.

RFM in R

In this section, we’re going to compute recency, frequency, and mandatory value from the data set. Let’s look again at our data to see where we are right now:

# show 30 rows of our data
head(data, 30)

Now, we’re going to compute something a bit specific. We’re going to compute the number of days that lapse between the last day in the database, which we assume to be January 1st, 2016 and the date of purchase in the data. And store that number of days in a new variable called days_since. We actually aim to use that to compute recency.

# compute days_since
data <- data %>% 
  mutate(days_since = as.numeric(interval(date_of_purchase,"2016-01-01"))/3600/24)

# look at the data again (30 rows)
head(data, 30) %>% datatable(style = "bootstrap")


Now, let’s compute recency, frequency, and average purchase amount. As before, we will use dplyr package to achieve that:

# Compute recency, frequency, and average purchase amount
customers <- data %>%
  # group the data by customer_id
  group_by(customer_id) %>%
            # for each customer_id, calculate:
  summarise(Recency = min(days_since, na.rm = T),
            Frequency = n(),
            AvgAmount = mean(purchase_amount, na.rm = T))

# now, have a look at our customers data
head(customers, 30) %>% datatable(style = "bootstrap")


# let's see how many unique customers we have
nrow(customers)
## [1] 18417
# and how many purchases in total did we have in the original data set?
nrow(data)
## [1] 51243

So, we can see that the customers dataset only has 18,417 rows. Meaning that there are only 18,417 unique customers in this data set. So 18,417 customers have made a total of 51,243 purchases.

Let’s explore our newly created data further:

# let's see some summary statistics:
# execlude the customer_id from the summary statistics as it's just an id
customers %>% select(-customer_id) %>% summary()
##     Recency       Frequency        AvgAmount      
##  Min.   :   1   Min.   : 1.000   Min.   :   5.00  
##  1st Qu.: 244   1st Qu.: 1.000   1st Qu.:  21.67  
##  Median :1070   Median : 2.000   Median :  30.00  
##  Mean   :1253   Mean   : 2.782   Mean   :  57.79  
##  3rd Qu.:2130   3rd Qu.: 3.000   3rd Qu.:  50.00  
##  Max.   :4014   Max.   :45.000   Max.   :4500.00

Here, we can see that the average recency of a customer is $1253. So, on average customers have lapsed for about 4 years. Some have lapsed 10 years ago, some have lapsed only a few days ago, but the average is four years. The minimum is barely one day, the maximum is pretty much the length of the entire data set.

In terms of frequency, some have made only one purchase. Actually, many have made only one purchase. Some have made an astounding number of 45 purchases over their lifetime, or at least over the 10 or 11 years we are observing, but on average, people have made approximately 2.8 purchases.

And then in terms of amount, the average amount goes from a minimum of $5 to a staggering $4,500 and the mean is around $57 per purchase per individual.

Let’s take our exploration a step further and look at the distributions of recency, frequency, and monetary value:

recency_p <- ggplot(data = customers, aes(x = Recency)) +
  geom_histogram(fill = "orange3", bins = 30) +
  scale_x_continuous(breaks = seq(0, 4000, by = 500)) +
  labs(x= "Recency", title = "Distribution of Recency")

ggplotly(recency_p)


From the histogram we can see we have a few customers whose recency is about 4000 days, and a bunch of customers whose recency is much more recent; About 100 days or so.

If you look at the histogram in terms of frequency:

frequency_p <- ggplot(data = customers, aes(x = Frequency)) +
  geom_histogram(fill = "blue", color = "blue3") +
  scale_x_continuous(breaks = seq(0, 40, by = 2)) +
  labs(x= "Frequency", title = "Distribution of Frequency")

ggplotly(frequency_p)


You can see the skewness of the distribution is even more extreme. Many people have made only 1 or 2 purchases, and then when you go to 5, 10, 20, 30 purchases, they are even rarer in the entire database.

And then finally, in terms of avg. purchase amount:

amount_p <- ggplot(data = customers, aes(x = AvgAmount)) +
  geom_histogram(fill = "orangered2", color = "orangered3") +
  scale_x_continuous(breaks = seq(0, 4000, by = 500)) +
  labs(x= "Avg. Purchase Amount", title = "Distribution of Avg. Purchase Amount")

ggplotly(amount_p)


The chart doesn’t look very nice, simply because you have a lot of customers who made purchases of very low amounts. And then extremely few who made purchases of much larger amounts, like $4000. So one thing you can do is to increase the number of bins of the histogram, and zoom in on a smaller range of x values (the Avg. Purchase Amount values). Since most of purchases are below $1000 on average. We will limit our x axis to 1000, using the coord_catesian function in ggplot, and also increase the number of bins to 200:

amount_p2 <- ggplot(data = customers, aes(x = AvgAmount)) +
  geom_histogram(fill = "orangered2", color = "orangered3", bins = 200) +
  scale_x_continuous(breaks = seq(0, 1000, by = 100)) +
  coord_cartesian(xlim = c(0, 1000)) +
  labs(x= "Avg. Purchase Amount", title = "Distribution of Avg. Purchase Amount")

ggplotly(amount_p2)


That’s a better view of the histogram, and as you can see, most people spend around $40, $50. A few spend over $100 and extremely few spend above $200, on average.

Data Transformation

Up until now, we’ve shown that customers are grouped together based on how similar or dissimilar they are to one another. And that similarity can be seen as a measure of distance. The problem is, however, is that to compute how similar two customers are, sometimes you have to compare apples and oranges.

For instance, if you’d like to group customers based on recency, frequency, and monetary value, you are basically comparing variables that are measured in terms of days, purchase occasions, and dollars (or any currency). These segmentation variables do not even use the same scales. So, How do you compare one to another?

The answer is: Data Transformation. You need to prepare and transform your data, so that your segmentation variables can be compared to one another. But what does this actually mean?

Well, data transformation is a big topic, but for our puproses, we mean two things:

  • Standardization:

This is to solve the problem of having different scales among your variables. Simply put, to standardize your data means that you subtract the mean and divide the data by its standard deviation:

\[\dfrac{data-data\:mean}{data\:standard\:deviation}\]

We won’t go into the details, but it simply means that, regardless of the actual scales used for your segmentation variables, they will roughly be scaled back within a range between minus two and plus two with some external values falling outside that range. By doing so, we can compare variables even though their original scales were different.

  • Dispersion Adjustmen:

This one has to deal with data dispersion, and the best example is probably the Avg. Purchase Amount. We saw before the histogram for the avg. purchase amount and noticed that most customers buy for moderate amounts and only a few buy for very large amounts. In this case, the distribution is said to be skewed. Let’s have a look again at the distribution of avg. purchase amount:



So what’s the problem? The problem is not statistical, it’s managerial; Would you say that a customer who spends $5 on average is different from someone who spends $15 instead? Well, definitely the latter generates three times more money at each purchase occasion. For a manager, they’re definitely different, it might be best if they were in different segments.

But what about two customers who spend $310 for the first and $320 for the second? From the managerial point of view, not a big different, right? Yet, from a statistical point of view, the difference in purchase amount is exactly the same, $10.

When you’re facing that kind of situation, it might be worth transforming your data and taking the logarithm of the amount. Once you take the log, the distribution will look something like this:



The same difference of $10 will have a huge impact on the left part of the chart, those who spend the least, but a minor influence on the right part of the chart, those who spend the most. It’s not really necessary from a statistical point of view, but from a managerial point of view, the segmentation solution will make much more sense. To learn more about data transformations, and its underlying mathematical concepts, watch this short lecture

Okay, now we are ready. In the next two sections, we’ll prepare our data for segmentation purpose, and then, we’ll run hierarchical segmentation and see what we get.

Data Transformation in R

First, Let’s see how can we perform the log transformation in ggplot2. This gives you the ability to see a graphical effect of the tranformation without changing the original data:

amount_p_log2 <- ggplot(data = customers, aes(x = AvgAmount)) +
  geom_histogram(fill = "orange3", color = "orange3", bins = 15, alpha = 0.9) +
  # all we need is add this line, which performs the log transformation
  # on the plotted data, but not changing the original one
  scale_x_log10() +
  labs(x= "Avg. Purchase Amount", title = "Transformed Avg. Purchase Amount")

ggplotly(amount_p_log2)


We can see now that the distribution looks pretty much close to the Normal Distribution due to the transformation.

Now, let’s actually perform the log transformation and standardization and store the result in a new dataset, we’ll call it customers_trans, for customers transformed:

customers_trans <- customers %>%
  # first let's remove customer_id column
  select(-customer_id) %>% 
  # transform the AvgAmount variable by taking the log10
  transform(AvgAmount = log10(AvgAmount)) %>% 
  # scale the variables (Recency, Frequency, AvgAmount)
  scale() %>% 
  # the output of scale() function is a matrix, convert it back to a data frame
  tbl_df() %>% 
  # add back the customer_id
  mutate(customer_id = customers$customer_id) %>% 
  # let the customer_id appears first in the data frame
  select(customer_id, everything())
  
# let's have a look at our newly transformed data
head(customers_trans, 30) %>% datatable(style = "bootstrap")

Now our new data customers_trans is the scaled version of customers. Meaning that each column will have a mean of zero and a standard deviation of one. Let’s check that out:

# first, let's check whether the means of all 3 varialbles are zeros
customers_trans %>% 
  # again remove the customer_id before the computation
  select(-customer_id) %>% 
  # compute the mean of all variables
  summarise_all(mean) %>% 
  # round the results to 2 decimals
  round(2)

Perfect! That’s exactly what we’ve expected. Now, lets’ check the standard deviations:

customers_trans %>% 
  select(-customer_id) %>% 
  # compute the standard deviation of all variables
  summarise_all(sd)

Excellent! So, we checked that our data have been standardized analytically. Let’s now see that pictorially; let’s plot the distributions of Recency, Frequency, and AvgAmount:

customers_trans %>% 
  gather(key = rfmVars, value = value, -customer_id) %>% 
  ggplot(aes(x = value)) +
  geom_histogram(aes(fill = rfmVars, color =rfmVars), alpha = 0.7) +
  coord_cartesian(xlim = c(-4, 4)) + 
  scale_x_continuous(breaks = -4:4) +
  facet_wrap(~rfmVars, nrow = 3) + 
  labs(x = "Value of a Segmentation Variable", title = "RFM Variables after Transformation", 
       fill = NULL, color = NULL) +
  theme(legend.position = "none")

We can see that all 3 variables have been centered around zero, with most observations lie between -2 and 2. The AvgAmount though looks very close to normal due to the log transformation we performed. So now, we can compare deviations in recency to deviations in frequency or deviations in AvgAmount. Basically, the data is now ready for being segmented and analyzed.

Hierarchical Segmentation in R

The first step to run a hierarchical clustering is to compute the distances between customers Knowing that the closer two customers are, the sooner they will be clustered together into the same segment. But here is an issue; our customer database contains 18,000+ customers, so if you want to compute distances among these customers, youl’ll be asking R to compute the distances between 18,000 customers to themselves, which is a total combination of:

nrow(customers)**2
## [1] 339185889

About 340 million distances in total. On many machines, that would be just too much to handle in terms of memory requirements. Instead, we will take a sample of the dataset, then calculate the distances, and lastly run our hierarchical clustering. Let’s see how can we do all this in R:

# take 10% sample from both the origianl customers data and the standardized one

customers_sample <- customers %>% 
  # take the 1st, 11th, 21th, ... , etc rows until the last row 
  slice(seq(1, nrow(customers), by = 10))

customers_trans_sample <- customers_trans %>% 
  slice(seq(1, nrow(customers_trans), by = 10))

# run the hierarchical clustering 
clusters <- customers_trans_sample %>%
  # remove customer_id
  select(-customer_id) %>% 
  # get distance metrics of standardized data
  dist() %>% 
  # lastly, perform h.clustering on distance metrics
  # note: read the documentation of hclust (?hclust) to understand the argument (method)
  hclust(method = "ward.D2")


# now, let's see the dendrogram of our clusters
dendro <- ggdendrogram(clusters, leaf_labels = F, labels = F)
dendro +
  labs(x = "Customer ID", title = "H.Clustering for our Sample Dataset")

At the very bottom of the dendrogram we have the 1,800 customers (the number of customers in our sample). And then you can see how all these individuals have been clustered together progressively step by step up to a stage where there is only big cluster.

As we explained before, the dendrogram suggests that you stop at 4 clusters or before; That’s where you notice this big jump. So, let’s cut the tree at 4 clusters and see what we get:

# cut the tree resulting from hclust() fucntion into 4 groups as suggested by the dendrogram
members = cutree(clusters, k = 4)

# The (members) vector has a length equal to the length of customers_trans_sample
# and it dictates to which group each customer belongs
# let's have a look at it
head(members, 20)
##  [1] 1 1 1 2 3 1 3 4 4 3 3 4 2 1 1 2 4 3 3 4

That means the first 3 customers belong to group 1, the 4th belongs to group 2, the 5th to group 3, and so on.

If you run the function table, it will count how many customers belong to each cluster:

table(members)
## members
##   1   2   3   4 
## 521 130 859 332

So here, cluster number 2 contains only 130 individuals, while cluster number 3 contains 859. Of course, that’s not very useful if you don’t know what these clusters are all about. So, what you’d like to do next is to compute the average profile of each segment. And for that purpose we don’t care about standardized variables any more. What you care about are the averages of the three variables computed in the their scales. So, we’re going to compute the aggregate of the actual original dataset, customers or customers_sample:

segments_profiles <- customers_sample %>% 
  # add the members as a new variable to customers and name it (segment)
  mutate(segment = members) %>%
  # group by the newly added segment variable 
  group_by(segment) %>% 
  # for each segment calculate the following summaries
  summarise(members_count = n(),
            avg_recency = round(mean(Recency), 2),
            avg_frequency = round(mean(Frequency), 2),
            avg_purchase_amount = round(mean(AvgAmount), 2))

# have a look at the segments profiles
segments_profiles %>% kable(caption = "Segments Profiles")
Segments Profiles
segment members_count avg_recency avg_frequency avg_purchase_amount
1 521 2612.50 1.30 29.03
2 130 193.65 10.62 42.02
3 859 712.52 2.55 31.12
4 332 972.55 2.76 149.68

From the table, we can see that, for example, segment 1, which contains 521 individuals, has an average recency of 2612 days, an average frequency of 1.3 purchases made in the past, and an average purchase amount of $29. This segment looks unimportant at all compared to the other ones. It has the worst values for all three segmentation variables. We notice a big difference in segment 4, which spends much more on average, $149, with a much lower average recency of 972 days. Segment 2 is the best in terms of average frequency, and average recency, and the second to best in terms of average purchase amount. While segment 4 is the best in terms of average purchase amount.

We can see how effective such a profiling mechanism is, because by only looking at this simple table, we can quickly deduce that segments 2 and 4 are the most important segments to our business, and that we should spend most of our marketing efforts and costs on them. To see this even more clearly, let’s plot these profiles using box plots:

recency_p <- customers_sample %>% 
  mutate(segment = members) %>% 
  group_by(segment = factor(segment)) %>% 
  ggplot(aes(x = segment, y = Recency, color = segment)) +
  geom_boxplot() +
  labs(title = "Recency Ditribution Across the 4 Segments", x = "Segment", color = NULL)

ggplotly(recency_p)


If you need a refresher on boxplots, check this link: how to read and use a Boxplot. But, simply put, the box captures the middle 50% of the data, the horizontal line inside the box shows the median. And the whiskers (the vertical lines outside the box) show the reasonable extent of data. Any dots outside the whiskers are good candidates for outliers.

So, looking at the plot above, it affirms what we’ve said before; Segment 1 is the worst in terms of Recency. It has a median of 2,577 day , and nearly 25% of its memebers have a recency between 3,000 and 4,000 days.

On the other hand, segment 2 is the best with a median as low as 60 days with only few outliers exceed the threshold of 611 days.

Segments 3 and 4 are close to each other in terms of recency as it appears from their distributions, with a slight advantage for segment 3. This is consistent with their average recencies we saw in the profiles table, which are 712 and 972 days, for segment 3 and 4 respectively.

But what about the other two segmentation variables? Will segment 3 and 4 be also close to each other? Let’s see for ourselves:

# plotting the frequency distributions of the 4 segments
frequency_p <- customers_sample %>% 
  mutate(segment = members) %>% 
  group_by(segment = factor(segment)) %>% 
  ggplot(aes(x = segment, y = Frequency, color = segment)) +
  geom_boxplot() +
  labs(title = "Frequency Ditribution Across the 4 Segments", x = "Segment", color = NULL)

ggplotly(frequency_p)


Segment 1 is still the worst in terms of Frequency, while segment 2 is by far ahead of its competition. Segments 3 and 4 are still close to each other in terms of Frequency. Most of segment 3 members have a frequency that’s slightly higher than their counterparts in segment 4. We notice that by looking at the 2 boxes and whiskers of segments 3 and 4.

But we also notice that there are quite a number of segment 4 members have a very high frequency. Those are represented by the dots above the whisker of segment 4. These members with very high frequency managed to pull up the average frequency of the whole segment that it topped that of segment 3, although very slihgtly. Go back to the profiles table above to see that segments 4 has an average frequency of 2.76, while segment 3 average frequency is 2.55.

Okay … up until now it seems that the software did a good job by deciding to separate segment 1 from segment 2, as one of them performs the worst while the other performs the best. But what about segments 3 and 4, why did our Machine decide to separate them from one another, while as we saw they’re quite similar?! Let’s see the third segmenatation variable, average purchase amount. Maybe it’ll reveal something to us:

# plotting the average purchase amount distributions of the 4 segments
AvgAmount_p <- customers_sample %>% 
  mutate(segment = members) %>% 
  group_by(segment = factor(segment)) %>% 
  ggplot(aes(x = segment, y = AvgAmount, color = segment)) +
  geom_boxplot() +
  labs(title = "Avg. Purchase Amount Ditribution Across the 4 Segments", x = "Segment", color = NULL) 

ggplotly(AvgAmount_p)


Now I get it! All three segments, 1-3, have an avg. purchase amount distribution that is confined below the $100 threshold. While segment 4 has exclusively collected all the big buyers. You can see them there above the whisker of segment 4. There’s a lot them!

Do you remember earlier when we noticed that the average purchase amount has increased slightly over the years, while the total purchase amount was increasing linearly? Go back to the The Year Dimension if you don’t remember. We concluded that this must be due to some big buyers that managed to increase the total purchase amount significantly across the years.

We found those buyers here! some of them even exceeds $2,000 worth of purchases. Those big buyers managed to pull up the average purchase amount of segment 4 to an astounding $150, almost 4 times more than segment 2, which is the second best, with an average purchase amount of $42.

Notice that those big buyers managed to pull up the average purchase amount of our entire customers database only slightly, because their big purchases were pulled down (on average) by a sheer number of buyers, among all the segments, with very low purchase amounts. Remeber that at that time, we havent’ yet segmented our customers database.

But now, when the software decided to put all those big buyers in one segment, they made a huge impact on this segment average purchase amount. This was a very sound decision, and that’s exactly why segment 4 had to be sparated from segment 3.

To sum up, segment 2 and 4 are strategic to our business, as segment 2 represent our loyal customers; those who buy frequently and don’t lapse long between purchase occasions, although on average they don’t make very big payments. And segment 4 represent our big customers who make us huge profits when they buy, although they’re not as frequent as segment 2 members.

Conclusion

In this first part of our marketing analytics series, we started by some crude, raw numbers about customers and their purchases. We managed to use R, ggplot2, and dplyr to explore these numbers until they started to speak, to tell their story, and to reveal their hidden patterns. We saw how we used Hierarchical Clustering and managed to divide our huge customer database into 4 meaningful and diversified groups of cutomers, in the sense that members of each group are close to one another in terms of segmentation variables, and at same time far away from members in other groups.

In the next part of our series, we will give command to the human, the manager, and let him do the segmentation for us. As our machine managed to achieve a good segmentation using what it knows best: computation, the human will manage, hopefully, to do a good job as well using what he knows best: reason.

But why? Since the software has accomplished the task, why bother doing it again by a human? Well, it turns out the statistical segmentation solution has some drawbacks.

What are those drawbacks? How will the manager overcome them using managerial segmentation? And how can we do all this in R?

The answers for all these questions are exactly what we’re going to do in the next part. So, stay tuned !