Customer Segmentation of Marketing Data Using the Tidyclust Package

Intro and Summary

Intro

Cluster analysis is a method used to group people or items based on their similarities. In marketing, it helps businesses identify different groups of customers who share similar characteristics or behaviors. This process is called audience segmentation.

By using cluster analysis, marketers can divide their audience into distinct groups. For example, they might find groups of customers who prefer low prices, those who are loyal to specific brands, or those who are quick to try new products. Each of these groups can then receive tailored marketing messages and offers that speak directly to their interests and needs.

This targeted approach allows businesses to create more personalized and effective marketing strategies, leading to better customer engagement and satisfaction. In essence, cluster analysis helps marketers understand their audience better and connect with them in a more meaningful way.

In this notebook we will be looking at marketing data from kaggle. This dataset includes demographics information as well as purchasing information and participation in offers made by the store. We will simulate audience segmentation for a store that is just opening by using only the demographic information to segment our audience. We will then determine how useful these clusters are by observing the differences among the clusters and seeing how much predictive power the clusters add in whether or not the customers will participate in offers made by the store.

Our clusters are made by giving extra weight to variables on whether and how many kids or teens or present in the household. We compare these clusters to a base set of clusters which weigh all the variables evenly and a third set of clusters which gives extra weight to income of the household, with our assumption being that the presence of kids/teens in a household affects a households purchasing behaviors with regards to groceries more than income does.

Summary of Results (In case you don’t want to read the whole thing)

Our tuning estimates suggest that 9 clusters might be the optimal amount, so we move forward with that. In our main set of clusters, which provides extra weight to the number and presence of kids and teens in a household, this creates a clear set of groups that do not have any children or teens in the house. Other variables also help distinguish the groups of this cluster set, for example in our exploratory analysis we find that income tends to be higher for households without kids or teens present and our cluster set also demonstrates that the groups without kids or teens have higher incomes.

When compared with only the other demographic variables, as a store might before it opens and before it has purchasing history data for it’s market, our clusters prove to be in the top 5 most important variables for predicting whether or not customers participate in the various offers made by the store, and in 5 out of the 6 offers made the cluster set that weighted the kids and teens more proved more predictive than the baseline cluster set and the cluster set that weighted income more.

# Results

When we include the purchasing history variables, imitating a store that has been open for some time and has been able to collect purchasing history data on customers, our clusters which were built only on demographic information drop in importance for the predictive model but are still used by the model. With the additional variables our kids/teens weighted clusters are more important than the baseline cluster in predicting participation for only 1 of the 6 offers, and better than the income weighted clusters in only 2 of the 6.

Overall our clusters prove to add significant value, both in dividing the total audience into manageable groups that can be targeted with specific marketing strategies. These clusters which are only made on demographic data provide significant additional value when used with a dataset that only has demographic variables, and despite only being used on demographic variables, still provide some value in a dataset that also includes purchase history data.

The Data

We will be using the Marketing Analytics dataset shared by Jack Daoud on the Kaggle website here.

For this exercise we will be starting with only the demographic variables of the customers, as if we are seeing them for the first time and trying to segment them for our first marketing campaigns, as if our store is just opening. We will look at how much information these clusters add to the likelihood of them accepting an offer after we have made the clusters. for now let’s start with a quick look at the data we have to work with.

Below we can see we have about 2,200 records of 13 variables. Categorical variables have already been transformed into dummy variables for us. We have information on Income, children at home, relationship status and education.

glimpse(marketing_data_initial)

## Rows: 2,205
## Columns: 13
## $ Income               <dbl> 58138, 46344, 71613, 26646, 58293, 62513, 55635, …
## $ Kidhome              <dbl> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0…
## $ Teenhome             <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0…
## $ Age                  <dbl> 63, 66, 55, 36, 39, 53, 49, 35, 46, 70, 44, 61, 6…
## $ marital_Divorced     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
## $ marital_Married      <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ marital_Single       <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ marital_Together     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1…
## $ marital_Widow        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_Basic      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ education_Graduation <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1…
## $ education_Master     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ education_PhD        <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0…

Looking at the output below we can see the dataset has no missing variables, so we don’t have to worry about that. “That’s good, one less thing.”

skimr::skim(marketing_data_initial)

Data summary
Name	marketing_data_initial
Number of rows	2205
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Income	1	51622.09	20713.06	1730	35196	51287	68281	113734	▂▇▇▆▁
Kidhome	1	0.44	0.54	0	0	0	1	2	▇▁▆▁▁
Teenhome	1	0.51	0.54	0	0	0	1	2	▇▁▇▁▁
Age	1	51.10	11.71	24	43	50	61	80	▂▇▇▆▂
marital_Divorced	1	0.10	0.31	0	0	0	0	1	▇▁▁▁▁
marital_Married	1	0.39	0.49	0	0	0	1	1	▇▁▁▁▅
marital_Single	1	0.22	0.41	0	0	0	0	1	▇▁▁▁▂
marital_Together	1	0.26	0.44	0	0	0	1	1	▇▁▁▁▃
marital_Widow	1	0.03	0.18	0	0	0	0	1	▇▁▁▁▁
education_Basic	1	0.02	0.15	0	0	0	0	1	▇▁▁▁▁
education_Graduation	1	0.50	0.50	0	0	1	1	1	▇▁▁▁▇
education_Master	1	0.17	0.37	0	0	0	0	1	▇▁▁▁▂
education_PhD	1	0.22	0.41	0	0	0	0	1	▇▁▁▁▂

Exploratory plots

First let’s get an idea of what kind of data we’re working with.

Luckily all our variables are already numeric (factor variables have been converted to dummy) so let’s make a correlation plot to start.

From the plot below a few correlations are visible. Age has some slight positive correlation with Income and Teenhome and a slight negative correlation with Kidhome. Conceptually this makes sense, as you get older, you likely have more years of work experience which leads to higher income. As you get older your children also get older so as you age it’s more likely you have teens in your home rather than young children. We also see a slight positive correlation between Income and the level of education.

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(marketing_data_initial))

Now let’s get an idea for how many of the people in each of our categorical variables(education and marital status) we have in our data.

Here we can see in most of the customers are married, single or together (likely dating) whereas divorced and widowed make up a minority.

For education One single category might make up the majority, graduation. Interestingly we appear to have more PhDs than Master’s.

marketing_data_initial %>%
  select(starts_with("marital")) %>%
  pivot_longer(everything(), names_to = "marital_status", values_to = "count") %>%
  ggplot(aes(x = marital_status, y = count, fill = marital_status)) +
  geom_col() +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

marketing_data_initial %>%
  select(starts_with("education")) %>%
  pivot_longer(everything(), names_to = "education_level", values_to = "count") %>%
  ggplot(aes(x = education_level, y = count, fill = education_level)) +
  geom_col() +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Income looks to be unimodel with really short tails.

marketing_data_initial %>%
  ggplot(aes(x = Income)) +
  geom_histogram() +
  theme_fira()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most people have zero kids and among those with kids, most only have one.

marketing_data_initial %>%
  ggplot(aes(x = Kidhome)) +
  geom_bar() +
  theme_fira()

Most people also don’t have a teen, but the homes with a teen make up a larger share than the homes with a kid.

marketing_data_initial %>%
  ggplot(aes(x = Teenhome)) +
  geom_bar() +
  theme_fira()

If we add the Kidhome variable and the Teenhome variable to create a new variable we call dependents we see that most households do indeed have a dependent and some even have 3, something that wasn’t quite clear looking at these variables independently. We will likely want to keep this new variable for our clustering (but maybe use pca among the three? or not?).

marketing_data_initial %>%
  mutate(dependents = Kidhome + Teenhome) %>%
  ggplot(aes(x = dependents)) +
  geom_bar() +
  theme_fira()

We saw that age correlates with income and teen and kid, what do they all look like together? It looks like high income households are more likely to not have a dependent. One dependent homes seem to dominate two clusters that sandwich and area with more 2 dependent household. Three dependent households are pretty rare and scattered but don’t seem to appear for anyone under the age of 40, or at least are very rare for that younger group.

marketing_data_initial %>%
  mutate(dependents = Kidhome + Teenhome) %>%
  ggplot(aes(x = Age, y = Income, color = factor(dependents))) +
  geom_point() +
  theme_fira()

Ideas for Approach

Cluster Analysis is different from most other machine learning methods people think of because it’s unsupervised, meaning it’s a method where we don’t train it with results toward a specific outcome. For example, if we had data that showed who accepted different offers (we do but for now we’re pretending like our store just opened) we’d train a model based on previous results of people accepting our offer to better predict who will accept our offer, and the model would use the data we have of who accepted and didn’t as sort of a supervisor of how the model should fit.

For cluster analysis though, there’s no end prediction. We just want to find a numerical way to separate people into clusters. At this step, we aren’t really trying to determine who will and won’t accept an offer, we’re more trying to break our population into groups that are similar enough within the group that we can target them with a specific marketing strategy that is effective, and different enough across groups that the different groups merit separate marketing strategies.

Because of this we have a different set of tools for tuning the model. One of them is playing with the weights of different variables. I believe that the number of dependents (kids and teens) in a household will likely have a bigger effect on a households grocery shopping habits than all other variables. Because of this I will add a little extra weight to this variable in one of my procedures. For comparison I will also include a second procedure where the income is given a little extra weight, and a third procedure where all variables are weighted the same in the cluster analysis. At then end we will see if any of these methods add better predictive value as to whether or not someone were to accept specific offers made.

The Clustering

Setting the model specification

First we want to set up our model. We will be doing k-means clustering and we will be tuning the number of clusters.

kmeans_mod <- k_means(num_clusters = tune())

Recipes

Luckily the recipes packages makes it easy for us to specify our different procedures and the workflowsets package makes it easy to get tune our model to all of them. Luckily we’re working with a small enough dataset that this should be to expensive in time or computational workload. Now let’s specify our recipes!

### base recipe (doing the bare minimum)
base_recipe <- recipe(~., data = marketing_data_initial) %>%
  step_normalize(all_numeric())

### weight dependents
weights_dependents_recipe <- recipe(~., data = marketing_data_initial) %>%
  step_mutate(
    dependents        = Kidhome+Teenhome,
    dependent_present = if_else(dependents >= 1, 1, 0)
    ) %>%
  step_select(-c(Kidhome, Teenhome)) %>%
  step_normalize(all_numeric()) %>%
  step_mutate(
    dependents        = dependents*1.1,
    dependent_present = dependent_present*1.1
    )

### weight income 
weights_income_recipe <- recipe(~., data = marketing_data_initial) %>%
  step_normalize(all_numeric()) %>%
  step_mutate(Income = Income*1.1)


### Create the workflow set
cluster_wflow <- workflow_set(
      preproc = list(
        base                          = base_recipe,
        weights_dependents            = weights_dependents_recipe,
        weights_income                = weights_income_recipe
    ),
      models = list(kmean = kmeans_mod),
      cross = TRUE
   )

Tuning the Clusters and Including BIC

In addition to the common SSE metrics used to determine clusters I want to also measure based on BIC which some research suggests is able to achieve better results than the SEE ration. Unfortunately the Tidyclust package doesn’t seem to include BIC as one of their available metrics. They do have a way to add custom metrics but it doesn’t seem to be as easy as the yardstick package’s way to add custom metrics for supervised machine learning outcomes. This took a lot more work than I had hoped, but by looking at the documentation for the metric functions of the tidyclust package, I was able to figure out how to add BIC as a custom metric to include in our tuning process. The code for that is below.

### create custom function to calculate BIC
extract_post_preprocessor <- function(object, new_data) {
  preprocessor <- hardhat::extract_preprocessor(object)

  if (inherits(preprocessor, "workflow_variables")) {
    new_data <- dplyr::select(new_data, !!preprocessor$predictors)
  } else if (rlang::is_formula(preprocessor)) {
    new_data <- hardhat::mold(preprocessor, new_data)$predictors
  } else if (inherits(preprocessor, "recipe")) {
    new_data <- object %>%
      hardhat::extract_recipe() %>%
      recipes::bake(new_data, recipes::all_predictors())
  }
  new_data
}

bic <- function(object, ...) {
  UseMethod("bic")
}

bic <- new_cluster_metric(bic, direction = "minimize")

bic.cluster_spec <- function(object, ...) {
  rlang::abort(
    paste(
      "This function requires a fitted model.",
      "Please use `fit()` on your cluster specification."
    )
  )
}

bic.cluster_fit <- function(object, new_data = NULL, dist_fun = NULL, ...) {
  if (is.null(dist_fun)) {
    dist_fun <- Rfast::dista
  }

  res <- bic_impl(object, new_data, dist_fun, ...)

  return(tibble::tibble(
    .metric = "bic",
    .estimator = "standard",
    .estimate = res
  ))
}

bic.workflow <- bic.cluster_fit

bic_vec <- function(object, new_data = NULL, dist_fun = Rfast::dista, ...) {
  bic_impl(object, new_data, dist_fun, ...)
}

bic_impl <- function(object, new_data = NULL, dist_fun = Rfast::dista, ...) {
  # browser()
  # Preprocess data before computing distances if appropriate
  if (inherits(object, "workflow") && !is.null(new_data)) {
    new_data <- extract_post_preprocessor(object, new_data)
  } 
  if(inherits(object, "workflow")){
      object <- extract_fit_parsnip(object)
  }

  if (is.null(new_data)) {
    m = ncol(object$fit$centers)
    n = length(object$fit$cluster)
    k = nrow(object$fit$centers)
    D = object$fit$tot.withinss

  } else {
    m = ncol(object$fit$centers)
    n = length(object$fit$cluster)
    k = nrow(object$fit$centers)
    D = sse_within_total_vec(object, new_data)
  }

  return(D + log(n)*m*k)
  
}

Finally, we can tune the model. We will create a tuning grid that allows us to try 1-15 clusters. The workflow_map function allows us to tune across all of our workflow sets. In short each of our recipes will be used to make 15 different models with 15 different setting for our num_clusters argument so that we can see how well each number of clusters performs across each of our recipes.

clust_num_grid <- grid_regular(num_clusters(range = c(1, 15)), levels = 15)


### get our tuning results
set.seed(1234)
tune_results <- workflow_map(
  cluster_wflow,
  fn = "tune_cluster", 
  resamples = marketing_cv, 
  grid = clust_num_grid,
  metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio, bic), 
  control = control_grid(save_pred = TRUE, extract = identity, save_workflow = FALSE, parallel_over = "resamples"),
  verbose = TRUE
  )

## i 1 of 3 tuning:     base_kmean

## ✔ 1 of 3 tuning:     base_kmean (27s)

## i 2 of 3 tuning:     weights_dependents_kmean

## ✔ 2 of 3 tuning:     weights_dependents_kmean (29.7s)

## i 3 of 3 tuning:     weights_income_kmean

## ✔ 3 of 3 tuning:     weights_income_kmean (28.2s)

Let’s look at the SSE ratio.

all_res_dataframe <- map_dfr(
  collect_metrics(tune_results)%>%distinct(wflow_id)%>%pull(wflow_id),
  ~bind_cols(collect_metrics(extract_workflow_set_result(tune_results, .x)), recipe = .x)
) 


all_res_dataframe %>%
  filter(.metric == "sse_ratio") %>%
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("mean WSS/TSS ratio, over 10 folds") +
  xlab("Number of clusters") +
  facet_wrap(~recipe)

And our custom BIC.

all_res_dataframe %>%
  filter(.metric == "bic") %>%
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("BIC, over 10 folds") +
  xlab("Number of clusters") +
  facet_wrap(~recipe)

Best Recipe?

We can see from the graphs above that some recipes perform better than others. To get a little clearer picture let’s take a look at the performance of all recipes at 7 clusters, right in the middle of our range. We see consistent ranking across both metrics.

Okay, so we can all agree that the clusters with income weighted are the least suited here, which is interesting because we increased weight of a variable there, we would expect that to increase the separation between our groups. The other two approaches seem to be a little more competitive, but the weighted dependents model is the one we chose, the others are for comparison, so if I change my mind now I look like a loser…but whichever one adds the most predictive value at the end is the one I chose…kidding.

bic_plot <- all_res_dataframe %>%
  select(num_clusters, .metric, mean, recipe) %>%
  filter(num_clusters == 6 & .metric == "bic") %>%
  mutate(recipe = fct_reorder(recipe, mean)) %>%
  ggplot(aes(x = recipe, y = mean)) +
  geom_segment(aes(y=0, yend=mean), color="grey") +
  geom_point(color = "blue", size=4) +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  ggtitle("BIC")

sse_plot <- all_res_dataframe %>%
  select(num_clusters, .metric, mean, recipe) %>%
  filter(num_clusters == 6 & .metric == "sse_ratio") %>%
  mutate(recipe = fct_reorder(recipe, mean)) %>%
  ggplot(aes(x = recipe, y = mean)) +
  geom_segment(aes(y=0, yend=mean), color="grey") +
  geom_point(color = "blue", size=4) +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  ggtitle("SSE Ratio")

gridExtra::grid.arrange(bic_plot, sse_plot, nrow = 1)

How Many Clusters

It looks like 9 clusters is pretty consistently a decent pick for clusters. It’s also nice to that it will be consistent across recipes, makes for a more equal comparison.

all_res_dataframe %>%
  filter(
      .metric == "sse_ratio"
      ) %>%
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("mean WSS/TSS ratio, over 10 folds") +
  xlab("Number of clusters") +
  facet_wrap(~recipe)

all_res_dataframe %>%
  filter(
      .metric == "bic"
      ) %>%
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("mean BIC, over 10 folds") +
  xlab("Number of clusters") +
  facet_wrap(~recipe)

So now let’s get our cluster assignments.

kmod <- k_means(num_clusters = 9)

base_wflow <- workflow() %>%
  add_recipe(base_recipe) %>%
  add_model(kmod)

depend_wflow <- workflow() %>%
  add_recipe(weights_dependents_recipe) %>%
  add_model(kmod)

income_wflow <- workflow() %>%
  add_recipe(weights_income_recipe) %>%
  add_model(kmod)

base_clusters       <- fit(base_wflow, marketing_data_initial)   %>% extract_cluster_assignment()%>%pull(.cluster)
dependent_clusters  <- fit(depend_wflow, marketing_data_initial) %>% extract_cluster_assignment()%>%pull(.cluster)
income_clusters     <- fit(income_wflow, marketing_data_initial) %>% extract_cluster_assignment()%>%pull(.cluster)

Visualizing our Clusters

We can visualize our clusters with 3d plots to see what our groups look like. First we do PCA (without any weighting) on our initial variables and pull the first three components. We add our clusters to this dataframe. Finally we can make 3d scatter plots with the three PCA components being the dimensions our data is plotted on and coloring the observations based on their cluster. Let’s take a look and see if there’s much difference between the groups of our clusters.

pca_res <- prcomp(marketing_data_initial,
       center = TRUE,
       scale. = TRUE)

pca_res <- bind_cols(
  pca_res$x[,1:3], 
  base_clusters       = base_clusters,
  dependent_clusters  = dependent_clusters,
  income_clusters     = income_clusters
  )

First, a 3d model based on our clusters with the weighted dependents variable, our primary model. Pretty cool, we can see some pretty clear grouping

plot_ly(pca_res, x = ~PC1, y = ~PC2, z = ~PC3, color = ~dependent_clusters) %>% add_markers()

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Now looking at the clustering with income given extra weight. Interestingly The cluster 6 in this set combines observations that were split among 6 and 5. We can see some other variations in how the clusters are grouped differently than in the previous model, but still appear to make clearly distinct groups based on the PCA.

plot_ly(pca_res, x = ~PC1, y = ~PC2, z = ~PC3, color = ~income_clusters) %>% add_markers()

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Finally our base clusters are below. Again some slight variation from the other two. At least from the front view these clusters seem to match the dependents weight clusters more, but not a perfect match there. We can still also make out clear distinctions of the groups grouped this way.

plot_ly(pca_res, x = ~PC1, y = ~PC2, z = ~PC3, color = ~base_clusters) %>% add_markers()

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

The plots above demonstrate an inherent challenge in unsupervised machine learning. We don’t really have a guided end goal. We want to separate people into groups to make it easy for us to target them with specific marketing strategies, but in this unguided fashion we’re left to make best guess judgement on which dimensions to divede those groups on. It’s not a perfect science, and with additional data, we would be able to make better distinctions based on the outcomes we want (e.g. training a supervised model based on the likelihood of them accepting a specific offer). For now though, this is still useful to segment our entire population into separate groups.

Did it Work?

Audience segmentation serves it’s specific purpose in Marketing. Dividing customers into manageable groups that we can target with specific marketing campaigns makes it easier for organizations to provide tailored messaging that doesn’t become 2,200 different messages. It helps to find a balance between having to write specific copy for every individual or using only one cookie cutter campaign to target a diverse community. In that way clustering has already helped. We can just look at our 9 clusters and come up with the necessary campaign strategies to target them, that don’t all have to be unique. For example clusters 8 and 3 might both be good groups to target with for a discount on diapers, while clusters 6 and 7 are better suited for offers on chips and salsa, but clusters 6 and 3 might both be more likely to respond to the offers via social media ads while clusters 8 and 7 are both more likely to respond to coupons in the Sunday paper. Dividing our customers into more manageable groups is already helpful, but let’s see just how effective these clusters are for our needs. First, we will see if there are any clear demographic differences between the clusters, then we will see if the clusters serve as good predictive variables, relative to our other available variables, on whether or not someone will accept an offer.

Are they Distinct?

The first step to determing just how helpful these clusters are, is seeing do they create distinct demographic groups that we can easily come up with distinct marketing strategies for? First, let’s look at Income.

Despite not weighting on Income for these clusters, we still have some distinct differences. Clusters 1, 7 and 9 are distinctly higher income while the others all seem pretty similar.

marketing_initial_clust <- bind_cols(
  marketing_data_initial, 
  base_clusters       = base_clusters,
  dependent_clusters  = dependent_clusters,
  income_clusters     = income_clusters
  )

marketing_initial_clust %>%
  ggplot(aes(x = dependent_clusters, y = Income, fill = dependent_clusters)) +
  geom_violin() +
  geom_boxplot(width=0.1, color="black") + 
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Do they have kids home? Well, clusters 1, 7 and 9 (the same higher income groups from above) can be targeted specifically with messages that do not appeal to households with kids. We’re getting some good information here with these clusters. While not excessively leaning towards children in the household, cluster 2 does seem to be a cluster that might be more responsive to marketing messages for households with children.

marketing_initial_clust %>%
  ggplot(aes(x = dependent_clusters, group = Kidhome, fill = Kidhome)) +
  geom_bar(position = "dodge") +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

We’re seeing a trend! Clusters 1, 7 and 9 again all have no teens in the household either. We may be able to combine these three into their own clusters, but we’re definitely seeing these three clusters emerge as a distinct group for distinct a distinct marketing strategy.

marketing_initial_clust %>%
  ggplot(aes(x = dependent_clusters, group = Teenhome, fill = Teenhome)) +
  geom_bar(position = "dodge") +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Here’s where we see the difference. It’s obvious from the graphs above that clusters 1, 7 and 9 all have no teens or kids in the household, this graph is the clearest one in which we can determine that 5 of the other six clusters can all be targeted with a marketing strategy geared possibly towards families, or at least households where there seems to be at least one, if not more, minor dependents in the household.

marketing_initial_clust %>%
  mutate(dependents = Teenhome+Kidhome) %>%
  ggplot(aes(x = dependent_clusters, group = dependents, fill = dependents)) +
  geom_bar(position = "dodge") +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Not a lot of distinct among age. Clusters 2-5 seem to gradually get older and while cluster 6 drops in median age, it still reaches a higher max age. Overall, though, not a lot of extreme differentiation on age.

marketing_initial_clust %>%
  mutate(dependents = Teenhome+Kidhome) %>%
  ggplot(aes(x = dependent_clusters, y = Age, fill = dependent_clusters)) +
  geom_violin() +
  geom_boxplot(width=0.1, color="black") + 
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Some distinction among relationship status, clusters 3 and 5 seem to be in dating relationships(?) and cluster 6 is exclusively divorcees.

marketing_initial_clust %>%
  pivot_longer(starts_with("marital_"), names_to = "marital_status", values_to = "value") %>%
  select(dependent_clusters, marital_status, value) %>%
  filter(value == 1) %>%
  ggplot(aes(x = dependent_clusters, group = marital_status, fill = marital_status)) +
  geom_bar(position = "dodge", stat = "count") +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

clusters 8 and 9 are exclusively people with Master’s degrees cluster 4 is exclusively PhDs and Cluster 1 is exclusively college graduates(?). If we want to market exclusively on education or relationship status, these clusters give us some distinct groups to target.

marketing_initial_clust %>%
  pivot_longer(starts_with("education_"), names_to = "education", values_to = "value") %>%
  select(dependent_clusters, education, value) %>%
  filter(value == 1) %>%
  ggplot(aes(x = dependent_clusters, group = education, fill = education)) +
  geom_bar(position = "dodge", stat = "count") +
  theme_fira() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Are they Predictive?

There are multiple ways to determine how important or helpful a variable is in predicting an outcome. For this example we will be measuring the variable importance in a random forest model.

First Let’s take a look at the Response to the final campaign with only the variables we would have had at the beginning.

Gah!!! I’m so bad at this, I’m a terrible data scientist!!! I don’t belong here!!! I was wrong, the income clusters are more important than the dependent clusters, and they both are far less predictive than Income and Age and a little less predictive than the Teenhome variable. I knew income would have an impact on behavior here, but I honestly thought having kids and teens at home would have a larger impact and would be better to weight our clusters on. Well, at least it did better than the base model clusters, so my hunch was a little better than nothing. But there’s still hope. We’re looking at this a little more holistically than just this one campaign, there were 5 other offers made to customers, let’s see which variables were most important for those.

library(tidymodels)
library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

modeling_data <- marketing_data_initial %>%
  bind_cols(
    base_clusters       = base_clusters,
    dependent_clusters  = dependent_clusters,
    income_clusters     = income_clusters,
    Response            = marketing_data$Response,
    AcceptedCmp1        = marketing_data$AcceptedCmp1,
    AcceptedCmp2        = marketing_data$AcceptedCmp2,
    AcceptedCmp3        = marketing_data$AcceptedCmp3,
    AcceptedCmp4        = marketing_data$AcceptedCmp4,
    AcceptedCmp5        = marketing_data$AcceptedCmp5
  ) %>%
  mutate(across(c(starts_with("AcceptedCmp"), Response), ~factor(.x)))

rf_mod <- rand_forest(mode = "classification", trees = 100) %>% 
  set_engine("ranger", importance = "impurity")
  
rf_recipe <- recipe(Response ~ ., data = modeling_data %>% select(-starts_with("AcceptedCmp"))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("Response")

Yes!!! Amazing, I’m so great at this, I’m an incredible data scientist!!! I was right, clusters made with a higher weighting of dependents in the household are more helpful than clusters that weight income more. Even the base clusters are slightly more predictive than the clusters more weighted on Income. There’s also a noticeable drop in importance after the dependents clusters, income and age are still the most important variables, but our new clusters add significant predictive value. Now, that we only have four additional outcomes left to test against, let’s put those outcomes in a pretty grid of plots.

rf_recipe <- recipe(AcceptedCmp1 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("AcceptedCmp1")

Exciting news! In all four of these the dependent clusters were more predictive than the other clusters! In five out of 6 marketing campaigns our clusters weighted more heavily on dependents added more predictive power than the baseline clusters or the clusters weighted on income. For all of these previous offers income and age were the most important variables in making predictions but in two of them the dependent clusters were the third most important variable! This is exciting, it’s indicative that our clusters probably add significant value as informative and useful segmentations of the customers when we only have the demographic information available to us. But what about when we have the additional customer data, such as their purchases?

rf_recipe <- recipe(AcceptedCmp2 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


AcceptedCmp2_plot <- rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("AcceptedCmp2")

rf_recipe <- recipe(AcceptedCmp3 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp2, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


AcceptedCmp3_plot <- rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("AcceptedCmp3")

rf_recipe <- recipe(AcceptedCmp4 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


AcceptedCmp4_plot <- rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("AcceptedCmp4")

rf_recipe <- recipe(AcceptedCmp5 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


AcceptedCmp5_plot <- rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10) +
  ggtitle("AcceptedCmp5")

gridExtra::grid.arrange(AcceptedCmp2_plot, AcceptedCmp3_plot, 
                        AcceptedCmp4_plot, AcceptedCmp5_plot,
                        nrow = 2)

Good news and bad news. Our clusters drop in importance for the random forest models when we include the additional variables, but that’s understandable. Clusters made off of only demographic data probably would not be expected to hold up once we get more useful purchasing habits data. But they still break the top 25 and are used in the model to help make predictions so these clusters do in fact still add value! It does look like in only one campaign are our dependent models more important than the other two sets of clusters and in only two is the dependent set more important than the income clusters.

modeling_data <- marketing_data %>%
  bind_cols(
    base_clusters       = base_clusters,
    dependent_clusters  = dependent_clusters,
    income_clusters     = income_clusters
  ) %>%
  mutate(across(c(starts_with("AcceptedCmp"), Response), ~factor(.x)))

rf_mod <- rand_forest(mode = "classification", trees = 100) %>% 
  set_engine("ranger", importance = "impurity")
  
rf_recipe <- recipe(Response ~ ., data = modeling_data %>% select(-starts_with("AcceptedCmp"))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("Response")

rf_recipe <- recipe(AcceptedCmp1 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("AcceptedCmp1")

rf_recipe <- recipe(AcceptedCmp2 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("AcceptedCmp2")

rf_recipe <- recipe(AcceptedCmp3 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp2, AcceptedCmp4, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("AcceptedCmp3")

rf_recipe <- recipe(AcceptedCmp4 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp3, AcceptedCmp2, AcceptedCmp5))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("AcceptedCmp4")

rf_recipe <- recipe(AcceptedCmp5 ~ ., data = modeling_data %>% select(-c(Response, AcceptedCmp1, AcceptedCmp3, AcceptedCmp4, AcceptedCmp2))) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)


rf_workflow %>% 
  fit(modeling_data) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 25) +
  ggtitle("AcceptedCmp5")

R Notebook