CLUSTER ANALYSIS FOR CUSTOMER SEGMENTATION

by Wesley Shen

01/09/17 8:35pm

Introduction

In early 2017, I embarked on a personal project inspired by some of the frustrations my wife and I experience when cooking with recipes. I was interested to see whether there are others out there who experience the same frustrations and what characterizes them in terms of their attitudes towards recipe purchase and usage. With a business idea in mind, I ultimately wanted to pinpoint key customer targets.

I created a survey which contained demographic and a series of likert-scale type attitudinal questions, as well as some price point questions at the end for me to create a Van Westendorp Price Sensitivity Meter.

Key concepts in the following cluster anlaysis include

  • normalizing data
  • hierarchal clustering and dendrogram (often used in genome biology)
  • Gower distance and (as opposed to Euclidian distance)
  • k-medoid clustering (as opposed to k-means)
  • sillhoutte width
  • t-SNE dimensionality reduction for high-dimensional points

Load Packages

library(heatmaply)
library(dplyr)
library(tidyr)
library(dendextend)
library(cluster)
library(Rtsne)
library(ggplot2)

Read Data

## Observations: 70
## Variables: 27
## $ Timestamp         <chr> "2017/01/07 9:18:19 AM GMT+13", "2017/01/07 ...
## $ Age               <int> 30, 33, 30, 32, 33, 32, 52, 31, 57, 32, 63, ...
## $ Gender            <chr> "Male", "Male", "Female", "Male", "Female", ...
## $ City              <chr> "Auckland", "Auckland", "Hamilton", "Aucklan...
## $ Country           <chr> "NZ", "NZ", "NZ", "NZ", "NZ", "NZ", "Canada"...
## $ Status            <chr> "Married", "Married", "Married", "Married", ...
## $ Have_Children     <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Ye...
## $ Fan_Author        <int> 5, 4, 1, 3, 4, 5, 2, 4, 4, 2, 4, 5, 2, 3, 4,...
## $ Convenience       <int> 2, 4, 4, 3, 4, 5, 4, 1, 5, 2, 3, 4, 1, 2, 3,...
## $ Trust_Chef        <int> 5, 2, 4, 4, 3, 4, 2, 3, 4, 1, 3, 3, 3, 3, 2,...
## $ Trust_Reviews     <int> 1, 4, 4, 4, 4, 4, 3, 5, 4, 3, 4, 3, 4, 3, 3,...
## $ Photography       <int> 4, 4, 3, 2, 3, 4, 3, 2, 5, 2, 2, 2, 4, 3, 3,...
## $ Collect           <int> 3, 1, 1, 2, 2, 4, 3, 2, 1, 2, 4, 4, 2, 4, 4,...
## $ Price_Expensive   <int> 5, 5, 5, 3, 5, 5, 3, 3, 5, 3, 4, 4, 2, 4, 5,...
## $ Display           <int> 4, 1, 1, 4, 1, 5, 1, 3, 1, 3, 3, 1, 4, 4, 4,...
## $ Accuracy          <int> 5, 4, 5, 4, 1, 3, 5, 1, 5, 1, 4, 4, 4, 4, 2,...
## $ Small_Device      <int> 5, 5, 5, 5, 5, 4, 5, 5, 5, 3, 5, 5, 5, 5, 5,...
## $ Purchase          <int> 4, 2, 2, 2, 4, 4, 4, 2, 2, 3, 5, 4, 1, 4, 2,...
## $ Family_Friends    <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Free_Online       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Magazines         <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ Too_Expensive     <dbl> 4.95, 3.00, 2.00, 2.00, 5.00, 5.00, 40.00, 1...
## $ Too_Cheap         <dbl> 0.25, 0.50, 0.01, 0.01, 1.00, 0.50, 1.00, 0....
## $ Getting_Expensive <dbl> 3.99, 2.00, 1.00, 1.50, 3.00, 3.00, 35.00, 5...
## $ Bargain           <dbl> 0.79, 0.99, 0.01, 0.30, 0.50, 1.00, 20.00, 1...

Visualize Data

Check Missing Data

Note one missing value from marital status question.

sapply(Recipes, function(x) sum(is.na(x)))
##         Timestamp               Age            Gender              City 
##                 0                 0                 0                 0 
##           Country            Status     Have_Children        Fan_Author 
##                 0                 1                 0                 0 
##       Convenience        Trust_Chef     Trust_Reviews       Photography 
##                 0                 0                 0                 0 
##           Collect   Price_Expensive           Display          Accuracy 
##                 0                 0                 0                 0 
##      Small_Device          Purchase    Family_Friends Purchased_Printed 
##                 0                 0                 0                 0 
## Purchased_Digital       Free_Online         Magazines     Too_Expensive 
##                 0                 0                 0                 0 
##         Too_Cheap Getting_Expensive           Bargain 
##                 0                 0                 0

Prune Data

I was mainly interested in those living in New Zealand. So excluding those residing outside of New Zealand, meant that 100% of the respondents also source free recipes online, so I exclude ‘Free_Online’ for the cluster analysis as including it will not provide additional information. I also exclude the pricing data.

Recipes_Clean <- Recipes %>%
filter(Country == "NZ") %>%
select(-Timestamp, -Country, -City, -Free_Online, -Too_Expensive, -Too_Cheap, -Getting_Expensive, -Bargain) %>%
drop_na()

glimpse(Recipes_Clean)
## Observations: 61
## Variables: 19
## $ Age               <int> 30, 33, 30, 32, 33, 32, 31, 32, 63, 30, 24, ...
## $ Gender            <chr> "Male", "Male", "Female", "Male", "Female", ...
## $ Status            <chr> "Married", "Married", "Married", "Married", ...
## $ Have_Children     <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Ye...
## $ Fan_Author        <int> 5, 4, 1, 3, 4, 5, 4, 2, 4, 5, 2, 4, 3, 5, 5,...
## $ Convenience       <int> 2, 4, 4, 3, 4, 5, 1, 2, 3, 4, 1, 3, 4, 4, 3,...
## $ Trust_Chef        <int> 5, 2, 4, 4, 3, 4, 3, 1, 3, 3, 3, 2, 4, 5, 5,...
## $ Trust_Reviews     <int> 1, 4, 4, 4, 4, 4, 5, 3, 4, 3, 4, 3, 4, 3, 5,...
## $ Photography       <int> 4, 4, 3, 2, 3, 4, 2, 2, 2, 2, 4, 3, 3, 5, 4,...
## $ Collect           <int> 3, 1, 1, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 5, 4,...
## $ Price_Expensive   <int> 5, 5, 5, 3, 5, 5, 3, 3, 4, 4, 2, 5, 5, 4, 5,...
## $ Display           <int> 4, 1, 1, 4, 1, 5, 3, 3, 3, 1, 4, 4, 3, 4, 1,...
## $ Accuracy          <int> 5, 4, 5, 4, 1, 3, 1, 1, 4, 4, 4, 2, 1, 1, 5,...
## $ Small_Device      <int> 5, 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 4, 5, 4,...
## $ Purchase          <int> 4, 2, 2, 2, 4, 4, 2, 3, 5, 4, 1, 2, 4, 4, 2,...
## $ Family_Friends    <int> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines         <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...

Transform Data

Just by visualizing the data, it is apparent that not all variables have a normal distribution, so no standardization was done on the data. However I normalize() the data in order to compare variables on the same scale and to generate a useful heatmap for all variables.

Recipes_Clean$Gender <- as.numeric(factor(Recipes_Clean$Gender,levels=c("Male","Female")))
Recipes_Clean$Status <- as.numeric(factor(Recipes_Clean$Status))
Recipes_Clean$Have_Children <- ifelse(Recipes_Clean$Have_Children == "Yes",1,0)
Recipes_Clean$Purchase <- ifelse(Recipes_Clean$Purchase %in% c(4,5),1,0)

glimpse(Recipes_Clean)
## Observations: 61
## Variables: 19
## $ Age               <int> 30, 33, 30, 32, 33, 32, 31, 32, 63, 30, 24, ...
## $ Gender            <dbl> 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Status            <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 2, 2, 3,...
## $ Have_Children     <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,...
## $ Fan_Author        <int> 5, 4, 1, 3, 4, 5, 4, 2, 4, 5, 2, 4, 3, 5, 5,...
## $ Convenience       <int> 2, 4, 4, 3, 4, 5, 1, 2, 3, 4, 1, 3, 4, 4, 3,...
## $ Trust_Chef        <int> 5, 2, 4, 4, 3, 4, 3, 1, 3, 3, 3, 2, 4, 5, 5,...
## $ Trust_Reviews     <int> 1, 4, 4, 4, 4, 4, 5, 3, 4, 3, 4, 3, 4, 3, 5,...
## $ Photography       <int> 4, 4, 3, 2, 3, 4, 2, 2, 2, 2, 4, 3, 3, 5, 4,...
## $ Collect           <int> 3, 1, 1, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 5, 4,...
## $ Price_Expensive   <int> 5, 5, 5, 3, 5, 5, 3, 3, 4, 4, 2, 5, 5, 4, 5,...
## $ Display           <int> 4, 1, 1, 4, 1, 5, 3, 3, 3, 1, 4, 4, 3, 4, 1,...
## $ Accuracy          <int> 5, 4, 5, 4, 1, 3, 1, 1, 4, 4, 4, 2, 1, 1, 5,...
## $ Small_Device      <int> 5, 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 4, 5, 4,...
## $ Purchase          <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Family_Friends    <int> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines         <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...
Recipes_Norm <- normalize(Recipes_Clean)

glimpse(Recipes_Norm)
## Observations: 61
## Variables: 19
## $ Age               <dbl> 0.23255814, 0.30232558, 0.23255814, 0.279069...
## $ Gender            <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Status            <dbl> 0.6666667, 0.6666667, 0.6666667, 0.6666667, ...
## $ Have_Children     <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,...
## $ Fan_Author        <dbl> 1.00, 0.75, 0.00, 0.50, 0.75, 1.00, 0.75, 0....
## $ Convenience       <dbl> 0.25, 0.75, 0.75, 0.50, 0.75, 1.00, 0.00, 0....
## $ Trust_Chef        <dbl> 1.00, 0.25, 0.75, 0.75, 0.50, 0.75, 0.50, 0....
## $ Trust_Reviews     <dbl> 0.00, 0.75, 0.75, 0.75, 0.75, 0.75, 1.00, 0....
## $ Photography       <dbl> 0.75, 0.75, 0.50, 0.25, 0.50, 0.75, 0.25, 0....
## $ Collect           <dbl> 0.50, 0.00, 0.00, 0.25, 0.25, 0.75, 0.25, 0....
## $ Price_Expensive   <dbl> 1.00, 1.00, 1.00, 0.50, 1.00, 1.00, 0.50, 0....
## $ Display           <dbl> 0.75, 0.00, 0.00, 0.75, 0.00, 1.00, 0.50, 0....
## $ Accuracy          <dbl> 1.00, 0.75, 1.00, 0.75, 0.00, 0.50, 0.00, 0....
## $ Small_Device      <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 0.75, 1.00, 0....
## $ Purchase          <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Family_Friends    <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines         <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...

Dendro Heatmap

Now that the data has been normalized, I create a heatmap that automatically repositions similar rows together. To right of the heatmap is a dendrogram clearly showing 5 distinct clusters represented by the different colours.

row_dend  <- Recipes_Norm %>% dist %>% hclust %>% as.dendrogram %>%
   set("branches_k_color", k = 6) %>% set("branches_lwd", c(1,3)) %>%
   ladderize
heatmaply(Recipes_Norm, xlab = "Features", ylab = "Respondents", main = "Heatmap (Normalized)", Rowv=row_dend)

Calculate Distance

As data contains different variable types (e.g. ordinal, nominal, continuous) I used Gower Distance from cluster package. As a sanity check, I wanted to see the most similar pair of respondents, and the most dissimilar pair based on their survey responses.

set.seed(1234)

library(cluster)
gower_dist <- daisy(Recipes_Clean, metric="gower")

# Sanity Check: Output most similar pair
gower_mat <- as.matrix(gower_dist)
Recipes_Clean[which(gower_mat == min(gower_mat[gower_mat !=min(gower_mat)]), arr.ind=TRUE)[1, ],]
## # A tibble: 2 x 19
##     Age Gender Status Have_Children Fan_Author Convenience Trust_Chef
##   <int>  <dbl>  <dbl>         <dbl>      <int>       <int>      <int>
## 1    35     2.     3.            1.          4           4          2
## 2    33     2.     3.            1.          4           4          3
## # ... with 12 more variables: Trust_Reviews <int>, Photography <int>,
## #   Collect <int>, Price_Expensive <int>, Display <int>, Accuracy <int>,
## #   Small_Device <int>, Purchase <dbl>, Family_Friends <int>,
## #   Purchased_Printed <int>, Purchased_Digital <int>, Magazines <int>
# Sanity Check: Output most dissimilar pair
Recipes_Clean[which(gower_mat == max(gower_mat[gower_mat !=max(gower_mat)]), arr.ind=TRUE)[1, ],]
## # A tibble: 2 x 19
##     Age Gender Status Have_Children Fan_Author Convenience Trust_Chef
##   <int>  <dbl>  <dbl>         <dbl>      <int>       <int>      <int>
## 1    34     1.     3.            1.          4           4          3
## 2    28     2.     3.            0.          3           2          4
## # ... with 12 more variables: Trust_Reviews <int>, Photography <int>,
## #   Collect <int>, Price_Expensive <int>, Display <int>, Accuracy <int>,
## #   Small_Device <int>, Purchase <dbl>, Family_Friends <int>,
## #   Purchased_Printed <int>, Purchased_Digital <int>, Magazines <int>

Number of Clusters

I use K-Medoids clustering, which takes on an identical algorithm to Euclidean distance, but using observations as centers rather than centroids.

sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

Five-clusters yields the highest sillhoutte width, where objects are most well matched to its own cluster and poorly matched to neighbouring clusters. This is in line with the 5 distinct clusters identified in the heatmap.

Cluster Summaries

pam_fit <- pam(gower_dist, diss = TRUE, k = 5)

pam_results <- Recipes_Clean %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

pam_results$the_summary
## [[1]]
##       Age            Gender          Status  Have_Children   Fan_Author  
##  Min.   :28.00   Min.   :1.000   Min.   :3   Min.   :1     Min.   :3.00  
##  1st Qu.:30.00   1st Qu.:2.000   1st Qu.:3   1st Qu.:1     1st Qu.:4.00  
##  Median :32.50   Median :2.000   Median :3   Median :1     Median :4.00  
##  Mean   :34.58   Mean   :1.917   Mean   :3   Mean   :1     Mean   :4.25  
##  3rd Qu.:37.75   3rd Qu.:2.000   3rd Qu.:3   3rd Qu.:1     3rd Qu.:5.00  
##  Max.   :47.00   Max.   :2.000   Max.   :3   Max.   :1     Max.   :5.00  
##   Convenience      Trust_Chef   Trust_Reviews    Photography   
##  Min.   :2.000   Min.   :2.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:2.75   1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :3.00   Median :3.000   Median :3.000  
##  Mean   :4.083   Mean   :3.50   Mean   :3.167   Mean   :2.833  
##  3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:4.000   3rd Qu.:3.250  
##  Max.   :5.000   Max.   :5.00   Max.   :4.000   Max.   :4.000  
##     Collect     Price_Expensive    Display       Accuracy  
##  Min.   :1.00   Min.   :3.000   Min.   :1.0   Min.   :1.0  
##  1st Qu.:2.00   1st Qu.:4.000   1st Qu.:1.0   1st Qu.:3.0  
##  Median :3.00   Median :4.000   Median :2.0   Median :4.0  
##  Mean   :2.75   Mean   :4.167   Mean   :2.5   Mean   :3.5  
##  3rd Qu.:3.25   3rd Qu.:5.000   3rd Qu.:4.0   3rd Qu.:4.0  
##  Max.   :5.00   Max.   :5.000   Max.   :5.0   Max.   :5.0  
##   Small_Device      Purchase      Family_Friends Purchased_Printed
##  Min.   :2.000   Min.   :0.0000   Min.   :1      Min.   :0.0000   
##  1st Qu.:4.750   1st Qu.:1.0000   1st Qu.:1      1st Qu.:1.0000   
##  Median :5.000   Median :1.0000   Median :1      Median :1.0000   
##  Mean   :4.583   Mean   :0.9167   Mean   :1      Mean   :0.9167   
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:1      3rd Qu.:1.0000   
##  Max.   :5.000   Max.   :1.0000   Max.   :1      Max.   :1.0000   
##  Purchased_Digital   Magazines         cluster 
##  Min.   :0.0000    Min.   :0.0000   Min.   :1  
##  1st Qu.:0.0000    1st Qu.:0.0000   1st Qu.:1  
##  Median :0.0000    Median :0.0000   Median :1  
##  Mean   :0.1667    Mean   :0.3333   Mean   :1  
##  3rd Qu.:0.0000    3rd Qu.:1.0000   3rd Qu.:1  
##  Max.   :1.0000    Max.   :1.0000   Max.   :1  
## 
## [[2]]
##       Age            Gender         Status      Have_Children   
##  Min.   :22.00   Min.   :1.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:24.00   1st Qu.:1.75   1st Qu.:3.000   1st Qu.:0.0000  
##  Median :28.50   Median :2.00   Median :3.000   Median :0.0000  
##  Mean   :30.33   Mean   :1.75   Mean   :2.833   Mean   :0.3333  
##  3rd Qu.:33.25   3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :48.00   Max.   :2.00   Max.   :4.000   Max.   :1.0000  
##    Fan_Author    Convenience      Trust_Chef    Trust_Reviews  
##  Min.   :1.00   Min.   :1.000   Min.   :2.000   Min.   :3.000  
##  1st Qu.:2.75   1st Qu.:2.750   1st Qu.:3.000   1st Qu.:4.000  
##  Median :3.00   Median :4.000   Median :3.500   Median :4.000  
##  Mean   :3.00   Mean   :3.417   Mean   :3.333   Mean   :4.083  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.250  
##  Max.   :4.00   Max.   :5.000   Max.   :4.000   Max.   :5.000  
##   Photography       Collect    Price_Expensive    Display    
##  Min.   :2.000   Min.   :1.0   Min.   :2.00    Min.   :1.00  
##  1st Qu.:2.000   1st Qu.:1.0   1st Qu.:4.00    1st Qu.:1.00  
##  Median :3.000   Median :1.0   Median :4.00    Median :1.00  
##  Mean   :2.917   Mean   :1.5   Mean   :4.00    Mean   :1.75  
##  3rd Qu.:3.250   3rd Qu.:2.0   3rd Qu.:4.25    3rd Qu.:2.25  
##  Max.   :5.000   Max.   :3.0   Max.   :5.00    Max.   :4.00  
##     Accuracy      Small_Device      Purchase Family_Friends
##  Min.   :1.000   Min.   :4.000   Min.   :0   Min.   :1     
##  1st Qu.:3.000   1st Qu.:5.000   1st Qu.:0   1st Qu.:1     
##  Median :4.000   Median :5.000   Median :0   Median :1     
##  Mean   :3.583   Mean   :4.917   Mean   :0   Mean   :1     
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:0   3rd Qu.:1     
##  Max.   :5.000   Max.   :5.000   Max.   :0   Max.   :1     
##  Purchased_Printed Purchased_Digital   Magazines       cluster 
##  Min.   :0.0000    Min.   :0.00000   Min.   :0.00   Min.   :2  
##  1st Qu.:0.0000    1st Qu.:0.00000   1st Qu.:0.00   1st Qu.:2  
##  Median :0.0000    Median :0.00000   Median :0.00   Median :2  
##  Mean   :0.3333    Mean   :0.08333   Mean   :0.25   Mean   :2  
##  3rd Qu.:1.0000    3rd Qu.:0.00000   3rd Qu.:0.25   3rd Qu.:2  
##  Max.   :1.0000    Max.   :1.00000   Max.   :1.00   Max.   :2  
## 
## [[3]]
##       Age            Gender          Status     Have_Children 
##  Min.   :24.00   Min.   :1.000   Min.   :2.00   Min.   :0.00  
##  1st Qu.:28.75   1st Qu.:1.000   1st Qu.:2.75   1st Qu.:0.75  
##  Median :31.50   Median :2.000   Median :3.00   Median :1.00  
##  Mean   :31.50   Mean   :1.625   Mean   :2.75   Mean   :0.75  
##  3rd Qu.:32.50   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:1.00  
##  Max.   :44.00   Max.   :2.000   Max.   :3.00   Max.   :1.00  
##    Fan_Author    Convenience      Trust_Chef    Trust_Reviews  
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.75   1st Qu.:1.750   1st Qu.:2.500   1st Qu.:1.750  
##  Median :3.00   Median :2.000   Median :3.000   Median :4.000  
##  Mean   :3.00   Mean   :2.375   Mean   :3.125   Mean   :3.375  
##  3rd Qu.:3.25   3rd Qu.:3.000   3rd Qu.:4.250   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##   Photography      Collect     Price_Expensive    Display   
##  Min.   :2.00   Min.   :1.00   Min.   :1       Min.   :1.0  
##  1st Qu.:2.75   1st Qu.:1.00   1st Qu.:3       1st Qu.:1.0  
##  Median :3.00   Median :2.00   Median :3       Median :1.5  
##  Mean   :3.00   Mean   :2.00   Mean   :3       Mean   :2.0  
##  3rd Qu.:3.25   3rd Qu.:2.25   3rd Qu.:3       3rd Qu.:3.0  
##  Max.   :4.00   Max.   :4.00   Max.   :5       Max.   :4.0  
##     Accuracy     Small_Device    Purchase    Family_Friends
##  Min.   :1.00   Min.   :3.0   Min.   :0.00   Min.   :0     
##  1st Qu.:2.00   1st Qu.:4.0   1st Qu.:0.00   1st Qu.:0     
##  Median :3.50   Median :5.0   Median :0.00   Median :0     
##  Mean   :3.25   Mean   :4.5   Mean   :0.25   Mean   :0     
##  3rd Qu.:4.25   3rd Qu.:5.0   3rd Qu.:0.25   3rd Qu.:0     
##  Max.   :5.00   Max.   :5.0   Max.   :1.00   Max.   :0     
##  Purchased_Printed Purchased_Digital   Magazines        cluster 
##  Min.   :0.000     Min.   :0         Min.   :0.000   Min.   :3  
##  1st Qu.:0.000     1st Qu.:0         1st Qu.:0.000   1st Qu.:3  
##  Median :1.000     Median :0         Median :0.000   Median :3  
##  Mean   :0.625     Mean   :0         Mean   :0.375   Mean   :3  
##  3rd Qu.:1.000     3rd Qu.:0         3rd Qu.:1.000   3rd Qu.:3  
##  Max.   :1.000     Max.   :0         Max.   :1.000   Max.   :3  
## 
## [[4]]
##       Age            Gender      Status      Have_Children   
##  Min.   :20.00   Min.   :2   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:28.00   1st Qu.:2   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :30.50   Median :2   Median :3.000   Median :0.0000  
##  Mean   :34.44   Mean   :2   Mean   :2.667   Mean   :0.3333  
##  3rd Qu.:32.75   3rd Qu.:2   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :63.00   Max.   :2   Max.   :4.000   Max.   :1.0000  
##    Fan_Author     Convenience      Trust_Chef    Trust_Reviews  
##  Min.   :2.000   Min.   :2.000   Min.   :2.000   Min.   :1.000  
##  1st Qu.:3.250   1st Qu.:2.250   1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :4.000   Median :4.000  
##  Mean   :4.111   Mean   :3.444   Mean   :3.833   Mean   :3.667  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.750   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##   Photography       Collect      Price_Expensive    Display     
##  Min.   :2.000   Min.   :2.000   Min.   :3       Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4       1st Qu.:2.250  
##  Median :4.000   Median :4.000   Median :4       Median :4.000  
##  Mean   :3.833   Mean   :4.056   Mean   :4       Mean   :3.389  
##  3rd Qu.:4.750   3rd Qu.:5.000   3rd Qu.:4       3rd Qu.:4.750  
##  Max.   :5.000   Max.   :5.000   Max.   :5       Max.   :5.000  
##     Accuracy      Small_Device      Purchase      Family_Friends  
##  Min.   :1.000   Min.   :4.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:5.000   1st Qu.:0.2500   1st Qu.:1.0000  
##  Median :4.000   Median :5.000   Median :1.0000   Median :1.0000  
##  Mean   :3.222   Mean   :4.778   Mean   :0.7222   Mean   :0.8333  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :5.000   Max.   :1.0000   Max.   :1.0000  
##  Purchased_Printed Purchased_Digital   Magazines         cluster 
##  Min.   :1         Min.   :0.00000   Min.   :0.0000   Min.   :4  
##  1st Qu.:1         1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:4  
##  Median :1         Median :0.00000   Median :1.0000   Median :4  
##  Mean   :1         Mean   :0.05556   Mean   :0.9444   Mean   :4  
##  3rd Qu.:1         3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:4  
##  Max.   :1         Max.   :1.00000   Max.   :1.0000   Max.   :4  
## 
## [[5]]
##       Age            Gender      Status  Have_Children      Fan_Author   
##  Min.   :23.00   Min.   :2   Min.   :2   Min.   :0.0000   Min.   :2.000  
##  1st Qu.:30.50   1st Qu.:2   1st Qu.:3   1st Qu.:1.0000   1st Qu.:2.000  
##  Median :32.00   Median :2   Median :3   Median :1.0000   Median :4.000  
##  Mean   :34.18   Mean   :2   Mean   :3   Mean   :0.9091   Mean   :3.182  
##  3rd Qu.:34.50   3rd Qu.:2   3rd Qu.:3   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :56.00   Max.   :2   Max.   :4   Max.   :1.0000   Max.   :4.000  
##   Convenience      Trust_Chef    Trust_Reviews    Photography   
##  Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :2.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.500   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median :3.000   Median :3.000  
##  Mean   :2.727   Mean   :2.545   Mean   :3.182   Mean   :3.182  
##  3rd Qu.:3.500   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     Collect      Price_Expensive    Display         Accuracy    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :3.000   Median :3.000   Median :1.000  
##  Mean   :2.636   Mean   :3.182   Mean   :3.091   Mean   :1.727  
##  3rd Qu.:3.500   3rd Qu.:4.500   3rd Qu.:4.500   3rd Qu.:2.500  
##  Max.   :4.000   Max.   :5.000   Max.   :5.000   Max.   :3.000  
##   Small_Device      Purchase       Family_Friends Purchased_Printed
##  Min.   :1.000   Min.   :0.00000   Min.   :1      Min.   :0.0000   
##  1st Qu.:3.000   1st Qu.:0.00000   1st Qu.:1      1st Qu.:1.0000   
##  Median :4.000   Median :0.00000   Median :1      Median :1.0000   
##  Mean   :3.636   Mean   :0.09091   Mean   :1      Mean   :0.9091   
##  3rd Qu.:5.000   3rd Qu.:0.00000   3rd Qu.:1      3rd Qu.:1.0000   
##  Max.   :5.000   Max.   :1.00000   Max.   :1      Max.   :1.0000   
##  Purchased_Digital   Magazines    cluster 
##  Min.   :0         Min.   :1   Min.   :5  
##  1st Qu.:0         1st Qu.:1   1st Qu.:5  
##  Median :0         Median :1   Median :5  
##  Mean   :0         Mean   :1   Mean   :5  
##  3rd Qu.:0         3rd Qu.:1   3rd Qu.:5  
##  Max.   :0         Max.   :1   Max.   :5

Cluster One Most likely to purchase individual recipes given below criteria are met and for an acceptable price

  1. Have children in household
  2. Fans of author
  3. Likely to trust the author more than the reviews,
  4. Use recipes that require less time in kitchen
  5. Generally find recipe books a little on the expensive side

Cluster Two Least likely to purchase

  1. No children
  2. Trust reviews
  3. Uses small device when refering to recipe
  4. Least likely to purchase printed recipe books

Cluster Three Unlikely to purchase

  1. Favours printed recipes
  2. Never purchased digital recipes

Cluster Four Likely to purchase

  1. All female
  2. Fans of author
  3. Expects beautiful photography
  4. Has a recipe book “collection”
  5. Uses small device when refering to recipe

Cluster Five No distinct features compared to other clusters

Visualize Clusters

tsne_obj <- Rtsne(gower_dist, dims=2, perplexity=20, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>% data.frame() %>% setNames(c("X", "Y")) %>% mutate(cluster = factor(pam_fit$clustering))

ggplot(aes(x = X, y = Y), data = tsne_data) +geom_point(aes(color = cluster))

Conclusion

From this cluster analysis, I’ve identified two target customer segments. This also provides me a good springboard to a classification problem where I want to know in which segment new customers are likely to belong and predict their propensity to purchase.