01/09/17 8:35pm
In early 2017, I embarked on a personal project inspired by some of the frustrations my wife and I experience when cooking with recipes. I was interested to see whether there are others out there who experience the same frustrations and what characterizes them in terms of their attitudes towards recipe purchase and usage. With a business idea in mind, I ultimately wanted to pinpoint key customer targets.
I created a survey which contained demographic and a series of likert-scale type attitudinal questions, as well as some price point questions at the end for me to create a Van Westendorp Price Sensitivity Meter.
Key concepts in the following cluster anlaysis include
library(heatmaply)
library(dplyr)
library(tidyr)
library(dendextend)
library(cluster)
library(Rtsne)
library(ggplot2)
## Observations: 70
## Variables: 27
## $ Timestamp <chr> "2017/01/07 9:18:19 AM GMT+13", "2017/01/07 ...
## $ Age <int> 30, 33, 30, 32, 33, 32, 52, 31, 57, 32, 63, ...
## $ Gender <chr> "Male", "Male", "Female", "Male", "Female", ...
## $ City <chr> "Auckland", "Auckland", "Hamilton", "Aucklan...
## $ Country <chr> "NZ", "NZ", "NZ", "NZ", "NZ", "NZ", "Canada"...
## $ Status <chr> "Married", "Married", "Married", "Married", ...
## $ Have_Children <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Ye...
## $ Fan_Author <int> 5, 4, 1, 3, 4, 5, 2, 4, 4, 2, 4, 5, 2, 3, 4,...
## $ Convenience <int> 2, 4, 4, 3, 4, 5, 4, 1, 5, 2, 3, 4, 1, 2, 3,...
## $ Trust_Chef <int> 5, 2, 4, 4, 3, 4, 2, 3, 4, 1, 3, 3, 3, 3, 2,...
## $ Trust_Reviews <int> 1, 4, 4, 4, 4, 4, 3, 5, 4, 3, 4, 3, 4, 3, 3,...
## $ Photography <int> 4, 4, 3, 2, 3, 4, 3, 2, 5, 2, 2, 2, 4, 3, 3,...
## $ Collect <int> 3, 1, 1, 2, 2, 4, 3, 2, 1, 2, 4, 4, 2, 4, 4,...
## $ Price_Expensive <int> 5, 5, 5, 3, 5, 5, 3, 3, 5, 3, 4, 4, 2, 4, 5,...
## $ Display <int> 4, 1, 1, 4, 1, 5, 1, 3, 1, 3, 3, 1, 4, 4, 4,...
## $ Accuracy <int> 5, 4, 5, 4, 1, 3, 5, 1, 5, 1, 4, 4, 4, 4, 2,...
## $ Small_Device <int> 5, 5, 5, 5, 5, 4, 5, 5, 5, 3, 5, 5, 5, 5, 5,...
## $ Purchase <int> 4, 2, 2, 2, 4, 4, 4, 2, 2, 3, 5, 4, 1, 4, 2,...
## $ Family_Friends <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Free_Online <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Magazines <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ Too_Expensive <dbl> 4.95, 3.00, 2.00, 2.00, 5.00, 5.00, 40.00, 1...
## $ Too_Cheap <dbl> 0.25, 0.50, 0.01, 0.01, 1.00, 0.50, 1.00, 0....
## $ Getting_Expensive <dbl> 3.99, 2.00, 1.00, 1.50, 3.00, 3.00, 35.00, 5...
## $ Bargain <dbl> 0.79, 0.99, 0.01, 0.30, 0.50, 1.00, 20.00, 1...
Note one missing value from marital status question.
sapply(Recipes, function(x) sum(is.na(x)))
## Timestamp Age Gender City
## 0 0 0 0
## Country Status Have_Children Fan_Author
## 0 1 0 0
## Convenience Trust_Chef Trust_Reviews Photography
## 0 0 0 0
## Collect Price_Expensive Display Accuracy
## 0 0 0 0
## Small_Device Purchase Family_Friends Purchased_Printed
## 0 0 0 0
## Purchased_Digital Free_Online Magazines Too_Expensive
## 0 0 0 0
## Too_Cheap Getting_Expensive Bargain
## 0 0 0
I was mainly interested in those living in New Zealand. So excluding those residing outside of New Zealand, meant that 100% of the respondents also source free recipes online, so I exclude ‘Free_Online’ for the cluster analysis as including it will not provide additional information. I also exclude the pricing data.
Recipes_Clean <- Recipes %>%
filter(Country == "NZ") %>%
select(-Timestamp, -Country, -City, -Free_Online, -Too_Expensive, -Too_Cheap, -Getting_Expensive, -Bargain) %>%
drop_na()
glimpse(Recipes_Clean)
## Observations: 61
## Variables: 19
## $ Age <int> 30, 33, 30, 32, 33, 32, 31, 32, 63, 30, 24, ...
## $ Gender <chr> "Male", "Male", "Female", "Male", "Female", ...
## $ Status <chr> "Married", "Married", "Married", "Married", ...
## $ Have_Children <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Ye...
## $ Fan_Author <int> 5, 4, 1, 3, 4, 5, 4, 2, 4, 5, 2, 4, 3, 5, 5,...
## $ Convenience <int> 2, 4, 4, 3, 4, 5, 1, 2, 3, 4, 1, 3, 4, 4, 3,...
## $ Trust_Chef <int> 5, 2, 4, 4, 3, 4, 3, 1, 3, 3, 3, 2, 4, 5, 5,...
## $ Trust_Reviews <int> 1, 4, 4, 4, 4, 4, 5, 3, 4, 3, 4, 3, 4, 3, 5,...
## $ Photography <int> 4, 4, 3, 2, 3, 4, 2, 2, 2, 2, 4, 3, 3, 5, 4,...
## $ Collect <int> 3, 1, 1, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 5, 4,...
## $ Price_Expensive <int> 5, 5, 5, 3, 5, 5, 3, 3, 4, 4, 2, 5, 5, 4, 5,...
## $ Display <int> 4, 1, 1, 4, 1, 5, 3, 3, 3, 1, 4, 4, 3, 4, 1,...
## $ Accuracy <int> 5, 4, 5, 4, 1, 3, 1, 1, 4, 4, 4, 2, 1, 1, 5,...
## $ Small_Device <int> 5, 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 4, 5, 4,...
## $ Purchase <int> 4, 2, 2, 2, 4, 4, 2, 3, 5, 4, 1, 2, 4, 4, 2,...
## $ Family_Friends <int> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...
Just by visualizing the data, it is apparent that not all variables have a normal distribution, so no standardization was done on the data. However I normalize() the data in order to compare variables on the same scale and to generate a useful heatmap for all variables.
Recipes_Clean$Gender <- as.numeric(factor(Recipes_Clean$Gender,levels=c("Male","Female")))
Recipes_Clean$Status <- as.numeric(factor(Recipes_Clean$Status))
Recipes_Clean$Have_Children <- ifelse(Recipes_Clean$Have_Children == "Yes",1,0)
Recipes_Clean$Purchase <- ifelse(Recipes_Clean$Purchase %in% c(4,5),1,0)
glimpse(Recipes_Clean)
## Observations: 61
## Variables: 19
## $ Age <int> 30, 33, 30, 32, 33, 32, 31, 32, 63, 30, 24, ...
## $ Gender <dbl> 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Status <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 2, 2, 3,...
## $ Have_Children <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,...
## $ Fan_Author <int> 5, 4, 1, 3, 4, 5, 4, 2, 4, 5, 2, 4, 3, 5, 5,...
## $ Convenience <int> 2, 4, 4, 3, 4, 5, 1, 2, 3, 4, 1, 3, 4, 4, 3,...
## $ Trust_Chef <int> 5, 2, 4, 4, 3, 4, 3, 1, 3, 3, 3, 2, 4, 5, 5,...
## $ Trust_Reviews <int> 1, 4, 4, 4, 4, 4, 5, 3, 4, 3, 4, 3, 4, 3, 5,...
## $ Photography <int> 4, 4, 3, 2, 3, 4, 2, 2, 2, 2, 4, 3, 3, 5, 4,...
## $ Collect <int> 3, 1, 1, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 5, 4,...
## $ Price_Expensive <int> 5, 5, 5, 3, 5, 5, 3, 3, 4, 4, 2, 5, 5, 4, 5,...
## $ Display <int> 4, 1, 1, 4, 1, 5, 3, 3, 3, 1, 4, 4, 3, 4, 1,...
## $ Accuracy <int> 5, 4, 5, 4, 1, 3, 1, 1, 4, 4, 4, 2, 1, 1, 5,...
## $ Small_Device <int> 5, 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 4, 5, 4,...
## $ Purchase <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Family_Friends <int> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...
Recipes_Norm <- normalize(Recipes_Clean)
glimpse(Recipes_Norm)
## Observations: 61
## Variables: 19
## $ Age <dbl> 0.23255814, 0.30232558, 0.23255814, 0.279069...
## $ Gender <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Status <dbl> 0.6666667, 0.6666667, 0.6666667, 0.6666667, ...
## $ Have_Children <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,...
## $ Fan_Author <dbl> 1.00, 0.75, 0.00, 0.50, 0.75, 1.00, 0.75, 0....
## $ Convenience <dbl> 0.25, 0.75, 0.75, 0.50, 0.75, 1.00, 0.00, 0....
## $ Trust_Chef <dbl> 1.00, 0.25, 0.75, 0.75, 0.50, 0.75, 0.50, 0....
## $ Trust_Reviews <dbl> 0.00, 0.75, 0.75, 0.75, 0.75, 0.75, 1.00, 0....
## $ Photography <dbl> 0.75, 0.75, 0.50, 0.25, 0.50, 0.75, 0.25, 0....
## $ Collect <dbl> 0.50, 0.00, 0.00, 0.25, 0.25, 0.75, 0.25, 0....
## $ Price_Expensive <dbl> 1.00, 1.00, 1.00, 0.50, 1.00, 1.00, 0.50, 0....
## $ Display <dbl> 0.75, 0.00, 0.00, 0.75, 0.00, 1.00, 0.50, 0....
## $ Accuracy <dbl> 1.00, 0.75, 1.00, 0.75, 0.00, 0.50, 0.00, 0....
## $ Small_Device <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 0.75, 1.00, 0....
## $ Purchase <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Family_Friends <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ Purchased_Printed <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Purchased_Digital <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Magazines <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,...
Now that the data has been normalized, I create a heatmap that automatically repositions similar rows together. To right of the heatmap is a dendrogram clearly showing 5 distinct clusters represented by the different colours.
row_dend <- Recipes_Norm %>% dist %>% hclust %>% as.dendrogram %>%
set("branches_k_color", k = 6) %>% set("branches_lwd", c(1,3)) %>%
ladderize
heatmaply(Recipes_Norm, xlab = "Features", ylab = "Respondents", main = "Heatmap (Normalized)", Rowv=row_dend)
As data contains different variable types (e.g. ordinal, nominal, continuous) I used Gower Distance from cluster package. As a sanity check, I wanted to see the most similar pair of respondents, and the most dissimilar pair based on their survey responses.
set.seed(1234)
library(cluster)
gower_dist <- daisy(Recipes_Clean, metric="gower")
# Sanity Check: Output most similar pair
gower_mat <- as.matrix(gower_dist)
Recipes_Clean[which(gower_mat == min(gower_mat[gower_mat !=min(gower_mat)]), arr.ind=TRUE)[1, ],]
## # A tibble: 2 x 19
## Age Gender Status Have_Children Fan_Author Convenience Trust_Chef
## <int> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 35 2. 3. 1. 4 4 2
## 2 33 2. 3. 1. 4 4 3
## # ... with 12 more variables: Trust_Reviews <int>, Photography <int>,
## # Collect <int>, Price_Expensive <int>, Display <int>, Accuracy <int>,
## # Small_Device <int>, Purchase <dbl>, Family_Friends <int>,
## # Purchased_Printed <int>, Purchased_Digital <int>, Magazines <int>
# Sanity Check: Output most dissimilar pair
Recipes_Clean[which(gower_mat == max(gower_mat[gower_mat !=max(gower_mat)]), arr.ind=TRUE)[1, ],]
## # A tibble: 2 x 19
## Age Gender Status Have_Children Fan_Author Convenience Trust_Chef
## <int> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 34 1. 3. 1. 4 4 3
## 2 28 2. 3. 0. 3 2 4
## # ... with 12 more variables: Trust_Reviews <int>, Photography <int>,
## # Collect <int>, Price_Expensive <int>, Display <int>, Accuracy <int>,
## # Small_Device <int>, Purchase <dbl>, Family_Friends <int>,
## # Purchased_Printed <int>, Purchased_Digital <int>, Magazines <int>
I use K-Medoids clustering, which takes on an identical algorithm to Euclidean distance, but using observations as centers rather than centroids.
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)
Five-clusters yields the highest sillhoutte width, where objects are most well matched to its own cluster and poorly matched to neighbouring clusters. This is in line with the 5 distinct clusters identified in the heatmap.
pam_fit <- pam(gower_dist, diss = TRUE, k = 5)
pam_results <- Recipes_Clean %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
## Age Gender Status Have_Children Fan_Author
## Min. :28.00 Min. :1.000 Min. :3 Min. :1 Min. :3.00
## 1st Qu.:30.00 1st Qu.:2.000 1st Qu.:3 1st Qu.:1 1st Qu.:4.00
## Median :32.50 Median :2.000 Median :3 Median :1 Median :4.00
## Mean :34.58 Mean :1.917 Mean :3 Mean :1 Mean :4.25
## 3rd Qu.:37.75 3rd Qu.:2.000 3rd Qu.:3 3rd Qu.:1 3rd Qu.:5.00
## Max. :47.00 Max. :2.000 Max. :3 Max. :1 Max. :5.00
## Convenience Trust_Chef Trust_Reviews Photography
## Min. :2.000 Min. :2.00 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:2.75 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000 Median :3.000
## Mean :4.083 Mean :3.50 Mean :3.167 Mean :2.833
## 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.:3.250
## Max. :5.000 Max. :5.00 Max. :4.000 Max. :4.000
## Collect Price_Expensive Display Accuracy
## Min. :1.00 Min. :3.000 Min. :1.0 Min. :1.0
## 1st Qu.:2.00 1st Qu.:4.000 1st Qu.:1.0 1st Qu.:3.0
## Median :3.00 Median :4.000 Median :2.0 Median :4.0
## Mean :2.75 Mean :4.167 Mean :2.5 Mean :3.5
## 3rd Qu.:3.25 3rd Qu.:5.000 3rd Qu.:4.0 3rd Qu.:4.0
## Max. :5.00 Max. :5.000 Max. :5.0 Max. :5.0
## Small_Device Purchase Family_Friends Purchased_Printed
## Min. :2.000 Min. :0.0000 Min. :1 Min. :0.0000
## 1st Qu.:4.750 1st Qu.:1.0000 1st Qu.:1 1st Qu.:1.0000
## Median :5.000 Median :1.0000 Median :1 Median :1.0000
## Mean :4.583 Mean :0.9167 Mean :1 Mean :0.9167
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:1.0000
## Max. :5.000 Max. :1.0000 Max. :1 Max. :1.0000
## Purchased_Digital Magazines cluster
## Min. :0.0000 Min. :0.0000 Min. :1
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1
## Median :0.0000 Median :0.0000 Median :1
## Mean :0.1667 Mean :0.3333 Mean :1
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.0000 Max. :1.0000 Max. :1
##
## [[2]]
## Age Gender Status Have_Children
## Min. :22.00 Min. :1.00 Min. :1.000 Min. :0.0000
## 1st Qu.:24.00 1st Qu.:1.75 1st Qu.:3.000 1st Qu.:0.0000
## Median :28.50 Median :2.00 Median :3.000 Median :0.0000
## Mean :30.33 Mean :1.75 Mean :2.833 Mean :0.3333
## 3rd Qu.:33.25 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :48.00 Max. :2.00 Max. :4.000 Max. :1.0000
## Fan_Author Convenience Trust_Chef Trust_Reviews
## Min. :1.00 Min. :1.000 Min. :2.000 Min. :3.000
## 1st Qu.:2.75 1st Qu.:2.750 1st Qu.:3.000 1st Qu.:4.000
## Median :3.00 Median :4.000 Median :3.500 Median :4.000
## Mean :3.00 Mean :3.417 Mean :3.333 Mean :4.083
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.250
## Max. :4.00 Max. :5.000 Max. :4.000 Max. :5.000
## Photography Collect Price_Expensive Display
## Min. :2.000 Min. :1.0 Min. :2.00 Min. :1.00
## 1st Qu.:2.000 1st Qu.:1.0 1st Qu.:4.00 1st Qu.:1.00
## Median :3.000 Median :1.0 Median :4.00 Median :1.00
## Mean :2.917 Mean :1.5 Mean :4.00 Mean :1.75
## 3rd Qu.:3.250 3rd Qu.:2.0 3rd Qu.:4.25 3rd Qu.:2.25
## Max. :5.000 Max. :3.0 Max. :5.00 Max. :4.00
## Accuracy Small_Device Purchase Family_Friends
## Min. :1.000 Min. :4.000 Min. :0 Min. :1
## 1st Qu.:3.000 1st Qu.:5.000 1st Qu.:0 1st Qu.:1
## Median :4.000 Median :5.000 Median :0 Median :1
## Mean :3.583 Mean :4.917 Mean :0 Mean :1
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:0 3rd Qu.:1
## Max. :5.000 Max. :5.000 Max. :0 Max. :1
## Purchased_Printed Purchased_Digital Magazines cluster
## Min. :0.0000 Min. :0.00000 Min. :0.00 Min. :2
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00 1st Qu.:2
## Median :0.0000 Median :0.00000 Median :0.00 Median :2
## Mean :0.3333 Mean :0.08333 Mean :0.25 Mean :2
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.25 3rd Qu.:2
## Max. :1.0000 Max. :1.00000 Max. :1.00 Max. :2
##
## [[3]]
## Age Gender Status Have_Children
## Min. :24.00 Min. :1.000 Min. :2.00 Min. :0.00
## 1st Qu.:28.75 1st Qu.:1.000 1st Qu.:2.75 1st Qu.:0.75
## Median :31.50 Median :2.000 Median :3.00 Median :1.00
## Mean :31.50 Mean :1.625 Mean :2.75 Mean :0.75
## 3rd Qu.:32.50 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:1.00
## Max. :44.00 Max. :2.000 Max. :3.00 Max. :1.00
## Fan_Author Convenience Trust_Chef Trust_Reviews
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.75 1st Qu.:1.750 1st Qu.:2.500 1st Qu.:1.750
## Median :3.00 Median :2.000 Median :3.000 Median :4.000
## Mean :3.00 Mean :2.375 Mean :3.125 Mean :3.375
## 3rd Qu.:3.25 3rd Qu.:3.000 3rd Qu.:4.250 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
## Photography Collect Price_Expensive Display
## Min. :2.00 Min. :1.00 Min. :1 Min. :1.0
## 1st Qu.:2.75 1st Qu.:1.00 1st Qu.:3 1st Qu.:1.0
## Median :3.00 Median :2.00 Median :3 Median :1.5
## Mean :3.00 Mean :2.00 Mean :3 Mean :2.0
## 3rd Qu.:3.25 3rd Qu.:2.25 3rd Qu.:3 3rd Qu.:3.0
## Max. :4.00 Max. :4.00 Max. :5 Max. :4.0
## Accuracy Small_Device Purchase Family_Friends
## Min. :1.00 Min. :3.0 Min. :0.00 Min. :0
## 1st Qu.:2.00 1st Qu.:4.0 1st Qu.:0.00 1st Qu.:0
## Median :3.50 Median :5.0 Median :0.00 Median :0
## Mean :3.25 Mean :4.5 Mean :0.25 Mean :0
## 3rd Qu.:4.25 3rd Qu.:5.0 3rd Qu.:0.25 3rd Qu.:0
## Max. :5.00 Max. :5.0 Max. :1.00 Max. :0
## Purchased_Printed Purchased_Digital Magazines cluster
## Min. :0.000 Min. :0 Min. :0.000 Min. :3
## 1st Qu.:0.000 1st Qu.:0 1st Qu.:0.000 1st Qu.:3
## Median :1.000 Median :0 Median :0.000 Median :3
## Mean :0.625 Mean :0 Mean :0.375 Mean :3
## 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:1.000 3rd Qu.:3
## Max. :1.000 Max. :0 Max. :1.000 Max. :3
##
## [[4]]
## Age Gender Status Have_Children
## Min. :20.00 Min. :2 Min. :1.000 Min. :0.0000
## 1st Qu.:28.00 1st Qu.:2 1st Qu.:2.000 1st Qu.:0.0000
## Median :30.50 Median :2 Median :3.000 Median :0.0000
## Mean :34.44 Mean :2 Mean :2.667 Mean :0.3333
## 3rd Qu.:32.75 3rd Qu.:2 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :63.00 Max. :2 Max. :4.000 Max. :1.0000
## Fan_Author Convenience Trust_Chef Trust_Reviews
## Min. :2.000 Min. :2.000 Min. :2.000 Min. :1.000
## 1st Qu.:3.250 1st Qu.:2.250 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :4.111 Mean :3.444 Mean :3.833 Mean :3.667
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.750 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Photography Collect Price_Expensive Display
## Min. :2.000 Min. :2.000 Min. :3 Min. :1.000
## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4 1st Qu.:2.250
## Median :4.000 Median :4.000 Median :4 Median :4.000
## Mean :3.833 Mean :4.056 Mean :4 Mean :3.389
## 3rd Qu.:4.750 3rd Qu.:5.000 3rd Qu.:4 3rd Qu.:4.750
## Max. :5.000 Max. :5.000 Max. :5 Max. :5.000
## Accuracy Small_Device Purchase Family_Friends
## Min. :1.000 Min. :4.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:0.2500 1st Qu.:1.0000
## Median :4.000 Median :5.000 Median :1.0000 Median :1.0000
## Mean :3.222 Mean :4.778 Mean :0.7222 Mean :0.8333
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :5.000 Max. :5.000 Max. :1.0000 Max. :1.0000
## Purchased_Printed Purchased_Digital Magazines cluster
## Min. :1 Min. :0.00000 Min. :0.0000 Min. :4
## 1st Qu.:1 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:4
## Median :1 Median :0.00000 Median :1.0000 Median :4
## Mean :1 Mean :0.05556 Mean :0.9444 Mean :4
## 3rd Qu.:1 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:4
## Max. :1 Max. :1.00000 Max. :1.0000 Max. :4
##
## [[5]]
## Age Gender Status Have_Children Fan_Author
## Min. :23.00 Min. :2 Min. :2 Min. :0.0000 Min. :2.000
## 1st Qu.:30.50 1st Qu.:2 1st Qu.:3 1st Qu.:1.0000 1st Qu.:2.000
## Median :32.00 Median :2 Median :3 Median :1.0000 Median :4.000
## Mean :34.18 Mean :2 Mean :3 Mean :0.9091 Mean :3.182
## 3rd Qu.:34.50 3rd Qu.:2 3rd Qu.:3 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :56.00 Max. :2 Max. :4 Max. :1.0000 Max. :4.000
## Convenience Trust_Chef Trust_Reviews Photography
## Min. :1.000 Min. :1.000 Min. :2.000 Min. :2.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:2.000
## Median :2.000 Median :2.000 Median :3.000 Median :3.000
## Mean :2.727 Mean :2.545 Mean :3.182 Mean :3.182
## 3rd Qu.:3.500 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Collect Price_Expensive Display Accuracy
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :3.000 Median :3.000 Median :1.000
## Mean :2.636 Mean :3.182 Mean :3.091 Mean :1.727
## 3rd Qu.:3.500 3rd Qu.:4.500 3rd Qu.:4.500 3rd Qu.:2.500
## Max. :4.000 Max. :5.000 Max. :5.000 Max. :3.000
## Small_Device Purchase Family_Friends Purchased_Printed
## Min. :1.000 Min. :0.00000 Min. :1 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:0.00000 1st Qu.:1 1st Qu.:1.0000
## Median :4.000 Median :0.00000 Median :1 Median :1.0000
## Mean :3.636 Mean :0.09091 Mean :1 Mean :0.9091
## 3rd Qu.:5.000 3rd Qu.:0.00000 3rd Qu.:1 3rd Qu.:1.0000
## Max. :5.000 Max. :1.00000 Max. :1 Max. :1.0000
## Purchased_Digital Magazines cluster
## Min. :0 Min. :1 Min. :5
## 1st Qu.:0 1st Qu.:1 1st Qu.:5
## Median :0 Median :1 Median :5
## Mean :0 Mean :1 Mean :5
## 3rd Qu.:0 3rd Qu.:1 3rd Qu.:5
## Max. :0 Max. :1 Max. :5
Cluster One Most likely to purchase individual recipes given below criteria are met and for an acceptable price
Cluster Two Least likely to purchase
Cluster Three Unlikely to purchase
Cluster Four Likely to purchase
Cluster Five No distinct features compared to other clusters
tsne_obj <- Rtsne(gower_dist, dims=2, perplexity=20, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>% data.frame() %>% setNames(c("X", "Y")) %>% mutate(cluster = factor(pam_fit$clustering))
ggplot(aes(x = X, y = Y), data = tsne_data) +geom_point(aes(color = cluster))
From this cluster analysis, I’ve identified two target customer segments. This also provides me a good springboard to a classification problem where I want to know in which segment new customers are likely to belong and predict their propensity to purchase.