This dataset provides a comprehensive profile of customers, including demographic information, purchasing behavior, and responses to marketing campaigns. Using clustering analysis, we can group customers into distinct segments based on shared characteristics such as income, education, and spending habits. This can help us understand the different types of customers in our market and tailor our marketing strategies to each segment.
Data source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
#Importing dataset
library(readr)
mydata <- read_csv("~/R data/MVA Homework/customer_segmentation.csv")
## Rows: 2240 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Education, Marital_Status, Dt_Customer
## dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydata)
## # A tibble: 6 × 29
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012 58 635 88
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014 38 11 1
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013 26 426 49
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014 26 11 4
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014 94 173 43
## 6 7446 1967 Master Together 62513 0 1 09-09-2013 16 520 42
## # ℹ 18 more variables: MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>, MntGoldProds <dbl>,
## # NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## # NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>, AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>,
## # AcceptedCmp2 <dbl>, Complain <dbl>, Z_CostContact <dbl>, Z_Revenue <dbl>, Response <dbl>
#Randomly choosing and keeping only 200 units, because the sample is huge (2000+)
set.seed(200)
mydata <- mydata[sample(nrow(mydata), 200), ]
#Adding the ID column
mydata$ID <- seq(1, nrow(mydata))
# Selecting the relevant columns
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- select(mydata, ID, Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts)
head(mydata)
## # A tibble: 6 × 6
## ID Income MntWines MntFruits MntMeatProducts MntFishProducts
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 80617 594 51 631 72
## 2 2 41658 8 4 12 15
## 3 3 42664 21 0 3 0
## 4 4 81975 983 76 184 180
## 5 5 16014 3 9 4 7
## 6 6 60482 255 43 134 37
Unit of observation: Individual customer
Sample size: 200 customers
Variables:
++ ID: Unique identifier for each individual in the dataset.
++ Income: The annual income of the individual (money units).
++ MntWines: The amount of money spent on wines per month (money units).
++ MntFruits: The amount of money spent on fruits per month (money units).
++ MntMeatProducts: The amount of money spent on meat products per month (money units).
++ MntFishProducts: The amount of money spent on fish products per month (money units).
Research question:
How can we segment our customer base into distinct groups based on their income and spending habits on various products (wines, fruits, meat, and fish products)?
# Removing rows where units have NA in column Income
mydata <- mydata[!is.na(mydata$Income), ]
summary(mydata)
## ID Income MntWines MntFruits MntMeatProducts MntFishProducts
## Min. : 1.0 Min. : 4023 Min. : 0.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 51.0 1st Qu.:31385 1st Qu.: 16.0 1st Qu.: 2.00 1st Qu.: 12.0 1st Qu.: 3.00
## Median :101.0 Median :48070 Median : 123.0 Median : 7.00 Median : 46.0 Median : 11.00
## Mean :100.9 Mean :48962 Mean : 286.1 Mean : 22.92 Mean :157.1 Mean : 31.58
## 3rd Qu.:151.0 3rd Qu.:65316 3rd Qu.: 493.0 3rd Qu.: 26.00 3rd Qu.:207.0 3rd Qu.: 32.00
## Max. :200.0 Max. :95529 Max. :1285.0 Max. :199.00 Max. :890.0 Max. :250.00
Explanation of few parameters:
Min: The customer with the lowest income earns 4,023 money units per year. The minimum amount spent on wines, fruits, meat products, and fish products per month is 0, 0, 1, and 0 money units respectively.
Max: The customer with the highest income earns 95,529 money units per year. The maximum amount spent on wines, fruits, meat products, and fish products per month is 1,285, 199, 890, and 250 money units respectively.
1st Quartile: 25% of customers earn 31,385 money units or less per year, and 25% of customers spend 16, 2, 12, and 3 money units or less per month on wines, fruits, meat products, and fish products respectively.
Median: The median income of customers is 48,070 money units per year. The median amount spent on wines, fruits, meat products, and fish products per month is 123, 7, 46, and 11 money units respectively.
Mean: The average income of customers is 48,962 money units per year. On average, customers spend 286.1, 22.92, 157.1, and 31.58 money units per month on wines, fruits, meat products, and fish products respectively.
3rd Quartile: 75% of customers earn 65,316 money units or less per year, and 75% of customers spend 493, 26, 207, and 32 money units or less per month on wines, fruits, meat products, and fish products respectively.
#Standardizing
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)
#Finding outliers
mydata$Dissimilarity_z <- sqrt(mydata$Income_z^2 + mydata$MntWines_z^2 + mydata$MntFruits_z^2 + mydata$MntMeatProducts_z^2 + mydata$MntFishProducts_z^2)
#Showing 10 units with the highest value of dissimilarity
head(mydata[order(-mydata$Dissimilarity_z), c("ID", "Dissimilarity_z")], 10)
## # A tibble: 10 × 2
## ID Dissimilarity_z[,1]
## <int> <dbl>
## 1 183 6.41
## 2 73 6.08
## 3 132 5.32
## 4 34 5.20
## 5 188 4.86
## 6 10 4.81
## 7 74 4.72
## 8 168 4.50
## 9 77 4.44
## 10 120 4.42
#Removing outliers
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(198, 114, 34, 123))
#Standardizing one more time after removing units
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)
round(head(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")], 3), 2)
## # A tibble: 3 × 5
## Income_z[,1] MntWines_z[,1] MntFruits_z[,1] MntMeatProducts_z[,1] MntFishProducts_z[,1]
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.48 0.9 0.88 2.23 0.8
## 2 -0.33 -0.8 -0.53 -0.66 -0.33
## 3 -0.28 -0.77 -0.65 -0.71 -0.62
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distance_euclidian <- get_dist(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
method = "euclidian")
Distance_euclidian2 <- Distance_euclidian^2
fviz_dist(Distance_euclidian2)
Based on the dissimilarity matrix, we can evaluate that customers can be classify into three groups. But before that, we have to check the Hopkins statistics, to see if the data is truly clusterable.
#Cheking if data is clusterable with Hopkins statistics
#install.packages("factoextra")
library(factoextra)
get_clust_tendency(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8267641
##
## $plot
## NULL
Hopkins statistic is 0.82 which is bigger than 0.5, which means that the data is clusterable.
library(dplyr)
library(factoextra)
WARD <- mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 193
With 192 steps, we will make 1 cluster (group).
library(factoextra)
fviz_dend(WARD,
k = 2,
cex = 0.5,
pallete = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Because the horizontal line of the group on the right side is too low, and we should “make a cut” too close to it, we decide to cluster customers to two groups.
We now have to indicate the initial leaders of the groups that will be the basis for K-means method.
mydata$ClassificationWARD <- cutree(WARD,
k = 2)
head(mydata[c("ID", "ClassificationWARD")])
## # A tibble: 6 × 2
## ID ClassificationWARD
## <int> <int>
## 1 1 1
## 2 2 2
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
InitialLeaders <- aggregate(mydata[,c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
by = list(mydata$ClassificationWARD),
FUN = mean)
round(InitialLeaders, 3)
## Group.1 Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1 1 1.252 1.092 1.163 1.402 1.168
## 2 2 -0.462 -0.403 -0.429 -0.517 -0.431
library(factoextra)
#Performing K-Means clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
k = 2,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 2 clusters of sizes 58, 135
##
## Cluster means:
## Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1 1.1809418 1.1489551 1.0386770 1.2642228 1.0225457
## 2 -0.5073676 -0.4936252 -0.4462464 -0.5431476 -0.4393159
##
## Clustering vector:
## [1] 1 2 2 1 2 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2
## [58] 1 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2
## [115] 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 1 1 1 1 1 2 2 1 2
## [172] 2 2 2 1 1 1 1 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 300.1258 126.0920
## (between_SS / total_SS = 55.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
## [8] "iter" "ifault" "data" "hclust"
In the table “Cluster means” we can see the final leaders.
In clustering vector we can see that the first unit is clustered in group 1, second unit is clustered in group 2 etc.
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
We will now check if any unit has been reclassified.
mydata$ClassificationK_MEANS <- K_MEANS$cluster
head(mydata[, c("ID", "ClassificationWARD", "ClassificationK_MEANS")])
## # A tibble: 6 × 3
## ID ClassificationWARD ClassificationK_MEANS
## <int> <int> <int>
## 1 1 1 1
## 2 2 2 2
## 3 3 2 2
## 4 4 1 1
## 5 5 2 2
## 6 6 2 2
table(mydata$ClassificationWARD)
##
## 1 2
## 52 141
table(mydata$ClassificationK_MEANS)
##
## 1 2
## 58 135
table(mydata$ClassificationWARD, mydata$ClassificationK_MEANS)
##
## 1 2
## 1 52 0
## 2 6 135
There has been a reclassification. In the beginning there were 143 units in group 1, when 136 stayed in the same group, when 7 of them were reclassified into group 2. In the beginning there were 50 units in group 2 and no unit was reclassified into group 2.
Now, there are 136 units in group 1 and 57 units in group 2, where 7 of them came from group 1.
Averages <- K_MEANS$centers
round(Averages, 3)
## Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1 1.181 1.149 1.039 1.264 1.023
## 2 -0.507 -0.494 -0.446 -0.543 -0.439
library(ggplot2)
library(tidyr)
Picture <- as.data.frame(Averages)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z"))
Picture$Group <- factor(Picture$id,
levels = c(1, 2, 3, 4, 5),
labels = c("1", "2", "3", "4", "5"))
Picture$nameFactor <- factor(Picture$name,
levels = c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z"),
labels = c("Income", "MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts"))
ggplot(Picture, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 2) +
geom_line(aes(group = id, linetype = Group, col = Group), linewidth = 1) +
ylab("Average") +
xlab("Cluster variables")
We will now check the appropriateness of the cluster variables used.
fit <- aov(cbind(Income_z, MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z) ~ as.factor(ClassificationK_MEANS),
data = mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationK_MEANS) 1 115.64 115.64 289.25 < 2.2e-16 ***
## Residuals 191 76.36 0.40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationK_MEANS) 1 109.461 109.461 253.3 < 2.2e-16 ***
## Residuals 191 82.539 0.432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationK_MEANS) 1 89.457 89.457 166.62 < 2.2e-16 ***
## Residuals 191 102.543 0.537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationK_MEANS) 1 132.525 132.525 425.6 < 2.2e-16 ***
## Residuals 191 59.475 0.311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationK_MEANS) 1 86.7 86.700 157.26 < 2.2e-16 ***
## Residuals 191 105.3 0.551
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject null hypothesis for all cluster variables, which means that the choice of cluster variables is correct.
On the basis of 5 standardized variables (Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts), we clustered 193 customers into 2 groups by hierarchical clustering (Ward’s algorithm, squared Euclidian distance - 5 customers were removed because of high dissimilarity). The number of groups was determined based on the dendogram. We further optimized the classification by using the K-means method, in which 7 units were reclassified.
Group 1 (70.47%) consists of customers, characterized by a lower than average value of all variables. They are the customers with income lower than average and who spend less than average amount on all kinds of products (Wine, Fruits, Meat products, Fish products).
Group 2 (29.53%) consists of customers, characterized by a higher than average value of all variables. They are the customers with income higher than average and who spend more than average amount on all kinds of products (Wine, Fruits, Meat products, Fish products).