Clustering

MVA Homework 4

Data

This dataset provides a comprehensive profile of customers, including demographic information, purchasing behavior, and responses to marketing campaigns. Using clustering analysis, we can group customers into distinct segments based on shared characteristics such as income, education, and spending habits. This can help us understand the different types of customers in our market and tailor our marketing strategies to each segment.

Data source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering

#Importing dataset
library(readr)
mydata <- read_csv("~/R data/MVA Homework/customer_segmentation.csv")

## Rows: 2240 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Education, Marital_Status, Dt_Customer
## dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(mydata)

## # A tibble: 6 × 29
##      ID Year_Birth Education  Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits
##   <dbl>      <dbl> <chr>      <chr>           <dbl>   <dbl>    <dbl> <chr>         <dbl>    <dbl>     <dbl>
## 1  5524       1957 Graduation Single          58138       0        0 04-09-2012       58      635        88
## 2  2174       1954 Graduation Single          46344       1        1 08-03-2014       38       11         1
## 3  4141       1965 Graduation Together        71613       0        0 21-08-2013       26      426        49
## 4  6182       1984 Graduation Together        26646       1        0 10-02-2014       26       11         4
## 5  5324       1981 PhD        Married         58293       1        0 19-01-2014       94      173        43
## 6  7446       1967 Master     Together        62513       0        1 09-09-2013       16      520        42
## # ℹ 18 more variables: MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>, MntGoldProds <dbl>,
## #   NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>, AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>,
## #   AcceptedCmp2 <dbl>, Complain <dbl>, Z_CostContact <dbl>, Z_Revenue <dbl>, Response <dbl>

#Randomly choosing and keeping only 200 units, because the sample is huge (2000+)
set.seed(200) 
mydata <- mydata[sample(nrow(mydata), 200), ]

#Adding the ID column
mydata$ID <- seq(1, nrow(mydata))

# Selecting the relevant columns
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- select(mydata, ID, Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts)
head(mydata)

## # A tibble: 6 × 6
##      ID Income MntWines MntFruits MntMeatProducts MntFishProducts
##   <int>  <dbl>    <dbl>     <dbl>           <dbl>           <dbl>
## 1     1  80617      594        51             631              72
## 2     2  41658        8         4              12              15
## 3     3  42664       21         0               3               0
## 4     4  81975      983        76             184             180
## 5     5  16014        3         9               4               7
## 6     6  60482      255        43             134              37

Unit of observation: Individual customer
Sample size: 200 customers
Variables:

++ ID: Unique identifier for each individual in the dataset.

++ Income: The annual income of the individual (money units).

++ MntWines: The amount of money spent on wines per month (money units).

++ MntFruits: The amount of money spent on fruits per month (money units).

++ MntMeatProducts: The amount of money spent on meat products per month (money units).

++ MntFishProducts: The amount of money spent on fish products per month (money units).

Research question:

How can we segment our customer base into distinct groups based on their income and spending habits on various products (wines, fruits, meat, and fish products)?

Descriptive statistics

# Removing rows where units have NA in column Income 
mydata <- mydata[!is.na(mydata$Income), ]
summary(mydata)

##        ID            Income         MntWines        MntFruits      MntMeatProducts MntFishProducts 
##  Min.   :  1.0   Min.   : 4023   Min.   :   0.0   Min.   :  0.00   Min.   :  1.0   Min.   :  0.00  
##  1st Qu.: 51.0   1st Qu.:31385   1st Qu.:  16.0   1st Qu.:  2.00   1st Qu.: 12.0   1st Qu.:  3.00  
##  Median :101.0   Median :48070   Median : 123.0   Median :  7.00   Median : 46.0   Median : 11.00  
##  Mean   :100.9   Mean   :48962   Mean   : 286.1   Mean   : 22.92   Mean   :157.1   Mean   : 31.58  
##  3rd Qu.:151.0   3rd Qu.:65316   3rd Qu.: 493.0   3rd Qu.: 26.00   3rd Qu.:207.0   3rd Qu.: 32.00  
##  Max.   :200.0   Max.   :95529   Max.   :1285.0   Max.   :199.00   Max.   :890.0   Max.   :250.00

Explanation of few parameters:

Min: The customer with the lowest income earns 4,023 money units per year. The minimum amount spent on wines, fruits, meat products, and fish products per month is 0, 0, 1, and 0 money units respectively.
Max: The customer with the highest income earns 95,529 money units per year. The maximum amount spent on wines, fruits, meat products, and fish products per month is 1,285, 199, 890, and 250 money units respectively.
1st Quartile: 25% of customers earn 31,385 money units or less per year, and 25% of customers spend 16, 2, 12, and 3 money units or less per month on wines, fruits, meat products, and fish products respectively.
Median: The median income of customers is 48,070 money units per year. The median amount spent on wines, fruits, meat products, and fish products per month is 123, 7, 46, and 11 money units respectively.
Mean: The average income of customers is 48,962 money units per year. On average, customers spend 286.1, 22.92, 157.1, and 31.58 money units per month on wines, fruits, meat products, and fish products respectively.
3rd Quartile: 75% of customers earn 65,316 money units or less per year, and 75% of customers spend 493, 26, 207, and 32 money units or less per month on wines, fruits, meat products, and fish products respectively.

Clustering

#Standardizing
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)

#Finding outliers
mydata$Dissimilarity_z <- sqrt(mydata$Income_z^2 + mydata$MntWines_z^2 + mydata$MntFruits_z^2 + mydata$MntMeatProducts_z^2 + mydata$MntFishProducts_z^2)

#Showing 10 units with the highest value of dissimilarity
head(mydata[order(-mydata$Dissimilarity_z), c("ID", "Dissimilarity_z")], 10)

## # A tibble: 10 × 2
##       ID Dissimilarity_z[,1]
##    <int>               <dbl>
##  1   183                6.41
##  2    73                6.08
##  3   132                5.32
##  4    34                5.20
##  5   188                4.86
##  6    10                4.81
##  7    74                4.72
##  8   168                4.50
##  9    77                4.44
## 10   120                4.42

#Removing outliers
library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(198, 114, 34, 123))

#Standardizing one more time after removing units
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)

round(head(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")], 3), 2)

## # A tibble: 3 × 5
##   Income_z[,1] MntWines_z[,1] MntFruits_z[,1] MntMeatProducts_z[,1] MntFishProducts_z[,1]
##          <dbl>          <dbl>           <dbl>                 <dbl>                 <dbl>
## 1         1.48           0.9             0.88                  2.23                  0.8 
## 2        -0.33          -0.8            -0.53                 -0.66                 -0.33
## 3        -0.28          -0.77           -0.65                 -0.71                 -0.62

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distance_euclidian <- get_dist(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
                               method = "euclidian")

Distance_euclidian2 <- Distance_euclidian^2
fviz_dist(Distance_euclidian2)

Based on the dissimilarity matrix, we can evaluate that customers can be classify into three groups. But before that, we have to check the Hopkins statistics, to see if the data is truly clusterable.

#Cheking if data is clusterable with Hopkins statistics
#install.packages("factoextra")
library(factoextra)
get_clust_tendency(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8267641
## 
## $plot
## NULL

Hopkins statistic is 0.82 which is bigger than 0.5, which means that the data is clusterable.

library(dplyr)
library(factoextra)

WARD <- mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 193

With 192 steps, we will make 1 cluster (group).

library(factoextra)

fviz_dend(WARD,
          k = 2,
          cex = 0.5,
          pallete = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Because the horizontal line of the group on the right side is too low, and we should “make a cut” too close to it, we decide to cluster customers to two groups.

We now have to indicate the initial leaders of the groups that will be the basis for K-means method.

mydata$ClassificationWARD <- cutree(WARD,
                                    k = 2)

head(mydata[c("ID", "ClassificationWARD")])

## # A tibble: 6 × 2
##      ID ClassificationWARD
##   <int>              <int>
## 1     1                  1
## 2     2                  2
## 3     3                  2
## 4     4                  1
## 5     5                  2
## 6     6                  2

InitialLeaders <- aggregate(mydata[,c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
                            by = list(mydata$ClassificationWARD),
                            FUN = mean)

round(InitialLeaders, 3)

##   Group.1 Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1       1    1.252      1.092       1.163             1.402             1.168
## 2       2   -0.462     -0.403      -0.429            -0.517            -0.431

library(factoextra)

#Performing K-Means clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z")],
                   k = 2,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 2 clusters of sizes 58, 135
## 
## Cluster means:
##     Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1  1.1809418  1.1489551   1.0386770         1.2642228         1.0225457
## 2 -0.5073676 -0.4936252  -0.4462464        -0.5431476        -0.4393159
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2
##  [58] 1 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2
## [115] 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 1 1 1 1 1 2 2 1 2
## [172] 2 2 2 1 1 1 1 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 300.1258 126.0920
##  (between_SS / total_SS =  55.6 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
##  [8] "iter"         "ifault"       "data"         "hclust"

In the table “Cluster means” we can see the final leaders.

In clustering vector we can see that the first unit is clustered in group 1, second unit is clustered in group 2 etc.

fviz_cluster(K_MEANS,
             palette = "jama",
             repel = FALSE,
             ggtheme = theme_classic())

We will now check if any unit has been reclassified.

mydata$ClassificationK_MEANS <- K_MEANS$cluster
head(mydata[, c("ID", "ClassificationWARD", "ClassificationK_MEANS")])

## # A tibble: 6 × 3
##      ID ClassificationWARD ClassificationK_MEANS
##   <int>              <int>                 <int>
## 1     1                  1                     1
## 2     2                  2                     2
## 3     3                  2                     2
## 4     4                  1                     1
## 5     5                  2                     2
## 6     6                  2                     2

table(mydata$ClassificationWARD)

## 
##   1   2 
##  52 141

table(mydata$ClassificationK_MEANS)

## 
##   1   2 
##  58 135

table(mydata$ClassificationWARD, mydata$ClassificationK_MEANS)

##    
##       1   2
##   1  52   0
##   2   6 135

There has been a reclassification. In the beginning there were 143 units in group 1, when 136 stayed in the same group, when 7 of them were reclassified into group 2. In the beginning there were 50 units in group 2 and no unit was reclassified into group 2.

Now, there are 136 units in group 1 and 57 units in group 2, where 7 of them came from group 1.

Averages <- K_MEANS$centers
round(Averages, 3)

##   Income_z MntWines_z MntFruits_z MntMeatProducts_z MntFishProducts_z
## 1    1.181      1.149       1.039             1.264             1.023
## 2   -0.507     -0.494      -0.446            -0.543            -0.439

library(ggplot2)
library(tidyr)

Picture <- as.data.frame(Averages)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z"))

Picture$Group <- factor(Picture$id,
                        levels = c(1, 2, 3, 4, 5),
                        labels = c("1", "2", "3", "4", "5"))

Picture$nameFactor <- factor(Picture$name,
                             levels = c("Income_z", "MntWines_z", "MntFruits_z", "MntMeatProducts_z", "MntFishProducts_z"),
                             labels = c("Income", "MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts"))

ggplot(Picture, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 2) +
  geom_line(aes(group = id, linetype = Group, col = Group), linewidth = 1) +
  ylab("Average") +
  xlab("Cluster variables")

We will now check the appropriateness of the cluster variables used.

fit <- aov(cbind(Income_z, MntWines_z, MntFruits_z, MntMeatProducts_z, MntFishProducts_z) ~ as.factor(ClassificationK_MEANS),
           data = mydata)

summary(fit)

##  Response 1 :
##                                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   1 115.64  115.64  289.25 < 2.2e-16 ***
## Residuals                        191  76.36    0.40                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   1 109.461 109.461   253.3 < 2.2e-16 ***
## Residuals                        191  82.539   0.432                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   1  89.457  89.457  166.62 < 2.2e-16 ***
## Residuals                        191 102.543   0.537                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   1 132.525 132.525   425.6 < 2.2e-16 ***
## Residuals                        191  59.475   0.311                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   1   86.7  86.700  157.26 < 2.2e-16 ***
## Residuals                        191  105.3   0.551                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject null hypothesis for all cluster variables, which means that the choice of cluster variables is correct.

Conclusion

On the basis of 5 standardized variables (Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts), we clustered 193 customers into 2 groups by hierarchical clustering (Ward’s algorithm, squared Euclidian distance - 5 customers were removed because of high dissimilarity). The number of groups was determined based on the dendogram. We further optimized the classification by using the K-means method, in which 7 units were reclassified.

Group 1:

Group 1 (70.47%) consists of customers, characterized by a lower than average value of all variables. They are the customers with income lower than average and who spend less than average amount on all kinds of products (Wine, Fruits, Meat products, Fish products).

Group 2:

Group 2 (29.53%) consists of customers, characterized by a higher than average value of all variables. They are the customers with income higher than average and who spend more than average amount on all kinds of products (Wine, Fruits, Meat products, Fish products).

Clustering

Maja Zupan

2024-01-28

MVA Homework 4

Data

Descriptive statistics

Clustering

Conclusion