MVA Homework 4

Data

This dataset provides a comprehensive profile of customers, including demographic information, purchasing behavior, and responses to marketing campaigns. Using clustering analysis, we can group customers into distinct segments based on shared characteristics such as income, education, and spending habits. This can help us understand the different types of customers in our market and tailor our marketing strategies to each segment.

Data source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering

#Importing dataset
library(readr)
mydata <- read_csv("~/R data/MVA Homework/customer_segmentation.csv")
## Rows: 2240 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Education, Marital_Status, Dt_Customer
## dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydata)
## # A tibble: 6 × 29
##      ID Year_Birth Education  Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits
##   <dbl>      <dbl> <chr>      <chr>           <dbl>   <dbl>    <dbl> <chr>         <dbl>    <dbl>     <dbl>
## 1  5524       1957 Graduation Single          58138       0        0 04-09-2012       58      635        88
## 2  2174       1954 Graduation Single          46344       1        1 08-03-2014       38       11         1
## 3  4141       1965 Graduation Together        71613       0        0 21-08-2013       26      426        49
## 4  6182       1984 Graduation Together        26646       1        0 10-02-2014       26       11         4
## 5  5324       1981 PhD        Married         58293       1        0 19-01-2014       94      173        43
## 6  7446       1967 Master     Together        62513       0        1 09-09-2013       16      520        42
## # ℹ 18 more variables: MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>, MntGoldProds <dbl>,
## #   NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>, AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>,
## #   AcceptedCmp2 <dbl>, Complain <dbl>, Z_CostContact <dbl>, Z_Revenue <dbl>, Response <dbl>
#Randomly choosing and keeping only 200 units, because the sample is huge (2000+)
set.seed(200) 
mydata <- mydata[sample(nrow(mydata), 200), ]

#Adding the ID column
mydata$ID <- seq(1, nrow(mydata))

#Creating columns with Age of customers and number of children
mydata$Age <- 2024 - mydata$Year_Birth
mydata$ChildHome <- mydata$Kidhome + mydata$Teenhome

# Selecting the relevant columns
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- select(mydata, ID, Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, Age, ChildHome, NumWebPurchases, NumStorePurchases, Recency, NumWebVisitsMonth)
head(mydata)
## # A tibble: 6 × 12
##      ID Income MntWines MntFruits MntMeatProducts MntFishProducts   Age ChildHome NumWebPurchases NumStorePurchases
##   <int>  <dbl>    <dbl>     <dbl>           <dbl>           <dbl> <dbl>     <dbl>           <dbl>             <dbl>
## 1     1  80617      594        51             631              72    29         0               4                 8
## 2     2  41658        8         4              12              15    49         2               1                 2
## 3     3  42664       21         0               3               0    57         1               1                 3
## 4     4  81975      983        76             184             180    62         1               6                 4
## 5     5  16014        3         9               4               7    45         2               1                 4
## 6     6  60482      255        43             134              37    48         1               7                 7
## # ℹ 2 more variables: Recency <dbl>, NumWebVisitsMonth <dbl>

Research question:

How can we segment our customer base into distinct groups based on their income and spending habits on various products (wines, fruits, meat, and fish products)?

Descriptive statistics

# Removing rows where units have NA in column Income 
mydata <- mydata[!is.na(mydata$Income), ]
summary(mydata)
##        ID            Income         MntWines        MntFruits      MntMeatProducts MntFishProducts       Age      
##  Min.   :  1.0   Min.   : 4023   Min.   :   0.0   Min.   :  0.00   Min.   :  1.0   Min.   :  0.00   Min.   :29.0  
##  1st Qu.: 51.0   1st Qu.:31385   1st Qu.:  16.0   1st Qu.:  2.00   1st Qu.: 12.0   1st Qu.:  3.00   1st Qu.:46.0  
##  Median :101.0   Median :48070   Median : 123.0   Median :  7.00   Median : 46.0   Median : 11.00   Median :53.0  
##  Mean   :100.9   Mean   :48962   Mean   : 286.1   Mean   : 22.92   Mean   :157.1   Mean   : 31.58   Mean   :54.3  
##  3rd Qu.:151.0   3rd Qu.:65316   3rd Qu.: 493.0   3rd Qu.: 26.00   3rd Qu.:207.0   3rd Qu.: 32.00   3rd Qu.:65.0  
##  Max.   :200.0   Max.   :95529   Max.   :1285.0   Max.   :199.00   Max.   :890.0   Max.   :250.00   Max.   :84.0  
##    ChildHome      NumWebPurchases  NumStorePurchases    Recency      NumWebVisitsMonth
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000    Min.   : 0.00   Min.   : 0.000   
##  1st Qu.:0.0000   1st Qu.: 2.000   1st Qu.: 3.000    1st Qu.:26.00   1st Qu.: 4.000   
##  Median :1.0000   Median : 3.000   Median : 4.000    Median :51.00   Median : 6.000   
##  Mean   :0.9442   Mean   : 3.888   Mean   : 5.584    Mean   :48.85   Mean   : 5.548   
##  3rd Qu.:1.0000   3rd Qu.: 6.000   3rd Qu.: 8.000    3rd Qu.:74.00   3rd Qu.: 7.000   
##  Max.   :3.0000   Max.   :11.000   Max.   :13.000    Max.   :99.00   Max.   :20.000

Explanation of few parameters for different variables:

Clustering

#Standardizing
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)
mydata$ChildHome_z <- scale(mydata$ChildHome)
mydata$NumWebP_z <- scale(mydata$NumWebPurchases)
mydata$NumStoreP_z <- scale(mydata$NumStorePurchases)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)

Correlation Matrix

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[, c("Income_z", "ChildHome_z", "NumWebP_z", "Recency_z", "NumWebVisitsMonth_z")]), 
      type = "pearson")
##                     Income_z ChildHome_z NumWebP_z Recency_z NumWebVisitsMonth_z
## Income_z                1.00       -0.34      0.54     -0.02               -0.66
## ChildHome_z            -0.34        1.00     -0.16      0.13                0.35
## NumWebP_z               0.54       -0.16      1.00     -0.08               -0.05
## Recency_z              -0.02        0.13     -0.08      1.00               -0.04
## NumWebVisitsMonth_z    -0.66        0.35     -0.05     -0.04                1.00
## 
## n= 197 
## 
## 
## P
##                     Income_z ChildHome_z NumWebP_z Recency_z NumWebVisitsMonth_z
## Income_z                     0.0000      0.0000    0.8227    0.0000             
## ChildHome_z         0.0000               0.0257    0.0626    0.0000             
## NumWebP_z           0.0000   0.0257                0.2708    0.4457             
## Recency_z           0.8227   0.0626      0.2708              0.5626             
## NumWebVisitsMonth_z 0.0000   0.0000      0.4457    0.5626
#Finding outliers
mydata$Dissimilarity_z <- sqrt(mydata$Income_z^2 + mydata$NumWebP_z^2 + mydata$Recency_z^2 + mydata$NumWebVisitsMonth_z^2 + mydata$ChildHome_z^2)

#Showing 10 units with the highest value of dissimilarity
head(mydata[order(-mydata$Dissimilarity_z), c("ID", "Dissimilarity_z")], 10)
## # A tibble: 10 × 2
##       ID Dissimilarity_z[,1]
##    <int>               <dbl>
##  1    90                5.93
##  2   109                5.72
##  3    30                4.28
##  4   168                3.37
##  5   179                3.28
##  6   101                3.24
##  7   126                3.21
##  8   156                3.19
##  9   172                3.15
## 10   163                3.09
#Removing outliers
library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(90, 109, 30))

#Standardizing one more time after removing units
mydata$Income_z <- scale(mydata$Income)
mydata$ChildHome_z <- scale(mydata$ChildHome)
mydata$NumWebP_z <- scale(mydata$NumWebPurchases)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)

round(head(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")], 3), 2)
## # A tibble: 3 × 5
##   Income_z[,1] ChildHome_z[,1] NumWebP_z[,1] NumWebVisitsMonth_z[,1] Recency_z[,1]
##          <dbl>           <dbl>         <dbl>                   <dbl>         <dbl>
## 1         1.49           -1.22          0.02                   -1.46         -0.24
## 2        -0.38            1.39         -1.11                   -0.59         -0.66
## 3        -0.34            0.08         -1.11                    0.28         -0.17
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distance_euclidian <- get_dist(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                               method = "euclidian")

Distance_euclidian2 <- Distance_euclidian^2

#Showing dissimilarity matrix
fviz_dist(Distance_euclidian2)

Based on the dissimilarity matrix, we can evaluate that customers can be classify into four groups. But before that, we have to check the Hopkins statistics, to see if the data is truly clusterable.

#Cheking if data is clusterable with Hopkins statistics
#install.packages("factoextra")
library(factoextra)
get_clust_tendency(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)
## $hopkins_stat
## [1] 0.6484489
## 
## $plot
## NULL

Hopkins statistic is 0.65 which is bigger than 0.5, which means that the data is clusterable.

library(dplyr)
library(factoextra)

WARD <- mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 194

With 193 steps, we will make 1 cluster (group).

library(factoextra)

fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          pallete = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Based on the dendogram, we can see that it is really best to cut it into four groups.

We now have to indicate the initial leaders of the groups that will be the basis for K-means method.

mydata$ClassificationWARD <- cutree(WARD,
                                    k = 4)

head(mydata[c("ID", "Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z", "Dissimilarity_z", "ClassificationWARD")])
## # A tibble: 6 × 8
##      ID Income_z[,1] ChildHome_z[,1] NumWebP_z[,1] NumWebVisitsMonth_z[,1] Recency_z[,1] Dissimilarity_z[,1]
##   <int>        <dbl>           <dbl>         <dbl>                   <dbl>         <dbl>               <dbl>
## 1     1        1.49          -1.22          0.0213                  -1.46         -0.241                2.33
## 2     2       -0.383          1.39         -1.11                    -0.592        -0.655                1.97
## 3     3       -0.335          0.0807       -1.11                     0.278        -0.172                1.15
## 4     4        1.55           0.0807        0.774                    0.713        -1.62                 2.43
## 5     5       -1.61           1.39         -1.11                    -1.03         -0.241                2.52
## 6     6        0.520          0.0807        1.15                     0.713         1.10                 1.78
## # ℹ 1 more variable: ClassificationWARD <int>
#Calculating the positions of initial leaders

InitialLeaders <- aggregate(mydata[,c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                            by = list(mydata$ClassificationWARD),
                            FUN = mean)

round(InitialLeaders, 3)
##   Group.1 Income_z ChildHome_z NumWebP_z NumWebVisitsMonth_z Recency_z
## 1       1    1.164      -0.947     0.157              -1.342     0.189
## 2       2   -0.682       0.046    -0.530               0.423    -0.595
## 3       3    0.609       0.382     1.391               0.301    -0.013
## 4       4   -0.827       0.793    -0.663               0.595     1.100
library(factoextra)

#Performing K-Means clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                   k = 4,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS
## Hierarchical K-means clustering with 4 clusters of sizes 44, 56, 43, 51
## 
## Cluster means:
##     Income_z ChildHome_z  NumWebP_z NumWebVisitsMonth_z   Recency_z
## 1  1.2059384 -1.01697895  0.0726130          -1.4232561  0.04390027
## 2 -0.7537547  0.03412505 -0.6435780           0.4958348 -0.77547806
## 3  0.5556536  0.14146303  1.4732741           0.3591842 -0.14037876
## 4 -0.6812575  0.72065020 -0.5981449           0.3806196  0.93198914
## 
## Clustering vector:
##   [1] 1 2 2 3 4 3 3 1 1 4 1 1 3 4 4 3 3 4 2 3 4 3 2 2 2 4 1 4 1 2 2 1 4 1 2 3 4 4 2 4 2 3 4 2 4 4 4 1 2 3 3 4 4 1 2 4 2
##  [58] 1 4 2 1 4 2 3 1 2 2 2 4 1 1 3 1 3 3 2 1 3 2 4 1 1 4 4 1 1 2 4 1 2 1 1 4 2 2 2 4 3 2 3 4 4 3 2 1 3 4 2 3 1 2 2 4 4
## [115] 1 3 3 4 2 4 3 2 2 2 2 4 3 2 1 3 2 2 2 4 2 4 2 1 4 1 2 3 3 4 4 4 3 1 2 4 3 4 3 3 1 2 1 2 2 2 1 3 1 1 1 1 1 2 4 2 2
## [172] 3 4 3 3 3 3 3 4 4 1 1 2 2 1 4 2 3 4 4 2 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1]  88.84582 113.99444 102.72623 124.01733
##  (between_SS / total_SS =  55.5 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
##  [8] "iter"         "ifault"       "data"         "hclust"

In the table “Cluster means” we can see the final leaders.

In clustering vector we can see that the first unit is clustered in group 1, second unit is clustered in group 2 etc.

fviz_cluster(K_MEANS,
             palette = "jama",
             repel = FALSE,
             ggtheme = theme_classic())

We will now check if any unit has been reclassified.

mydata$ClassificationK_MEANS <- K_MEANS$cluster
head(mydata[, c("ID", "ClassificationWARD", "ClassificationK_MEANS")])
## # A tibble: 6 × 3
##      ID ClassificationWARD ClassificationK_MEANS
##   <int>              <int>                 <int>
## 1     1                  1                     1
## 2     2                  2                     2
## 3     3                  2                     2
## 4     4                  3                     3
## 5     5                  2                     4
## 6     6                  3                     3
table(mydata$ClassificationWARD)
## 
##  1  2  3  4 
## 47 75 39 33
table(mydata$ClassificationK_MEANS)
## 
##  1  2  3  4 
## 44 56 43 51
table(mydata$ClassificationWARD, mydata$ClassificationK_MEANS)
##    
##      1  2  3  4
##   1 40  0  3  4
##   2  4 55  5 11
##   3  0  0 35  4
##   4  0  1  0 32

There have been many reclassifications.

In the beginning there were 47 units in group 1, when 40 stayed in the same group, 3 of them were reclassified into group 3 and 4 of them were reclassified into group 4. Now, there are 44 units in group 1, where 4 of them came from group 2.

In the beginning there were 75 units in group 2, when 4 were reclassified into group 1, 5 into group 3 and 11 into group 4. Now, there are 56 units in group 2 and 1 of them came from group 4.

In the beginning there were 39 units in group 3, when 4 of them were reclassified into group 4. Now, there are 43 units in group 3, where 3 of them came from group 1 and 5 of them came from group 2.

In the beginning there were 33 units in group 4, when 1 of them were reclassified into group 2. Now, there are 51 units in group 4, where 4 of them came from group 1, 11 units came from group 2 and 4 of them came from group 3.

Averages <- K_MEANS$centers
round(Averages, 3)
##   Income_z ChildHome_z NumWebP_z NumWebVisitsMonth_z Recency_z
## 1    1.206      -1.017     0.073              -1.423     0.044
## 2   -0.754       0.034    -0.644               0.496    -0.775
## 3    0.556       0.141     1.473               0.359    -0.140
## 4   -0.681       0.721    -0.598               0.381     0.932
library(ggplot2)
library(tidyr)

Picture <- as.data.frame(Averages)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"))

Picture$Group <- factor(Picture$id,
                        levels = c(1, 2, 3, 4, 5),
                        labels = c("1", "2", "3", "4", "5"))

Picture$nameFactor <- factor(Picture$name,
                             levels = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"),
                             labels = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"))

ggplot(Picture, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 2) +
  geom_line(aes(group = id, linetype = Group, col = Group), linewidth = 1) +
  ylab("Average") +
  xlab("Cluster variables")

We will now check the appropriateness of the cluster variables used.

#Are all the cluster variables successful at classifying units into groups? Performing ANOVAs.

fit <- aov(cbind(Income_z, ChildHome_z, NumWebP_z, NumWebVisitsMonth_z, Recency_z) ~ as.factor(ClassificationK_MEANS),
           data = mydata)

summary(fit)
##  Response 1 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 132.751  44.250  139.55 < 2.2e-16 ***
## Residuals                        190  60.249   0.317                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3  72.919  24.306  38.459 < 2.2e-16 ***
## Residuals                        190 120.081   0.632                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 135.007  45.002  147.44 < 2.2e-16 ***
## Residuals                        190  57.993   0.305                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 115.833  38.611  95.067 < 2.2e-16 ***
## Residuals                        190  77.167   0.406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3  78.907 26.3025  43.802 < 2.2e-16 ***
## Residuals                        190 114.093  0.6005                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject null hypothesis for all cluster variables (p<0.001), which means that the choice of cluster variables is correct.

aggregate(mydata$NumStorePurchases, 
          by = list(mydata$ClassificationK_MEANS), 
          FUN = "mean")
##   Group.1        x
## 1       1 8.750000
## 2       2 3.428571
## 3       3 8.000000
## 4       4 3.509804
fit <- aov(NumStorePurchases ~ as.factor(ClassificationK_MEANS), 
           data = mydata)

summary(fit)
##                                   Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 1170.2   390.1   85.71 <2e-16 ***
## Residuals                        190  864.7     4.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: average(NumStorePurchases, G1) = average(NumStorePurchases, G2) = average(NumStorePurchases, G3) = average(NumStorePurchases, G4) H1: at least one average(NumStorePurchases, group) of the group is different

We can reject the null hypothesis at p<0.001, variable Number of Store purchases is useful.

Conclusion

On the basis of 5 standardized variables (Income, Number of children, Number of webpage purchases, number of webpage visits, recency), we clustered 194 customers into 4 groups by hierarchical clustering (Ward’s algorithm, squared Euclidian distance). The number of groups was determined based on the dendogram. We further optimized the classification by using the K-means method, in which many units were reclassified.