Clustering

MVA Homework 4

Data

This dataset provides a comprehensive profile of customers, including demographic information, purchasing behavior, and responses to marketing campaigns. Using clustering analysis, we can group customers into distinct segments based on shared characteristics such as income, education, and spending habits. This can help us understand the different types of customers in our market and tailor our marketing strategies to each segment.

Data source: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering

#Importing dataset
library(readr)
mydata <- read_csv("~/R data/MVA Homework/customer_segmentation.csv")

## Rows: 2240 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Education, Marital_Status, Dt_Customer
## dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(mydata)

## # A tibble: 6 × 29
##      ID Year_Birth Education  Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits
##   <dbl>      <dbl> <chr>      <chr>           <dbl>   <dbl>    <dbl> <chr>         <dbl>    <dbl>     <dbl>
## 1  5524       1957 Graduation Single          58138       0        0 04-09-2012       58      635        88
## 2  2174       1954 Graduation Single          46344       1        1 08-03-2014       38       11         1
## 3  4141       1965 Graduation Together        71613       0        0 21-08-2013       26      426        49
## 4  6182       1984 Graduation Together        26646       1        0 10-02-2014       26       11         4
## 5  5324       1981 PhD        Married         58293       1        0 19-01-2014       94      173        43
## 6  7446       1967 Master     Together        62513       0        1 09-09-2013       16      520        42
## # ℹ 18 more variables: MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>, MntGoldProds <dbl>,
## #   NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>, AcceptedCmp5 <dbl>, AcceptedCmp1 <dbl>,
## #   AcceptedCmp2 <dbl>, Complain <dbl>, Z_CostContact <dbl>, Z_Revenue <dbl>, Response <dbl>

#Randomly choosing and keeping only 200 units, because the sample is huge (2000+)
set.seed(200) 
mydata <- mydata[sample(nrow(mydata), 200), ]

#Adding the ID column
mydata$ID <- seq(1, nrow(mydata))

#Creating columns with Age of customers and number of children
mydata$Age <- 2024 - mydata$Year_Birth
mydata$ChildHome <- mydata$Kidhome + mydata$Teenhome

# Selecting the relevant columns
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- select(mydata, ID, Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, Age, ChildHome, NumWebPurchases, NumStorePurchases, Recency, NumWebVisitsMonth)
head(mydata)

## # A tibble: 6 × 12
##      ID Income MntWines MntFruits MntMeatProducts MntFishProducts   Age ChildHome NumWebPurchases NumStorePurchases
##   <int>  <dbl>    <dbl>     <dbl>           <dbl>           <dbl> <dbl>     <dbl>           <dbl>             <dbl>
## 1     1  80617      594        51             631              72    29         0               4                 8
## 2     2  41658        8         4              12              15    49         2               1                 2
## 3     3  42664       21         0               3               0    57         1               1                 3
## 4     4  81975      983        76             184             180    62         1               6                 4
## 5     5  16014        3         9               4               7    45         2               1                 4
## 6     6  60482      255        43             134              37    48         1               7                 7
## # ℹ 2 more variables: Recency <dbl>, NumWebVisitsMonth <dbl>

Unit of observation: Individual customer
Sample size: 200 customers
Variables:

++ ID: Unique identifier for each individual in the dataset.

++ Income: The annual income of the individual (money units).

++ Age: Age of customers (years).

++ ChildHome: Number of children of customers.

++ NumWebPurchases: Number of purchases a customer has done on the webpage per month.

++ NumStorePurchases: Number of purchases a customer has done in store per month.

++ Recency: The number of days since the last purchase or interaction.

++ NumWebVisitsMonth: Number of webpage visits per month.

Research question:

How can we segment our customer base into distinct groups based on their income and spending habits on various products (wines, fruits, meat, and fish products)?

Descriptive statistics

# Removing rows where units have NA in column Income 
mydata <- mydata[!is.na(mydata$Income), ]
summary(mydata)

##        ID            Income         MntWines        MntFruits      MntMeatProducts MntFishProducts       Age      
##  Min.   :  1.0   Min.   : 4023   Min.   :   0.0   Min.   :  0.00   Min.   :  1.0   Min.   :  0.00   Min.   :29.0  
##  1st Qu.: 51.0   1st Qu.:31385   1st Qu.:  16.0   1st Qu.:  2.00   1st Qu.: 12.0   1st Qu.:  3.00   1st Qu.:46.0  
##  Median :101.0   Median :48070   Median : 123.0   Median :  7.00   Median : 46.0   Median : 11.00   Median :53.0  
##  Mean   :100.9   Mean   :48962   Mean   : 286.1   Mean   : 22.92   Mean   :157.1   Mean   : 31.58   Mean   :54.3  
##  3rd Qu.:151.0   3rd Qu.:65316   3rd Qu.: 493.0   3rd Qu.: 26.00   3rd Qu.:207.0   3rd Qu.: 32.00   3rd Qu.:65.0  
##  Max.   :200.0   Max.   :95529   Max.   :1285.0   Max.   :199.00   Max.   :890.0   Max.   :250.00   Max.   :84.0  
##    ChildHome      NumWebPurchases  NumStorePurchases    Recency      NumWebVisitsMonth
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000    Min.   : 0.00   Min.   : 0.000   
##  1st Qu.:0.0000   1st Qu.: 2.000   1st Qu.: 3.000    1st Qu.:26.00   1st Qu.: 4.000   
##  Median :1.0000   Median : 3.000   Median : 4.000    Median :51.00   Median : 6.000   
##  Mean   :0.9442   Mean   : 3.888   Mean   : 5.584    Mean   :48.85   Mean   : 5.548   
##  3rd Qu.:1.0000   3rd Qu.: 6.000   3rd Qu.: 8.000    3rd Qu.:74.00   3rd Qu.: 7.000   
##  Max.   :3.0000   Max.   :11.000   Max.   :13.000    Max.   :99.00   Max.   :20.000

Explanation of few parameters for different variables:

Income: The annual income of customers ranges from a minimum of 8,820 money units to a maximum of 95,529 money units. The average income is 49,644 money units, with 25% of customers earning 32,188 money units or less, and 75% of customers earning 66,455 money units or less.
Age: The age of customers ranges from 29 to 84 years. The average age is 54 years, with 25% of customers being 46 years or younger, and 75% of customers being 65 years or younger.
ChildHome: The number of children at home ranges from 0 to 3. On average, customers have approximately 1 child at home.
NumWebPurchases: The number of purchases made on the webpage per month ranges from 0 to 11. On average, customers make about 4 web purchases per month.
NumStorePurchases: The number of purchases made in store per month ranges from 2 to 13. On average, customers make about 6 store purchases per month.
Recency: The number of days since the last purchase or interaction ranges from 0 to 99 days. On average, the last purchase or interaction was about 49 days ago.
NumWebVisitsMonth: The number of webpage visits per month ranges from 0 to 9. On average, customers visit the webpage about 5 times per month.

Clustering

#Standardizing
mydata$Income_z <- scale(mydata$Income)
mydata$MntWines_z <- scale(mydata$MntWines)
mydata$MntFruits_z <- scale(mydata$MntFruits)
mydata$MntMeatProducts_z <- scale(mydata$MntMeatProducts)
mydata$MntFishProducts_z <- scale(mydata$MntFishProducts)
mydata$ChildHome_z <- scale(mydata$ChildHome)
mydata$NumWebP_z <- scale(mydata$NumWebPurchases)
mydata$NumStoreP_z <- scale(mydata$NumStorePurchases)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)

Correlation Matrix

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("Income_z", "ChildHome_z", "NumWebP_z", "Recency_z", "NumWebVisitsMonth_z")]), 
      type = "pearson")

##                     Income_z ChildHome_z NumWebP_z Recency_z NumWebVisitsMonth_z
## Income_z                1.00       -0.34      0.54     -0.02               -0.66
## ChildHome_z            -0.34        1.00     -0.16      0.13                0.35
## NumWebP_z               0.54       -0.16      1.00     -0.08               -0.05
## Recency_z              -0.02        0.13     -0.08      1.00               -0.04
## NumWebVisitsMonth_z    -0.66        0.35     -0.05     -0.04                1.00
## 
## n= 197 
## 
## 
## P
##                     Income_z ChildHome_z NumWebP_z Recency_z NumWebVisitsMonth_z
## Income_z                     0.0000      0.0000    0.8227    0.0000             
## ChildHome_z         0.0000               0.0257    0.0626    0.0000             
## NumWebP_z           0.0000   0.0257                0.2708    0.4457             
## Recency_z           0.8227   0.0626      0.2708              0.5626             
## NumWebVisitsMonth_z 0.0000   0.0000      0.4457    0.5626

#Finding outliers
mydata$Dissimilarity_z <- sqrt(mydata$Income_z^2 + mydata$NumWebP_z^2 + mydata$Recency_z^2 + mydata$NumWebVisitsMonth_z^2 + mydata$ChildHome_z^2)

#Showing 10 units with the highest value of dissimilarity
head(mydata[order(-mydata$Dissimilarity_z), c("ID", "Dissimilarity_z")], 10)

## # A tibble: 10 × 2
##       ID Dissimilarity_z[,1]
##    <int>               <dbl>
##  1    90                5.93
##  2   109                5.72
##  3    30                4.28
##  4   168                3.37
##  5   179                3.28
##  6   101                3.24
##  7   126                3.21
##  8   156                3.19
##  9   172                3.15
## 10   163                3.09

#Removing outliers
library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(90, 109, 30))

#Standardizing one more time after removing units
mydata$Income_z <- scale(mydata$Income)
mydata$ChildHome_z <- scale(mydata$ChildHome)
mydata$NumWebP_z <- scale(mydata$NumWebPurchases)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)

round(head(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")], 3), 2)

## # A tibble: 3 × 5
##   Income_z[,1] ChildHome_z[,1] NumWebP_z[,1] NumWebVisitsMonth_z[,1] Recency_z[,1]
##          <dbl>           <dbl>         <dbl>                   <dbl>         <dbl>
## 1         1.49           -1.22          0.02                   -1.46         -0.24
## 2        -0.38            1.39         -1.11                   -0.59         -0.66
## 3        -0.34            0.08         -1.11                    0.28         -0.17

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distance_euclidian <- get_dist(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                               method = "euclidian")

Distance_euclidian2 <- Distance_euclidian^2

#Showing dissimilarity matrix
fviz_dist(Distance_euclidian2)

Based on the dissimilarity matrix, we can evaluate that customers can be classify into four groups. But before that, we have to check the Hopkins statistics, to see if the data is truly clusterable.

#Cheking if data is clusterable with Hopkins statistics
#install.packages("factoextra")
library(factoextra)
get_clust_tendency(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6484489
## 
## $plot
## NULL

Hopkins statistic is 0.65 which is bigger than 0.5, which means that the data is clusterable.

library(dplyr)
library(factoextra)

WARD <- mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 194

With 193 steps, we will make 1 cluster (group).

library(factoextra)

fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          pallete = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Based on the dendogram, we can see that it is really best to cut it into four groups.

We now have to indicate the initial leaders of the groups that will be the basis for K-means method.

mydata$ClassificationWARD <- cutree(WARD,
                                    k = 4)

head(mydata[c("ID", "Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z", "Dissimilarity_z", "ClassificationWARD")])

## # A tibble: 6 × 8
##      ID Income_z[,1] ChildHome_z[,1] NumWebP_z[,1] NumWebVisitsMonth_z[,1] Recency_z[,1] Dissimilarity_z[,1]
##   <int>        <dbl>           <dbl>         <dbl>                   <dbl>         <dbl>               <dbl>
## 1     1        1.49          -1.22          0.0213                  -1.46         -0.241                2.33
## 2     2       -0.383          1.39         -1.11                    -0.592        -0.655                1.97
## 3     3       -0.335          0.0807       -1.11                     0.278        -0.172                1.15
## 4     4        1.55           0.0807        0.774                    0.713        -1.62                 2.43
## 5     5       -1.61           1.39         -1.11                    -1.03         -0.241                2.52
## 6     6        0.520          0.0807        1.15                     0.713         1.10                 1.78
## # ℹ 1 more variable: ClassificationWARD <int>

#Calculating the positions of initial leaders

InitialLeaders <- aggregate(mydata[,c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                            by = list(mydata$ClassificationWARD),
                            FUN = mean)

round(InitialLeaders, 3)

##   Group.1 Income_z ChildHome_z NumWebP_z NumWebVisitsMonth_z Recency_z
## 1       1    1.164      -0.947     0.157              -1.342     0.189
## 2       2   -0.682       0.046    -0.530               0.423    -0.595
## 3       3    0.609       0.382     1.391               0.301    -0.013
## 4       4   -0.827       0.793    -0.663               0.595     1.100

library(factoextra)

#Performing K-Means clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z")],
                   k = 4,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 4 clusters of sizes 44, 56, 43, 51
## 
## Cluster means:
##     Income_z ChildHome_z  NumWebP_z NumWebVisitsMonth_z   Recency_z
## 1  1.2059384 -1.01697895  0.0726130          -1.4232561  0.04390027
## 2 -0.7537547  0.03412505 -0.6435780           0.4958348 -0.77547806
## 3  0.5556536  0.14146303  1.4732741           0.3591842 -0.14037876
## 4 -0.6812575  0.72065020 -0.5981449           0.3806196  0.93198914
## 
## Clustering vector:
##   [1] 1 2 2 3 4 3 3 1 1 4 1 1 3 4 4 3 3 4 2 3 4 3 2 2 2 4 1 4 1 2 2 1 4 1 2 3 4 4 2 4 2 3 4 2 4 4 4 1 2 3 3 4 4 1 2 4 2
##  [58] 1 4 2 1 4 2 3 1 2 2 2 4 1 1 3 1 3 3 2 1 3 2 4 1 1 4 4 1 1 2 4 1 2 1 1 4 2 2 2 4 3 2 3 4 4 3 2 1 3 4 2 3 1 2 2 4 4
## [115] 1 3 3 4 2 4 3 2 2 2 2 4 3 2 1 3 2 2 2 4 2 4 2 1 4 1 2 3 3 4 4 4 3 1 2 4 3 4 3 3 1 2 1 2 2 2 1 3 1 1 1 1 1 2 4 2 2
## [172] 3 4 3 3 3 3 3 4 4 1 1 2 2 1 4 2 3 4 4 2 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1]  88.84582 113.99444 102.72623 124.01733
##  (between_SS / total_SS =  55.5 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
##  [8] "iter"         "ifault"       "data"         "hclust"

In the table “Cluster means” we can see the final leaders.

In clustering vector we can see that the first unit is clustered in group 1, second unit is clustered in group 2 etc.

fviz_cluster(K_MEANS,
             palette = "jama",
             repel = FALSE,
             ggtheme = theme_classic())

We will now check if any unit has been reclassified.

mydata$ClassificationK_MEANS <- K_MEANS$cluster
head(mydata[, c("ID", "ClassificationWARD", "ClassificationK_MEANS")])

## # A tibble: 6 × 3
##      ID ClassificationWARD ClassificationK_MEANS
##   <int>              <int>                 <int>
## 1     1                  1                     1
## 2     2                  2                     2
## 3     3                  2                     2
## 4     4                  3                     3
## 5     5                  2                     4
## 6     6                  3                     3

table(mydata$ClassificationWARD)

## 
##  1  2  3  4 
## 47 75 39 33

table(mydata$ClassificationK_MEANS)

## 
##  1  2  3  4 
## 44 56 43 51

table(mydata$ClassificationWARD, mydata$ClassificationK_MEANS)

##    
##      1  2  3  4
##   1 40  0  3  4
##   2  4 55  5 11
##   3  0  0 35  4
##   4  0  1  0 32

There have been many reclassifications.

In the beginning there were 47 units in group 1, when 40 stayed in the same group, 3 of them were reclassified into group 3 and 4 of them were reclassified into group 4. Now, there are 44 units in group 1, where 4 of them came from group 2.

In the beginning there were 75 units in group 2, when 4 were reclassified into group 1, 5 into group 3 and 11 into group 4. Now, there are 56 units in group 2 and 1 of them came from group 4.

In the beginning there were 39 units in group 3, when 4 of them were reclassified into group 4. Now, there are 43 units in group 3, where 3 of them came from group 1 and 5 of them came from group 2.

In the beginning there were 33 units in group 4, when 1 of them were reclassified into group 2. Now, there are 51 units in group 4, where 4 of them came from group 1, 11 units came from group 2 and 4 of them came from group 3.

Averages <- K_MEANS$centers
round(Averages, 3)

##   Income_z ChildHome_z NumWebP_z NumWebVisitsMonth_z Recency_z
## 1    1.206      -1.017     0.073              -1.423     0.044
## 2   -0.754       0.034    -0.644               0.496    -0.775
## 3    0.556       0.141     1.473               0.359    -0.140
## 4   -0.681       0.721    -0.598               0.381     0.932

library(ggplot2)
library(tidyr)

Picture <- as.data.frame(Averages)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"))

Picture$Group <- factor(Picture$id,
                        levels = c(1, 2, 3, 4, 5),
                        labels = c("1", "2", "3", "4", "5"))

Picture$nameFactor <- factor(Picture$name,
                             levels = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"),
                             labels = c("Income_z", "ChildHome_z", "NumWebP_z", "NumWebVisitsMonth_z", "Recency_z"))

ggplot(Picture, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 2) +
  geom_line(aes(group = id, linetype = Group, col = Group), linewidth = 1) +
  ylab("Average") +
  xlab("Cluster variables")

We will now check the appropriateness of the cluster variables used.

#Are all the cluster variables successful at classifying units into groups? Performing ANOVAs.

fit <- aov(cbind(Income_z, ChildHome_z, NumWebP_z, NumWebVisitsMonth_z, Recency_z) ~ as.factor(ClassificationK_MEANS),
           data = mydata)

summary(fit)

##  Response 1 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 132.751  44.250  139.55 < 2.2e-16 ***
## Residuals                        190  60.249   0.317                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3  72.919  24.306  38.459 < 2.2e-16 ***
## Residuals                        190 120.081   0.632                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 135.007  45.002  147.44 < 2.2e-16 ***
## Residuals                        190  57.993   0.305                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 115.833  38.611  95.067 < 2.2e-16 ***
## Residuals                        190  77.167   0.406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationK_MEANS)   3  78.907 26.3025  43.802 < 2.2e-16 ***
## Residuals                        190 114.093  0.6005                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject null hypothesis for all cluster variables (p<0.001), which means that the choice of cluster variables is correct.

aggregate(mydata$NumStorePurchases, 
          by = list(mydata$ClassificationK_MEANS), 
          FUN = "mean")

##   Group.1        x
## 1       1 8.750000
## 2       2 3.428571
## 3       3 8.000000
## 4       4 3.509804

fit <- aov(NumStorePurchases ~ as.factor(ClassificationK_MEANS), 
           data = mydata)

summary(fit)

##                                   Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClassificationK_MEANS)   3 1170.2   390.1   85.71 <2e-16 ***
## Residuals                        190  864.7     4.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: average(NumStorePurchases, G1) = average(NumStorePurchases, G2) = average(NumStorePurchases, G3) = average(NumStorePurchases, G4) H1: at least one average(NumStorePurchases, group) of the group is different

We can reject the null hypothesis at p<0.001, variable Number of Store purchases is useful.

Conclusion

On the basis of 5 standardized variables (Income, Number of children, Number of webpage purchases, number of webpage visits, recency), we clustered 194 customers into 4 groups by hierarchical clustering (Ward’s algorithm, squared Euclidian distance). The number of groups was determined based on the dendogram. We further optimized the classification by using the K-means method, in which many units were reclassified.

Group 1 (44 customers): This group has customers with an income significantly above the average, fewer children at home than average, slightly above average web purchases, significantly fewer website visits per month, and recency close to the average. This suggests these are high-income customers who make frequent online purchases but visit the website less frequently. From the marketing perspective, this could be a sign to make more tailored email campaigns highlighting premium products or exclusive online offers to achieve increase in purchasing. Since they visit the website less frequently, ensuring the website is engaging and user-friendly could encourage more frequent visits.
Group 2 (56 customers): This group consists of customers with an income significantly below the average, an average number of children at home, significantly fewer web purchases and website visits per month but is recent. This suggests these are lower-income customers with more children, who interact less with the website but have recently made a purchase or interaction. Marketing strategies could include promoting budget-friendly and family-oriented products. Since they have recently interacted with the store, follow-up communications about their recent purchase or interaction could help maintain engagement.
Group 3 (43 customers): This group includes customers with an income slightly above the average, slightly more children at home than average, significantly more web purchases and website visits per month with average recency. This suggests these customers are average in terms of income and family size, but are highly engaged with the website and make frequent online purchases. From marketing point of view - they could be targeted with frequent updates about new products or promotions. Implementing a loyalty program could also be beneficial to encourage repeat purchases.
Group 4 (51 customers): This group has customers with an income below the average, significantly more children at home, significantly fewer web purchases but high website visits per month with the highest recency. This suggests these are lower-income customers with more children, who visit the website frequently but make fewer purchases, and have the most recent interaction or purchase. Marketing strategies should include improving the online shopping experience, offering online-exclusive discounts, or providing personalized product recommendations based on their browsing history.

Clustering

Maja Zupan

2024-01-28

MVA Homework 4

Data

Descriptive statistics

Clustering

Correlation Matrix

Conclusion