Preparation - Workspace and Notation
Read and Describe Data
Variables Selection
Hierarchical Clustering
5 Hierarchical Clustering
Erstelle die Mittelwerte pro Cluster
- Profile Customers for Targeting
Partitional Clustering (here: K-Means)
- Set Seed
- Example: Kmeans for 5 Clusters

In this exercise, you will analyze a data set derived by a survey with the following questions:

Attitudinal questions regarding online shopping:

When shopping online I…

fun_exploring: … enjoy discovering the newest products. (1: does not apply, 7: completely applies)
search_specific: … am only looking for specific products. (1: does not apply, 7: completely applies)
buy_too_much: … often buy more than planned. (1: does not apply, 7: completely applies)
compare_prices: … compare prices on different websites. (1: does not apply, 7: completely applies)

Preparation - Workspace and Notation

# Before you start, always clear your workspace first:
rm(list = ls())

# Set the working directory to the current directory
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
getwd()

## [1] "/Users/peterkurschner/Library/Mobile Documents/com~apple~CloudDocs/Documents/ExchangeNachhilfeschüler/remlingeramelie@gmail.com/Aufgaben/16.01.2026"

# Optionally: Prevent R from using scientific number formatting (e.g., 1E-1, instead of 0.1)
options(scipen = 999)

Read and Describe Data

# YOUR TASK!
# Read the file "data_ST_exercise.csv" into the data object "data_clustering"
data_clustering<- read.csv("data_ST_exercise.csv")

# Describe the data
str(data_clustering)

## 'data.frame':    40 obs. of  9 variables:
##  $ fun_exploring   : int  7 2 6 3 2 2 2 4 7 4 ...
##  $ search_specific : int  3 6 3 7 7 6 6 6 2 7 ...
##  $ buy_too_much    : int  7 2 6 2 1 1 2 3 6 3 ...
##  $ compare_prices  : int  4 6 3 6 1 6 2 7 4 2 ...
##  $ age             : int  24 39 24 19 42 32 45 35 18 30 ...
##  $ income          : int  55000 55000 45000 25000 30000 35000 55000 30000 70000 80000 ...
##  $ read_online_news: int  4 3 4 5 2 5 3 6 3 4 ...
##  $ use_social_media: int  6 3 6 4 1 4 3 4 5 1 ...
##  $ listen_radio    : int  2 4 1 3 5 4 6 3 2 6 ...

Number of respondents (Stichprobengröße-> Anzahl der zeilen): 40

Number of variables (welche Merkmale wurden gemessen-> Anzahl der Spalten): 9

summary(data_clustering)

##  fun_exploring   search_specific  buy_too_much compare_prices      age       
##  Min.   :1.000   Min.   :1.00    Min.   :1.0   Min.   :1.00   Min.   :15.00  
##  1st Qu.:2.000   1st Qu.:3.00    1st Qu.:2.0   1st Qu.:2.75   1st Qu.:24.00  
##  Median :4.000   Median :6.00    Median :4.0   Median :3.00   Median :30.50  
##  Mean   :3.825   Mean   :4.95    Mean   :3.8   Mean   :3.70   Mean   :30.95  
##  3rd Qu.:5.250   3rd Qu.:7.00    3rd Qu.:6.0   3rd Qu.:5.00   3rd Qu.:38.25  
##  Max.   :7.000   Max.   :7.00    Max.   :7.0   Max.   :7.00   Max.   :45.00  
##      income      read_online_news use_social_media  listen_radio 
##  Min.   :25000   Min.   :1.000    Min.   :1.00     Min.   :1.00  
##  1st Qu.:30000   1st Qu.:2.000    1st Qu.:2.75     1st Qu.:2.00  
##  Median :42500   Median :3.000    Median :4.00     Median :4.00  
##  Mean   :46000   Mean   :3.375    Mean   :3.85     Mean   :3.55  
##  3rd Qu.:60000   3rd Qu.:4.250    3rd Qu.:6.00     3rd Qu.:5.00  
##  Max.   :80000   Max.   :7.000    Max.   :7.00     Max.   :7.00

Variables Selection

# Dieser Code-Abschnitt dient dazu, einen Datensatz für eine Segmentierung (z. B. eine Clusteranalyse) vorzubereiten
#Select segmentation variables and save them in a new data set:
data_segmentation <- data_clustering[, c("fun_exploring", "search_specific",
  "buy_too_much", "compare_prices")]
head(data_segmentation)

Hierarchical Clustering

Grundlage für die Hierarchische Clusteranalys: Es geht darum, mathematisch zu bestimmen, wie „nah“ oder „fern“ sich die einzelnen Befragten in ihrem Antwortverhalten sind. Erklärung der Begriffe: dist_matrix: Das ist der Name deiner neuen Variable. Sie speichert eine sogenannte Distanzmatrix. Das ist eine Tabelle, die für jedes Paar von Befragten (z. B. Person A und Person B) einen Wert enthält, der aussagt, wie unähnlich sie sich sind. dist(…): Die Funktion in R, die diese Abstände berechnet. Sie vergleicht jede Zeile deines Datensatzes mit jeder anderen Zeile. method = “euclidean”: Die Euklidische Distanz ist die Standardmethode (ähnlich wie ein Lineal, das die Luftlinie zwischen zwei Punkten im Raum misst). Je kleiner der Wert, desto ähnlicher sind sich zwei Personen.

5 Hierarchical Clustering

Perform Hierarchical Clustering

# Calculate distances (Euclidean)
distances <- dist(data_segmentation, method = "euclidean")

# Hierarchical clustering using Ward's method
hclust_solution <- hclust(distances, method = "ward.D2")

## Distance Matrix with Euclidean Distance

YOUR TASK! Create the distance matrix

dist_matrix <- dist(data_segmentation, method = "euclidean")
round(dist_matrix, 2)

##       1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
## 2  7.94                                                                      
## 3  1.73 7.07                                                                 
## 4  7.81 1.41 7.07                                                            
## 5  9.27 5.20 7.81 5.20                                                       
## 6  8.60 1.00 7.68 1.73 5.10                                                  
## 7  7.94 4.00 6.48 4.24 1.73 4.12                                             
## 8  6.56 2.45 6.16 2.00 6.71 3.00 5.48                                        
## 9  1.41 7.81 1.73 7.81 9.17 8.37 7.81 6.56                                   
## 10 6.71 4.69 5.48 4.24 3.00 5.00 2.45 5.10 6.86                              
## 11 8.43 1.41 7.35 2.00 4.12 1.00 3.16 3.46 8.19 4.24                         
## 12 7.62 3.32 6.40 3.32 2.83 3.74 1.73 4.58 7.75 2.24 3.00                    
## 13 1.73 7.07 0.00 7.07 7.81 7.68 6.48 6.16 1.73 5.48 7.35 6.40               
## 14 8.43 3.16 7.07 3.46 2.24 3.00 1.41 4.90 8.19 3.16 2.00 2.24 7.07          
## 15 9.00 4.24 7.62 4.47 1.73 4.36 1.41 6.00 9.00 3.16 3.46 1.73 7.62 2.00     
## 16 6.16 5.00 5.00 4.58 3.74 5.48 3.00 5.20 6.48 1.00 4.80 2.45 5.00 3.87 3.61
## 17 1.41 7.81 1.73 7.81 9.17 8.37 7.81 6.56 0.00 6.86 8.19 7.75 1.73 8.19 9.00
## 18 8.43 3.16 7.07 3.74 2.65 3.32 1.41 5.10 8.31 3.46 2.45 1.73 7.07 1.41 1.41
## 19 8.83 3.32 7.55 3.61 2.45 3.46 1.73 5.20 8.83 3.32 2.65 1.41 7.55 1.73 1.00
## 20 3.00 6.00 1.41 6.16 6.71 6.56 5.29 5.48 2.65 4.69 6.16 5.39 1.41 5.83 6.48
## 21 2.00 6.71 1.00 6.56 7.35 7.35 6.08 5.74 2.45 4.80 7.00 5.83 1.00 6.71 7.14
## 22 0.00 7.94 1.73 7.81 9.27 8.60 7.94 6.56 1.41 6.71 8.43 7.62 1.73 8.43 9.00
## 23 8.60 4.12 7.14 4.58 2.00 4.24 1.00 5.92 8.49 3.32 3.32 2.00 7.14 1.73 1.00
## 24 8.12 2.24 7.00 2.24 3.16 2.45 2.24 3.87 8.12 3.00 1.73 1.41 7.00 1.73 2.24
## 25 5.57 2.83 5.10 2.45 6.24 3.61 4.90 1.41 5.74 4.24 3.74 3.87 5.10 4.69 5.48
## 26 2.45 6.56 1.00 6.71 7.35 7.21 5.92 5.92 2.45 5.20 6.86 5.83 1.00 6.56 7.00
## 27 5.92 2.45 5.10 2.45 5.20 3.32 3.74 2.45 6.08 3.46 3.16 2.65 5.10 3.74 4.24
## 28 1.73 7.35 1.41 7.21 8.06 8.06 6.78 6.32 2.65 5.48 7.75 6.40 1.41 7.48 7.75
## 29 2.45 8.54 1.73 8.66 8.83 9.17 7.55 7.81 2.45 6.71 8.77 7.62 1.73 8.31 8.66
## 30 7.81 1.41 7.07 0.00 5.20 1.73 4.24 2.00 7.81 4.24 2.00 3.32 7.07 3.46 4.47
## 31 5.29 3.00 4.58 2.65 5.48 3.74 4.12 2.24 5.48 3.32 3.61 3.16 4.58 4.12 4.80
## 32 4.24 7.81 3.00 8.06 7.87 8.12 6.71 7.42 3.16 6.40 7.68 7.35 3.00 7.14 8.06
## 33 3.16 7.55 1.73 7.81 7.75 8.12 6.40 7.14 2.83 5.92 7.68 6.63 1.73 7.14 7.55
## 34 4.58 7.35 3.16 7.75 7.42 7.68 6.16 7.21 3.61 6.16 7.21 6.86 3.16 6.63 7.48
## 35 5.29 3.00 4.58 2.65 5.48 3.74 4.12 2.24 5.48 3.32 3.61 3.16 4.58 4.12 4.80
## 36 7.62 3.32 6.40 3.32 2.83 3.74 1.73 4.58 7.75 2.24 3.00 0.00 6.40 2.24 1.73
## 37 2.65 7.07 1.41 7.35 7.94 7.68 6.48 6.48 2.24 6.00 7.35 6.56 1.41 7.07 7.62
## 38 6.56 2.45 6.16 2.00 6.71 3.00 5.48 0.00 6.56 5.10 3.46 4.58 6.16 4.90 6.00
## 39 8.19 3.16 6.93 3.16 2.24 3.32 1.41 4.69 8.19 2.45 2.45 1.00 6.93 1.41 1.41
## 40 5.74 2.45 5.10 2.83 6.24 3.32 4.69 2.00 5.74 4.69 3.46 3.87 5.10 4.47 5.29
##      16   17   18   19   20   21   22   23   24   25   26   27   28   29   30
## 2                                                                            
## 3                                                                            
## 4                                                                            
## 5                                                                            
## 6                                                                            
## 7                                                                            
## 8                                                                            
## 9                                                                            
## 10                                                                           
## 11                                                                           
## 12                                                                           
## 13                                                                           
## 14                                                                           
## 15                                                                           
## 16                                                                           
## 17 6.48                                                                      
## 18 3.87 8.31                                                                 
## 19 3.74 8.83 1.00                                                            
## 20 4.36 2.65 5.83 6.40                                                       
## 21 4.24 2.45 6.71 7.07 1.73                                                  
## 22 6.16 1.41 8.43 8.83 3.00 2.00                                             
## 23 3.74 8.49 1.00 1.41 5.92 6.78 8.60                                        
## 24 3.46 8.12 1.73 1.41 5.92 6.48 8.12 2.45                                   
## 25 4.12 5.74 4.69 4.80 4.47 4.58 5.57 5.39 3.61                              
## 26 4.69 2.45 6.40 6.93 1.00 1.41 2.45 6.48 6.48 4.80                         
## 27 3.32 6.08 3.46 3.61 4.24 4.58 5.92 4.12 2.65 1.41 4.58                    
## 28 4.80 2.65 7.35 7.68 2.45 1.00 1.73 7.42 7.14 5.10 1.73 5.10               
## 29 6.16 2.45 8.19 8.72 2.65 2.45 2.45 8.12 8.37 6.71 2.00 6.56 2.24          
## 30 4.58 7.81 3.74 3.61 6.16 6.56 7.81 4.58 2.24 2.45 6.71 2.45 7.21 8.66     
## 31 3.16 5.48 4.12 4.24 3.87 4.00 5.29 4.69 3.16 1.00 4.24 1.00 4.58 6.16 2.65
## 32 6.32 3.16 7.42 8.12 2.65 3.74 4.24 7.35 7.75 6.71 3.16 6.56 4.36 3.16 8.06
## 33 5.48 2.83 7.00 7.62 1.73 2.45 3.16 6.93 7.35 6.08 1.41 5.74 2.65 1.41 7.81
## 34 6.08 3.61 6.78 7.55 2.45 3.87 4.58 6.71 7.28 6.48 3.00 6.16 4.47 3.32 7.75
## 35 3.16 5.48 4.12 4.24 3.87 4.00 5.29 4.69 3.16 1.00 4.24 1.00 4.58 6.16 2.65
## 36 2.45 7.75 1.73 1.41 5.39 5.83 7.62 2.00 1.41 3.87 5.83 2.65 6.40 7.62 3.32
## 37 5.57 2.24 6.93 7.55 1.41 2.24 2.65 7.00 7.14 5.48 1.00 5.29 2.45 1.73 7.35
## 38 5.20 6.56 5.10 5.20 5.48 5.74 6.56 5.92 3.87 1.41 5.92 2.45 6.32 7.81 2.00
## 39 3.00 8.19 1.41 1.00 5.83 6.40 8.19 1.73 1.00 4.24 6.40 3.16 7.07 8.19 3.16
## 40 4.58 5.74 4.24 4.58 4.24 4.80 5.74 5.00 3.61 1.41 4.58 1.41 5.29 6.56 2.83
##      31   32   33   34   35   36   37   38   39
## 2                                              
## 3                                              
## 4                                              
## 5                                              
## 6                                              
## 7                                              
## 8                                              
## 9                                              
## 10                                             
## 11                                             
## 12                                             
## 13                                             
## 14                                             
## 15                                             
## 16                                             
## 17                                             
## 18                                             
## 19                                             
## 20                                             
## 21                                             
## 22                                             
## 23                                             
## 24                                             
## 25                                             
## 26                                             
## 27                                             
## 28                                             
## 29                                             
## 30                                             
## 31                                             
## 32 6.16                                        
## 33 5.48 2.45                                   
## 34 5.92 1.00 2.24                              
## 35 0.00 6.16 5.48 5.92                         
## 36 3.16 7.35 6.63 6.86 3.16                    
## 37 5.00 2.65 1.00 2.45 5.00 6.56               
## 38 2.24 7.42 7.14 7.21 2.24 4.58 6.48          
## 39 3.61 7.55 7.14 7.07 3.61 1.00 7.07 4.69     
## 40 1.73 6.40 5.74 6.00 1.73 3.87 5.10 2.00 4.24

#Interpretation: Wenn du das Ergebnis siehst, achte auf die Werte. Eine 0.00 auf der Diagonalen (falls angezeigt) bedeutet, dass eine Person mit sich selbst verglichen wird. Hohe Werte zeigen Befragte an, die sehr unterschiedliche Profile in data_segmentation haben.

YOUR TASK! Conduct the cluster analysis with the Ward Metric In diesem Schritt führst du die eigentliche Gruppierung (Clustering) durch. Nachdem du im vorherigen Schritt die Abstände (Distanzen) berechnet hast, bestimmt der Computer nun, welche Personen zu einem Cluster zusammengefasst werden.

Run Clustering with Ward Method

#Durchführung der Clusteranalyse mit der ward Methode
hclust_solution <- hclust(dist_matrix, method = "ward.D2")
#Ward minimiert die Varianz innerhalb der Cluster. Das Ziel: Die Personen innerhalb eines Clusters sollen sich so ähnlich wie möglich sein. Diese Methode ist sehr beliebt, da sie versucht, Cluster zu bilden, die in sich möglichst „kompakt“ und homogen sind.

# Anzeigen der Ergebnisse (Zusammenfassung)
hclust_solution

## 
## Call:
## hclust(d = dist_matrix, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 40

Derivation of Number of Clusters

Dendrogram

YOUR TASK! Create the dendrogram

# Plot dendrogram
# Erzeugt den "Stammbaum" der Cluster
plot(hclust_solution, main = "Dendrogram of Hierarchical Clustering")
# Optional: Add rectangles to show 3 clusters
rect.hclust(hclust_solution, k = 3, border = "red")

YOUR TASK! Select the optimal number of clusters

# k = optimal number of clusters
# Draw dendrogram with red borders around the k clusters

plot(hclust_solution)
rect.hclust(hclust_solution, k = 4 , border = "red")

#Bedeutung: Diese Funktion zeichnet Rechtecke um die Cluster in deinem bereits geöffneten Plot.
#k = ...: Hier gibst du die von dir gewählte Anzahl der Cluster ein (z. B. k = 4). Die Funktion schneidet den Baum automatisch auf der entsprechenden Höhe ab und identifiziert diese Gruppen.
#border = "red": Legt einfach fest, dass die Rahmen um die Cluster rot sein sollen. 
# For example: rect.hclust(hclust_solution, k = 5, border = "red")

Elbow-Criterion

For the elbow-criterion, we need to plot the within cluster sum of squared errors (within SSE) vs. the number of clusters: Der Elbow Plot (Ellenbogen-Diagramm) ist ein visuelles Hilfsmittel, das verwendet wird, um die optimale Anzahl von Clustern {k} bei der Clusteranalyse zu bestimmen. Es ist eine der gängigsten Methoden neben der visuellen Inspektion des Dendrogramms.

Das Diagramm stellt zwei Werte gegenüber: X-Achse: Die Anzahl der Cluster {k}, die du wählst (von 1 bis zur Gesamtzahl der Befragten). Y-Achse: Die Within-Cluster Sum of Squared Errors (WSS) bzw. Within-Cluster Sum of Squares (WCSS). Dies ist genau die Variable, die du gerade als within_sse vorbereitest.

Ziel: Du möchtest die WSS minimieren (also die Varianz innerhalb der Gruppen klein halten, damit die Cluster kompakt sind). Verlauf: Wenn du von einem einzigen Cluster {k}=1 zu immer mehr Clustern wechselst, sinkt die WSS rapide ab, weil die Gruppen kleiner und homogener werden. Der Knick: Irgendwann flacht die Kurve im Diagramm stark ab. Dieser Punkt, an dem die Verbesserung nur noch marginal ist (der Punkt der “abnehmenden Erträge”), sieht aus wie ein Ellenbogen oder Knick in der Kurve. Das optimale K: Die {k}-Anzahl direkt an diesem Knickpunkt ist in der Regel das empfohlene Optimum. Zusätzliche Cluster nach diesem Punkt bringen kaum noch einen Mehrwert für die Erklärung der Datenvarianz.

within_sse <- hclust_solution$height # 'Height' refers to the dendrogram height, which is proportional to the within SSE
within_sse

##  [1]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
##  [8]  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000
## [15]  1.000000  1.000000  1.154701  1.290994  1.290994  1.290994  1.581139
## [22]  1.581139  1.653280  1.732051  1.843909  1.897367  2.000000  2.198484
## [29]  2.476557  2.563480  3.126944  3.277630  3.346834  4.637270  5.323782
## [36]  5.795223  6.155524 12.713429 27.123237

The within_sse variable goes from low (many cluster) to high (1 cluster). We need to invert it to plot it from high (1 cluster) to low (40 cluster, every customer in a single segment)

within_sse <- rev(within_sse) #Durch das Umkehren mit rev() liegen die Daten nun in der Reihenfolge vor, die du für die Y-Achse des Plots benötigst:
#Der erste Wert ist jetzt die höchste Distanz (entspricht 1 Cluster).
#Der letzte Wert ist die niedrigste Distanz (entspricht 40 Clustern).
#Du bereitest damit die Daten für den nächsten Schritt vor: Das eigentliche Plotten, um den "Ellenbogen" zu finden.
within_sse

##  [1] 27.123237 12.713429  6.155524  5.795223  5.323782  4.637270  3.346834
##  [8]  3.277630  3.126944  2.563480  2.476557  2.198484  2.000000  1.897367
## [15]  1.843909  1.732051  1.653280  1.581139  1.581139  1.290994  1.290994
## [22]  1.290994  1.154701  1.000000  1.000000  1.000000  1.000000  1.000000
## [29]  1.000000  1.000000  1.000000  1.000000  0.000000  0.000000  0.000000
## [36]  0.000000  0.000000  0.000000  0.000000

# Plot the elbow-criterion
plot(within_sse, 
     type = 'b', 
     main = "Hierarchical Clustering - Elbow Criterion", 
     xlab = "Number of Clusters", 
     ylab = "Height")

Save Cluster Assignments

YOUR TASK! Save the cluster assignments For example, in case of 10 clusters: “data_clustering$h_cluster_assignment <- cutree(hclust_solution, 3)”

data_clustering$h_cluster_assignment <- cutree(hclust_solution, k = 3)

head(data_clustering)

# Weise jedem der 40 Befragten eine Cluster-Nummer zu
data_clustering$h_cluster_assignment <- cutree(hclust_solution, k = 3)

# Kontrolliere den Datensatz (jetzt 40 Zeilen mit einer neuen Spalte am Ende)
head(data_clustering)

Cluster Description

YOUR TASK! Calculate the mean of each shopping attitude variable, grouped by cluster using the ‘aggregate’ function:

# Hier erstellen wir das Objekt 'agg_data' korrekt:
agg_data <- aggregate(
  x = data_segmentation,                               # Die Segmentierungsvariablen
  by = list(cluster = data_clustering$h_cluster_assignment), # Gruppiert nach Cluster-ID
  FUN = mean                                           # Berechnet den Durchschnitt
)

# Jetzt existiert 'agg_data' und wir können die Namen hinzufügen:
agg_data$cluster_names <- c("Explorers", "Bargain hunters", "Loyal buyers")

# Zeige das Ergebnis an
agg_data

Erstelle die Mittelwerte pro Cluster

agg_data <- aggregate(
  x = data_segmentation,
  by = list(cluster = data_clustering$h_cluster_assignment),
  FUN = mean
)

# Vergib die Namen für die Cluster
agg_data$cluster_names <- c("Explorers", "Bargain hunters", "Loyal buyers")

agg_data

# Vergib die Namen für die Cluster
agg_data$cluster_names <- c("Explorers", "Bargain hunters", "Loyal buyers")

agg_data

# Jetzt kannst du die Namen vergeben (muss zur Anzahl k passen!)
agg_data$cluster_names <- c("Explorers", "Bargain hunters", "Loyal buyers")
agg_data

YOUR TASK! Describe briefly the characteristics of each cluster.

Cluster 1:

High scores on TODO
Medium/low scores on TODO
Enjoy discovering new products and buy more than planned. -> Potential description: Explorers

Cluster 2:

High scores on TODO
Low scores on TODO
Look for specific products and compare prices. -> Potential description: Bargain hunters (Schnäppchenjäger)

Cluster 3:

High score on TODO
Low scores on TODO
Only look for specific products, but do not compare prices on other websites. -> Potential description: Loyal buyers

# Add a name for each cluster
agg_data$cluster_names <- c("Explorers", "Bargain hunters", "Loyal buyers")
agg_data

You can also visualize these differences in boxplots:

YOUR TASK: Create the boxplots for clusters “search_specific”, “buy_too_much” and “compare_prices”.

# boxplot for fun_exploring
boxplot(
  data_clustering$fun_exploring ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Distribution of fun_exploring",
  ylab = "fun_exploring",
  xlab = "Name of cluster"
)

# Boxplot for search_specific
boxplot(
  data_clustering$search_specific ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Attitude: Search Specific",
  col = "lightgreen"
)

# boxplot for buy_too_much

boxplot(
  data_clustering$buy_too_much ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Attitude: Buy Too Much",
  col = "lightcoral"
)

# boxplot for compare_prices

boxplot(
  data_clustering$compare_prices ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Attitude: Compare Prices",
  col = "lightgoldenrod"
)

Profile Customers for Targeting

To profile the customer segments, we compare the means of the remaining variables (e.g., demographics, media usage).

# '5:10' returns columns 5 to 10 (all variables except for the segmentation variables).
agg_data_2 <- aggregate(
  x = data_clustering[, 5:10],
  by = list(data_clustering$h_cluster_assignment),
  FUN = mean
)
agg_data_2

Cluster 1 (Explorers)

Lowest age (around 27)
Medium income (around 49000)
Highest social media usage intensity (around 6)

We should target customers from cluster 1 by social media; advertising content should suit to a younger people with medium income.

Cluster 2 (Bargain hunters)

Medium age (around 30)
Lowest income (around 33333)
Read online news (around 3)

We should target customers from cluster 2 by online news; advertising content should suit medium age people with low income.

Cluster 3 (Loyal buyers)

Highest age (around 37)
Highest income (around 54231)
Listen to radio (around 2)

We should target customers from cluster 3 by radio; advertising content should suit to older people with high income.

Visualization of targeting variables in boxplots:

YOUR TASK: Create boxplots for income, read_online_news, use_social_media and listen_radio.

# Boxplot for age
boxplot(
  data_clustering$age ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Distribution of age",
  ylab = "age",
  xlab = "Name of cluster"
)

# Boxplot for income

boxplot(
  data_clustering$income ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Distribution of Income",
  ylab = "Income",
  col = "darkgreen"
)

# Boxplot for read_online_news

boxplot(
  data_clustering$read_online_news ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Media: Read Online News",
  col = "orange"
)

# Boxplot for use_social_media

boxplot(
  data_clustering$use_social_media ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Media: Social Media Usage",
  col = "purple"
)

# Boxplot for listen_radio

boxplot(
  data_clustering$listen_radio ~ data_clustering$h_cluster_assignment,
  names = agg_data$cluster_names,
  main = "Media: Listen to Radio",
  col = "blue"
)

Partitional Clustering (here: K-Means)

K-Means uses different random starting positions. Therefore, results can be slightly different every time you run the analysis.

Set Seed

In order to fix the starting positions (and hence the results), you can set the seed of R’s random number generator:

set.seed(123) # 123 is a random number, you can also choose any other number
kmeans_solution <- kmeans(data_segmentation, centers = 3)
# Speichern der K-Means Ergebnisse
data_clustering$k_cluster_assignment <- kmeans_solution$cluster


set.seed(123)
# Wir führen K-Means mit der optimalen Cluster-Anzahl k=3 aus
kmeans_solution <- kmeans(data_segmentation, centers = 3)

# Wir erstellen eine übersichtliche Tabelle der Cluster-Zentren (Mittelwerte)
kmeans_centers <- as.data.frame(kmeans_solution$centers)
kmeans_centers$cluster <- c("Cluster 1", "Cluster 2", "Cluster 3")

# Visualisierung der K-Means Zentren
barplot(t(as.matrix(kmeans_solution$centers)), 
        beside = TRUE, 
        col = terrain.colors(4), 
        main = "K-Means Cluster Profiles (Centers)",
        xlab = "Cluster", 
        ylab = "Mean Score",
        legend.text = colnames(data_segmentation),
        args.legend = list(x = "topright", cex = 0.7))

For K-Means, we need to run the analysis for a specific value of k (number of clusters). To identify the optimal number of clusters, we run K-Means for different values of k and compare the results (e.g., by looking at the Elbow-Criterion).

Example: Kmeans for 5 Clusters

To run kmeans, simply adjust the following command (replace 5 by the appropriate number of clusters)

kmeans_solution <- kmeans(data_segmentation, centers= 5) # Example for k = 5 clusters
kmeans_solution$centers

##   fun_exploring search_specific buy_too_much compare_prices
## 1      7.000000        2.500000     6.500000       4.000000
## 2      5.500000        1.000000     4.000000       2.000000
## 3      2.000000        6.692308     2.230769       2.538462
## 4      3.166667        6.083333     2.833333       5.833333
## 5      5.555556        2.888889     6.111111       2.777778

YOUR TASK! Run kmeans with an appropriate number of clusters.

Exercise 6 - Cluster Analysis (Segmenting & Targeting)

Prof. Dr. Bernd Skiera - Goethe University Frankfurt