Data Mining - Homework 3

Cluster Analysis

Data Preparation
- Data Cleaning: Handle missing values & outliers
- Feature Selection: Choose relevant features for clustering.
- Normalization/Standardization: Min-max, Z-score, weighted normalization, etc.
Choosing a Clustering Algorithm
- K-Means Clustering: define K.
- Hierarchical Clustering: define linkage methods (e.g., single, complete, or Ward’s method).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Useful for detecting clusters of arbitrary shapes and identifying outliers.
Determining the Optimal Number of Clusters:
- Business approach: Based on domain knowledge, how many clusters are meaningful?
- Technical apporach: Elbow Method (for K-Means), Silhouette Score, Gap Statistic
Clustering the Data
- Apply the chosen clustering algorithm
- Assign data points to clusters
Evaluating Cluster Quality
- Visual Evaluation: using techniques like PCA, t-SNE, UMAP
- Quantitative Evaluation: Silhouette Score, Dunn Index, Davies-Bouldin Index
Interpreting the Results
- Cluster Characteristics: Analyze the centroids or medoids by important features. Use profiling plots.
- Label the Clusters: Give meaningful names to the clusters.

# load libraries
library(mlba)
library(factoextra)
library(cluster)
library(dplyr)
library(ggplot2)
library(skimr)
library(tidyverse)
library(kableExtra)
library(caret)

set.seed(123)

15.2

Pharmaceutical Industry. An equities analyst is studying the pharmaceutical industry and would like your help in exploring and understanding the financial data collected by her firm. Her main objective is to understand the structure of the pharmaceutical industry using some basic financial measures.

Financial data gathered on 21 firms in the pharmaceutical industry are available in the file Pharmaceuticals.csv. For each firm, the following variables are recorded:

Market capitalization (in billions of dollars)
Beta
Price/earnings ratio
Return on equity
Return on assets
Asset turnover
Leverage
Estimated revenue growth
Net profit margin
Median recommendation (across major brokerages)
Location of firm’s headquarters
Stock exchange on which the firm is listed Use cluster analysis to explore and analyze the given dataset as follows:

a.

Use only the numerical variables to cluster the 21 firms. Justify the various choices made in conducting the cluster analysis, such as:

weights for different variables,
the specific clustering algorithm(s) used,
the number of clusters formed, and so on.

Load Data

# load data
data(Pharmaceuticals) # need to load library mlba
skim(Pharmaceuticals) # need to load library skimr

Data summary
Name	Pharmaceuticals
Number of rows	21
Number of columns	14
_______________________
Column type frequency:
character	5
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Symbol	1	3	4	21
Name	1	5	34	21
Median_Recommendation	1	4	13	4
Location	1	2	11	7
Exchange	1	4	6	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Market_Cap	1	57.65	58.60	0.41	6.30	48.19	73.84	199.47	▇▆▁▂▂
Beta	1	0.53	0.26	0.18	0.35	0.46	0.65	1.11	▆▇▃▂▂
PE_Ratio	1	25.46	16.31	3.60	18.90	21.50	27.90	82.50	▅▇▁▁▁
ROE	1	25.80	15.08	3.90	14.90	22.60	31.00	62.90	▇▇▃▂▂
ROA	1	10.51	5.32	1.40	5.70	11.20	15.00	20.30	▃▇▂▇▂
Asset_Turnover	1	0.70	0.22	0.30	0.60	0.60	0.90	1.10	▂▇▁▆▂
Leverage	1	0.59	0.78	0.00	0.16	0.34	0.60	3.51	▇▂▁▁▁
Rev_Growth	1	13.37	11.05	-3.17	6.38	9.37	21.87	34.21	▅▇▅▂▅
Net_Profit_Margin	1	15.70	6.56	2.60	11.20	16.10	21.10	25.50	▂▅▇▅▇

# head of data
head(Pharmaceuticals)

##   Symbol                Name Market_Cap Beta PE_Ratio  ROE  ROA Asset_Turnover
## 1    ABT Abbott Laboratories      68.44 0.32     24.7 26.4 11.8            0.7
## 2    AGN      Allergan, Inc.       7.58 0.41     82.5 12.9  5.5            0.9
## 3    AHM        Amersham plc       6.30 0.46     20.7 14.9  7.8            0.9
## 4    AZN     AstraZeneca PLC      67.63 0.52     21.5 27.4 15.4            0.9
## 5    AVE             Aventis      47.16 0.32     20.1 21.8  7.5            0.6
## 6    BAY            Bayer AG      16.90 1.11     27.9  3.9  1.4            0.6
##   Leverage Rev_Growth Net_Profit_Margin Median_Recommendation Location Exchange
## 1     0.42       7.54              16.1          Moderate Buy       US     NYSE
## 2     0.60       9.16               5.5          Moderate Buy   CANADA     NYSE
## 3     0.27       7.05              11.2            Strong Buy       UK     NYSE
## 4     0.00      15.00              18.0         Moderate Sell       UK     NYSE
## 5     0.34      26.81              12.9          Moderate Buy   FRANCE     NYSE
## 6     0.00      -3.17               2.6                  Hold  GERMANY     NYSE

# check missing values
sum(is.na(Pharmaceuticals))

## [1] 0

Normalize Data

# check if numeric variables, then normalize data
Pharmaceuticals.scaled <- Pharmaceuticals %>%
  select_if(is.numeric) %>%
  scale()

Clustering

We either choose K-means or Hierarchical clustering for this analysis.

K-means Clustering

K-means clustering
Elbow method to find the best k
Gap statistic to find the best k

# K-means clustering using for loop and kmeans function
set.seed(120)
wss <- numeric(10)
for (i in 1:10) {
  kmeansPharmaceuticals <- kmeans(Pharmaceuticals.scaled, centers = i, nstart = 5)
  wss[i] <- kmeansPharmaceuticals$tot.withinss
}
# nstart = 25 to avoid local minima should be big enough to avoid local minima, and small enough to save computation time 

# Plot the elbow method using fviz_nbclust
fviz_nbclust(Pharmaceuticals.scaled, kmeans, method = "wss") + theme_minimal()

# Gap statistic to find the best k
set.seed(121)
gap_stat <- clusGap(Pharmaceuticals.scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat) + theme_minimal()

# K-means clustering, using k = 4
k4 <- kmeans(Pharmaceuticals.scaled, centers = 4, nstart = 5)
k4a <- kmeans(Pharmaceuticals.scaled, centers = 4, nstart = 200)
k4

## K-means clustering with 4 clusters of sizes 7, 5, 5, 4
## 
## Cluster means:
##    Market_Cap       Beta   PE_Ratio        ROE        ROA Asset_Turnover
## 1  0.08926902 -0.4618336 -0.3208615  0.3260892  0.5396003   6.589509e-02
## 2 -0.57238455 -0.6220844  0.8692748 -0.7381675 -0.7242993  -3.108624e-16
## 3 -0.90905697  1.4110965 -0.2613021 -0.7063477 -1.1114156  -1.014784e+00
## 4  1.69558112 -0.1780563 -0.1984582  1.2349879  1.3503431   1.153164e+00
##     Leverage Rev_Growth Net_Profit_Margin
## 1 -0.2559803 -0.7230135         0.7343816
## 2 -0.2991312  0.3682951        -0.8069490
## 3  1.0319661  0.2701808        -0.6941793
## 4 -0.4680782  0.4671788         0.5912425
## 
## Clustering vector:
##  [1] 1 2 2 1 2 3 1 3 3 1 4 3 4 3 4 1 4 2 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 16.655937 22.069459 31.940530  9.284424
##  (between_SS / total_SS =  55.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

k4a

## K-means clustering with 4 clusters of sizes 3, 6, 4, 8
## 
## Cluster means:
##    Market_Cap       Beta   PE_Ratio        ROE        ROA Asset_Turnover
## 1 -0.52462814  0.4451409  1.8498439 -1.0404550 -1.1865838  -3.330669e-16
## 2 -0.82617719  0.4775991 -0.3696184 -0.5631589 -0.8514589  -9.994088e-01
## 3  1.69558112 -0.1780563 -0.1984582  1.2349879  1.3503431   1.153164e+00
## 4 -0.03142211 -0.4360989 -0.3172485  0.1950459  0.4083915   1.729746e-01
##     Leverage Rev_Growth Net_Profit_Margin
## 1 -0.3443544 -0.5769454        -1.6095439
## 2  0.8502201  0.9158889        -0.3319956
## 3 -0.4680782  0.4671788         0.5912425
## 4 -0.2744931 -0.7041516         0.5569544
## 
## Clustering vector:
##  [1] 4 1 4 4 2 1 4 2 2 4 3 2 3 2 3 4 3 1 4 2 4
## 
## Within cluster sum of squares by cluster:
## [1] 14.938904 32.143356  9.284424 21.879320
##  (between_SS / total_SS =  56.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# visualize the clusters with symbol variable as lable of the nodes
fviz_cluster(k4, data = Pharmaceuticals.scaled, geom = "point", ellipse.type = "convex", ellipse = TRUE, label = "symbol", ggtheme = theme_minimal())

fviz_cluster(k4a, data = Pharmaceuticals.scaled, geom = "point", ellipse.type = "convex", ellipse = TRUE, label = "symbol", ggtheme = theme_minimal())

Hierarchical Clustering

Hierarchical clustering
Dendrogram to find the best k

# Hierarchical clustering, using dendrogram to find the best k
d <- dist(Pharmaceuticals.scaled, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
plot(hc, main = "Dendrogram", hang = -1)
# Add a horizontal line at height 5.8
abline(h = 5.8, col = "red", lty = 2)

# cut the dendrogram at 4 clusters
cutree(hc, h = 5.8)

##  [1] 1 2 3 1 3 2 1 3 3 1 4 3 4 3 4 1 4 2 1 3 1

# highlight the clusters
rect.hclust(hc, k = 4, border = 2:5)

b.

Interpret the clusters with respect to the numerical variables used in forming the clusters.

# centroids of clusters
k4$centers

##    Market_Cap       Beta   PE_Ratio        ROE        ROA Asset_Turnover
## 1  0.08926902 -0.4618336 -0.3208615  0.3260892  0.5396003   6.589509e-02
## 2 -0.57238455 -0.6220844  0.8692748 -0.7381675 -0.7242993  -3.108624e-16
## 3 -0.90905697  1.4110965 -0.2613021 -0.7063477 -1.1114156  -1.014784e+00
## 4  1.69558112 -0.1780563 -0.1984582  1.2349879  1.3503431   1.153164e+00
##     Leverage Rev_Growth Net_Profit_Margin
## 1 -0.2559803 -0.7230135         0.7343816
## 2 -0.2991312  0.3682951        -0.8069490
## 3  1.0319661  0.2701808        -0.6941793
## 4 -0.4680782  0.4671788         0.5912425

Is there any other ways to present the cluster information?

# cluster centroids profiling plot
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

centers <- as.data.frame(k4$centers)
centers$cluster <- rownames(centers)
centers.melt <- melt(centers, id.vars = "cluster")
ggplot(centers.melt, aes(x = variable, y = value, group = cluster, color = cluster)) +
  geom_line() +
  geom_point() +
  labs(title = "Cluster Centroids Profiling Plot", x = "Variables", y = "Centroid Values") +
  theme_minimal()

#### Answer

Cluster 1: …
Cluster 2: …
Cluster 3: …
Cluster 4: …

c.

Is there a pattern in the clusters with respect to the non-numerical variables (10 to 12)? (those not used in forming the clusters)

# add cluster variable to Pharmaceuticals data
Pharmaceuticals$cluster <- k4$cluster

# table of cluster variable by location, stock exchange, and median recommendation
table(Pharmaceuticals$cluster, Pharmaceuticals$Location)

##    
##     CANADA FRANCE GERMANY IRELAND SWITZERLAND UK US
##   1      0      0       0       0           1  1  5
##   2      1      1       0       0           0  1  2
##   3      0      0       1       1           0  0  3
##   4      0      0       0       0           0  1  3

table(Pharmaceuticals$cluster, Pharmaceuticals$Exchange)

##    
##     AMEX NASDAQ NYSE
##   1    0      0    7
##   2    0      0    5
##   3    1      1    3
##   4    0      0    4

table(Pharmaceuticals$cluster, Pharmaceuticals$Median_Recommendation)

##    
##     Hold Moderate Buy Moderate Sell Strong Buy
##   1    4            1             2          0
##   2    1            2             1          1
##   3    2            2             1          0
##   4    2            2             0          0

Answer

Any pattern in the clusters with respect to the non-numerical variables (10 to 12)?

d.

Provide an appropriate name for each cluster using any or all of the variables in the dataset.

Answer

Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:

15.4

Marketing to Frequent Fliers. The file EastWestAirlinesCluster.csv contains information on 3999 passengers who belong to an airline’s frequent flier program. For each passenger, the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.

a.

Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to normalize the data first. How many clusters appear?

# read data from mlba package
data(EastWestAirlinesCluster, package = "mlba")
airlines.df <- EastWestAirlinesCluster

# check data
head(airlines.df)

# normalize data, keep ID.
airlines.norm <- sapply(airlines.df[, -1], scale)

# check data
head(airlines.norm)

# apply hierarchical clustering with Euclidean distance and Ward’s method
distance <- dist(airlines.norm, method = "euclidean")
airlines.ward <- hclust(distance, method = "ward.D2")

# check data
summary(airlines.ward)

##             Length Class  Mode     
## merge       7996   -none- numeric  
## height      3998   -none- numeric  
## order       3999   -none- numeric  
## labels         0   -none- NULL     
## method         1   -none- character
## call           3   -none- call     
## dist.method    1   -none- character

# plot dendrogram
plot(airlines.ward, main = "Dendrogram", ann = FALSE, hang = -1)

Answer

The dendrogram shows ?? clusters.

b.

What would happen if the data were not normalized?

# check the summary of the unnormailzed data
skim(airlines.df)

Data summary
Name	airlines.df
Number of rows	3999
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID.	1	2014.82	1160.76	1	1010.5	2016	3020.5	4021	▇▇▇▇▇
Balance	1	73601.33	100775.66	0	18527.5	43097	92404.0	1704838	▇▁▁▁▁
Qual_miles	1	144.11	773.66	0	0.0	0	0.0	11148	▇▁▁▁▁
cc1_miles	1	2.06	1.38	1	1.0	1	3.0	5	▇▁▂▂▁
cc2_miles	1	1.01	0.15	1	1.0	1	1.0	3	▇▁▁▁▁
cc3_miles	1	1.01	0.20	1	1.0	1	1.0	5	▇▁▁▁▁
Bonus_miles	1	17144.85	24150.97	0	1250.0	7171	23800.5	263685	▇▁▁▁▁
Bonus_trans	1	11.60	9.60	0	3.0	12	17.0	86	▇▂▁▁▁
Flight_miles_12mo	1	460.06	1400.21	0	0.0	0	311.0	30817	▇▁▁▁▁
Flight_trans_12	1	1.37	3.79	0	0.0	0	1.0	53	▇▁▁▁▁
Days_since_enroll	1	4118.56	2065.13	2	2330.0	4096	5790.5	8296	▅▇▇▇▅
Award.	1	0.37	0.48	0	0.0	0	1.0	1	▇▁▁▁▅

Answer

c.

Compare the cluster centroid to characterize the different clusters, and try to give each cluster a label.

# cut the dendrogram to get 2 clusters
airlines.cut1 <- cutree(airlines.ward, k = 2)

# calculate cluster centroids
centers2 <- aggregate(airlines.norm, by = list(airlines.cut1), FUN = mean)
centers2

##   Group.1    Balance Qual_miles  cc1_miles   cc2_miles   cc3_miles Bonus_miles
## 1       1 -0.2667555 -0.1742877 -0.5935343  0.05959289 -0.06275873  -0.5013072
## 2       2  0.4397049  0.2872862  0.9783490 -0.09822960  0.10344800   0.8263269
##   Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll     Award.
## 1  -0.4696781        -0.1706092      -0.1775063        -0.1674339 -0.3774981
## 2   0.7741912         0.2812228       0.2925916         0.2759887  0.6222468

Answer

Cluster 1:
Cluster 2:

d.

To check the stability of the clusters,

remove a random 5% of the data (by taking a random sample of 95% of the records),
and repeat the analysis.

Does the same picture emerge?

Answer

Compare the denfromgrams:

# remove 5% of the data
set.seed(124)
airlines.df2 <- airlines.df[sample(1:nrow(airlines.df), 0.95 * nrow(airlines.df)), ]

# normalize data, keep ID.
airlines.norm2 <- sapply(airlines.df2[, -1], scale)

# apply hierarchical clustering with Euclidean distance and Ward’s method
distance2 <- dist(airlines.norm2, method = "euclidean")
airlines.ward2 <- hclust(distance2, method = "ward.D2")

# check clusters
summary(airlines.ward2)

##             Length Class  Mode     
## merge       7596   -none- numeric  
## height      3798   -none- numeric  
## order       3799   -none- numeric  
## labels         0   -none- NULL     
## method         1   -none- character
## call           3   -none- call     
## dist.method    1   -none- character

# plot dendrogram
plot(airlines.ward2, main = "Dendrogram", ann = FALSE, hang = -1)

# plot 2 dendrograms in one plot
par(mfrow = c(1, 2))
plot(airlines.ward, main = "Dendrogram 1", ann = FALSE, hang = -1)
plot(airlines.ward2, main = "Dendrogram 2", ann = FALSE, hang = -1)

par(mfrow = c(1, 1))

Comment: are they similar or different?

e.

Use k-means clustering with the number of clusters that you found above. Does the same picture emerge?

Answer

# apply k-means clustering with k = 2
airlines.kmeans <- kmeans(airlines.norm, centers = 2, nstart = 3) # try different nstart values

table(airlines.cut1, airlines.kmeans$cluster)

##              
## airlines.cut1    1    2
##             1   53 2436
##             2 1246  264

Group 2000 chairs Red. Blue. (first clustering method) Tall 500 500 Short 500 500 (2nd method)

Group 2000 chairs
Red. Blue. (first clustering method) Tall 0 1000 Short 1000 0 (2nd method) ### f.

Which clusters would you target for offers, and what types of offers would you target to customers in that cluster?

Answer

Look at the cluster centroids and characteristics to decide which clusters to target for offers.

Data Mining - Homework 3

2024-10-15

Cluster Analysis

15.2

a.

Load Data

Normalize Data

Clustering

K-means Clustering

Hierarchical Clustering

b.

c.

Answer

d.

Answer

15.4

a.

Answer

b.

Answer

c.

Answer

d.

Answer

e.

Answer

Answer