The file EastWestAirlinesCluster.xls (available on the textbook website http://dataminingbook.com/) contains information on 4000 passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.
1.1 Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to standardize the data first. How many clusters appear?
1.2 What would happen if the data were not standardized?
1.3 Compare the cluster centroids to characterize the different clusters and try to give each cluster a label.
1.4 To check the stability of the clusters, remove a random 5% of the data (by taking a random sample of 95% of the records), and repeat the analysis. Does the same picture emerge?
1.5 Use k-means clustering with the number of clusters that you found above in Part (a). Does the same picture emerge? If not, how does it contrast or validate the finding in Part c above?
1.6 Which cluster(s) would you target for offers, and what type of offers would you target to customers in that cluster? Include proper reasoning in support of your choice of cluster(s) and the corresponding offer(s).
As per the nature of the columns in EastWestAirlinesCluster.xls, The data is mixed data. Hence the categorical variables are converted into binary and then clustering is applied.
As per the generated dendogram below, if slicing is done at height 100 then there are 3 clusters formed.
As per the dendogram branching is done into 2 parts and left branch alone formed one cluster and 2nd branch(Right Branch) again splitted into two more branches. The two branches of right branch are formed into an cluster. Hence in total 3 clusters with maximum number of customers.
## Libraries
library("dummies")
library("dendextend")
library("dendextendRcpp")
library("gridExtra")
library("cluster")
library("factoextra")
library("MASS")
library("fpc")
## Set the working directory to where the Excel file is
setwd('J:\\ISB Business Analytics\\Data Mining\\Data Mining Assignment 1')
## Input file read
input <- read.csv("EastWestAirlinesClusterCSV.csv",header=TRUE)
## 1. Loading and preparing data
mydatawd <- input[,2:11]
#Creating Dummy variables for categorical data
mydata <- dummy.data.frame(mydatawd, names = "cc1_miles", omit.constants=FALSE )
mydata <- dummy.data.frame(mydatawd, names = "cc2_miles", omit.constants=FALSE )
mydata <- dummy.data.frame(mydatawd, names = "cc3_miles", omit.constants=FALSE )
# Standardize Data
my_data <- scale(mydata)
# 2. Compute dissimilarity matrix
d <- dist(my_data, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
Stanardization/Normalization of data is not done then below issues will influence the model.
-> Distance measure will be wrongly clacualted, if all the variables are not with equal weight.
-> Laregest scale dominating the measure
The below dendogarm generated with the given data set without standardization is as below. As per the insights large value units are influencing the small values.
The below are the metrics of cluster centriods.
Cluster 2 metrics are leading in progressive way except non-flight bonus transactions.
These are the segment of customers, who are associated with EastWest Airlines since long time.
Cluster 3 metrics depicts that the flight miles and flight transactions in last 12 months is zero and their
non-flight bonus transactions are leading than the other clusters.
Cluser 1 metrics depicts that the level of spending is average to cluster 2 and 3.
## [,1] [,2] [,3]
## Balance 73547.288828 115682.00 89401.333
## Qual_miles 144.071142 295.50 0.000
## cc1_miles 2.057114 3.75 3.000
## cc2_miles 1.014529 1.00 1.000
## cc3_miles 1.009519 3.00 2.000
## Bonus_miles 17064.150802 74597.50 47920.000
## Bonus_trans 11.576653 32.75 17.000
## Flight_miles_12mo 460.035822 825.00 0.000
## Flight_trans_12 1.373246 2.75 0.000
## Days_since_enroll 4117.809118 5058.75 3863.333
The Process to check for the stability of clusters check. Create different Dendogarms with the sample of 95% data. Here Dend1 and Dend2 are with sample of 95% data.
# Hierarchical clustering using Ward's method
Dend1 <- as.dendrogram(res.hc)
#Random Sample1 with 95% of data
input2=input[sample(nrow(my_data),replace=F,size=0.95*nrow(input)),]
d <- dist(input2, method = "euclidean")
res2.hc <- hclust(d, method = "ward.D2" )
Dend2 <- as.dendrogram(res2.hc)
#Random Sample2 with 95% of data
input3=input[sample(nrow(my_data),replace=F,size=0.95*nrow(input)),]
d <- dist(input3, method = "euclidean")
res3.hc <- hclust(d, method = "ward.D2" )
Dend3 <- as.dendrogram(res3.hc)
# Global Comparison of Dendograms
# Total Population Metrics
Dend1
## 'dendrogram' with 2 branches and 3999 members total, at height 128.0329
# Random Sample 1 Metrics
Dend2
## 'dendrogram' with 2 branches and 3799 members total, at height 6463412
# Random Sample 2 Metrics
Dend3
## 'dendrogram' with 2 branches and 3799 members total, at height 6469119
# Comparison of Population, Random sample 1 and Random sample 2 dendogarms.
all.equal(Dend3, Dend2, Dend1, use.edge.length = TRUE)
## [1] "Difference in branch heights - Mean relative difference: 0.03081155"
From above stastics of dendogram, dendogram comparison and dendogram visualizations, the dendogarams are not same(The Pictures are not same).For every instance of sample, new insights are emerged.
The below R code depicts the process to get the aggregate metrics for Hierarchial clustering and K-Means.
## K-means clustering
set.seed(123)
fit <- kmeans(my_data, 3) # 3 cluster solution
#Aggregation of k-means
mydatak <- data.frame(mydata, fit$cluster) # append cluster membership
temp <- aggregate(mydatak, by=list(fit$cluster), FUN=mean)
#to find the size of clusters
ClusterCo <- aggregate(mydatak, by=list(fit$cluster), FUN=sum)
#to find the cluster size
d <- transform(ClusterCo, clusterSize = fit.cluster / Group.1)
d <- transform(d, fit.cluster= fit.cluster/ clusterSize)
temp$clusterSize <- d$clusterSize
temp$clusterPCT <- (d$clusterSize*100)/3999
# transpose to change from horizontal to vertical
temp2 <- t(temp)
round_df <- function(x, digits) {
# round all numeric variables
# x: data frame
# digits: number of digits to round
numeric_columns <- sapply(x, class) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
temp4 <- round_df(temp2, 2)
#Hierarchical Aggregate calculations
# Hierarchical clustering using Ward's method
#set.seed(123)
groups <- cutree(res.hc, k=3) # cut tree into 3 clusters
membership <-as.matrix(groups)
membership <- data.frame(membership)
names(membership) <- c("cluster")
mydatao <- data.frame(mydata, membership$cluster) # append cluster membership
temp <- aggregate(mydatao, by=list(membership$cluster), FUN=mean)
temp2 <- t(temp)
round_df <- function(x, digits) {
# round all numeric variables
# x: data frame
# digits: number of digits to round
numeric_columns <- sapply(x, class) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
temp5 <- round_df(temp2, 2)
Along with the primary dataset metrics, each cluster size( clusterSize ) and percentage of cluster size( clusterPCT ) in given population are calculated and tabulated.
From above metrics, none of the clusters are matching, Hence the outcomes of K-means cluster and Hierarchial clustering are not same.
From the metrics of K-Means Aggregate Metrics with Mean, The below are the target offers to the customers.
Business Proportion : 62.64% of customers fall into this cluster. There is potential business growth
opportunity, if the passengers are turned into regular travellers.
Potential Offers :
1. If frequent flyer credit card is used then the reward points per travel
percentage can increased.
2. If the number of travel checkins are more than 5 in an year, Then extra
bonus points can be awarded to the customer.
Business Proportion : This cluster is second largest with 33.31% of customers.
Potential Offers :
1. If Bonus miles are used with the proportion of 80:20 then the customers can
be awarded with 50% off with frequent flyer credit card.
Step 1: Download the Wine data from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets/Wine)
Step 2: Do a Principal Components Analysis (PCA) on the data. Please include (copy-paste) the relevant software outputs in your submission while answering the following questions.
2.1 Enumerate the insights you gathered during your PCA exercise. (Please do not clutter your report with too MANY insignificant insights as it will dilute the value of your other significant findings)
2.2 What are the social and business values of those insights, and how the value of those insights can be harnessed?
Step 3: Do a cluster analysis using (i) all chemical measurements (ii) using two most significant PC scores. Please include (copy-paste) the relevant software outputs in your submission while answering the following questions.
2.3 Any more insights you come across during the clustering exercise?
2.4 Are there clearly separable clusters of wines? How many clusters did you go with? How the clusters obtained in part (i) are different from or similar to clusters obtained in part (ii), qualitatively?
2.5 Could you suggest a subset of the chemical measurements that can separate wines more distinctly? How did you go about choosing that subset? How do the rest of the measurements that were not included while clustering, vary across those clusters?
setwd('J:\\ISB Business Analytics\\Data Mining\\Data Mining Assignment 1\\WIne DataSet')
## Principal Component Analysis
input <- read.csv("WineData.csv",header=TRUE)
mydata <- input[1:178,2:14]
# Pricipal Components Analysis
# entering raw data and extracting PCs
# from the correlation matrix
wine.pca <- princomp(mydata, cor=TRUE)
summary(wine.pca) # print variance accounted for
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.1692972 1.5801816 1.2025273 0.9586313 0.92370351
## Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294
## Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.80103498 0.74231281 0.59033665 0.53747553
## Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153
## Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.50090167 0.47517222 0.41081655 0.321524394
## Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149
## Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000
As per the barplot above the principal components 1,2,3,4 and 5 are highly negatively correlated.
As per the summary statistics, the cumulative proportion of variance till principal componet is 80%.So 5 principal components can be considered for the analysis.
As per the standrad deviation metrics above the principal componets 1 through 5 are having above and equal to 1. It indicate the better predictors.
setwd('J:\\ISB Business Analytics\\Data Mining\\Data Mining Assignment 1\\WIne DataSet')
inputk1 <- read.csv("WineData.csv",header=TRUE)
winek <- inputk1[1:178,2:14]
wineks <- scale(winek)
wine.km <- kmeans(wineks, 3)
#Size of clusters
wine.km$size
## [1] 65 51 62
#Size of clusters
wine.km$centers
## Alcohol Malic_Acid Ash Ash_Alcalinity Magnesium
## 1 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869
## 2 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047
## 3 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208
## Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
## 1 -0.07576891 0.02075402 -0.03343924 0.05810161
## 2 -0.97657548 -1.21182921 0.72402116 -0.77751312
## 3 0.88274724 0.97506900 -0.56050853 0.57865427
## Color_Intensity Hue OD280_OD315 Proline
## 1 -0.8993770 0.4605046 0.2700025 -0.7517257
## 2 0.9388902 -1.1615122 -1.2887761 -0.4059428
## 3 0.1705823 0.4726504 0.7770551 1.1220202
#aggregate(inputk1[-1], by=list(cluster=wine.km$cluster), mean)
From Below plot the insights are as below.
–> The Cluster representing Green colour, the composition of chemicals Alcohol,OD280_OD315 , Proline and Total_Phenols and Flavanoids is more compared to other chemical proportions.
These metrics indicate that the composition of checmicals is harzardous to health.
–> The Cluster representing black colour, the composition of chemicals Flavanoids, Hue, OD280_OD315 and Proanthocyanins is less compared to other chemical proportions.
–> The Cluster representing red colour, the composition of chemicals Alcohol, Malic_Acid, Magnesium and Color_Intensity is less compared to other chemical proportions.
The metrics indicate that the compisition of checmicals is good for health.
As per the below plot the clusters are clearly seperable as there is no interminging of the clusters.
The optimum number of clusters are 3 from above metrics. The optimum number of clusters are decided based on
NBClust function results.
2.2 Social and Business Value Insights
The social and business insights of each principal component can drwan with metrics of loading with help of correlation effect of each component.
Food additives rich with advantages
Health Harzardous composition of chemicals