Tin Yun Hon Assignment 4
tyh6518 6-8pm
dta <- read.csv('sales_data.csv', head = T)
str(dta)
## 'data.frame': 30 obs. of 4 variables:
## $ Age : int 23 24 26 33 21 34 36 36 32 38 ...
## $ Average.table.size : int 576 720 576 1008 720 1008 1008 1260 576 576 ...
## $ Purchases.per.year : int 1 1 1 6 1 6 3 3 4 4 ...
## $ Dollars.per.purchase: int 100 200 125 750 250 950 900 1200 200 150 ...
dta_std = scale(dta) ## standardize first
library(matrixStats)
colMeans(dta_std)
## Age Average.table.size Purchases.per.year
## 6.996718e-17 -1.831868e-16 9.575674e-17
## Dollars.per.purchase
## 7.956598e-17
colSds(dta_std)
## [1] 1 1 1 1
k-means cluster analysis
k = 2
km.2 = kmeans(dta_std, 2, nstart = 20)
dta_km2 = data.frame(dta,km.2$cluster)
aggregate(dta_km2, by = list(km.2$cluster), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year Dollars.per.purchase
## 1 1 38.92 881.28 3.32 551.6
## 2 2 50.00 2160.00 2.40 2010.0
## km.2.cluster
## 1 1
## 2 2
## For k = 2, there are 2 clusters with sizes of 25 and 5, total within cluster sum of squares is 39.4%. Cluster 1 has younger customers, the average is 38.92 and they tend to buy smaller (average table size 881.28) and less expensive table(dollars per purchase is 551.6) in the middle volume. Cluster 2 has older cusotmers with average age 50, and they buy larger (the average table size is 2160.00), more expensive (dollars per purchase is 2010) purchases per year is 2.4.
k = 3
km.3 = kmeans(dta_std, 3, nstart = 20)
dta_km3 = data.frame(dta,km.3$cluster)
aggregate(dta_km3, by = list(km.3$cluster), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 50.00000 2160.0000 2.400000
## 2 2 32.21053 854.5263 4.000000
## 3 3 60.16667 966.0000 1.166667
## Dollars.per.purchase km.3.cluster
## 1 2010.0000 1
## 2 568.4211 2
## 3 498.3333 3
## There are 3 clusters with sizes of 6, 19 and 5. The total within sum of squares is 47.64%. Cluster 2 has the youngest customers (32) who buy smallest and least expensive table in high volumes. Cluter 1 has the oldest customers, who are more likely to buy smaller and less expensive tables in low volumes. Cluster 3 are older customers (50) who buy more expensive and large-sized tables in middle volumes.
k = 4
km.4 = kmeans(dta_std, 4, nstart = 20)
dta_km4 = data.frame(dta,km.4$cluster)
aggregate(dta_km4, by = list(km.4$cluster), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year Dollars.per.purchase
## 1 1 27.2 619.2 2.2 227.5
## 2 2 51.2 1058.4 1.6 689.0
## 3 3 37.8 1051.2 9.0 925.0
## 4 4 50.0 2160.0 2.4 2010.0
## km.4.cluster
## 1 1
## 2 2
## 3 3
## 4 4
## The 4 clusters have sizes of 10, 5, 5, 10. Cluster 1 has the youngest customers age 27.2, lowest average table size and lowest dollars per purchase. Cluster 2 has the middle age customers who are more likely to buy middle sized and more expensive tables in highest volumes. Cluster 3 has the older customers who are more likely to buy large-sized and most expensive table in middle volumes. Cluster 4 has the oldest customers aged 51.2, and are more likely to buy middle sized and less expensive tables in middle volumes.
Hierarchical CLuster analysis
Distance matrix
## (1) correlation-based distance and complete linkage and describe the clusters.
cd = as.dist(1 - cor(t(dta_std)))
h.cd.complete = hclust(cd, method = 'complete')
plot(h.cd.complete)
abline(h = 1, col = 'red')

h1 = cutree(h.cd.complete,5)
dta1 = data.frame(dta,h1)
aggregate(dta1, by = list(h1), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 27.33333 780 3.000000
## 2 2 38.66667 1476 4.333333
## 3 3 35.83333 738 6.666667
## 4 4 47.50000 1860 1.333333
## 5 5 60.16667 966 1.166667
## Dollars.per.purchase h1
## 1 450.0000 1
## 2 1450.0000 2
## 3 475.0000 3
## 4 1600.0000 4
## 5 498.3333 5
## Cluster 1 has the youngest customers who buy smaller sized and least expensive tables in middle volumes.
## Cluster 2 has middle aged customers who buy larger sized and more expensive tables in higher volumes
## Cluster 3 has middle aged customers who buy smallest sized and less expensive tables in highest volumes.
## Cluster 4 has older customers who buy largest and most expensive tables in lower volumes.
## Cluster 5 has the oldest customers who buy middle sized and less expensive tables in lowest volumes.
(2) Euclidean distance and average linkage and describe the clusters.
ed = dist(dta_std,method = 'euclidean')
h.ed.average = hclust(ed, method = 'average')
plot(h.ed.average)
abline(h = 2.2, col = 'red')

h2 = cutree(h.ed.average,4)
dta2 = data.frame(dta,h2)
aggregate(dta2, by = list(h2), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 30.62500 812.25 2.687500
## 2 2 40.66667 1080.00 11.000000
## 3 3 50.00000 2160.00 2.400000
## 4 4 60.16667 966.00 1.166667
## Dollars.per.purchase h2
## 1 492.1875 1
## 2 975.0000 2
## 3 2010.0000 3
## 4 498.3333 4
## Cluster 1 has the youngest customers who buy smallest sized and least expensive tables in middle volumes.
## Cluster 2 has middle aged customers who buy middle sized and more expensive tables in the highest volumes.
## Cluster 3 has older customers who buy largest sized and most expensive customers in middle volumes.
## CLuster 4 has the oldest customers who buy smaller sized and less expensive tables in lowest volumes.
(3) correlation-based distance and average linkage and describe the clusters.
h.cd.average = hclust(cd, method = 'average')
plot(h.cd.average)
abline(h = 1, col = 'red')

h3 = cutree(h.cd.average,3)
dta3 = data.frame(dta,h3)
aggregate(dta3, by = list(h3), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 30.73333 763.2 4.466667
## 2 2 44.55556 1732.0 2.333333
## 3 3 60.16667 966.0 1.166667
## Dollars.per.purchase h3
## 1 460.0000 1
## 2 1550.0000 2
## 3 498.3333 3
## Cluster 1 has the youngest customers who are more likely to buy smallest and least expensive tables in higher volumes.
## CLuster 2 has middle aged customers who are more likely to buy largest sized and most expensive tables in middle volumes.
## Cluster 3 has the oldest customers who are more likely to buy middle sized and less expensive tables in the lowest volumes.
(4) Euclidean distance and single linkage and describe the clusters.
h.ed.single = hclust(ed, method = 'single')
plot(h.ed.single)
abline(h = 1.2, col = 'red')

h4 = cutree(h.ed.single,4)
dta4 = data.frame(dta,h4)
aggregate(dta4, by = list(h4), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 38.68182 854.1818 2.272727
## 2 2 40.66667 1080.0000 11.000000
## 3 3 44.00000 2160.0000 7.000000
## 4 4 51.50000 2160.0000 1.250000
## Dollars.per.purchase h4
## 1 493.8636 1
## 2 975.0000 2
## 3 2250.0000 3
## 4 1950.0000 4
## Cluster 1 has middle aged customers who are more likely to buy smallest sized and least expensive tables with low volume.
## Cluster 2 has middle aged customers who are more likely to buy smaller sized and less expensive tables in highest volumes.
## CLuster 3 only has one customers
## Cluster 4 has older customers who are more likely to buy larger sized and most expensive tables in lower volumes.
(5) correlation-based distance and single linkage and describe the clusters.
h.cd.single = hclust(cd, method = 'single')
plot(h.cd.single)
abline(h = 0.16, col = 'red')

h5 = cutree(h.cd.single,5)
dta5 = data.frame(dta,h5)
aggregate(dta5, by = list(h5), FUN = mean)
## Group.1 Age Average.table.size Purchases.per.year
## 1 1 30.42857 766.2857 4.642857
## 2 2 38.66667 1476.0000 4.333333
## 3 3 47.50000 1860.0000 1.333333
## 4 4 35.00000 720.0000 2.000000
## 5 5 60.16667 966.0000 1.166667
## Dollars.per.purchase h5
## 1 450.0000 1
## 2 1450.0000 2
## 3 1600.0000 3
## 4 600.0000 4
## 5 498.3333 5
table(h5)
## h5
## 1 2 3 4 5
## 14 3 6 1 6
## Cluster 1 has the youngest customers who are more likely to buy smaller sized and least expensive tables in highest volumes.
## Cluster 2 has the middle aged customers who are more likely to buy larger sized and more expensive tables in higher volumes.
## Cluster 3 hasare older customers who are more likely to buy largest sized and most expensive tables in lower volumes.
## CLuster 4 only has one customer
## Cluster 5 has the oldest customers who are more likely to buy middle sized and less expensive tables with lowest volumes.