Tin Yun Hon Assignment 4

tyh6518 6-8pm

dta <- read.csv('sales_data.csv', head = T)
str(dta)

## 'data.frame':    30 obs. of  4 variables:
##  $ Age                 : int  23 24 26 33 21 34 36 36 32 38 ...
##  $ Average.table.size  : int  576 720 576 1008 720 1008 1008 1260 576 576 ...
##  $ Purchases.per.year  : int  1 1 1 6 1 6 3 3 4 4 ...
##  $ Dollars.per.purchase: int  100 200 125 750 250 950 900 1200 200 150 ...

dta_std = scale(dta) ## standardize first
library(matrixStats)
colMeans(dta_std)

##                  Age   Average.table.size   Purchases.per.year 
##         6.996718e-17        -1.831868e-16         9.575674e-17 
## Dollars.per.purchase 
##         7.956598e-17

colSds(dta_std)

## [1] 1 1 1 1

k-means cluster analysis

k = 2

km.2 = kmeans(dta_std, 2, nstart = 20)
dta_km2 = data.frame(dta,km.2$cluster)
aggregate(dta_km2, by = list(km.2$cluster), FUN = mean)

##   Group.1   Age Average.table.size Purchases.per.year Dollars.per.purchase
## 1       1 38.92             881.28               3.32                551.6
## 2       2 50.00            2160.00               2.40               2010.0
##   km.2.cluster
## 1            1
## 2            2

## For k = 2, there are 2 clusters with sizes of 25 and 5, total within cluster sum of squares is 39.4%. Cluster 1 has younger customers, the average is 38.92 and they tend to buy smaller (average table size 881.28) and less expensive table(dollars per purchase is 551.6) in the middle volume.  Cluster 2 has older cusotmers with average age 50, and they buy larger (the average table size is 2160.00), more expensive (dollars per purchase is 2010) purchases per year is 2.4.

k = 3

km.3 = kmeans(dta_std, 3, nstart = 20)
dta_km3 = data.frame(dta,km.3$cluster)
aggregate(dta_km3, by = list(km.3$cluster), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 50.00000          2160.0000           2.400000
## 2       2 32.21053           854.5263           4.000000
## 3       3 60.16667           966.0000           1.166667
##   Dollars.per.purchase km.3.cluster
## 1            2010.0000            1
## 2             568.4211            2
## 3             498.3333            3

## There are 3 clusters with sizes of 6, 19 and 5. The total within sum of squares is 47.64%. Cluster 2 has the youngest customers (32) who buy smallest and least expensive table in high volumes. Cluter 1 has the oldest customers, who are more likely to buy smaller and less expensive tables in low volumes. Cluster 3 are older customers (50) who buy more expensive and large-sized tables in middle volumes.

k = 4

km.4 = kmeans(dta_std, 4, nstart = 20)
dta_km4 = data.frame(dta,km.4$cluster)
aggregate(dta_km4, by = list(km.4$cluster), FUN = mean)

##   Group.1  Age Average.table.size Purchases.per.year Dollars.per.purchase
## 1       1 27.2              619.2                2.2                227.5
## 2       2 51.2             1058.4                1.6                689.0
## 3       3 37.8             1051.2                9.0                925.0
## 4       4 50.0             2160.0                2.4               2010.0
##   km.4.cluster
## 1            1
## 2            2
## 3            3
## 4            4

## The 4 clusters have sizes of 10, 5, 5, 10. Cluster 1 has the youngest customers age 27.2, lowest average table size and lowest dollars per purchase. Cluster 2 has the middle age customers who are more likely to buy middle sized and more expensive tables in highest volumes. Cluster 3 has the older customers who are more likely to buy large-sized and most expensive table in middle volumes. Cluster 4 has the oldest customers aged 51.2, and are more likely to buy middle sized and less expensive tables in middle volumes.

Hierarchical CLuster analysis

Distance matrix

## (1) correlation-based distance and complete linkage and describe the clusters. 
cd = as.dist(1 - cor(t(dta_std))) 
h.cd.complete = hclust(cd, method = 'complete')
plot(h.cd.complete)
abline(h = 1, col = 'red')

h1 = cutree(h.cd.complete,5)
dta1 = data.frame(dta,h1)
aggregate(dta1, by = list(h1), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 27.33333                780           3.000000
## 2       2 38.66667               1476           4.333333
## 3       3 35.83333                738           6.666667
## 4       4 47.50000               1860           1.333333
## 5       5 60.16667                966           1.166667
##   Dollars.per.purchase h1
## 1             450.0000  1
## 2            1450.0000  2
## 3             475.0000  3
## 4            1600.0000  4
## 5             498.3333  5

## Cluster 1 has the youngest customers who buy smaller sized and least expensive tables in middle volumes.
## Cluster 2 has middle aged customers who buy larger sized and more expensive tables in higher volumes
## Cluster 3 has middle aged customers who buy smallest sized and less expensive tables in highest volumes.
## Cluster 4 has older customers who buy largest and most expensive tables in lower volumes.
## Cluster 5 has the oldest customers who buy middle sized and less expensive tables in lowest volumes.

(2) Euclidean distance and average linkage and describe the clusters.

ed = dist(dta_std,method = 'euclidean')
h.ed.average = hclust(ed, method = 'average')
plot(h.ed.average)
abline(h = 2.2, col = 'red')

h2 = cutree(h.ed.average,4)
dta2 = data.frame(dta,h2)
aggregate(dta2, by = list(h2), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 30.62500             812.25           2.687500
## 2       2 40.66667            1080.00          11.000000
## 3       3 50.00000            2160.00           2.400000
## 4       4 60.16667             966.00           1.166667
##   Dollars.per.purchase h2
## 1             492.1875  1
## 2             975.0000  2
## 3            2010.0000  3
## 4             498.3333  4

## Cluster 1 has the youngest customers who buy smallest sized and least expensive tables in middle volumes.
## Cluster 2 has middle aged customers who buy middle sized and more expensive tables in the highest volumes.
## Cluster 3 has older customers who buy largest sized and most expensive customers in middle volumes.
## CLuster 4 has the oldest customers who buy smaller sized and less expensive tables in lowest volumes.

(3) correlation-based distance and average linkage and describe the clusters.

h.cd.average = hclust(cd, method = 'average')
plot(h.cd.average)
abline(h = 1, col = 'red')

h3 = cutree(h.cd.average,3)
dta3 = data.frame(dta,h3)
aggregate(dta3, by = list(h3), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 30.73333              763.2           4.466667
## 2       2 44.55556             1732.0           2.333333
## 3       3 60.16667              966.0           1.166667
##   Dollars.per.purchase h3
## 1             460.0000  1
## 2            1550.0000  2
## 3             498.3333  3

## Cluster 1 has the youngest customers who are more likely to buy smallest and least expensive tables in higher volumes.
## CLuster 2 has middle aged customers who are more likely to buy largest sized and most expensive tables in middle volumes.
## Cluster 3 has the oldest customers who are more likely to buy middle sized and less expensive tables in the lowest volumes.

(4) Euclidean distance and single linkage and describe the clusters.

h.ed.single = hclust(ed, method = 'single')
plot(h.ed.single)
abline(h = 1.2, col = 'red')

h4 = cutree(h.ed.single,4)
dta4 = data.frame(dta,h4)
aggregate(dta4, by = list(h4), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 38.68182           854.1818           2.272727
## 2       2 40.66667          1080.0000          11.000000
## 3       3 44.00000          2160.0000           7.000000
## 4       4 51.50000          2160.0000           1.250000
##   Dollars.per.purchase h4
## 1             493.8636  1
## 2             975.0000  2
## 3            2250.0000  3
## 4            1950.0000  4

## Cluster 1 has middle aged customers who are more likely to buy smallest sized and least expensive tables with low volume.
## Cluster 2 has middle aged customers who are more likely to buy smaller sized and less expensive tables in highest volumes.
## CLuster 3 only has one customers
## Cluster 4 has older customers who are more likely to buy larger sized and most expensive tables in lower volumes.

(5) correlation-based distance and single linkage and describe the clusters.

h.cd.single = hclust(cd, method = 'single')
plot(h.cd.single)
abline(h = 0.16, col = 'red')

h5 = cutree(h.cd.single,5)
dta5 = data.frame(dta,h5)
aggregate(dta5, by = list(h5), FUN = mean)

##   Group.1      Age Average.table.size Purchases.per.year
## 1       1 30.42857           766.2857           4.642857
## 2       2 38.66667          1476.0000           4.333333
## 3       3 47.50000          1860.0000           1.333333
## 4       4 35.00000           720.0000           2.000000
## 5       5 60.16667           966.0000           1.166667
##   Dollars.per.purchase h5
## 1             450.0000  1
## 2            1450.0000  2
## 3            1600.0000  3
## 4             600.0000  4
## 5             498.3333  5

table(h5)

## h5
##  1  2  3  4  5 
## 14  3  6  1  6

## Cluster 1 has the youngest customers who are more likely to buy smaller sized and least expensive tables in highest volumes.
## Cluster 2 has the middle aged customers who are more likely to buy larger sized and more expensive tables in higher volumes.
## Cluster 3 hasare older customers who are more likely to buy largest sized and most expensive tables in lower volumes.
## CLuster 4 only has one customer
## Cluster 5 has the oldest customers who are more likely to buy middle sized and less expensive tables with lowest volumes.