In HW3, we will use the wine quality data set to answer the following questions.
# Distribution of wine quality
table(data$quality)
##
## 30 40 50 60 70 80
## 10 53 681 638 199 18
# The mean of wine quality
mean(data$quality)
## [1] 56.36023
# standard deviation of quality
sd(data$quality)
## [1] 8.075694
We can see that most observations have a value of quality rated near 50 and 60. The mean of quality is around 56, treating quality as a continuous variable. This gives us an idea about what the high and low quality indicate. The value of higher quality should be somehow above 56 and values of lower quality should be lower than 56, but not too far from the mean (because the standard deviation is around 8).
data <- na.omit(data)
data <- scale(data)
head(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 2 -0.2984541 1.9668271 -1.391037 0.04340257 0.22380518
## 3 -0.2984541 1.2966596 -1.185699 -0.16937425 0.09632273
## 4 1.6543385 -1.3840105 1.483689 -0.45307667 -0.26487754
## 5 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 6 -0.5281944 0.7381867 -1.391037 -0.52400227 -0.26487754
## free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 -0.46604672 -0.3790141 0.55809987 1.2882399
## 2 0.87236532 0.6241680 0.02825193 -0.7197081
## 3 -0.08364328 0.2289750 0.13422152 -0.3310730
## 4 0.10755844 0.4113718 0.66406945 -0.9787982
## 5 -0.46604672 -0.3790141 0.55809987 1.2882399
## 6 -0.27484500 -0.1966174 0.55809987 1.2882399
## sulphates alcohol quality
## 1 -0.57902538 -0.9599458 -0.7875763
## 2 0.12891007 -0.5845942 -0.7875763
## 3 -0.04807379 -0.5845942 -0.7875763
## 4 -0.46103614 -0.5845942 0.4507074
## 5 -0.57902538 -0.9599458 -0.7875763
## 6 -0.57902538 -0.9599458 -0.7875763
set.seed(123)
fviz_nbclust(data, kmeans, method = "wss")
By using this function, we can get a plot of optimal number of clusters. As the number of clusters (k) increase, wss decreases. However, since we are interested in clusters that have high and low values of the wine quality, we need to examine which number k (2,3,…) can make the group segmentation more meaningful in terms of wine quality rather than use k that minimizes wss.
# When K = 2 clusters
set.seed(123)
km.out=kmeans(data,2,nstart=20)
k2 <- kmeans(data, 2, nstart = 20)
data2 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
mutate(Cluster = k2$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
data2 <- as.data.frame(data2)
data2
## Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 7.429783 0.6031854 0.1598126 2.436440
## 2 2 9.862051 0.3971880 0.4636581 2.716239
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.08115878 16.94970 51.19034 0.9963218 3.371154
## 2 0.09840000 14.01197 38.28205 0.9974832 3.207043
## sulphates alcohol quality
## 1 0.6024753 10.24446 54.18146
## 2 0.7546496 10.73242 60.13675
# When adding K = 2 clusters to original data and examining average values
# We find K is too small for us to compare the variables between different wine qualities.
# since there are only two groups.
# we cannot clearly see patterns in the variables.
# Examine K = 8 clusters, because at 8 the slope of the wss line flattens
# and adding another cluster does not improve wss much.
set.seed(123)
km.out=kmeans(data,8,nstart=20)
k8 <- kmeans(data, 8, nstart = 20)
data8 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
mutate(Cluster = k8$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
data8 <- as.data.frame(data8)
data8
## Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 8.097799 0.5330975 0.28264151 2.521698
## 2 2 8.485714 0.5282143 0.50107143 1.989286
## 3 3 8.765546 0.3479622 0.42180672 2.442647
## 4 4 8.170588 0.5216176 0.35176471 9.532353
## 5 5 6.441837 0.5819643 0.09801020 2.161224
## 6 6 11.581000 0.4226000 0.53240000 2.778000
## 7 7 8.357422 0.5029102 0.26539062 2.130664
## 8 8 7.319453 0.7045289 0.07112462 2.346049
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.08521069 26.929245 88.18868 0.9971118 3.297484
## 2 0.35964286 15.357143 63.25000 0.9970864 3.043214
## 3 0.07552941 13.029412 29.68908 0.9956391 3.271261
## 4 0.10141176 34.176471 102.23529 0.9990579 3.260882
## 5 0.06810204 18.581633 40.54592 0.9942211 3.486173
## 6 0.08959000 10.845000 32.25000 0.9991640 3.139300
## 7 0.08654687 9.359375 32.94922 0.9970747 3.269687
## 8 0.08463830 11.916413 33.77812 0.9967071 3.413495
## sulphates alcohol quality
## 1 0.6305031 9.838155 52.95597
## 2 1.2789286 9.496429 53.57143
## 3 0.7383613 11.652101 65.67227
## 4 0.6541176 10.035294 56.17647
## 5 0.6391837 11.693197 60.61224
## 6 0.7207500 10.392333 58.25000
## 7 0.6041406 9.794727 53.90625
## 8 0.5896960 9.968794 51.39818
# With K = 8 clusters the differences between quality values are closer to each other
# As a result, it is harder to distinguish the major difference
# among variables between different wine qualities.
The process was repeated for values of K ranging from 3 to 7.
Most of the clusters generated presented either
or
# K = 3 clusters
set.seed(123)
km.out=kmeans(data,3,nstart=20)
fviz_cluster(km.out, data = data)
k3 <- kmeans(data, 3, nstart = 20)
data3 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
mutate(Cluster = k3$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
data3 <- as.data.frame(data3)
data3
## Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 9.969763 0.3972332 0.4670356 2.591008
## 2 2 8.215681 0.5378406 0.2905398 3.054370
## 3 3 7.191051 0.6161435 0.1192472 2.216406
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.09881818 11.40119 31.19170 0.9974330 3.198794
## 2 0.08741131 26.58740 88.36761 0.9973112 3.283907
## 3 0.07933807 13.17116 34.29545 0.9959414 3.406875
## sulphates alcohol quality
## 1 0.7576877 10.829578 60.90909
## 2 0.6253213 9.831577 52.80206
## 3 0.6047443 10.457528 55.05682
K = 3 clusters divide the wine quality into different levels: 60.9 (higher value), 52.8 (lower value) and 55 (moderate value, near average 56).
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(dist(data, method = "euclidean"), method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1) #notice how row.names become labels
# Using the same number of groups in Question 1
# cut off = 3
sub_grp <- cutree(hc1, k = 3)
# Number of members in each cluster
table(sub_grp)
## sub_grp
## 1 2 3
## 1581 2 16
#Add clusters back into original data
library(dplyr)
d1 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
mutate(cluster = sub_grp) %>%
group_by(cluster) %>%
summarise_all("mean")
d1 <- as.data.frame(d1)
d1
## cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 8.317457 0.5292188 0.2686907 2.445383
## 2 2 8.450000 0.4650000 0.8800000 2.600000
## 3 3 8.518750 0.3975000 0.4206250 11.762500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.08661796 15.72233 45.94624 0.9967196 3.312638
## 2 0.61050000 20.00000 57.00000 0.9982000 2.900000
## 3 0.10593750 30.43750 96.68750 0.9992450 3.211875
## sulphates alcohol quality
## 1 0.6570209 10.42723 56.35041
## 2 1.6300000 9.40000 45.00000
## 3 0.6481250 10.13125 58.75000
#Visualize clusters
fviz_cluster(list(data = data, cluster = sub_grp))
By using Hierarchical Cluster Analysis, K = 3 clusters divide the wine quality into different levels: 58.75 (higher value), 45 (lower value) and 56.35 (moderate value, near average 56).
pr.out=prcomp(data, scale = TRUE)
# PC1
pr.out$rotation[,1]
## fixed.acidity volatile.acidity citric.acid
## 0.487883358 -0.265128984 0.473335467
## residual.sugar chlorides free.sulfur.dioxide
## 0.139154423 0.197426792 -0.045880713
## total.sulfur.dioxide density pH
## 0.004066746 0.370301191 -0.432720849
## sulphates alcohol quality
## 0.254535354 -0.073176777 0.112488776
# PC2
pr.out$rotation[,2]
## fixed.acidity volatile.acidity citric.acid
## -0.004173212 0.338967858 -0.137358104
## residual.sugar chlorides free.sulfur.dioxide
## 0.167736336 0.189788185 0.259483136
## total.sulfur.dioxide density pH
## 0.363971374 0.330780789 -0.065440145
## sulphates alcohol quality
## -0.109333620 -0.502708647 -0.473166214
biplot(pr.out, scale=0)
pr.out$sdev
## [1] 1.7666827 1.4972916 1.2972739 1.1022799 0.9865412 0.8139977 0.7863319
## [8] 0.7112472 0.6413326 0.5726425 0.4245216 0.2439629
pr.var=pr.out$sdev^2
pve=pr.var/sum(pr.var)
pve
## [1] 0.260097308 0.186823504 0.140243308 0.101251739 0.081105302
## [6] 0.055216020 0.051526483 0.042156046 0.034275628 0.027326616
## [11] 0.015018219 0.004959826
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),
type="b")
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),type="b")
The first PC (horizontal axis) places more positive weight on variable fixed.acidity and places negative weights on variable pH.
The second PC (vertical axis) places positive weight on variable total.sulfur.dioxide and negative weight on variables alcohol and quality.
In supervised learning, PCA can be used to perform regression models using the principal component scores as features (reference: link). For example, we can use the 5 most important variables (fixed.acidity, pH, total.sulfur.dioxide, alcohol and quality) in the dataset concentrated in its first two principal components as input variables and price as output variable for supervised learning, which could lead to a more robust model and predict the wine price better.