link

1)

Use K Means Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.) Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

Wine Quality Frequency
Quality Frequency
30 10
40 53
50 681
60 638
70 199
80 18

Quality is only divided between six categories with the majority of the observations between the categories 50-60, as such it is to measure what is a good wine or in other words how substantial is the difference between a 50 or 60 category. What can be said is that wines below 50 are less quality wines and 70 or above are bettwe quality ones. Therefore the sample has 217 “Good” quality wines and 63 “Bad” quality ones, the objective of the clustering will be to capture the maximun quantity of these in distinct clusters.

5 Clusters
cluster 30 40 50 60 70 80
1 2 5 88 179 78 6
2 7 40 319 183 8 0
3 0 1 25 189 105 12
4 1 6 230 78 7 0
5 0 1 19 9 1 0
6 Clusters
cluster 30 40 50 60 70 80
1 0 1 16 12 5 0
2 0 1 18 9 1 0
3 2 3 84 175 77 6
4 7 37 302 174 8 0
5 0 1 24 179 103 12
6 1 10 237 89 5 0
7 Clusters
cluster 30 40 50 60 70 80
1 1 4 232 94 6 0
2 0 1 11 125 112 13
3 0 1 18 9 1 0
4 0 1 16 12 5 0
5 2 4 88 122 34 1
6 7 40 285 151 6 0
7 0 2 31 125 35 4
8 Clusters
cluster 30 40 50 60 70 80
1 1 4 218 86 6 0
2 0 0 3 102 111 13
3 0 1 16 12 5 0
4 1 2 63 110 34 1
5 0 1 17 9 1 0
6 7 33 197 89 2 0
7 0 2 29 124 37 4
8 1 10 138 106 3 0

As can be seen from the tables above 7 clusters seems to maximize the quantity of “Good”(125/217) and “Bad”(47/63) wines into unique clusters.

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality

Cluster number 6 refer to less quality wines, It seems that the most distinc features are on average lower levels of fixed.acidity, with above average volatile acidity, low citric acid, average sugar levels, average chloride levels. Low free sulphites, average total sulphites, higher pH, lower than average alcohol levels.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis?

Cluster number 2 refer to higher quality wines, It seems that the most distinc features are average levels of fixed.acidity, with lower than average volatile acidity, higher citric acid, high sugar levels, average chloride levels. Average free sulphites, lower total sulphites, average pH, higher than average alcohol levels.

2)

Use Hierarchical Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.) Use complete linkage and the same number of groups that you found to be the most meaningful in question 1. Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.
If you want to make a good bottle of wine, then what characteristics are most important according to this analysis? Have your conclusions changed using Hierarchical clustering rather than k means clustering? Present any figures that assist you in your analysis.

Wine Quality Frequency
cluster 30 40 50 60 70 80
1 3 40 608 517 157 12
2 0 2 37 23 1 0
3 7 9 19 70 22 3
4 0 1 1 0 0 0
5 0 0 13 19 16 3
6 0 1 3 8 1 0
7 0 0 0 1 2 0

Hierarchichal clustering does not seem to be a good way to identify Good and bad quality wines. Even if as it can see below cluster 4 has a lower quality mean it only contains two wines, and cluster 7 that has a higher quality mean only has 2 as well.

3)

Use Principal Components Analysis to reduce the dimensions of your data. How much of the variation in your data is explained by the first two principal components. How might you use the first two components to do supervised learning on some other variable tied to wine (e.g. - wine price)?

The two first components explain 0.4469208

We could use the Z scores achieved by the principal components to regress the wine values, afterwards having the Z value for a Wine that is not on our sample we could estimate the price of this new wine.