In HW3, we will use the wine quality data set to answer the following questions.

Read the data and related packages first

chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE)
data <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv")
library(ggplot2)
install.packages("factoextra")

## 
## The downloaded binary packages are in
##  /var/folders/zp/3g72mswd05d6nsdjf4g88dzc0000gn/T//RtmpI9cHxO/downloaded_packages

library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(tidyverse)

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## <U+221A> tibble  1.4.2     <U+221A> purrr   0.2.4
## <U+221A> tidyr   0.7.2     <U+221A> dplyr   0.7.4
## <U+221A> readr   1.1.1     <U+221A> stringr 1.2.0
## <U+221A> tibble  1.4.2     <U+221A> forcats 0.2.0

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Since we are interested in wine quality, first, examining the distribution of quality in the dataset might be useful.

# Distribution of wine quality
table(data$quality)

## 
##  30  40  50  60  70  80 
##  10  53 681 638 199  18

# The mean of wine quality
mean(data$quality)

## [1] 56.36023

# standard deviation of quality
sd(data$quality)

## [1] 8.075694

Interpretation:

We can see that most observations have a value of quality rated near 50 and 60. The mean of quality is around 56, treating quality as a continuous variable. This gives us an idea about what the high and low quality indicate. The value of higher quality should be somehow above 56 and values of lower quality should be lower than 56, but not too far from the mean (because the standard deviation is around 8).

Delete N/A data and Scale the data

data <- na.omit(data)
data <- scale(data)
head(data)

##   fixed.acidity volatile.acidity citric.acid residual.sugar   chlorides
## 1    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 2    -0.2984541        1.9668271   -1.391037     0.04340257  0.22380518
## 3    -0.2984541        1.2966596   -1.185699    -0.16937425  0.09632273
## 4     1.6543385       -1.3840105    1.483689    -0.45307667 -0.26487754
## 5    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 6    -0.5281944        0.7381867   -1.391037    -0.52400227 -0.26487754
##   free.sulfur.dioxide total.sulfur.dioxide    density         pH
## 1         -0.46604672           -0.3790141 0.55809987  1.2882399
## 2          0.87236532            0.6241680 0.02825193 -0.7197081
## 3         -0.08364328            0.2289750 0.13422152 -0.3310730
## 4          0.10755844            0.4113718 0.66406945 -0.9787982
## 5         -0.46604672           -0.3790141 0.55809987  1.2882399
## 6         -0.27484500           -0.1966174 0.55809987  1.2882399
##     sulphates    alcohol    quality
## 1 -0.57902538 -0.9599458 -0.7875763
## 2  0.12891007 -0.5845942 -0.7875763
## 3 -0.04807379 -0.5845942 -0.7875763
## 4 -0.46103614 -0.5845942  0.4507074
## 5 -0.57902538 -0.9599458 -0.7875763
## 6 -0.57902538 -0.9599458 -0.7875763

Question 1)

Use K Means Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.)

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis?

First step:

set.seed(123)
fviz_nbclust(data, kmeans, method = "wss")

Interpretation:

By using this function, we can get a plot of optimal number of clusters. As the number of clusters (k) increase, wss decreases. However, since we are interested in clusters that have high and low values of the wine quality, we need to examine which number k (2,3,…) can make the group segmentation more meaningful in terms of wine quality rather than use k that minimizes wss.

Second step: Examine K = 2,3,4,…,8 to make a decision on K number

# When K = 2 clusters
set.seed(123)
km.out=kmeans(data,2,nstart=20)
k2 <- kmeans(data, 2, nstart = 20)

data2 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
  mutate(Cluster = k2$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
data2 <- as.data.frame(data2)
data2

##   Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       1      7.429783        0.6031854   0.1598126       2.436440
## 2       2      9.862051        0.3971880   0.4636581       2.716239
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density       pH
## 1 0.08115878            16.94970             51.19034 0.9963218 3.371154
## 2 0.09840000            14.01197             38.28205 0.9974832 3.207043
##   sulphates  alcohol  quality
## 1 0.6024753 10.24446 54.18146
## 2 0.7546496 10.73242 60.13675

# When adding K = 2 clusters to original data and examining average values
# We find K is too small for us to compare the variables between different wine qualities.
# since there are only two groups.
# we cannot clearly see patterns in the variables.

# Examine K = 8 clusters, because at 8 the slope of the wss line flattens
# and adding another cluster does not improve wss much.  
set.seed(123)
km.out=kmeans(data,8,nstart=20)

k8 <- kmeans(data, 8, nstart = 20)

data8 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
  mutate(Cluster = k8$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
data8 <- as.data.frame(data8)
data8

##   Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       1      8.097799        0.5330975  0.28264151       2.521698
## 2       2      8.485714        0.5282143  0.50107143       1.989286
## 3       3      8.765546        0.3479622  0.42180672       2.442647
## 4       4      8.170588        0.5216176  0.35176471       9.532353
## 5       5      6.441837        0.5819643  0.09801020       2.161224
## 6       6     11.581000        0.4226000  0.53240000       2.778000
## 7       7      8.357422        0.5029102  0.26539062       2.130664
## 8       8      7.319453        0.7045289  0.07112462       2.346049
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density       pH
## 1 0.08521069           26.929245             88.18868 0.9971118 3.297484
## 2 0.35964286           15.357143             63.25000 0.9970864 3.043214
## 3 0.07552941           13.029412             29.68908 0.9956391 3.271261
## 4 0.10141176           34.176471            102.23529 0.9990579 3.260882
## 5 0.06810204           18.581633             40.54592 0.9942211 3.486173
## 6 0.08959000           10.845000             32.25000 0.9991640 3.139300
## 7 0.08654687            9.359375             32.94922 0.9970747 3.269687
## 8 0.08463830           11.916413             33.77812 0.9967071 3.413495
##   sulphates   alcohol  quality
## 1 0.6305031  9.838155 52.95597
## 2 1.2789286  9.496429 53.57143
## 3 0.7383613 11.652101 65.67227
## 4 0.6541176 10.035294 56.17647
## 5 0.6391837 11.693197 60.61224
## 6 0.7207500 10.392333 58.25000
## 7 0.6041406  9.794727 53.90625
## 8 0.5896960  9.968794 51.39818

# With K = 8 clusters the differences between quality values are closer to each other
# As a result, it is harder to distinguish the major difference 
# among variables between different wine qualities.

Cluster analysis with K ranging from 3 to 7 - Results Interpretation:

The process was repeated for values of K ranging from 3 to 7.

Most of the clusters generated presented either

a challenge in clearly relating the change in the variables to the change in quality

a challenge in clearly identifying cluster boundaries (quality means were very close to one another).

The best K appears to be 3 where the above two challenges are not present concurrently.

Question 1) Answer:

Based on the above analysis, K = 3 makes the group division most meaningful in terms of wine quality

# K = 3 clusters
set.seed(123)
km.out=kmeans(data,3,nstart=20)
fviz_cluster(km.out, data = data)

k3 <- kmeans(data, 3, nstart = 20)

data3 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
  mutate(Cluster = k3$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
data3 <- as.data.frame(data3)
data3

##   Cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       1      9.969763        0.3972332   0.4670356       2.591008
## 2       2      8.215681        0.5378406   0.2905398       3.054370
## 3       3      7.191051        0.6161435   0.1192472       2.216406
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density       pH
## 1 0.09881818            11.40119             31.19170 0.9974330 3.198794
## 2 0.08741131            26.58740             88.36761 0.9973112 3.283907
## 3 0.07933807            13.17116             34.29545 0.9959414 3.406875
##   sulphates   alcohol  quality
## 1 0.7576877 10.829578 60.90909
## 2 0.6253213  9.831577 52.80206
## 3 0.6047443 10.457528 55.05682

Question 1) Answer Interpretation:

K = 3 clusters divide the wine quality into different levels: 60.9 (higher value), 52.8 (lower value) and 55 (moderate value, near average 56).

For the higher mean value of wine quality (60.9),

variables with the highest mean values amongst the 3 groups: fixed.acidity, citric.acid, chlorides, density, sulphates and alcohol.
variables with moderate mean values amongst the 3 groups: residual.sugar
and variables with have the lowest mean values amongst the 3 groups: volatile.acidity, free.sulfur.dioxide, total.sulfur.dioxide and pH.

For the modeate mean value of wine quality (55),

variables with the highest mean values amongst the 3 groups: volatile.acidity and pH
variables with moderate mean values amongst the 3 groups: free.sulfur.dioxide, total.sulfur.dioxide and alcohol
and variables with the lowest mean values amongst the 3 groups: fixed.acidity, citric.acid, residual.sugar, chlorides, density and sulphates.

For the lowest mean value of wine quality (52.8),

variables with the highest mean values amongst the 3 groups: residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide
variables with moderate mean values amongst the 3 groups: fixed.acidity, volatile.acidity, citric.acid, chlorides, density, pH and sulphates
and variables with the lowest mean values amongst the 3 groups: alcohol.

Three variables have the distinct patterns and very different values in line with the quality and can be used as a reference for making good quality wine. Specifically the most important characteristics for a good wine are low free.sulfur.dioxide and low total.sulfur.dioxide and high alcohol.

Question 2)

Use Hierarchical Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.) Use complete linkage and the same number of groups that you found to be the most meaningful in question 1.

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis? Have your conclusions changed using Hierarchical clustering rather than k means clustering? Present any figures that assist you in your analysis.

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(dist(data, method = "euclidean"), method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1) #notice how row.names become labels

# Using the same number of groups in Question 1
# cut off = 3
sub_grp <- cutree(hc1, k = 3)

# Number of members in each cluster
table(sub_grp)

## sub_grp
##    1    2    3 
## 1581    2   16

#Add clusters back into original data
library(dplyr)
d1 <- read.csv("/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/HW3/Wine_data.csv") %>%
  mutate(cluster = sub_grp) %>%
  group_by(cluster) %>%
  summarise_all("mean")
d1 <- as.data.frame(d1)
d1

##   cluster fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       1      8.317457        0.5292188   0.2686907       2.445383
## 2       2      8.450000        0.4650000   0.8800000       2.600000
## 3       3      8.518750        0.3975000   0.4206250      11.762500
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density       pH
## 1 0.08661796            15.72233             45.94624 0.9967196 3.312638
## 2 0.61050000            20.00000             57.00000 0.9982000 2.900000
## 3 0.10593750            30.43750             96.68750 0.9992450 3.211875
##   sulphates  alcohol  quality
## 1 0.6570209 10.42723 56.35041
## 2 1.6300000  9.40000 45.00000
## 3 0.6481250 10.13125 58.75000

#Visualize clusters
fviz_cluster(list(data = data, cluster = sub_grp))

Question 2) Answer:

By using Hierarchical Cluster Analysis, K = 3 clusters divide the wine quality into different levels: 58.75 (higher value), 45 (lower value) and 56.35 (moderate value, near average 56).

For the higher mean value of wine quality (58.75),

variables with the highest mean values amongst the 3 groups: fixed.acidity, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide and density
variables with moderate mean values amongst the 3 groups: citric.acid, chlorides, pH and alcohol
and variables with have the lowest mean values amongst the 3 groups: volatile.acidity and sulphates

For the modeate mean value of wine quality (56.35),

variables with the highest mean values amongst the 3 groups: volatile.acidity, pH and alcohol
variables with moderate mean values amongst the 3 groups: sulphates
and variables with the lowest mean values amongst the 3 groups: fixed.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and density

For the lowest mean value of wine quality (45),

variables with the highest mean values amongst the 3 groups: citric.acid, chlorides and sulphates
variables with moderate mean values amongst the 3 groups: fixed.acidity, volatile.acidity, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide and density
and variables with the lowest mean values amongst the 3 groups: pH and alcohol

There is only one variable that has a distinct pattern and very different values in line with the quality and can be used as a reference for making good quality wine. Specifically the most important characteristic for a good wine is low sulphates.

My conclusions have changed using Hierarchical Cluster Analysis. The reason might be that Hierarchical Cluster Analysis results in unbalanced groups due to the small number of clusters we were required to use from Question 1 (K = 3).

Question 3)

Use Principal Components Analysis to reduce the dimensions of your data. How much of the variation in your data is explained by the first two principal components. How might you use the first two components to do supervised learning on some other variable tied to wine (e.g. - wine price)?

pr.out=prcomp(data, scale = TRUE)

# PC1
pr.out$rotation[,1]

##        fixed.acidity     volatile.acidity          citric.acid 
##          0.487883358         -0.265128984          0.473335467 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.139154423          0.197426792         -0.045880713 
## total.sulfur.dioxide              density                   pH 
##          0.004066746          0.370301191         -0.432720849 
##            sulphates              alcohol              quality 
##          0.254535354         -0.073176777          0.112488776

# PC2
pr.out$rotation[,2]

##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.004173212          0.338967858         -0.137358104 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.167736336          0.189788185          0.259483136 
## total.sulfur.dioxide              density                   pH 
##          0.363971374          0.330780789         -0.065440145 
##            sulphates              alcohol              quality 
##         -0.109333620         -0.502708647         -0.473166214

biplot(pr.out, scale=0)

pr.out$sdev

##  [1] 1.7666827 1.4972916 1.2972739 1.1022799 0.9865412 0.8139977 0.7863319
##  [8] 0.7112472 0.6413326 0.5726425 0.4245216 0.2439629

pr.var=pr.out$sdev^2

pve=pr.var/sum(pr.var)

pve

##  [1] 0.260097308 0.186823504 0.140243308 0.101251739 0.081105302
##  [6] 0.055216020 0.051526483 0.042156046 0.034275628 0.027326616
## [11] 0.015018219 0.004959826

plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),
     type="b")

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", 
     ylim=c(0,1),type="b")

Question 3) Answer:

From the plot above, we observe that around 45% of the variation in the data is explained by the first two principal components.

The first PC (horizontal axis) places more positive weight on variable fixed.acidity and places negative weights on variable pH.

The second PC (vertical axis) places positive weight on variable total.sulfur.dioxide and negative weight on variables alcohol and quality.

In supervised learning, PCA can be used to perform regression models using the principal component scores as features (reference: link). For example, we can use the 5 most important variables (fixed.acidity, pH, total.sulfur.dioxide, alcohol and quality) in the dataset concentrated in its first two principal components as input variables and price as output variable for supervised learning, which could lead to a more robust model and predict the wine price better.

HW 3: Unsupervised Learning Practice

Dongnanzi Zheng

March 6, 2018

Since we are interested in wine quality, first, examining the distribution of quality in the dataset might be useful.

Interpretation:

Delete N/A data and Scale the data

Question 1)

Use K Means Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.)

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis?

First step:

Interpretation:

Second step: Examine K = 2,3,4,…,8 to make a decision on K number

Cluster analysis with K ranging from 3 to 7 - Results Interpretation:

The best K appears to be 3 where the above two challenges are not present concurrently.

Question 1) Answer:

Based on the above analysis, K = 3 makes the group division most meaningful in terms of wine quality

Question 1) Answer Interpretation:

Three variables have the distinct patterns and very different values in line with the quality and can be used as a reference for making good quality wine. Specifically the most important characteristics for a good wine are low free.sulfur.dioxide and low total.sulfur.dioxide and high alcohol.

Question 2)

Use Hierarchical Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.) Use complete linkage and the same number of groups that you found to be the most meaningful in question 1.

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis? Have your conclusions changed using Hierarchical clustering rather than k means clustering? Present any figures that assist you in your analysis.

Question 2) Answer:

There is only one variable that has a distinct pattern and very different values in line with the quality and can be used as a reference for making good quality wine. Specifically the most important characteristic for a good wine is low sulphates.

My conclusions have changed using Hierarchical Cluster Analysis. The reason might be that Hierarchical Cluster Analysis results in unbalanced groups due to the small number of clusters we were required to use from Question 1 (K = 3).

Question 3)

Use Principal Components Analysis to reduce the dimensions of your data. How much of the variation in your data is explained by the first two principal components. How might you use the first two components to do supervised learning on some other variable tied to wine (e.g. - wine price)?

Question 3) Answer:

From the plot above, we observe that around 45% of the variation in the data is explained by the first two principal components.