This is the R codes for the two questions in Assignment 2. Please download and load the data R data A2.RData to run the codes.

load("A2.RData")

From the R data, you are given two data frames cleanser and spotify, as well as two functions for cluster analysis cluster_mean and elbow_plot.

## cleanser : Classes 'tbl_df', 'tbl' and 'data.frame': 330 obs. of  8 variables:
##  $ id       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ profile  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ ratings  : num  6 3 5 2 3 1 1 6 6 6 ...
##  $ form     : Factor w/ 3 levels "Powder","Concentrate",..: 2 1 3 1 1 2 3 3 1 2 ...
##  $ noapply  : Factor w/ 3 levels "200 times","100 times",..: 1 1 2 1 3 1 2 1 2 3 ...
##  $ disinfect: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 2 ...
##  $ bio      : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 1 1 1 ...
##  $ price    : Factor w/ 3 levels "35 cents","49 cents",..: 1 1 2 2 3 3 3 2 2 2 ...
## cluster_mean : function (x)  
## elbow_plot : function (height, nclust = 10)  
## spotify : Classes 'tbl_df', 'tbl' and 'data.frame':  2017 obs. of  9 variables:
##  $ acousticness    : num  0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...
##  $ danceability    : num  0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...
##  $ energy          : num  0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...
##  $ instrumentalness: num  2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ...
##  $ speechiness     : num  0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...
##  $ valence         : num  0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...
##  $ song_title      : chr  "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...
##  $ artist          : chr  "Future" "Childish Gambino" "Future" "Beach House" ...
##  $ ID              : num  1 2 3 4 5 6 7 8 9 10 ...

Question 1

First, following the instructions in Question 1a, we will run a regression of ratings on the following variables: form, noapply, disinfect, bio, and price.

summary(lm(ratings ~ 1 + form + noapply + disinfect + bio + price, 
           data = cleanser))
## 
## Call:
## lm(formula = ratings ~ 1 + form + noapply + disinfect + bio + 
##     price, data = cleanser)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.193 -1.354  0.178  1.098  4.328 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.0204     0.3187  15.751  < 2e-16 ***
## formConcentrate    0.3665     0.2398   1.528   0.1274    
## formPremix        -0.2978     0.2569  -1.159   0.2472    
## noapply100 times  -0.1180     0.2471  -0.478   0.6333    
## noapply50 times   -0.4726     0.2565  -1.843   0.0663 .  
## disinfectYes       0.9433     0.2154   4.379 1.62e-05 ***
## bioYes             0.1497     0.2103   0.712   0.4771    
## price49 cents     -1.3947     0.2512  -5.553 5.90e-08 ***
## price79 cents     -2.8187     0.2502 -11.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.817 on 321 degrees of freedom
## Multiple R-squared:  0.3139, Adjusted R-squared:  0.2968 
## F-statistic: 18.36 on 8 and 321 DF,  p-value: < 2.2e-16

From the results, you can apply the 3 rules to transform the coefficients into partworths. Please see the suggested solutions for more details.

Question 2

For this question, we first run a hierarchical clustering analysis to determine the no. of clusters. Then, based on the no. of clusters, we obtain the cluster means and interpret the clusters.

Note that we will use the first 6 variables in the data frame spotify to run the hierarchical clustering.

Question 2a

# run a hierarchical clustering with Euclidean distance and Ward's method
spotify_dist <- dist(spotify[,1:6], method = "euclidean")
spotify_clust <- hclust(spotify_dist, method = "ward.D2")

From the results, spotify_clust, we obtain height and use it to produce an elbow plot.

elbow_plot(spotify_clust$height)

From the analysis, it is straightforward that the elbow point is 5, as from 4 to 5, the within-cluster variation has a big decrease, but from 5 to 6, the decrease is relatively small.

Next, we obtain the clustering output, and transform it into a factor indicating for each song the cluster it belongs to.

# if we set the no. of clusters to 5
clust_5 <- as.factor(cutree(spotify_clust, k = 5))

Question 2b

For Question 2b, we perform an ANOVA analysis to test if each variable indeed differs across the 5 clusters we have identified in clust_5.

anova_clust_5 <- aov(cbind(acousticness,
                           danceability,
                           energy,
                           instrumentalness,
                           speechiness,
                           valence) ~ clust_5, data = spotify)
summary(anova_clust_5)
##  Response acousticness :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 87.569 21.8923  904.44 < 2.2e-16 ***
## Residuals   2012 48.701  0.0242                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response danceability :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4  8.973 2.24332  104.23 < 2.2e-16 ***
## Residuals   2012 43.302 0.02152                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response energy :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 40.792  10.198  424.42 < 2.2e-16 ***
## Residuals   2012 48.345   0.024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response instrumentalness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 131.352  32.838  3463.4 < 2.2e-16 ***
## Residuals   2012  19.077   0.009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response speechiness :
##               Df  Sum Sq  Mean Sq F value    Pr(>F)    
## clust_5        4  0.9113 0.227820  29.777 < 2.2e-16 ***
## Residuals   2012 15.3935 0.007651                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response valence :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 72.226 18.0565  712.86 < 2.2e-16 ***
## Residuals   2012 50.963  0.0253                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the analysis, all the variable differ significantly across the 5 clusters.

Question 2c

Next, we derive the cluster means by using the cluster_mean function. It takes in the ANOVA object.

cluster_mean(anova_clust_5)
##                   cluster_1  cluster_2  cluster_3  cluster_4 cluster_5
## acousticness     0.07870514 0.15388140 0.10211908 0.61749474 0.9103421
## danceability     0.58731090 0.69779907 0.62910931 0.56311053 0.4043026
## energy           0.72903712 0.73969003 0.74412955 0.39352316 0.1692197
## instrumentalness 0.01714902 0.03037208 0.71164777 0.01482938 0.7363289
## speechiness      0.09981253 0.11010093 0.07171417 0.04981684 0.0395000
## valence          0.37388538 0.76540031 0.43885587 0.35326000 0.1695118

Solutions when the no. of clusters is 4

Some of you may notice that based on the elbow plot, the elbow point should be 5. However, if you look closely, the 5th cluster in clust_5 is relatively small. You can use a function table to obtain the no. of songs in each cluster:

table(clust_5)
## clust_5
##   1   2   3   4   5 
## 862 642 247 190  76

The 5th cluster only has 76 songs or roughly 3.77%. This cluster is relatively small. So, it is also fine if you think the no. of clusters should be equal to 4 instead of 5 for easier interpretation. Next, I will show the solutions based on the no. of clusters equal to 4.

# obtain the clusters when the no. of clusters = 4
clust_4 <- as.factor(cutree(spotify_clust, k = 4))

# perform anova analysis 
anova_clust_4 <- aov(cbind(acousticness,
                           danceability,
                           energy,
                           instrumentalness,
                           speechiness,
                           valence) ~ clust_4, data = spotify)
summary(anova_clust_4)
##  Response acousticness :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 82.914 27.6379  1042.7 < 2.2e-16 ***
## Residuals   2013 53.357  0.0265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response danceability :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3  7.604 2.53473  114.22 < 2.2e-16 ***
## Residuals   2013 44.671 0.02219                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response energy :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 38.061 12.6870  500.02 < 2.2e-16 ***
## Residuals   2013 51.076  0.0254                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response instrumentalness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 103.093  34.364  1461.4 < 2.2e-16 ***
## Residuals   2013  47.336   0.024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response speechiness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3  0.9055 0.30183  39.456 < 2.2e-16 ***
## Residuals   2013 15.3992 0.00765                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response valence :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 70.393 23.4643  894.65 < 2.2e-16 ***
## Residuals   2013 52.796  0.0262                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# obtain the cluster means 
cluster_mean(anova_clust_4)
##                   cluster_1  cluster_2  cluster_3  cluster_4
## acousticness     0.07870514 0.15388140 0.10211908 0.70116541
## danceability     0.58731090 0.69779907 0.62910931 0.51773684
## energy           0.72903712 0.73969003 0.74412955 0.32943647
## instrumentalness 0.01714902 0.03037208 0.71164777 0.22097212
## speechiness      0.09981253 0.11010093 0.07171417 0.04686917
## valence          0.37388538 0.76540031 0.43885587 0.30076053