This is the R codes for the two questions in Assignment 2. Please download and load the data R data A2.RData to run the codes.
load("A2.RData")From the R data, you are given two data frames cleanser and spotify, as well as two functions for cluster analysis cluster_mean and elbow_plot.
## cleanser : Classes 'tbl_df', 'tbl' and 'data.frame': 330 obs. of 8 variables:
## $ id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ profile : num 1 2 3 4 5 6 7 8 9 10 ...
## $ ratings : num 6 3 5 2 3 1 1 6 6 6 ...
## $ form : Factor w/ 3 levels "Powder","Concentrate",..: 2 1 3 1 1 2 3 3 1 2 ...
## $ noapply : Factor w/ 3 levels "200 times","100 times",..: 1 1 2 1 3 1 2 1 2 3 ...
## $ disinfect: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 2 ...
## $ bio : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 1 1 1 ...
## $ price : Factor w/ 3 levels "35 cents","49 cents",..: 1 1 2 2 3 3 3 2 2 2 ...
## cluster_mean : function (x)
## elbow_plot : function (height, nclust = 10)
## spotify : Classes 'tbl_df', 'tbl' and 'data.frame': 2017 obs. of 9 variables:
## $ acousticness : num 0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...
## $ danceability : num 0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...
## $ energy : num 0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...
## $ instrumentalness: num 2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ...
## $ speechiness : num 0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...
## $ valence : num 0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...
## $ song_title : chr "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...
## $ artist : chr "Future" "Childish Gambino" "Future" "Beach House" ...
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
Question 1
First, following the instructions in Question 1a, we will run a regression of ratings on the following variables: form, noapply, disinfect, bio, and price.
summary(lm(ratings ~ 1 + form + noapply + disinfect + bio + price,
data = cleanser))##
## Call:
## lm(formula = ratings ~ 1 + form + noapply + disinfect + bio +
## price, data = cleanser)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.193 -1.354 0.178 1.098 4.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0204 0.3187 15.751 < 2e-16 ***
## formConcentrate 0.3665 0.2398 1.528 0.1274
## formPremix -0.2978 0.2569 -1.159 0.2472
## noapply100 times -0.1180 0.2471 -0.478 0.6333
## noapply50 times -0.4726 0.2565 -1.843 0.0663 .
## disinfectYes 0.9433 0.2154 4.379 1.62e-05 ***
## bioYes 0.1497 0.2103 0.712 0.4771
## price49 cents -1.3947 0.2512 -5.553 5.90e-08 ***
## price79 cents -2.8187 0.2502 -11.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.817 on 321 degrees of freedom
## Multiple R-squared: 0.3139, Adjusted R-squared: 0.2968
## F-statistic: 18.36 on 8 and 321 DF, p-value: < 2.2e-16
From the results, you can apply the 3 rules to transform the coefficients into partworths. Please see the suggested solutions for more details.
Question 2
For this question, we first run a hierarchical clustering analysis to determine the no. of clusters. Then, based on the no. of clusters, we obtain the cluster means and interpret the clusters.
Note that we will use the first 6 variables in the data frame spotify to run the hierarchical clustering.
Question 2a
# run a hierarchical clustering with Euclidean distance and Ward's method
spotify_dist <- dist(spotify[,1:6], method = "euclidean")
spotify_clust <- hclust(spotify_dist, method = "ward.D2")From the results, spotify_clust, we obtain height and use it to produce an elbow plot.
elbow_plot(spotify_clust$height) From the analysis, it is straightforward that the elbow point is 5, as from 4 to 5, the within-cluster variation has a big decrease, but from 5 to 6, the decrease is relatively small.
Next, we obtain the clustering output, and transform it into a factor indicating for each song the cluster it belongs to.
# if we set the no. of clusters to 5
clust_5 <- as.factor(cutree(spotify_clust, k = 5))Question 2b
For Question 2b, we perform an ANOVA analysis to test if each variable indeed differs across the 5 clusters we have identified in clust_5.
anova_clust_5 <- aov(cbind(acousticness,
danceability,
energy,
instrumentalness,
speechiness,
valence) ~ clust_5, data = spotify)
summary(anova_clust_5)## Response acousticness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 87.569 21.8923 904.44 < 2.2e-16 ***
## Residuals 2012 48.701 0.0242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response danceability :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 8.973 2.24332 104.23 < 2.2e-16 ***
## Residuals 2012 43.302 0.02152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response energy :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 40.792 10.198 424.42 < 2.2e-16 ***
## Residuals 2012 48.345 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response instrumentalness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 131.352 32.838 3463.4 < 2.2e-16 ***
## Residuals 2012 19.077 0.009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response speechiness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 0.9113 0.227820 29.777 < 2.2e-16 ***
## Residuals 2012 15.3935 0.007651
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response valence :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 72.226 18.0565 712.86 < 2.2e-16 ***
## Residuals 2012 50.963 0.0253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the analysis, all the variable differ significantly across the 5 clusters.
Question 2c
Next, we derive the cluster means by using the cluster_mean function. It takes in the ANOVA object.
cluster_mean(anova_clust_5)## cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
## acousticness 0.07870514 0.15388140 0.10211908 0.61749474 0.9103421
## danceability 0.58731090 0.69779907 0.62910931 0.56311053 0.4043026
## energy 0.72903712 0.73969003 0.74412955 0.39352316 0.1692197
## instrumentalness 0.01714902 0.03037208 0.71164777 0.01482938 0.7363289
## speechiness 0.09981253 0.11010093 0.07171417 0.04981684 0.0395000
## valence 0.37388538 0.76540031 0.43885587 0.35326000 0.1695118
Solutions when the no. of clusters is 4
Some of you may notice that based on the elbow plot, the elbow point should be 5. However, if you look closely, the 5th cluster in clust_5 is relatively small. You can use a function table to obtain the no. of songs in each cluster:
table(clust_5)## clust_5
## 1 2 3 4 5
## 862 642 247 190 76
The 5th cluster only has 76 songs or roughly 3.77%. This cluster is relatively small. So, it is also fine if you think the no. of clusters should be equal to 4 instead of 5 for easier interpretation. Next, I will show the solutions based on the no. of clusters equal to 4.
# obtain the clusters when the no. of clusters = 4
clust_4 <- as.factor(cutree(spotify_clust, k = 4))
# perform anova analysis
anova_clust_4 <- aov(cbind(acousticness,
danceability,
energy,
instrumentalness,
speechiness,
valence) ~ clust_4, data = spotify)
summary(anova_clust_4)## Response acousticness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 82.914 27.6379 1042.7 < 2.2e-16 ***
## Residuals 2013 53.357 0.0265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response danceability :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 7.604 2.53473 114.22 < 2.2e-16 ***
## Residuals 2013 44.671 0.02219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response energy :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 38.061 12.6870 500.02 < 2.2e-16 ***
## Residuals 2013 51.076 0.0254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response instrumentalness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 103.093 34.364 1461.4 < 2.2e-16 ***
## Residuals 2013 47.336 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response speechiness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 0.9055 0.30183 39.456 < 2.2e-16 ***
## Residuals 2013 15.3992 0.00765
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response valence :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 70.393 23.4643 894.65 < 2.2e-16 ***
## Residuals 2013 52.796 0.0262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# obtain the cluster means
cluster_mean(anova_clust_4)## cluster_1 cluster_2 cluster_3 cluster_4
## acousticness 0.07870514 0.15388140 0.10211908 0.70116541
## danceability 0.58731090 0.69779907 0.62910931 0.51773684
## energy 0.72903712 0.73969003 0.74412955 0.32943647
## instrumentalness 0.01714902 0.03037208 0.71164777 0.22097212
## speechiness 0.09981253 0.11010093 0.07171417 0.04686917
## valence 0.37388538 0.76540031 0.43885587 0.30076053