This is the R codes for the two questions in Assignment 2. Please download and load the data R data A2.RData to run the codes. Also, don’t forget to load the MSR package to your work space.

# load the MSR package 
library(MSR)

# load the data 
load("A2.RData")

From the R data, you are given two data frames app.installs and spotify.

Question 1

Question 1a

For Question 2a, we will first estimate the \(p\), \(q\), \(M\) of the apps in the app.installs. Notice that app.installs is a list containing the daily installations of apps. To refer to the daily installations within the list, we use the $ sign like in data frames. For example, to refer to the daily installations of “Dogbook”, we use app.installs$Dogbook.

To estimate the Bass parameters, you may use the function estimate_bass, which can be found in the MSR package. One way is to do the estimation one by one for all the apps, like the following:

  • estimate_bass(app.installs$Dogbook)
  • estimate_bass(app.installs$Books)
  • \(\cdots \cdots\)
  • estimate_bass(app.installs$Weather)

Alternatively, you can use a function at R called lapply which applies a function to a list. For more information of lapply, please run ?lapply in your command line.

bass_parameters <- lapply(app.installs,estimate_bass)

The results of the estimation are:

##                                   p           q          M
## Dogbook                0.0012085370 0.009555393  426147556
## Books                  0.0011914796 0.007892291   26127121
## PuzzleBee              0.0011913232 0.011301563  432231050
## Astrology              0.0012175449 0.016335565  391132854
## Languages              0.0012181564 0.017320782    3235610
## BioRhythms             0.0047749840 0.013975515   11855916
## TV.Shows               0.0015368375 0.013910218    7208739
## X.fluff.Friends        0.0012821425 0.014008677 1110381495
## The.Greek.Community    0.0023553027 0.013864790   37232112
## Pop.Culture.Quizzes    0.0008782294 0.014046545   27159228
## My.Countdowns          0.0010571332 0.006894111   51303262
## Yo.Momma               0.0029939254 0.015094171   53674508
## Fantasy.Stock.Exchange 0.0022325679 0.009898809   53947020
## PersonalDNA            0.0043451070 0.017478213   54315815
## Weather                0.0007818028 0.018393308   54432287

Question 2c

As an example, I will set \(p = 0.001\), \(q = 0.01\) and \(M = 50,000,000\). This is ONLY an example! You may use any combinations of \(\{p,q,M\}\) as you see reasonable in Question 2b.

We set T = 181:200 in total 20 days. You can use the function predict_bass from the MSR package.

# setting p, q and M
p <- 0.001
q <- 0.01
M <- 50000000

# setting the value of time periods
T <- 181:200

# do the prediction
N <- predict_bass(T,c(p,q,M))
N
##  [1] 18250034 18397795 18545800 18694038 18842503 18991184 19140075 19289166
##  [9] 19438448 19587914 19737553 19887358 20037320 20187429 20337678 20488056
## [17] 20638556 20789167 20939882 21090691

We can also plot the cumulative no. of installations with a line plot:

plot(T,N,type = "l")

Question 2

For this question, we first run a hierarchical clustering analysis to determine the no. of clusters. Then, based on the no. of clusters, we obtain the cluster means and interpret the clusters.

Note that we will use the first 6 variables in the data frame spotify to run the hierarchical clustering.

Question 2a

# run a hierarchical clustering with Euclidean distance and Ward's method
spotify_dist <- dist(spotify[,1:6], method = "euclidean")
spotify_clust <- hclust(spotify_dist, method = "ward.D2")

From the results, spotify_clust, we obtain height and use it to produce an elbow plot.

elbow_plot(spotify_clust$height)

From the analysis, it is straightforward that the elbow point is 5, as from 4 to 5, the within-cluster variation has a big decrease, but from 5 to 6, the decrease is relatively small.

Next, we obtain the clustering output, and transform it into a factor indicating for each song the cluster it belongs to.

# if we set the no. of clusters to 5
clust_5 <- as.factor(cutree(spotify_clust, k = 5))

Question 2b

For Question 2b, we perform an ANOVA analysis to test if each variable indeed differs across the 5 clusters we have identified in clust_5.

anova_clust_5 <- aov(cbind(acousticness,
                           danceability,
                           energy,
                           instrumentalness,
                           speechiness,
                           valence) ~ clust_5, data = spotify)
summary(anova_clust_5)
##  Response acousticness :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 87.569 21.8923  904.44 < 2.2e-16 ***
## Residuals   2012 48.701  0.0242                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response danceability :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4  8.973 2.24332  104.23 < 2.2e-16 ***
## Residuals   2012 43.302 0.02152                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response energy :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 40.792  10.198  424.42 < 2.2e-16 ***
## Residuals   2012 48.345   0.024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response instrumentalness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 131.352  32.838  3463.4 < 2.2e-16 ***
## Residuals   2012  19.077   0.009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response speechiness :
##               Df  Sum Sq  Mean Sq F value    Pr(>F)    
## clust_5        4  0.9113 0.227820  29.777 < 2.2e-16 ***
## Residuals   2012 15.3935 0.007651                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response valence :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_5        4 72.226 18.0565  712.86 < 2.2e-16 ***
## Residuals   2012 50.963  0.0253                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the analysis, all the variable differ significantly across the 5 clusters.

Question 2c

Next, we derive the cluster means by using the cluster_mean function. It takes in the ANOVA object.

cluster_mean(anova_clust_5)
##                   cluster_1  cluster_2  cluster_3  cluster_4 cluster_5
## acousticness     0.07870514 0.15388140 0.10211908 0.61749474 0.9103421
## danceability     0.58731090 0.69779907 0.62910931 0.56311053 0.4043026
## energy           0.72903712 0.73969003 0.74412955 0.39352316 0.1692197
## instrumentalness 0.01714902 0.03037208 0.71164777 0.01482938 0.7363289
## speechiness      0.09981253 0.11010093 0.07171417 0.04981684 0.0395000
## valence          0.37388538 0.76540031 0.43885587 0.35326000 0.1695118

Solutions when the no. of clusters is 4

Some of you may notice that based on the elbow plot, the elbow point should be 5. However, if you look closely, the 5th cluster in clust_5 is relatively small. You can use a function table to obtain the no. of songs in each cluster:

table(clust_5)
## clust_5
##   1   2   3   4   5 
## 862 642 247 190  76

The 5th cluster only has 76 songs or roughly 3.77%. This cluster is relatively small. So, it is also fine if you think the no. of clusters should be equal to 4 instead of 5 for easier interpretation. Next, I will show the solutions based on the no. of clusters equal to 4.

# obtain the clusters when the no. of clusters = 4
clust_4 <- as.factor(cutree(spotify_clust, k = 4))

# perform anova analysis 
anova_clust_4 <- aov(cbind(acousticness,
                           danceability,
                           energy,
                           instrumentalness,
                           speechiness,
                           valence) ~ clust_4, data = spotify)
summary(anova_clust_4)
##  Response acousticness :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 82.914 27.6379  1042.7 < 2.2e-16 ***
## Residuals   2013 53.357  0.0265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response danceability :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3  7.604 2.53473  114.22 < 2.2e-16 ***
## Residuals   2013 44.671 0.02219                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response energy :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 38.061 12.6870  500.02 < 2.2e-16 ***
## Residuals   2013 51.076  0.0254                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response instrumentalness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 103.093  34.364  1461.4 < 2.2e-16 ***
## Residuals   2013  47.336   0.024                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response speechiness :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3  0.9055 0.30183  39.456 < 2.2e-16 ***
## Residuals   2013 15.3992 0.00765                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response valence :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_4        3 70.393 23.4643  894.65 < 2.2e-16 ***
## Residuals   2013 52.796  0.0262                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# obtain the cluster means 
cluster_mean(anova_clust_4)
##                   cluster_1  cluster_2  cluster_3  cluster_4
## acousticness     0.07870514 0.15388140 0.10211908 0.70116541
## danceability     0.58731090 0.69779907 0.62910931 0.51773684
## energy           0.72903712 0.73969003 0.74412955 0.32943647
## instrumentalness 0.01714902 0.03037208 0.71164777 0.22097212
## speechiness      0.09981253 0.11010093 0.07171417 0.04686917
## valence          0.37388538 0.76540031 0.43885587 0.30076053