This is the R codes for the two questions in Assignment 2. Please
download and load the data R data A2.RData to run the
codes. Also, don’t forget to load the MSR package to your
work space.
From the R data, you are given two data frames
app.installs and spotify.
Question 1
Question 1a
For Question 2a, we will first estimate the \(p\), \(q\), \(M\)
of the apps in the app.installs. Notice that
app.installs is a list containing the daily installations
of apps. To refer to the daily installations within the list, we use the
$ sign like in data frames. For example, to refer to the
daily installations of “Dogbook”, we use
app.installs$Dogbook.
To estimate the Bass parameters, you may use the function
estimate_bass, which can be found in the MSR
package. One way is to do the estimation one by one for all the apps,
like the following:
estimate_bass(app.installs$Dogbook)estimate_bass(app.installs$Books)- \(\cdots \cdots\)
estimate_bass(app.installs$Weather)
Alternatively, you can use a function at R called lapply
which applies a function to a list. For more information of
lapply, please run ?lapply in your command
line.
The results of the estimation are:
## p q M
## Dogbook 0.0012085370 0.009555393 426147556
## Books 0.0011914796 0.007892291 26127121
## PuzzleBee 0.0011913232 0.011301563 432231050
## Astrology 0.0012175449 0.016335565 391132854
## Languages 0.0012181564 0.017320782 3235610
## BioRhythms 0.0047749840 0.013975515 11855916
## TV.Shows 0.0015368375 0.013910218 7208739
## X.fluff.Friends 0.0012821425 0.014008677 1110381495
## The.Greek.Community 0.0023553027 0.013864790 37232112
## Pop.Culture.Quizzes 0.0008782294 0.014046545 27159228
## My.Countdowns 0.0010571332 0.006894111 51303262
## Yo.Momma 0.0029939254 0.015094171 53674508
## Fantasy.Stock.Exchange 0.0022325679 0.009898809 53947020
## PersonalDNA 0.0043451070 0.017478213 54315815
## Weather 0.0007818028 0.018393308 54432287
Question 2c
As an example, I will set \(p = 0.001\), \(q = 0.01\) and \(M = 50,000,000\). This is ONLY an example! You may use any combinations of \(\{p,q,M\}\) as you see reasonable in Question 2b.
We set T = 181:200 in total 20 days. You can use the
function predict_bass from the MSR
package.
# setting p, q and M
p <- 0.001
q <- 0.01
M <- 50000000
# setting the value of time periods
T <- 181:200
# do the prediction
N <- predict_bass(T,c(p,q,M))
N## [1] 18250034 18397795 18545800 18694038 18842503 18991184 19140075 19289166
## [9] 19438448 19587914 19737553 19887358 20037320 20187429 20337678 20488056
## [17] 20638556 20789167 20939882 21090691
We can also plot the cumulative no. of installations with a line plot:
Question 2
For this question, we first run a hierarchical clustering analysis to determine the no. of clusters. Then, based on the no. of clusters, we obtain the cluster means and interpret the clusters.
Note that we will use the first 6 variables in the data frame
spotify to run the hierarchical clustering.
Question 2a
# run a hierarchical clustering with Euclidean distance and Ward's method
spotify_dist <- dist(spotify[,1:6], method = "euclidean")
spotify_clust <- hclust(spotify_dist, method = "ward.D2")From the results, spotify_clust, we obtain
height and use it to produce an elbow plot.
From the analysis, it is straightforward that the elbow point is 5, as from 4 to 5, the within-cluster variation has a big decrease, but from 5 to 6, the decrease is relatively small.
Next, we obtain the clustering output, and transform it into a factor indicating for each song the cluster it belongs to.
Question 2b
For Question 2b, we perform an ANOVA analysis to test if each
variable indeed differs across the 5 clusters we have identified in
clust_5.
anova_clust_5 <- aov(cbind(acousticness,
danceability,
energy,
instrumentalness,
speechiness,
valence) ~ clust_5, data = spotify)
summary(anova_clust_5)## Response acousticness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 87.569 21.8923 904.44 < 2.2e-16 ***
## Residuals 2012 48.701 0.0242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response danceability :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 8.973 2.24332 104.23 < 2.2e-16 ***
## Residuals 2012 43.302 0.02152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response energy :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 40.792 10.198 424.42 < 2.2e-16 ***
## Residuals 2012 48.345 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response instrumentalness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 131.352 32.838 3463.4 < 2.2e-16 ***
## Residuals 2012 19.077 0.009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response speechiness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 0.9113 0.227820 29.777 < 2.2e-16 ***
## Residuals 2012 15.3935 0.007651
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response valence :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_5 4 72.226 18.0565 712.86 < 2.2e-16 ***
## Residuals 2012 50.963 0.0253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the analysis, all the variable differ significantly across the 5 clusters.
Question 2c
Next, we derive the cluster means by using the
cluster_mean function. It takes in the ANOVA object.
## cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
## acousticness 0.07870514 0.15388140 0.10211908 0.61749474 0.9103421
## danceability 0.58731090 0.69779907 0.62910931 0.56311053 0.4043026
## energy 0.72903712 0.73969003 0.74412955 0.39352316 0.1692197
## instrumentalness 0.01714902 0.03037208 0.71164777 0.01482938 0.7363289
## speechiness 0.09981253 0.11010093 0.07171417 0.04981684 0.0395000
## valence 0.37388538 0.76540031 0.43885587 0.35326000 0.1695118
Solutions when the no. of clusters is 4
Some of you may notice that based on the elbow plot, the elbow point
should be 5. However, if you look closely, the 5th cluster in
clust_5 is relatively small. You can use a function table
to obtain the no. of songs in each cluster:
## clust_5
## 1 2 3 4 5
## 862 642 247 190 76
The 5th cluster only has 76 songs or roughly 3.77%. This cluster is relatively small. So, it is also fine if you think the no. of clusters should be equal to 4 instead of 5 for easier interpretation. Next, I will show the solutions based on the no. of clusters equal to 4.
# obtain the clusters when the no. of clusters = 4
clust_4 <- as.factor(cutree(spotify_clust, k = 4))
# perform anova analysis
anova_clust_4 <- aov(cbind(acousticness,
danceability,
energy,
instrumentalness,
speechiness,
valence) ~ clust_4, data = spotify)
summary(anova_clust_4)## Response acousticness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 82.914 27.6379 1042.7 < 2.2e-16 ***
## Residuals 2013 53.357 0.0265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response danceability :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 7.604 2.53473 114.22 < 2.2e-16 ***
## Residuals 2013 44.671 0.02219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response energy :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 38.061 12.6870 500.02 < 2.2e-16 ***
## Residuals 2013 51.076 0.0254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response instrumentalness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 103.093 34.364 1461.4 < 2.2e-16 ***
## Residuals 2013 47.336 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response speechiness :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 0.9055 0.30183 39.456 < 2.2e-16 ***
## Residuals 2013 15.3992 0.00765
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response valence :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_4 3 70.393 23.4643 894.65 < 2.2e-16 ***
## Residuals 2013 52.796 0.0262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## cluster_1 cluster_2 cluster_3 cluster_4
## acousticness 0.07870514 0.15388140 0.10211908 0.70116541
## danceability 0.58731090 0.69779907 0.62910931 0.51773684
## energy 0.72903712 0.73969003 0.74412955 0.32943647
## instrumentalness 0.01714902 0.03037208 0.71164777 0.22097212
## speechiness 0.09981253 0.11010093 0.07171417 0.04686917
## valence 0.37388538 0.76540031 0.43885587 0.30076053