The dataset I used contains song from Spotify from 1921 to 2020.
Songs <- read.csv("data_by_year.csv")
Allsongs<-read.csv("data.csv")
Songsmelt<-melt(Songs,id.vars = 'year')
There are 14 variables looking at the song release year, acousticness, danceability,d uration_ms, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence and popularity. they are shown as the average value of songs released in the year. Although popularity is interesting, as these ratings are done in modern times, the average popularity continue to increase as people like songs they are use to listen to. Looking at the plots, I think danceablility is interesting. It seems like songs in the 1920 - 1930s have higher danceable scores and decreased to the lowest to the 1950s and continue to increase since then. This seems to be related to great depression and economy booming after WW2.
ggplot(Songsmelt, aes(x=year, y=value)) + geom_point() + facet_wrap(~variable, scale = "free")+ geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Songsts <- ts(Songs, start = c(1921), frequency = 1)
SongPopularity <- ts(Songs[,13], start = c(1921), frequency = 1)
SongDance <- ts(Songs[,4], start = c(1921), frequency = 1)
SongTempo <- ts(Songs[,11], start = c(1921), frequency = 1)
autoplot(SongPopularity, main='Song Popularity')
autoplot(SongDance, main='Song Danceability')
autoplot(SongTempo, main='Song Tempo')
cor(SongDance,SongTempo)
## [1] 0.6035434
summary(Allsongs)
## valence year acousticness artists
## Min. :0.0000 Min. :1921 Min. :0.0000 Length:170653
## 1st Qu.:0.3170 1st Qu.:1956 1st Qu.:0.1020 Class :character
## Median :0.5400 Median :1977 Median :0.5160 Mode :character
## Mean :0.5286 Mean :1977 Mean :0.5021
## 3rd Qu.:0.7470 3rd Qu.:1999 3rd Qu.:0.8930
## Max. :1.0000 Max. :2020 Max. :0.9960
## danceability duration_ms energy explicit
## Min. :0.0000 Min. : 5108 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.4150 1st Qu.: 169827 1st Qu.:0.2550 1st Qu.:0.00000
## Median :0.5480 Median : 207467 Median :0.4710 Median :0.00000
## Mean :0.5374 Mean : 230948 Mean :0.4824 Mean :0.08458
## 3rd Qu.:0.6680 3rd Qu.: 262400 3rd Qu.:0.7030 3rd Qu.:0.00000
## Max. :0.9880 Max. :5403500 Max. :1.0000 Max. :1.00000
## id instrumentalness key liveness
## Length:170653 Min. :0.000000 Min. : 0.0 Min. :0.0000
## Class :character 1st Qu.:0.000000 1st Qu.: 2.0 1st Qu.:0.0988
## Mode :character Median :0.000216 Median : 5.0 Median :0.1360
## Mean :0.167010 Mean : 5.2 Mean :0.2058
## 3rd Qu.:0.102000 3rd Qu.: 8.0 3rd Qu.:0.2610
## Max. :1.000000 Max. :11.0 Max. :1.0000
## loudness mode name popularity
## Min. :-60.000 Min. :0.0000 Length:170653 Min. : 0.00
## 1st Qu.:-14.615 1st Qu.:0.0000 Class :character 1st Qu.: 11.00
## Median :-10.580 Median :1.0000 Mode :character Median : 33.00
## Mean :-11.468 Mean :0.7069 Mean : 31.43
## 3rd Qu.: -7.183 3rd Qu.:1.0000 3rd Qu.: 48.00
## Max. : 3.855 Max. :1.0000 Max. :100.00
## release_date speechiness tempo
## Length:170653 Min. :0.00000 Min. : 0.00
## Class :character 1st Qu.:0.03490 1st Qu.: 93.42
## Mode :character Median :0.04500 Median :114.73
## Mean :0.09839 Mean :116.86
## 3rd Qu.:0.07560 3rd Qu.:135.54
## Max. :0.97000 Max. :243.51
ACF and PACF shows autocorrelation and partial autocorrelation. We split the data into testing and training.
acf(SongDance)
pacf(SongDance)
SongDancediff<-diff(SongDance)
train.songs<-window(SongDance, end=c(2014))
test.songs<-window(SongDance,start=c(2015))
train.songsall<-window(Songsts, end=c(2014))
test.songsall<-window(Songsts,start=c(2015))
The first model I built is using the simple exponential smoothing method. This model will not incorporate trend/seasonal behavior. The resulted prediction is just a horizontal line.
Next I used ETS model. This model will incorporate the different components: Error, trend and seasonal. It also includes model selection method such as AIC, BIC. The final model produced is EST(M, Ad, N). This means Error is multiplied, Trend is additive with dampening and there is no seasonal factor. This make sense since it is annual data. The prediction showing increasing trend.
ARIMA model is autoregressive intergrated moving average. As shown in the ACF and PACF plots, the data is autoregressive so ARIMA could produce good results. The ARIMA function built ARIMA(1,1,0) model producing results similar to ses.
I also used the regression time series model. I used AIC to choose variable used. The adjusted R^2 is great at 0.84. The prediction is very different than the other 3 models with a dip in 2015. The shape of the prediction looks correct, but much lower than the acutal danceability score.
fit.ses<-ses(train.songs)
fit.ETS<-ets(train.songs)
fit.Arima<-auto.arima(train.songs)
fit.lm<-tslm(danceability~loudness+liveness+tempo+instrumentalness+speechiness+valence, train.songsall)
fit.ETS
## ETS(M,Ad,N)
##
## Call:
## ets(y = train.songs)
##
## Smoothing parameters:
## alpha = 0.1964
## beta = 0.1964
## phi = 0.8
##
## Initial states:
## l = 0.3565
## b = 0.1025
##
## sigma: 0.0475
##
## AIC AICc BIC
## -258.0424 -257.0769 -242.7826
fit.Arima
## Series: train.songs
## ARIMA(1,1,0)
##
## Coefficients:
## ar1
## -0.5496
## s.e. 0.0879
##
## sigma^2 estimated as 0.0008505: log likelihood=197.1
## AIC=-390.2 AICc=-390.07 BIC=-385.14
summary(fit.lm)
##
## Call:
## tslm(formula = danceability ~ loudness + liveness + tempo + instrumentalness +
## speechiness + valence, data = train.songsall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.035418 -0.013599 -0.000546 0.009662 0.052543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2454488 0.1018274 2.410 0.018039 *
## loudness 0.0084137 0.0013897 6.054 3.49e-08 ***
## liveness -0.8818339 0.1358135 -6.493 5.01e-09 ***
## tempo 0.0028528 0.0008127 3.510 0.000712 ***
## instrumentalness 0.1013650 0.0277289 3.656 0.000438 ***
## speechiness 0.4025114 0.0259924 15.486 < 2e-16 ***
## valence 0.3356060 0.0430169 7.802 1.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0182 on 87 degrees of freedom
## Multiple R-squared: 0.8578, Adjusted R-squared: 0.848
## F-statistic: 87.48 on 6 and 87 DF, p-value: < 2.2e-16
fcast.ses<-forecast(fit.ses,h=6)
fcast.ETS<-forecast(fit.ETS,h=6)
fcast.Arima<-forecast(fit.Arima,h=6)
fcast.lm<-forecast(fit.lm, newdata = data.frame((test.songsall)))
autoplot(fcast.ses)+autolayer(test.songs)
autoplot(fcast.ETS)+autolayer(test.songs)
autoplot(fcast.Arima)+autolayer(test.songs)
autoplot(fcast.lm)+autolayer(test.songs)
I also noticed the 0.6 correlation between tempo and danceability and decide to use the VAR model. The prediction looks like ARIMA model.
train.songstemp<-window(SongTempo, end=c(2014))
test.songstemp<-window(SongTempo,start=c(2015))
songVar<-cbind(train.songs,train.songstemp)
VARselect(songVar[,1:2], lag.max=8,
type="const")[["selection"]]
## AIC(n) HQ(n) SC(n) FPE(n)
## 4 2 2 4
var1 <- VAR(songVar[,1:2], p=2, type="const")
serial.test(var1, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var1
## Chi-squared = 34.866, df = 32, p-value = 0.3333
var2 <- VAR(songVar[,1:2], p=4, type="const")
serial.test(var2, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var2
## Chi-squared = 41.751, df = 24, p-value = 0.01376
fcast.var<-forecast(var1,h=6)
fcast.var%>%
autoplot() + xlab("Year")
fcast.var[1]
## $model
##
## VAR Estimation Results:
## =======================
##
## Estimated coefficients for equation train.songs:
## ================================================
## Call:
## train.songs = train.songs.l1 + train.songstemp.l1 + train.songs.l2 + train.songstemp.l2 + const
##
## train.songs.l1 train.songstemp.l1 train.songs.l2 train.songstemp.l2
## 0.425742365 -0.002059692 0.371960466 0.002617805
## const
## 0.045140699
##
##
## Estimated coefficients for equation train.songstemp:
## ====================================================
## Call:
## train.songstemp = train.songs.l1 + train.songstemp.l1 + train.songs.l2 + train.songstemp.l2 + const
##
## train.songs.l1 train.songstemp.l1 train.songs.l2 train.songstemp.l2
## 13.4672057 0.5936525 -18.3724306 0.2920932
## const
## 16.1015677
Looking at the accuracy, the Linear regression model looks the best in the traning data, but the ETS model look the best with the testing data set.
accuracy(fcast.ses, test.songs)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002280188 0.02945345 0.01972465 0.188572 3.780342 0.888221
## Test set 0.056067951 0.06652267 0.05606795 8.550136 8.550136 2.524797
## ACF1 Theil's U
## Training set -0.1241110 NA
## Test set 0.3143762 2.167138
accuracy(fcast.ETS, test.songs)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001873897 0.02522909 0.01710386 -0.476116 3.296318 0.770204
## Test set 0.051905774 0.06108404 0.05190577 7.924512 7.924512 2.337370
## ACF1 Theil's U
## Training set -0.1954701 NA
## Test set 0.2697272 1.985876
accuracy(fcast.Arima, test.songs)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002608245 0.02885193 0.01858916 0.3129528 3.576429 0.8370888
## Test set 0.051672452 0.06254149 0.05167245 7.8607125 7.860712 2.3268628
## ACF1 Theil's U
## Training set -0.03008585 NA
## Test set 0.32811344 2.035452
accuracy(fcast.lm, test.songs)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.134097e-18 0.01750806 0.01374176 -0.1049302 2.603616 0.6188052
## Test set 6.458409e-02 0.06637262 0.06458409 10.1065085 10.106508 2.9082868
## ACF1 Theil's U
## Training set 0.4279913 NA
## Test set 0.2227189 2.141784
accuracy(fcast.var[[2]]$train.songs, test.songs)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.223546e-18 0.02694409 0.01861317 -0.2669777 3.579185 0.8381699
## Test set 6.168726e-02 0.07255202 0.06168726 9.4180203 9.418020 2.7778398
## ACF1 Theil's U
## Training set -0.01941191 NA
## Test set 0.34758770 2.356297