Assignment

The dataset I used contains song from Spotify from 1921 to 2020.

Songs <- read.csv("data_by_year.csv")
Allsongs<-read.csv("data.csv")
Songsmelt<-melt(Songs,id.vars = 'year')

There are 14 variables looking at the song release year, acousticness, danceability,d uration_ms, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence and popularity. they are shown as the average value of songs released in the year. Although popularity is interesting, as these ratings are done in modern times, the average popularity continue to increase as people like songs they are use to listen to. Looking at the plots, I think danceablility is interesting. It seems like songs in the 1920 - 1930s have higher danceable scores and decreased to the lowest to the 1950s and continue to increase since then. This seems to be related to great depression and economy booming after WW2.

ggplot(Songsmelt, aes(x=year, y=value)) + geom_point() + facet_wrap(~variable, scale = "free")+ geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Songsts <- ts(Songs, start = c(1921), frequency = 1)

SongPopularity <- ts(Songs[,13], start = c(1921), frequency = 1)
SongDance <- ts(Songs[,4], start = c(1921), frequency = 1)
SongTempo <- ts(Songs[,11], start = c(1921), frequency = 1)
autoplot(SongPopularity, main='Song Popularity')

autoplot(SongDance, main='Song Danceability')

autoplot(SongTempo, main='Song Tempo')

cor(SongDance,SongTempo)

## [1] 0.6035434

summary(Allsongs)

##     valence            year       acousticness      artists         
##  Min.   :0.0000   Min.   :1921   Min.   :0.0000   Length:170653     
##  1st Qu.:0.3170   1st Qu.:1956   1st Qu.:0.1020   Class :character  
##  Median :0.5400   Median :1977   Median :0.5160   Mode  :character  
##  Mean   :0.5286   Mean   :1977   Mean   :0.5021                     
##  3rd Qu.:0.7470   3rd Qu.:1999   3rd Qu.:0.8930                     
##  Max.   :1.0000   Max.   :2020   Max.   :0.9960                     
##   danceability     duration_ms          energy          explicit      
##  Min.   :0.0000   Min.   :   5108   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.4150   1st Qu.: 169827   1st Qu.:0.2550   1st Qu.:0.00000  
##  Median :0.5480   Median : 207467   Median :0.4710   Median :0.00000  
##  Mean   :0.5374   Mean   : 230948   Mean   :0.4824   Mean   :0.08458  
##  3rd Qu.:0.6680   3rd Qu.: 262400   3rd Qu.:0.7030   3rd Qu.:0.00000  
##  Max.   :0.9880   Max.   :5403500   Max.   :1.0000   Max.   :1.00000  
##       id            instrumentalness        key          liveness     
##  Length:170653      Min.   :0.000000   Min.   : 0.0   Min.   :0.0000  
##  Class :character   1st Qu.:0.000000   1st Qu.: 2.0   1st Qu.:0.0988  
##  Mode  :character   Median :0.000216   Median : 5.0   Median :0.1360  
##                     Mean   :0.167010   Mean   : 5.2   Mean   :0.2058  
##                     3rd Qu.:0.102000   3rd Qu.: 8.0   3rd Qu.:0.2610  
##                     Max.   :1.000000   Max.   :11.0   Max.   :1.0000  
##     loudness            mode            name             popularity    
##  Min.   :-60.000   Min.   :0.0000   Length:170653      Min.   :  0.00  
##  1st Qu.:-14.615   1st Qu.:0.0000   Class :character   1st Qu.: 11.00  
##  Median :-10.580   Median :1.0000   Mode  :character   Median : 33.00  
##  Mean   :-11.468   Mean   :0.7069                      Mean   : 31.43  
##  3rd Qu.: -7.183   3rd Qu.:1.0000                      3rd Qu.: 48.00  
##  Max.   :  3.855   Max.   :1.0000                      Max.   :100.00  
##  release_date        speechiness          tempo       
##  Length:170653      Min.   :0.00000   Min.   :  0.00  
##  Class :character   1st Qu.:0.03490   1st Qu.: 93.42  
##  Mode  :character   Median :0.04500   Median :114.73  
##                     Mean   :0.09839   Mean   :116.86  
##                     3rd Qu.:0.07560   3rd Qu.:135.54  
##                     Max.   :0.97000   Max.   :243.51

ACF and PACF shows autocorrelation and partial autocorrelation. We split the data into testing and training.

acf(SongDance)

pacf(SongDance)

SongDancediff<-diff(SongDance)
train.songs<-window(SongDance, end=c(2014))
test.songs<-window(SongDance,start=c(2015))

train.songsall<-window(Songsts, end=c(2014))
test.songsall<-window(Songsts,start=c(2015))

The first model I built is using the simple exponential smoothing method. This model will not incorporate trend/seasonal behavior. The resulted prediction is just a horizontal line.

Next I used ETS model. This model will incorporate the different components: Error, trend and seasonal. It also includes model selection method such as AIC, BIC. The final model produced is EST(M, Ad, N). This means Error is multiplied, Trend is additive with dampening and there is no seasonal factor. This make sense since it is annual data. The prediction showing increasing trend.

ARIMA model is autoregressive intergrated moving average. As shown in the ACF and PACF plots, the data is autoregressive so ARIMA could produce good results. The ARIMA function built ARIMA(1,1,0) model producing results similar to ses.

I also used the regression time series model. I used AIC to choose variable used. The adjusted R^2 is great at 0.84. The prediction is very different than the other 3 models with a dip in 2015. The shape of the prediction looks correct, but much lower than the acutal danceability score.

fit.ses<-ses(train.songs)
fit.ETS<-ets(train.songs)
fit.Arima<-auto.arima(train.songs)
fit.lm<-tslm(danceability~loudness+liveness+tempo+instrumentalness+speechiness+valence, train.songsall)



fit.ETS

## ETS(M,Ad,N) 
## 
## Call:
##  ets(y = train.songs) 
## 
##   Smoothing parameters:
##     alpha = 0.1964 
##     beta  = 0.1964 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 0.3565 
##     b = 0.1025 
## 
##   sigma:  0.0475
## 
##       AIC      AICc       BIC 
## -258.0424 -257.0769 -242.7826

fit.Arima

## Series: train.songs 
## ARIMA(1,1,0) 
## 
## Coefficients:
##           ar1
##       -0.5496
## s.e.   0.0879
## 
## sigma^2 estimated as 0.0008505:  log likelihood=197.1
## AIC=-390.2   AICc=-390.07   BIC=-385.14

summary(fit.lm)

## 
## Call:
## tslm(formula = danceability ~ loudness + liveness + tempo + instrumentalness + 
##     speechiness + valence, data = train.songsall)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035418 -0.013599 -0.000546  0.009662  0.052543 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.2454488  0.1018274   2.410 0.018039 *  
## loudness          0.0084137  0.0013897   6.054 3.49e-08 ***
## liveness         -0.8818339  0.1358135  -6.493 5.01e-09 ***
## tempo             0.0028528  0.0008127   3.510 0.000712 ***
## instrumentalness  0.1013650  0.0277289   3.656 0.000438 ***
## speechiness       0.4025114  0.0259924  15.486  < 2e-16 ***
## valence           0.3356060  0.0430169   7.802 1.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0182 on 87 degrees of freedom
## Multiple R-squared:  0.8578, Adjusted R-squared:  0.848 
## F-statistic: 87.48 on 6 and 87 DF,  p-value: < 2.2e-16

fcast.ses<-forecast(fit.ses,h=6)
fcast.ETS<-forecast(fit.ETS,h=6)
fcast.Arima<-forecast(fit.Arima,h=6)
fcast.lm<-forecast(fit.lm, newdata = data.frame((test.songsall)))

autoplot(fcast.ses)+autolayer(test.songs)

autoplot(fcast.ETS)+autolayer(test.songs)

autoplot(fcast.Arima)+autolayer(test.songs)

autoplot(fcast.lm)+autolayer(test.songs)

I also noticed the 0.6 correlation between tempo and danceability and decide to use the VAR model. The prediction looks like ARIMA model.

train.songstemp<-window(SongTempo, end=c(2014))
test.songstemp<-window(SongTempo,start=c(2015))
songVar<-cbind(train.songs,train.songstemp)
VARselect(songVar[,1:2], lag.max=8,
  type="const")[["selection"]]

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      4      2      2      4

var1 <- VAR(songVar[,1:2], p=2, type="const")
serial.test(var1, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var1
## Chi-squared = 34.866, df = 32, p-value = 0.3333

var2 <- VAR(songVar[,1:2], p=4, type="const")
serial.test(var2, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var2
## Chi-squared = 41.751, df = 24, p-value = 0.01376

fcast.var<-forecast(var1,h=6)

fcast.var%>%
  autoplot() + xlab("Year")

fcast.var[1]

## $model
## 
## VAR Estimation Results:
## ======================= 
## 
## Estimated coefficients for equation train.songs: 
## ================================================ 
## Call:
## train.songs = train.songs.l1 + train.songstemp.l1 + train.songs.l2 + train.songstemp.l2 + const 
## 
##     train.songs.l1 train.songstemp.l1     train.songs.l2 train.songstemp.l2 
##        0.425742365       -0.002059692        0.371960466        0.002617805 
##              const 
##        0.045140699 
## 
## 
## Estimated coefficients for equation train.songstemp: 
## ==================================================== 
## Call:
## train.songstemp = train.songs.l1 + train.songstemp.l1 + train.songs.l2 + train.songstemp.l2 + const 
## 
##     train.songs.l1 train.songstemp.l1     train.songs.l2 train.songstemp.l2 
##         13.4672057          0.5936525        -18.3724306          0.2920932 
##              const 
##         16.1015677

Looking at the accuracy, the Linear regression model looks the best in the traning data, but the ETS model look the best with the testing data set.

accuracy(fcast.ses, test.songs)

##                       ME       RMSE        MAE      MPE     MAPE     MASE
## Training set 0.002280188 0.02945345 0.01972465 0.188572 3.780342 0.888221
## Test set     0.056067951 0.06652267 0.05606795 8.550136 8.550136 2.524797
##                    ACF1 Theil's U
## Training set -0.1241110        NA
## Test set      0.3143762  2.167138

accuracy(fcast.ETS, test.songs)

##                        ME       RMSE        MAE       MPE     MAPE     MASE
## Training set -0.001873897 0.02522909 0.01710386 -0.476116 3.296318 0.770204
## Test set      0.051905774 0.06108404 0.05190577  7.924512 7.924512 2.337370
##                    ACF1 Theil's U
## Training set -0.1954701        NA
## Test set      0.2697272  1.985876

accuracy(fcast.Arima, test.songs)

##                       ME       RMSE        MAE       MPE     MAPE      MASE
## Training set 0.002608245 0.02885193 0.01858916 0.3129528 3.576429 0.8370888
## Test set     0.051672452 0.06254149 0.05167245 7.8607125 7.860712 2.3268628
##                     ACF1 Theil's U
## Training set -0.03008585        NA
## Test set      0.32811344  2.035452

accuracy(fcast.lm, test.songs)

##                         ME       RMSE        MAE        MPE      MAPE      MASE
## Training set -4.134097e-18 0.01750806 0.01374176 -0.1049302  2.603616 0.6188052
## Test set      6.458409e-02 0.06637262 0.06458409 10.1065085 10.106508 2.9082868
##                   ACF1 Theil's U
## Training set 0.4279913        NA
## Test set     0.2227189  2.141784

accuracy(fcast.var[[2]]$train.songs, test.songs)

##                         ME       RMSE        MAE        MPE     MAPE      MASE
## Training set -4.223546e-18 0.02694409 0.01861317 -0.2669777 3.579185 0.8381699
## Test set      6.168726e-02 0.07255202 0.06168726  9.4180203 9.418020 2.7778398
##                     ACF1 Theil's U
## Training set -0.01941191        NA
## Test set      0.34758770  2.356297

Assignment

Yu Mu

12/5/2020