In this project, I compared the efficacy of three different forecasting approaches - simple exponential smoothing, ARIMA, and vector-based methods VAR and VECM - in an engagement prediction task. Specifically, I wanted to see how well each of these techniques did for predicting a single song’s Spotify streams.

The dataset I used contained around 1.5 years’ worth of weekly streams for 5 songs by the same, unnamed artist. The dates of the weeks were also unlabeled, so it’s not clear exactly when the data starts. In this project, I attempt to forecast the next 8 weeks of streams for 1 of the songs in the data. As we’ll see, the other 4 songs’ data may come in handy.

Loading Data

library(fpp2)
library(ggplot2)
library(quantmod)
library(readxl)
library(seasonal)
library(tseries)
library(tsibble)
library(urca)
library(vars)
library(Metrics)
library(zoo)

# Load data, select song 1 streams
df = read_excel("artist_spotify_streams.xlsx", col_types = "numeric")
df = df[1:83, c("streams_song1")]

# Create time-series song 1 object
df.ts = ts(df, frequency = 52)

# First look at plotted series
plot(df.ts, xlab = "Year", ylab = "Weekly Song 1 Streams")

Above, you can see Song 1’s weekly streams. From the pronounced dips just after the year 1 and year 2 marks, one can infer that there must be some yearly seasonality to this data.

# function to train/test split
split = function(ts, test_num) {
    # ts = time series of aggregated data test_num = number of samples for testing
    
    # for storing output time series
    out = list()
    # length of time series
    n = nrow(ts)
    # indices for slicing
    test_start = n - test_num + 1
    # create training and test sets
    train = ts[1:test_start - 1, ]
    test = ts[test_start:n, ]
    
    # store in list
    out$train = ts(train, frequency = 52)
    out$test = ts(test, frequency = 52, start = c(2, 24))
    
    return(out)
    
}

# configure test set size (8 weeks)
test.num = 8

# split datasets
train = split(df.ts, test.num)$train
test = split(df.ts, test.num)$test

# plot settings to 2x1
par(mfrow = c(2, 1))
# top 10
plot(train, xlab = "Year", ylab = "Train Song 1 Streams")
plot(test, xlab = "Year", ylab = "Test Song 1 Streams")

The first technique we will try is simple exponential smoothing (SES), with a twist. SES is a forecasting technique that outputs a flat (that is, unchanging) forecast based on the past observations of the forecasted variable. It is named as such because it weighs observations exponentially less in its forecast as the observations get older.

The twist we will be adding here is a seasonality approximation: Since we have gathered from our initial EDA that there could be some yearly seasonality to this data, it would probably not be advisable to forecast future streams of this song as a flat forecast; instead, we will augment that flat forecast with the effects we would expect yearly seasonality to add to the streaming trend.

That being said, because we have less than 2 years of data, we have no simple way of adding seasonality to an SES forecast: with less than two periods worth of data, forecasting applications cannot surmise any seasonality. Thus, we can use Fourier terms to approximate what we would expect yearly seasonality to look like if the first year’s data contains a representative seasonal pattern. We will add the SES forecast to the estimated seasonality to obtain our final first model’s forecast.

Model 1: Simple Exponential Smoothing (SES) with Fourier Seasonality Estimation

# Model 1: Simple Exponential Smoothing with Fourier seasonality estimation

# Approximate seasonal trend with Fourier terms
decompose.df = tslm(train ~ trend + fourier(train, 3))
trend = coef(decompose.df)[1] + coef(decompose.df)["trend"] * seq_along(train)
components = cbind(data = train, trend = trend, season = train - trend - residuals(decompose.df), 
    remainder = residuals(decompose.df))

# Check decomposition
autoplot(components, facet = TRUE)

# Isolate seasonally-adjusted data
adjust_train = train - components[, "season"]

# Isolate seasonality
seasonal_train = components[, "season"]

# Forecast seasonally adjusted data with Simple Exponential Smoothing
adjust.pred = ses(adjust_train, h = 8)

# Naive forecast for seasonal component
seasonal.pred = snaive(seasonal_train, 8)

# Create function for combining seasonal and adjusted forecasts
combine.forecasts = function(y, z) {
    # Combine forecasts
    method = "Combined Naive Seasonal & SES Model"
    mean = y$mean + z$mean
    lower = y$lower + z$lower
    upper = y$upper + z$upper
    level = y$level
    x = y$x + z$x
    residuals = y$residuals + z$residuals
    fitted = y$fitted + z$fitted
    
    # Construct output list
    output = list(method = method, mean = mean, lower = lower, upper = upper, level = level, 
        x = x, residuals = residuals, fitted = fitted)
    # Return with forecasting class
    return(structure(output, class = "forecast"))
}

# Recombine seasonally adjusted component and seasonal component
fourier.forecast = combine.forecasts(adjust.pred, seasonal.pred)

# Plot recombined forecast
autoplot(fourier.forecast) + ylab("Song 1 Weekly Streams") + xlab("Year")

The SES + Fourier seasonality model predicts a streaming pattern that is initially upward-sloping, and then tapers downward for the next 8 weeks.

# extract predictions
pred.model1 = fourier.forecast$mean

rmse.model1 = rmse(test, pred.model1)
print(rmse.model1)
## [1] 13639.65
year = time(test)

df.model1 = as_tibble(cbind(pred.model1, test, year))

# plot forecasts
ggplot(df.model1, aes(x = year, y = test)) + geom_line(aes(y = test, color = "True Values"), 
    size = 1.5) + geom_line(aes(y = pred.model1, color = "Forecast"), size = 1.5) + 
    ggtitle("Forecast of Stream Volume") + xlab("Year") + ylab("Streams") + theme(plot.title = element_text(hjust = 0.5), 
    legend.title = element_blank())

Unfortunately, our SES+Fourier model’s forecast is woefully off. Although the model starts off strong with its positive trend, the seasonality approximated from the first year’s data dictates that the forecast will continue to rise, while the true stream counts begin to dip. Thus, our RMSE of around ~14k is somewhat large.

The next model we will look at is the heralded, oft-used ARIMA model, which is an effective forecasting tool for any univariate time series data. Although we cannot use yearly seasonality here due to the short length of our training data, we can count on R’s auto.arima package to fit the data to a sensible model for forecasting future values.

Model 2: Non-seasonal ARIMA

# Model 2: Non-seasonal ARIMA

# Assess stationarity
train %>% ur.kpss() %>% summary()  # Test statistic >=10%, borderline non-stationarity
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.4682 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
checkresiduals(train, df = 1)  # Un-differenced data is somewhat non-stationary

# Check what auto.arima thinks is best model
fit.arima = auto.arima(train, stepwise = FALSE, approximation = FALSE)

# Review ACF and PCF to gauge model order
ggAcf(train)  # could be order 2 moving average model

ggPacf(train)  # could also be order 2 autoregressive model

# Forecast 10 weeks
arima.forecast = forecast(fit.arima, h = 8)

# Plot ARIMA forecast
autoplot(arima.forecast) + ylab("Song 1 Weekly Streams") + xlab("Year")

From the ACF, we can see that for at least the next 8 weeks, the sustained positive autocorrelations of the data make it at least somewhat reliable in helping us predict future values.

The ARIMA model chosen - a differenced, second-order autoregressive model - predicts a mostly flat forecast for the next 10 weeks.

# extract predictions
pred.model2 = arima.forecast$mean

rmse.model2 = rmse(test, pred.model2)
print(rmse.model2)
## [1] 3239.591
year = time(test)

df.model2 = as_tibble(cbind(pred.model2, test, year))

# plot forecasts
ggplot(df.model2, aes(x = year, y = test)) + geom_line(aes(y = test, color = "True Values"), 
    size = 1.5) + geom_line(aes(y = pred.model2, color = "Forecast"), size = 1.5) + 
    ggtitle("ARIMA Forecast of Stream Volume") + xlab("Year") + ylab("Streams") + 
    theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())

The markedly lower RMSE of model 2 compared to model 1 reflects how ARIMA models can be a reliable forecasting method, even when seasonality cannot be incorporated into the model. As we saw in our last model’s validation procedure, the true values here have much more severe trend patterns than our forecast, and the ARIMA model does not produce a precise forecast. That being said, the ARIMA forecast does trend in the correct direction, and appears to average out the peaks and valleys present in the true data. Comparing endpoints alone, our forecast does seem to be going the right direction, with relatively low bias. Our low RMSE - around 3.2k - reflects this model’s usefulness.

The last approach we will examine is a vector autoregression (VAR)-based approach. Essentially, if we have parallel time series data streams that we can assume are endogenous (that is, related inextricably), we can use them to predict each other. This is where the streaming data for other songs by the same artist comes into play: we can say that the popularity of one song by an artist is related to the popularity of the artists’ other songs, so we can model each song’s streaming performance as linear combinations of the others’ performances. However, this approach necessitates all our songs’ streams are stationary, and if they’re not, we may want to rethink our approach.

Model 3: Vector Autoregression (VAR) / Vector Error Correction Model (VECM)

# Model 3: Vector Autoregression (VAR) / Vector Error Correction Model (VECM)

# load Song 2 data
df2 = read_excel("artist_spotify_streams.xlsx")
df2 = df2[1:83, c("streams_song2")]
test.num = 8
train2 = split(df2, test.num)$train
test2 = split(df2, test.num)$test

# plot Song 2
par(mfrow = c(2, 1))
plot(train2)
plot(test2)

# check stationarity
checkresiduals(train2, df = 1)

train2 %>% ur.kpss() %>% summary()  # non-stationary!
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 1.78 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# load song 3 data
df3 = read_excel("artist_spotify_streams.xlsx")
df3 = df3[1:83, c("streams_song3")]
train3 = split(df3, test.num)$train
test3 = split(df3, test.num)$test

# plot song 3
par(mfrow = c(2, 1))
plot(train3)
plot(test3)

# check stationarity
checkresiduals(train3, df = 1)

train3 %>% ur.kpss() %>% summary()  # sort of stationary, but could be non-stationary
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.2612 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# load song 4 data
df4 = read_excel("artist_spotify_streams.xlsx")
df4 = df4[1:83, c("streams_song4")]
train4 = split(df4, test.num)$train
test4 = split(df4, test.num)$test

# plot song 4
par(mfrow = c(2, 1))
plot(train4)
plot(test4)

# check stationarity
checkresiduals(train4, df = 1)

train4 %>% ur.kpss() %>% summary()  # non-stationary!
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.9415 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# load song 5 data
df5 = read_excel("artist_spotify_streams.xlsx")
df5 = df5[1:83, c("streams_song5")]
train5 = split(df5, test.num)$train
test5 = split(df5, test.num)$test

# plot song 5
par(mfrow = c(2, 1))
plot(train5)
plot(test5)

# check stationarity
checkresiduals(train5, df = 1)

train5 %>% ur.kpss() %>% summary()  # non-stationary!
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 1.0474 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# Assemble df with all songs
songs.vars = data.frame(train, train2, train3, train4, train5)

# Make VECM model, run Johansen's cointegration test
jotest = ca.jo(songs.vars, type = "trace", K = 2, ecdet = "none", spec = "longrun")
summary(jotest)  # shows r=1 cointegration!
## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: trace statistic , with linear trend 
## 
## Eigenvalues (lambda):
## [1] 0.601461586 0.381917223 0.134772388 0.096643605 0.003566559
## 
## Values of teststatistic and critical values of test:
## 
##            test 10pct  5pct  1pct
## r <= 4 |   0.26  6.50  8.18 11.65
## r <= 3 |   7.68 15.66 17.95 23.52
## r <= 2 |  18.25 28.71 31.52 37.22
## r <= 1 |  53.37 45.23 48.28 55.43
## r = 0  | 120.53 66.49 70.60 78.87
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##                     train.l2 streams_song2.l2 streams_song3.l2 streams_song4.l2
## train.l2           1.0000000        1.0000000        1.0000000       1.00000000
## streams_song2.l2  -0.5219788        0.4043681       -0.9161308      -0.11479033
## streams_song3.l2  47.8318014        1.0244506        4.7737101      -1.64284070
## streams_song4.l2 -22.1419900      -11.7531756      -14.1076357      -3.34578823
## streams_song5.l2   5.8503796        7.5276022        1.2621178       0.08009886
##                  streams_song5.l2
## train.l2               1.00000000
## streams_song2.l2      -0.39550849
## streams_song3.l2      -0.55603557
## streams_song4.l2      -0.73271760
## streams_song5.l2       0.02224005
## 
## Weights W:
## (This is the loading matrix)
## 
##                     train.l2 streams_song2.l2 streams_song3.l2 streams_song4.l2
## train.d         -0.059169335      -0.02498955      0.040039889     -0.044030674
## streams_song2.d -0.046611376      -0.01909851      0.001621429     -0.022321871
## streams_song3.d -0.019644560       0.01014852      0.006856553      0.003863351
## streams_song4.d -0.009106541      -0.00327162      0.012369298      0.021150952
## streams_song5.d -0.058637176      -0.04718616     -0.019711972      0.090666379
##                 streams_song5.l2
## train.d             -0.019761061
## streams_song2.d     -0.036781190
## streams_song3.d     -0.003682716
## streams_song4.d     -0.003292574
## streams_song5.d      0.003059808
cajorls(jotest)
## $rlm
## 
## Call:
## lm(formula = substitute(form1), data = data.mat)
## 
## Coefficients:
##                    train.d     streams_song2.d  streams_song3.d
## ect1               -5.917e-02  -4.661e-02       -1.964e-02     
## constant            8.693e+04   7.099e+04        2.882e+04     
## train.dl1          -8.683e-02   2.658e-01        2.375e-02     
## streams_song2.dl1  -7.408e-02  -1.746e-01       -3.148e-02     
## streams_song3.dl1  -1.237e+00  -1.564e+00       -5.510e-01     
## streams_song4.dl1   1.492e+00   1.331e+00        3.605e-01     
## streams_song5.dl1  -3.558e-01  -2.714e-01       -1.153e-01     
##                    streams_song4.d  streams_song5.d
## ect1               -9.107e-03       -5.864e-02     
## constant            1.321e+04        8.456e+04     
## train.dl1           1.858e-02        3.424e-02     
## streams_song2.dl1  -1.969e-02        1.084e-01     
## streams_song3.dl1  -2.839e-01       -2.903e+00     
## streams_song4.dl1   1.891e-01        1.527e+00     
## streams_song5.dl1  -5.198e-02       -4.979e-01     
## 
## 
## $beta
##                         ect1
## train.l2           1.0000000
## streams_song2.l2  -0.5219788
## streams_song3.l2  47.8318014
## streams_song4.l2 -22.1419900
## streams_song5.l2   5.8503796
# Convert VECM model to VAR for predictions
var.vecm = vec2var(jotest, r = 1)

# Forecast with predict() function
pred_var.vecm = predict(var.vecm, n.ahead = 8)

Since all of our songs’ stream data appears to be at least somewhat non-stationary, normal VAR (or ‘VAR in levels’) is not advisable. One alternative is to run VAR on differenced versions of all the songs’ streams, but this could ignore possible long-run relationships between the levels of the streaming data. Another alternative is to run a vector error correction model (VECM), which takes into account both levels and differences. To make sure this is suitable, we fit a VECM model on the data and test for cointegration within the parallel series with Johanson’s cointegration test. Thankfully, this test shows a test statistic of 120.53 for r=0. Since this exceeds the 1% level (~120 vs. ~79), we know there is cointegration with these songs’ streams, so we can go ahead and predict with a VECM model. The r<=1 coefficient nearly exceeds the 1% mark as well, which suitably indicates that this data involves 1 cointegrating vector.

Because a forecasting object is not created with this method, I have not included an analogous plot of the entire series with the forecast from this procedure. However, the forecast values compared to the true values can still be viewed below.

pred.model3b = pred_var.vecm$fcst$train[, 1]

rmse.model3b = rmse(test, pred.model3b)
print(rmse.model3b)
## [1] 4903.522
year = time(test)

df.model3b = as_tibble(cbind(pred.model3b, test, year))

# plot forecasts
ggplot(df.model3b, aes(x = year, y = test)) + geom_line(aes(y = test, color = "True Values"), 
    size = 1.5) + geom_line(aes(y = pred.model3b, color = "Forecast"), size = 1.5) + 
    ggtitle("VECM Forecast of Stream Volume") + xlab("Year") + ylab("Streams") + 
    theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())

Although the VECM model interestingly matches the true data better in some ways - it ‘sees’ the sharp initial increase in the true values, and shows some ability to deduce the true trending direction - the ARIMA model we tried last has a lower RMSE, with this model scoring around 4.9k RMSE versus around 3.2k for the ARIMA.

We can deduce from this that although our cointegrating vectors approach works to a certain extent - it does appear that our parallel songs 2 through 5 provide information that can be used to forecast song 1 - it may be overkill compared to the tried-and-true method of ARIMA modeling with song 1’s past streams alone.

Of course, it’s important to note that all of these models could be improved with some contextual information and supporting variables added as predictors. We have no date labels on any of this data, and since holidays and time of year impact music streaming, being able to incorporate these into our model would impact its accuracy. Furthermore, incorporating predictors that contextualize what’s going on with the particular artist whose data we are studying, such as timing of releases, timing of competitors’ releases, timing of press, and perhaps more parallel stream data from YouTube would aid the predictive power of any of these models. Nevertheless, in lieu of all of this supporting information, it’s striking how well standard, non-seasonal ARIMA models do in predicting the next 8 weeks of this data.

Thanks for reading through my code! Feel free to use anything I have done here for your own projects.