Stock markets are where individual and institutional investors come together to buy and sell shares of a publicly listed company. With the digital ecosystem rapidly evolving in the last decade, most of the exchanges today are electronic. Hence, this domain has a lot of scope in having abundant and accurate publicly available data. Every publicly listed company will have an instantaneous stock price associated with it (during the trading hours). The stock price is driven by the forces of supply and demand. This supply and demand helps determine the price for each security or the levels at which stock market participants or investors or traders are willing to buy or sell.
An important question here is - If there is accurate and abundant historic and real-time data, why don’t stock price have an accurate prediction? If that was the case, data scientists and not businessmen, would have been the richest people on the planet. The answer lies in the aforementioned supply and demand. Although it has two simple terms, it is influenced by hundreds or perhaps thousands of micro-economic and macroeconomic factors as well as hundreds of factors within the company itself. This makes the stock market very unpredictable, and one of the hardest domains for forecasting. For example, assuming forecasting models were built in early 2019, would they have predicted the stock market crash caused due to COVID in 2020?
I have been interested in stock market since I was 16. The fascination was around how the marketplaces functioned. The cascading effect of collapse of an economic superpower like US on all other global markets is interesting. I started investing as soon as I got my first job. The motivation to choose stock price prediction for forecasting is a combination of my interest in stock market as well as the challenge in getting accurate predictions. I love challenges!
That data that I have chosen for analysis is the stock price of Alphabet Inc. that goes by the ticker symbol ‘GOOGL’. We shall source the publicly available data from Yahoo finance using the R library ‘Quantmod’. It is ironic that we are using Yahoo Finance to analyze GOOGL stock price :) Irrespective, the data is accurate and from a trustworthy source.
Out of a plethora of publicly listed companies, I chose to analyze GOOGL stock price for the following reasons -
Another important aspect in time series is frequency of data. The stock market data can be obtained at a yearly, monthly, weekly, hourly, by every minute, or perhaps even every second. For our analysis, we will be looking at the closing stock price at a daily level since Jan-2007.
Although Yahoo Finance directly sources its data from exchanges, understanding the data generating process for the stock prices is not simple. In simple words, based on number of people willing to buy the stock in both the deliveries and options market, the stock price is computed by the exchange.
We will be using the following libraries -
library(quantmod)
library(dplyr)
library(lubridate)
library(tidyverse)
library(vtable)
library(sjPlot)
library(MASS)
library(tseries)
library(forecast)
library(prophet)
getSymbols("GOOGL", src = "yahoo")
## [1] "GOOGL"
GOOGL <- dplyr::select(data.frame(date = index(GOOGL), coredata(GOOGL)), c('date','GOOGL.Close'))
head(GOOGL)
## date GOOGL.Close
## 1 2007-01-03 234.0290
## 2 2007-01-04 241.8719
## 3 2007-01-05 243.8388
## 4 2007-01-08 242.0320
## 5 2007-01-09 242.9930
## 6 2007-01-10 244.9750
#No. of observations
nrow(GOOGL)
## [1] 3810
There are in total 3789 dates for which we have stock closing price.
#Check for missing dates and Nulls
nrow(GOOGL) == length(unique(GOOGL$date))
## [1] TRUE
sum(is.null(GOOGL$GOOGL.Close))
## [1] 0
We do not have any dates missing. Further, we do not have any Nulls in the stock price. As told earlier, stock market data is usually complete and accurate if it is from the right source :)
#Range of data
range(GOOGL$date, na.rm = TRUE)
## [1] "2007-01-03" "2022-02-17"
The dataset that we have is from Jan 3rd, 2007 until yesterday. It is up-to-date.
#Data Summary
sumtable(GOOGL, add.median=TRUE)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|---|
| GOOGL.Close | 3810 | 764.705 | 639.675 | 128.849 | 287.758 | 549.79 | 1076.453 | 2996.77 |
Looking at the summary of the stock price, we see that during this period,
#histogram
ggplot(data = GOOGL, aes(x = GOOGL.Close)) +
geom_histogram(bins = 30) +
xlab("Closing Price of Google stock") + ylab("Frequency") +
ggtitle("Histogram of Google stock closing price")
The histogram confirms our earlier observation that the distribution must be right skewed. The interpretation of the values in this histogram is - "The stock price of google was between $150-$250 (lets say) on 145 days. Does this make any sense for the forecast that we want to do? I dont think so.
#density plot
ggplot(data = GOOGL, aes(x = GOOGL.Close)) +
geom_density() +
xlab("Closing Price of Google stock") + ylab("Density") +
ggtitle("Density plot of Google stock closing price")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
The above chart gives us a probability density function. Clearly, it is right skewed distribution. Further, it cannot be approximated to normal distribution.
#boxplot
ggplot(data = GOOGL, aes(y = GOOGL.Close)) +
geom_boxplot() +
ylab("Closing Price of Google stock") +
ggtitle("Box plot of Google stock closing price")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
The boxplot has almost an inverted hammer shape. This is typical for a highly right skewed distribution. 50% of the data is between $275-$1075. The box plot also indicates the extreme values greater than $2250. We for sure know that it is not erroneous data, but just the case of extreme values in our dataset.
#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
geom_line() +
xlab("Date") + ylab("Closing Price of Google stock") +
ggtitle("Google stock price trend over time")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
This chart appears to be the most useful among all the charts we have seen. It clearly shows trend of the stock over time. Few observations we can make from the chart are -
We want to see if a simple linear line can capture the trend in the data and help us predict the stock prices. If yes, what portion of the variance is explained by the predictor (which in our case is date).
lm_model <- lm(GOOGL.Close ~ date, data=GOOGL)
summary(lm_model)
##
## Call:
## lm(formula = GOOGL.Close ~ date, data = GOOGL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -425.45 -205.97 -126.26 93.16 1307.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.868e+03 5.360e+01 -90.83 <2e-16 ***
## date 3.460e-01 3.277e-03 105.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 322.8 on 3808 degrees of freedom
## Multiple R-squared: 0.7454, Adjusted R-squared: 0.7454
## F-statistic: 1.115e+04 on 1 and 3808 DF, p-value: < 2.2e-16
tab_model(lm_model, show.intercept = FALSE)
| GOOGL Close | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| date | 0.35 | 0.34 – 0.35 | <0.001 |
| Observations | 3810 | ||
| R2 / R2 adjusted | 0.745 / 0.745 | ||
Based on the model, we can draw the following conclusions -
Variance stationarity
#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
geom_line() +
xlab("Date") + ylab("Closing Price of Google stock") +
ggtitle("Google stock price trend over time")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
Based on the plot above, we can clearly see that the time series is not variance stationary, i.e its variance is not constant across time periods. Further, a clear positive trend indicates and not reverting back to a constant value indicates that the time series is not mean stationary. Generally, for better and easier time series modelling, these two properties are non-desirable.
To further confirm that it is variance non-stationary, we draw a comparison of the original time-series data with the transformed data. We apply the log-normal transformation and Box-Cox transformation to do so. Further, to check for mean stationarity, we apply the popular
Log transformation
#line chart after log transformation
GOOGL$GOOGL.log_close <- log(GOOGL$GOOGL.Close)
ggplot(data = GOOGL, aes(x = date, y = GOOGL.log_close)) +
geom_line() +
xlab("Date") + ylab("Log(Closing Price)") +
ggtitle("Log transformed closing price of google over time")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
We can notice that the variance appears more constant after the transformation compared to the original data.
Box-Cox transformation
#line chart after box cox transformation
bc <- boxcox(GOOGL$GOOGL.Close ~ GOOGL$date)
lambda <- bc$x[which.max(bc$y)]
GOOGL$GOOGL.bc_close <- GOOGL$GOOGL.Close ** lambda
ggplot(data = GOOGL, aes(x = date, y = GOOGL.bc_close)) +
geom_line() +
xlab("Date") + ylab("Transformed Closing Price of Google stock") +
ggtitle("Google stock price trend over time after Box-Cox transformation")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
Although we see improvement in the graph with lesser fluctuations in variances, we can see that the trend has reversed. For, this reason, we shall proceed with the Log-normal transformation.
Mean Stationarity
Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series. It is based on the presence of unit root that indicates that the time series is non-stationary.
The null hypothesis in the ADF test is that the co-efficient of the first lag on Y is equal to 1. When the test statistic is lower than the critical value, we reject the null hypothesis and infer that the time series is stationary.
We perform the ADF test on the original closing price, log transformed closing price and the Box-Cox transformed closing price
GOOGL <- drop_na(GOOGL)
adf.test(x=GOOGL$GOOGL.Close)
##
## Augmented Dickey-Fuller Test
##
## data: GOOGL$GOOGL.Close
## Dickey-Fuller = -0.0020527, Lag order = 15, p-value = 0.99
## alternative hypothesis: stationary
adf.test(x=GOOGL$GOOGL.log_close)
##
## Augmented Dickey-Fuller Test
##
## data: GOOGL$GOOGL.log_close
## Dickey-Fuller = -2.8913, Lag order = 15, p-value = 0.2009
## alternative hypothesis: stationary
adf.test(x=GOOGL$GOOGL.bc_close)
##
## Augmented Dickey-Fuller Test
##
## data: GOOGL$GOOGL.bc_close
## Dickey-Fuller = -3.271, Lag order = 15, p-value = 0.07563
## alternative hypothesis: stationary
We can see with the p-value that although it decreases dramatically after transformation, we can not confidently reject the null hypothesis at 95% significance level.
In order to resolve mean non-stationary, we compute the first order lag of the closing price.
#Mean Stationary
GOOGL$GOOGL.log.Close_diff = GOOGL$GOOGL.log_close - lag(GOOGL$GOOGL.log_close)
ggplot(data = GOOGL, aes(x = date, y = GOOGL.log.Close_diff)) +
geom_line() +
xlab("Date") + ylab("First order Lag") +
ggtitle("Time series of first order lag for log transformed Google stock price")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
The graph above shows a good mean stationary property on first order lag. However, we also have outliers that are deviating more than most points. We now perform the ADF test on the first order lag closing price to see if we have achieved mean stationary property.
GOOGL <- drop_na(GOOGL)
adf.test(x=GOOGL$GOOGL.log.Close_diff)
##
## Augmented Dickey-Fuller Test
##
## data: GOOGL$GOOGL.log.Close_diff
## Dickey-Fuller = -15.25, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
Clearly, the p-value from the test is below the critical value of 0.05, indicating that the series is now mean stationary.
Based on the visual glance, we do not see any evident seasonality or cyclicity. This is expected in stock market data where seasonality has very less impact.
For any time series modelling, it becomes necessary to know the nature of the time series - i.e whether it is an autoregressive process or a moving average process or both, and also determine the order. An intuition to this is provided by the ACF and PACF plots.
#par(mfrow = c(1,2))
acf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="ACF plot of Log transofrmed closing price of Google")
pacf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="PACF plot of Log transofrmed closing price of Google")
Based on the ACF graph, the data most likely appears to be a 1st order MA process.
ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.
Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models. An ARIMA model is characterized by 3 terms: p, d, q, where - p is the order of the AR term, q is the order of the MA term and d is the number of differencing required to make the time series stationary.
From the ACF graph, we had an intuition that the time series could be following first order MA process. However, we shall build out different ARIMA models to verify this. We shall be using AIC and BIC in order to evaluate the model performance.
BIC(
arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
)
## df BIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML") 2 8738.072
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML") 3 3585.540
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML") 1 -19691.919
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML") 2 -19692.731
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML") 3 -19672.517
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML") 4 -19673.286
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML") 2 -19692.813
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML") 3 -19684.531
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML") 4 -19676.370
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML") 5 -19668.386
According to BIC values, the best model has an order = c(0,1,1)) with BIC = -19643.75
AIC(arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
)
## df AIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML") 2 8725.582
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML") 3 3566.804
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML") 1 -19698.163
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML") 2 -19705.221
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML") 3 -19691.252
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML") 4 -19698.267
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML") 2 -19705.302
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML") 3 -19703.265
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML") 4 -19701.350
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML") 5 -19699.610
Even according to AIC, the best model is a first order MA process with differencing factor 1. Hence, it is confirmed that the data follows a first order MA process.
best_model <- arima(GOOGL$GOOGL.log_close, order = c(0,1,1))
resi <- best_model$residuals
predi <- resi + GOOGL$GOOGL.log_close
ggplot()+
geom_line(aes(GOOGL$date,GOOGL$GOOGL.log_close)) +
geom_line(aes(GOOGL$date,predi),color='blue',alpha=0.4) +
theme_bw() +
xlab("Date") + ylab("Log(Close price)") +
ggtitle("In-sample prediction for the best model")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
This test is done to check whether or not the auto-correlations for the errors or residuals are non zero.
forecast::checkresiduals(best_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 23.509, df = 9, p-value = 0.005149
##
## Model df: 1. Total lags used: 10
acf(resi,lag.max=20, main="ACF plot of Residuals")
pacf(resi,lag.max=20, main="PACF plot of Residuals")
Box.test(resi,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resi
## X-squared = 40.217, df = 20, p-value = 0.004689
RMSE = sqrt(mean((predi - GOOGL$GOOGL.log_close)^2,na.rm=T))
RMSE
## [1] 0.01818918
We will be forecasting the 5 time-period forecast from the best model.
prediction = predict(best_model, n.ahead = 5)
for(i in 1:5){
print(paste("Prediction for t +",i," : ",prediction$pred[i]))
}
## [1] "Prediction for t + 1 : 7.88445752623601"
## [1] "Prediction for t + 2 : 7.88445752623601"
## [1] "Prediction for t + 3 : 7.88445752623601"
## [1] "Prediction for t + 4 : 7.88445752623601"
## [1] "Prediction for t + 5 : 7.88445752623601"
We see here that with the best model, that is with parameters c(0,1,1), we are obtaining same values of prediction for t+1, t+2, and so on. This needs to be further analyzed/debugged.
pred_data = data.frame(
Forecast=prediction,
date = (max(GOOGL$date)+1):(max(GOOGL$date)+1+4)
)
pred_data$date <- as.Date(pred_data$date)
head(pred_data)
## Forecast.pred Forecast.se date
## 1 7.884458 0.01819135 2022-02-18
## 2 7.884458 0.02510950 2022-02-19
## 3 7.884458 0.03049670 2022-02-20
## 4 7.884458 0.03506580 2022-02-21
## 5 7.884458 0.03910463 2022-02-22
pred_merge = GOOGL %>%
full_join(pred_data) %>%
mutate(
pred_high = (Forecast.pred + 2*Forecast.se),
pred_low = (Forecast.pred - 2*Forecast.se),
pred.pred = Forecast.pred,
error = Forecast.pred - GOOGL.log_close
)
Now, we shall plot the forecasted values for the time series. The date filter has been applies in order to clearly see the constructed blue ribbon.
pred_merge %>%
filter(date > '2021-01-01') %>%
ggplot()+
geom_line(aes(date,GOOGL.log_close))+
geom_line(aes(date,Forecast.pred),color='blue')+
geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
ylab("Log(Close price)") +
xlab("Date") +
ggtitle("Prediction of log of Google stock price")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
Note, that this is still the forecast on the log transformed values of the close price. We need to transform back to the original format and look at the forecast.
pred_merge %>%
filter(date > '2021-01-01') %>%
ggplot()+
geom_line(aes(date,GOOGL.Close))+
geom_line(aes(date,2.718**Forecast.pred),color='blue')+
geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
ylab("Log(Close price)") +
xlab("Date") +
ggtitle("Prediction of Google stock price")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
After modelling the time series using ARIMA model, we now focus on modelling the same time series using Facebook Prophet model. It is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. However, we dont expect the stock market data to have a seasonal or holiday effect.
Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. A major difference compared to ARIMA models is that prophet model does not require the time-series data input to be mean and variance stationary.
We shall be considering data before 1-Jan-2021 as the training dataset and from 1st-Jan-2021 to 15-Feb-2022 as testing dataset.
GOOGL_prophet_data = GOOGL %>%
rename(ds = date, # Have to name our date variable "ds"
y = GOOGL.Close) %>%
dplyr::select(ds,y)
train_prophet = GOOGL_prophet_data %>%
filter(ds < ymd("2021-01-01"))
test_prophet = GOOGL_prophet_data %>%
filter(ds >= ymd("2021-01-01"))
model = prophet(train_prophet)
future_pred = make_future_dataframe(model, periods = 365)
forecast = predict(model,future_pred)
plot(model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
ggtitle("GOOGL price Forecast using Prophet Model") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
The below chart presents an interactive version of the forecast plot. Hover over the chart to check for values. Use the slider to adjust time period.
dyplot.prophet(model,forecast)
Plotting the individual components of the time series, we see that there is no day of year chart indicating no seasnality at that level.
The day of week chart has lows in Saturday and Sunday. This makes sense because stock markets are closed on Saturday and Sunday and we have systematic missing data on these two days of week.
prophet_plot_components(model,forecast)
Most real time series frequently have abrupt changes in their trajectories. These are referred to as changepoints. Identifying them could be important for accurately modelling changes in trends. By default, prophet identifies the changepoints. However, we can also control this using arguments.
Changepoints are only inferred for the first 80% of the data in order to prevent overfit.
plot(model,forecast)+add_changepoints_to_plot(model)+
xlab("Date")+
ylab("Close price of Google")+
ggtitle("Default Changepoints in the forecasted time series") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
Now, we see by manually changing the number of changepoints captured by controlling the n.changepoints argument in the prophet function to 50.
model = prophet(train_prophet, n.changepoints=50)
forecast = predict(model,future_pred)
plot(model,forecast)+add_changepoints_to_plot(model) +
xlab("Date")+
ylab("Close price of Google")+
ggtitle("Forecasted time series with 50 changepoints") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
We see that there are more changepoints detected now.
There are two techniques for combining time series components - Additive and Multiplicative. In additive, the individual components are added together and is typically used for linear trend/seasonality. In multiplicative, the individual components are multiplicative together and are generally used for non-linear trend/seasonality.
The default is additive. Hence, the plots that we saw earlier were for additive seasonality. Hence, we will only explore multiplicative seasonality below by setting seasonality.mode as ‘multiplicative’
#Multiplicative seasonality
mul_model = prophet(train_prophet, seasonality.mode = 'multiplicative')
future_pred = make_future_dataframe(model, periods = 365)
forecast = predict(model,future_pred)
plot(mul_model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
ggtitle("GOOGL price Forecast using Prophet Model (Multiplicative)") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(hjust = 0.5))
prophet_plot_components(mul_model, forecast)
Another interesting and useful difference between ARIMA and Prophet is that we can include holiday information in Prophet models. In the below code excerpt, we use add_country_holidays to the model and include US holidays.
model = prophet(train_prophet,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train_prophet)
forecast = predict(model,future_pred)
prophet_plot_components(model,forecast)
Further examining holidays, we can see the impact of major holidays -
forecast %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning: Removed 10 rows containing missing values (geom_col).
We see a positive impact on New Year’s eve and Columbus day while a negative impact on Veterans day. However, we should not over-infer anything for stock market data. For datasets like product demand, understanding impact of holidays might be more critical and insightful.
#Assessing Prophet Models
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test_prophet$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test_prophet$y - forecast_metric_data$yhat))
MAPE = mean(abs((test_prophet$y - forecast_metric_data$yhat)/test_prophet$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 992.69"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 922.91"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.36"
Based on above metrics, we can conclude that the model is performing decently in sample.
#Cross-validation with Prophet
df.cv <- cross_validation(model, initial = 730, period = 300, horizon = 300, units = 'days')
## Making 14 forecasts with cutoffs between 2009-07-02 and 2020-03-06
head(df.cv)
## y ds yhat yhat_lower yhat_upper cutoff
## 1 205.0100 2009-07-06 213.4323 201.7635 224.6946 2009-07-02
## 2 198.5135 2009-07-07 214.3315 202.7911 225.5828 2009-07-02
## 3 201.4464 2009-07-08 214.8179 203.6980 226.6125 2009-07-02
## 4 205.4004 2009-07-09 214.5177 202.7274 225.6356 2009-07-02
## 5 207.4074 2009-07-10 214.6672 202.5218 225.7187 2009-07-02
## 6 212.3624 2009-07-13 211.4312 199.9079 223.1120 2009-07-02
The model has fit data for 13 forecasts with an initial training period of 730 days,period of 300 days and horizon of 300 days.
unique(df.cv$cutoff)
## [1] "2009-07-02 GMT" "2010-04-28 GMT" "2011-02-22 GMT" "2011-12-19 GMT"
## [5] "2012-10-14 GMT" "2013-08-10 GMT" "2014-06-06 GMT" "2015-04-02 GMT"
## [9] "2016-01-27 GMT" "2016-11-22 GMT" "2017-09-18 GMT" "2018-07-15 GMT"
## [13] "2019-05-11 GMT" "2020-03-06 GMT"
#cross-validation Actual vs Predicted
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Amazon Closing Price")+
scale_color_discrete(name = 'Cutoff')
#Cross Validation RMSE metric plot
plot_cross_validation_metric(df.cv, metric = 'rmse')
The RMSE of model increases with Horizon. This seems reasonable because the model predicts the “near future” better than the farther away future.