Introduction

Stock markets are where individual and institutional investors come together to buy and sell shares of a publicly listed company. With the digital ecosystem rapidly evolving in the last decade, most of the exchanges today are electronic. Hence, this domain has a lot of scope in having abundant and accurate publicly available data. Every publicly listed company will have an instantaneous stock price associated with it (during the trading hours). The stock price is driven by the forces of supply and demand. This supply and demand helps determine the price for each security or the levels at which stock market participants or investors or traders are willing to buy or sell.

An important question here is - If there is accurate and abundant historic and real-time data, why don’t stock price have an accurate prediction? If that was the case, data scientists and not businessmen, would have been the richest people on the planet. The answer lies in the aforementioned supply and demand. Although it has two simple terms, it is influenced by hundreds or perhaps thousands of micro-economic and macroeconomic factors as well as hundreds of factors within the company itself. This makes the stock market very unpredictable, and one of the hardest domains for forecasting. For example, assuming forecasting models were built in early 2019, would they have predicted the stock market crash caused due to COVID in 2020?

I have been interested in stock market since I was 16. The fascination was around how the marketplaces functioned. The cascading effect of collapse of an economic superpower like US on all other global markets is interesting. I started investing as soon as I got my first job. The motivation to choose stock price prediction for forecasting is a combination of my interest in stock market as well as the challenge in getting accurate predictions. I love challenges!

Data

That data that I have chosen for analysis is the stock price of Alphabet Inc. that goes by the ticker symbol ‘GOOGL’. We shall source the publicly available data from Yahoo finance using the R library ‘Quantmod’. It is ironic that we are using Yahoo Finance to analyze GOOGL stock price :) Irrespective, the data is accurate and from a trustworthy source.

Out of a plethora of publicly listed companies, I chose to analyze GOOGL stock price for the following reasons -

  1. It is Google!
  2. It is a large cap company with less market volatility
  3. I used to head the Developers Students Club by Google at my University. So, I always wanted it to be my first stock purchase in the US.

Another important aspect in time series is frequency of data. The stock market data can be obtained at a yearly, monthly, weekly, hourly, by every minute, or perhaps even every second. For our analysis, we will be looking at the closing stock price at a daily level since Jan-2007.

Although Yahoo Finance directly sources its data from exchanges, understanding the data generating process for the stock prices is not simple. In simple words, based on number of people willing to buy the stock in both the deliveries and options market, the stock price is computed by the exchange.

Analysis

Library requirements

We will be using the following libraries -

library(quantmod)
library(dplyr)
library(lubridate)
library(tidyverse)
library(vtable)
library(sjPlot)
library(MASS)
library(tseries)
library(forecast)
library(prophet)

Loading the dataset


getSymbols("GOOGL", src = "yahoo")
## [1] "GOOGL"
GOOGL <- dplyr::select(data.frame(date = index(GOOGL), coredata(GOOGL)), c('date','GOOGL.Close'))
head(GOOGL)
##         date GOOGL.Close
## 1 2007-01-03    234.0290
## 2 2007-01-04    241.8719
## 3 2007-01-05    243.8388
## 4 2007-01-08    242.0320
## 5 2007-01-09    242.9930
## 6 2007-01-10    244.9750

Exploration


#No. of observations
nrow(GOOGL)
## [1] 3810

There are in total 3789 dates for which we have stock closing price.

#Check for missing dates and Nulls
nrow(GOOGL) == length(unique(GOOGL$date))
## [1] TRUE
sum(is.null(GOOGL$GOOGL.Close))
## [1] 0

We do not have any dates missing. Further, we do not have any Nulls in the stock price. As told earlier, stock market data is usually complete and accurate if it is from the right source :)

#Range of data
range(GOOGL$date, na.rm = TRUE)
## [1] "2007-01-03" "2022-02-17"

The dataset that we have is from Jan 3rd, 2007 until yesterday. It is up-to-date.

#Data Summary
sumtable(GOOGL, add.median=TRUE)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
GOOGL.Close 3810 764.705 639.675 128.849 287.758 549.79 1076.453 2996.77

Looking at the summary of the stock price, we see that during this period,

  1. The stock has hit a low of $128 and high of $2997. Incredible. It is ~22 times returns.
  2. The mean is much higher than the median indicating a right skewed distribution. It makes sense because we know that the GOOGL stock has outperfomed itself in the last 5-10 years.

Visualizations


Histogram

#histogram
ggplot(data = GOOGL, aes(x = GOOGL.Close)) +
  geom_histogram(bins = 30) +
  xlab("Closing Price of Google stock") + ylab("Frequency") +
  ggtitle("Histogram of Google stock closing price")

The histogram confirms our earlier observation that the distribution must be right skewed. The interpretation of the values in this histogram is - "The stock price of google was between $150-$250 (lets say) on 145 days. Does this make any sense for the forecast that we want to do? I dont think so.

Density plot


#density plot
ggplot(data = GOOGL, aes(x = GOOGL.Close)) +
  geom_density() +
  xlab("Closing Price of Google stock") + ylab("Density") +
  ggtitle("Density plot of Google stock closing price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The above chart gives us a probability density function. Clearly, it is right skewed distribution. Further, it cannot be approximated to normal distribution.

Box plot


#boxplot
ggplot(data = GOOGL, aes(y = GOOGL.Close)) +
  geom_boxplot() +
  ylab("Closing Price of Google stock") +
  ggtitle("Box plot of Google stock closing price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The boxplot has almost an inverted hammer shape. This is typical for a highly right skewed distribution. 50% of the data is between $275-$1075. The box plot also indicates the extreme values greater than $2250. We for sure know that it is not erroneous data, but just the case of extreme values in our dataset.

Line Chart


#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
  geom_line() +
  xlab("Date") + ylab("Closing Price of Google stock") +
  ggtitle("Google stock price trend over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

This chart appears to be the most useful among all the charts we have seen. It clearly shows trend of the stock over time. Few observations we can make from the chart are -

  1. Volatility in stock market can be clearly seen
  2. Although there are crests and troughs in short-period, over the long run, we can see a clear upward trend
  3. The upward trend does not appear to be linear
  4. (Additional for stock-market Geeks) Post-COVID recovery has been fantastic

Key Observations from Analysis

  1. The data is clean. No preprocessing is essential
  2. Boxplot suggests that there might be few outliers. But, removing them does not make sense since we know that they are not because of an error in data collection. However, we might want to consider subsetting the data to see trends for a specific time range. This might make more sense because the trends pre-COVID and post-COVID are different
  3. Histograms, density plots and box plots are great visualizations methods. But, they might not be well-suited to study time-series data
  4. The line chart shows a strong positive trend over the long run

Does a linear model do a good job capturing the trend in our timeseries?

We want to see if a simple linear line can capture the trend in the data and help us predict the stock prices. If yes, what portion of the variance is explained by the predictor (which in our case is date).

lm_model <- lm(GOOGL.Close ~ date, data=GOOGL)
summary(lm_model)
## 
## Call:
## lm(formula = GOOGL.Close ~ date, data = GOOGL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -425.45 -205.97 -126.26   93.16 1307.84 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.868e+03  5.360e+01  -90.83   <2e-16 ***
## date         3.460e-01  3.277e-03  105.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 322.8 on 3808 degrees of freedom
## Multiple R-squared:  0.7454, Adjusted R-squared:  0.7454 
## F-statistic: 1.115e+04 on 1 and 3808 DF,  p-value: < 2.2e-16


Summary table


tab_model(lm_model, show.intercept = FALSE)
  GOOGL Close
Predictors Estimates CI p
date 0.35 0.34 – 0.35 <0.001
Observations 3810
R2 / R2 adjusted 0.745 / 0.745


Based on the model, we can draw the following conclusions -

  1. The linear model with date as the predictor is able to capture ~74.6% variance in the stock price. Although this is not bad, we can do much better.
  2. The results are statistically significant, i.e. we can confidently say that there is a relationship between date and stock price
  3. The co-efficient of date suggests that with every day’s change, the closing stock price increases by $0.3

Time Series Modelling

Stationarity in Time series

Variance stationarity

#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
  geom_line() +
  xlab("Date") + ylab("Closing Price of Google stock") +
  ggtitle("Google stock price trend over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Based on the plot above, we can clearly see that the time series is not variance stationary, i.e its variance is not constant across time periods. Further, a clear positive trend indicates and not reverting back to a constant value indicates that the time series is not mean stationary. Generally, for better and easier time series modelling, these two properties are non-desirable.

To further confirm that it is variance non-stationary, we draw a comparison of the original time-series data with the transformed data. We apply the log-normal transformation and Box-Cox transformation to do so. Further, to check for mean stationarity, we apply the popular

Log transformation

#line chart after log transformation

GOOGL$GOOGL.log_close <- log(GOOGL$GOOGL.Close)

ggplot(data = GOOGL, aes(x = date, y = GOOGL.log_close)) +
  geom_line() +
  xlab("Date") + ylab("Log(Closing Price)") +
  ggtitle("Log transformed closing price of google over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

We can notice that the variance appears more constant after the transformation compared to the original data.

Box-Cox transformation

#line chart after box cox transformation
bc <- boxcox(GOOGL$GOOGL.Close ~ GOOGL$date)

lambda <- bc$x[which.max(bc$y)]

GOOGL$GOOGL.bc_close <- GOOGL$GOOGL.Close ** lambda
  
ggplot(data = GOOGL, aes(x = date, y = GOOGL.bc_close)) +
  geom_line() +
  xlab("Date") + ylab("Transformed Closing Price of Google stock") +
  ggtitle("Google stock price trend over time after Box-Cox transformation")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Although we see improvement in the graph with lesser fluctuations in variances, we can see that the trend has reversed. For, this reason, we shall proceed with the Log-normal transformation.

Mean Stationarity

Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series. It is based on the presence of unit root that indicates that the time series is non-stationary.

The null hypothesis in the ADF test is that the co-efficient of the first lag on Y is equal to 1. When the test statistic is lower than the critical value, we reject the null hypothesis and infer that the time series is stationary.

We perform the ADF test on the original closing price, log transformed closing price and the Box-Cox transformed closing price

GOOGL <- drop_na(GOOGL)

adf.test(x=GOOGL$GOOGL.Close)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.Close
## Dickey-Fuller = -0.0020527, Lag order = 15, p-value = 0.99
## alternative hypothesis: stationary
adf.test(x=GOOGL$GOOGL.log_close)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.log_close
## Dickey-Fuller = -2.8913, Lag order = 15, p-value = 0.2009
## alternative hypothesis: stationary
adf.test(x=GOOGL$GOOGL.bc_close)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.bc_close
## Dickey-Fuller = -3.271, Lag order = 15, p-value = 0.07563
## alternative hypothesis: stationary

We can see with the p-value that although it decreases dramatically after transformation, we can not confidently reject the null hypothesis at 95% significance level.

In order to resolve mean non-stationary, we compute the first order lag of the closing price.

#Mean Stationary
GOOGL$GOOGL.log.Close_diff = GOOGL$GOOGL.log_close - lag(GOOGL$GOOGL.log_close)

ggplot(data = GOOGL, aes(x = date, y = GOOGL.log.Close_diff)) +
  geom_line() +
  xlab("Date") + ylab("First order Lag") +
  ggtitle("Time series of first order lag for log transformed Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The graph above shows a good mean stationary property on first order lag. However, we also have outliers that are deviating more than most points. We now perform the ADF test on the first order lag closing price to see if we have achieved mean stationary property.

GOOGL <- drop_na(GOOGL)
adf.test(x=GOOGL$GOOGL.log.Close_diff)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.log.Close_diff
## Dickey-Fuller = -15.25, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary

Clearly, the p-value from the test is below the critical value of 0.05, indicating that the series is now mean stationary.

Assesing seasonality and cyclicity

Based on the visual glance, we do not see any evident seasonality or cyclicity. This is expected in stock market data where seasonality has very less impact.

ACF and PACF plots

For any time series modelling, it becomes necessary to know the nature of the time series - i.e whether it is an autoregressive process or a moving average process or both, and also determine the order. An intuition to this is provided by the ACF and PACF plots.

#par(mfrow = c(1,2))
acf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="ACF plot of Log transofrmed closing price of Google")

pacf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="PACF plot of Log transofrmed closing price of Google")

Based on the ACF graph, the data most likely appears to be a 1st order MA process.

Arima Modelling and Forecasting

ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.

Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models. An ARIMA model is characterized by 3 terms: p, d, q, where - p is the order of the AR term, q is the order of the MA term and d is the number of differencing required to make the time series stationary.

From the ACF graph, we had an intuition that the time series could be following first order MA process. However, we shall build out different ARIMA models to verify this. We shall be using AIC and BIC in order to evaluate the model performance.

BIC(
    arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
  )
##                                                                 df        BIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML")  2   8738.072
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML")  3   3585.540
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML")  1 -19691.919
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML")  2 -19692.731
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML")  3 -19672.517
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML")  4 -19673.286
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML")  2 -19692.813
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML")  3 -19684.531
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML")  4 -19676.370
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML")  5 -19668.386

According to BIC values, the best model has an order = c(0,1,1)) with BIC = -19643.75

AIC(arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
)
##                                                                 df        AIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML")  2   8725.582
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML")  3   3566.804
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML")  1 -19698.163
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML")  2 -19705.221
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML")  3 -19691.252
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML")  4 -19698.267
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML")  2 -19705.302
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML")  3 -19703.265
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML")  4 -19701.350
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML")  5 -19699.610

Even according to AIC, the best model is a first order MA process with differencing factor 1. Hence, it is confirmed that the data follows a first order MA process.

Actuals vs Predicted for the best model:

best_model <- arima(GOOGL$GOOGL.log_close, order = c(0,1,1))

resi <- best_model$residuals
predi <- resi + GOOGL$GOOGL.log_close

ggplot()+
  geom_line(aes(GOOGL$date,GOOGL$GOOGL.log_close)) +
  geom_line(aes(GOOGL$date,predi),color='blue',alpha=0.4) +
  theme_bw() +
  xlab("Date") + ylab("Log(Close price)") +
  ggtitle("In-sample prediction for the best model")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Box-Ljung test

This test is done to check whether or not the auto-correlations for the errors or residuals are non zero.

forecast::checkresiduals(best_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 23.509, df = 9, p-value = 0.005149
## 
## Model df: 1.   Total lags used: 10
acf(resi,lag.max=20, main="ACF plot of Residuals")

pacf(resi,lag.max=20, main="PACF plot of Residuals")

Box.test(resi,type='Ljung-Box',lag=20)
## 
##  Box-Ljung test
## 
## data:  resi
## X-squared = 40.217, df = 20, p-value = 0.004689

RMSE

RMSE = sqrt(mean((predi - GOOGL$GOOGL.log_close)^2,na.rm=T))
RMSE
## [1] 0.01818918

Forecast using ARIMA

We will be forecasting the 5 time-period forecast from the best model.

prediction = predict(best_model, n.ahead = 5)
for(i in 1:5){
  print(paste("Prediction for t +",i," : ",prediction$pred[i]))
}
## [1] "Prediction for t + 1  :  7.88445752623601"
## [1] "Prediction for t + 2  :  7.88445752623601"
## [1] "Prediction for t + 3  :  7.88445752623601"
## [1] "Prediction for t + 4  :  7.88445752623601"
## [1] "Prediction for t + 5  :  7.88445752623601"

We see here that with the best model, that is with parameters c(0,1,1), we are obtaining same values of prediction for t+1, t+2, and so on. This needs to be further analyzed/debugged.

pred_data = data.frame(
  Forecast=prediction,
  date = (max(GOOGL$date)+1):(max(GOOGL$date)+1+4)
)
pred_data$date <- as.Date(pred_data$date)
head(pred_data)
##   Forecast.pred Forecast.se       date
## 1      7.884458  0.01819135 2022-02-18
## 2      7.884458  0.02510950 2022-02-19
## 3      7.884458  0.03049670 2022-02-20
## 4      7.884458  0.03506580 2022-02-21
## 5      7.884458  0.03910463 2022-02-22
pred_merge = GOOGL %>% 
  full_join(pred_data) %>%
  mutate(
    pred_high = (Forecast.pred + 2*Forecast.se),
    pred_low = (Forecast.pred - 2*Forecast.se),
    pred.pred = Forecast.pred,
    error = Forecast.pred - GOOGL.log_close
    )

Now, we shall plot the forecasted values for the time series. The date filter has been applies in order to clearly see the constructed blue ribbon.

pred_merge %>%
  filter(date > '2021-01-01') %>%
  ggplot()+
  geom_line(aes(date,GOOGL.log_close))+
  geom_line(aes(date,Forecast.pred),color='blue')+
  geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
  ylab("Log(Close price)") +
  xlab("Date") +
  ggtitle("Prediction of log of Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Note, that this is still the forecast on the log transformed values of the close price. We need to transform back to the original format and look at the forecast.

pred_merge %>%
  filter(date > '2021-01-01') %>%
  ggplot()+
  geom_line(aes(date,GOOGL.Close))+
  geom_line(aes(date,2.718**Forecast.pred),color='blue')+
  geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
  ylab("Log(Close price)") +
  xlab("Date") +
  ggtitle("Prediction of Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5)) 

Prophet Model Forecasting

After modelling the time series using ARIMA model, we now focus on modelling the same time series using Facebook Prophet model. It is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. However, we dont expect the stock market data to have a seasonal or holiday effect.

Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. A major difference compared to ARIMA models is that prophet model does not require the time-series data input to be mean and variance stationary.

We shall be considering data before 1-Jan-2021 as the training dataset and from 1st-Jan-2021 to 15-Feb-2022 as testing dataset.

GOOGL_prophet_data = GOOGL %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = GOOGL.Close) %>%
  dplyr::select(ds,y)


train_prophet = GOOGL_prophet_data %>% 
  filter(ds < ymd("2021-01-01"))

test_prophet = GOOGL_prophet_data %>% 
  filter(ds >= ymd("2021-01-01"))


model = prophet(train_prophet)
future_pred = make_future_dataframe(model, periods = 365)

forecast = predict(model,future_pred)

plot(model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
  ggtitle("GOOGL price Forecast using Prophet Model") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The below chart presents an interactive version of the forecast plot. Hover over the chart to check for values. Use the slider to adjust time period.

dyplot.prophet(model,forecast)

Decomposition

Plotting the individual components of the time series, we see that there is no day of year chart indicating no seasnality at that level.

The day of week chart has lows in Saturday and Sunday. This makes sense because stock markets are closed on Saturday and Sunday and we have systematic missing data on these two days of week.

prophet_plot_components(model,forecast)

Changepoints

Most real time series frequently have abrupt changes in their trajectories. These are referred to as changepoints. Identifying them could be important for accurately modelling changes in trends. By default, prophet identifies the changepoints. However, we can also control this using arguments.

Changepoints are only inferred for the first 80% of the data in order to prevent overfit.

plot(model,forecast)+add_changepoints_to_plot(model)+
  xlab("Date")+
  ylab("Close price of Google")+
  ggtitle("Default Changepoints in the forecasted time series") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Now, we see by manually changing the number of changepoints captured by controlling the n.changepoints argument in the prophet function to 50.

model = prophet(train_prophet, n.changepoints=50)
forecast = predict(model,future_pred)

plot(model,forecast)+add_changepoints_to_plot(model) +
  xlab("Date")+
  ylab("Close price of Google")+
  ggtitle("Forecasted time series with 50 changepoints") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

We see that there are more changepoints detected now.

Additive and Multiplicative Seasonality

There are two techniques for combining time series components - Additive and Multiplicative. In additive, the individual components are added together and is typically used for linear trend/seasonality. In multiplicative, the individual components are multiplicative together and are generally used for non-linear trend/seasonality.

The default is additive. Hence, the plots that we saw earlier were for additive seasonality. Hence, we will only explore multiplicative seasonality below by setting seasonality.mode as ‘multiplicative’

#Multiplicative seasonality

mul_model = prophet(train_prophet, seasonality.mode = 'multiplicative')
future_pred = make_future_dataframe(model, periods = 365)

forecast = predict(model,future_pred)

plot(mul_model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
  ggtitle("GOOGL price Forecast using Prophet Model (Multiplicative)") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

prophet_plot_components(mul_model, forecast)

Holidays

Another interesting and useful difference between ARIMA and Prophet is that we can include holiday information in Prophet models. In the below code excerpt, we use add_country_holidays to the model and include US holidays.

model = prophet(train_prophet,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train_prophet)

forecast = predict(model,future_pred)
prophet_plot_components(model,forecast)

Further examining holidays, we can see the impact of major holidays -

forecast %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  theme_bw()
## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf

## Warning in max(., na.rm = T): no non-missing arguments to max; returning -Inf
## Warning: Removed 10 rows containing missing values (geom_col).

We see a positive impact on New Year’s eve and Columbus day while a negative impact on Veterans day. However, we should not over-infer anything for stock market data. For datasets like product demand, understanding impact of holidays might be more critical and insightful.

Evaluating In-sample performance

#Assessing Prophet Models

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test_prophet$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test_prophet$y - forecast_metric_data$yhat))
MAPE = mean(abs((test_prophet$y - forecast_metric_data$yhat)/test_prophet$y))

print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 992.69"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 922.91"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.36"

Based on above metrics, we can conclude that the model is performing decently in sample.

Rolling window cross-validation

#Cross-validation with Prophet
df.cv <- cross_validation(model, initial = 730, period = 300, horizon = 300, units = 'days')
## Making 14 forecasts with cutoffs between 2009-07-02 and 2020-03-06
head(df.cv)
##          y         ds     yhat yhat_lower yhat_upper     cutoff
## 1 205.0100 2009-07-06 213.4323   201.7635   224.6946 2009-07-02
## 2 198.5135 2009-07-07 214.3315   202.7911   225.5828 2009-07-02
## 3 201.4464 2009-07-08 214.8179   203.6980   226.6125 2009-07-02
## 4 205.4004 2009-07-09 214.5177   202.7274   225.6356 2009-07-02
## 5 207.4074 2009-07-10 214.6672   202.5218   225.7187 2009-07-02
## 6 212.3624 2009-07-13 211.4312   199.9079   223.1120 2009-07-02

The model has fit data for 13 forecasts with an initial training period of 730 days,period of 300 days and horizon of 300 days.

unique(df.cv$cutoff)
##  [1] "2009-07-02 GMT" "2010-04-28 GMT" "2011-02-22 GMT" "2011-12-19 GMT"
##  [5] "2012-10-14 GMT" "2013-08-10 GMT" "2014-06-06 GMT" "2015-04-02 GMT"
##  [9] "2016-01-27 GMT" "2016-11-22 GMT" "2017-09-18 GMT" "2018-07-15 GMT"
## [13] "2019-05-11 GMT" "2020-03-06 GMT"
#cross-validation Actual vs Predicted
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Amazon Closing Price")+
  scale_color_discrete(name = 'Cutoff')

#Cross Validation RMSE metric plot
plot_cross_validation_metric(df.cv, metric = 'rmse')

The RMSE of model increases with Horizon. This seems reasonable because the model predicts the “near future” better than the farther away future.