A time series Analysis and Forecasting of GOOGL stock price

Introduction

Stock markets are where individual and institutional investors come together to buy and sell shares of a publicly listed company. With the digital ecosystem rapidly evolving in the last decade, most of today’s exchanges are electronic. Hence, this domain has a lot of scope in having abundant and accurate publicly available data. Every publicly listed company will have an instantaneous stock price associated with it (during the trading hours). The stock price is driven by supply and demand forces. This supply and demand help determine the price for each security or the levels at which stock market participants, investors, or traders are willing to buy or sell.

An important question here is - If there is accurate and abundant historical and real-time data, why doesn’t stock price have an accurate prediction? If that were the case, data scientists and not businessmen would have been the world’s wealthiest people. The answer lies in the interactions, as mentioned earlier, between supply and demand. Although it has two simple terms, it is influenced by hundreds or perhaps thousands of microeconomic and macroeconomic factors and hundreds of factors within the company itself. This makes the stock market very unpredictable and one of the most complex domains for forecasting. For example, assuming forecasting models were built in early 2019, would they have predicted the stock market crash caused due to COVID in 2020?

I have been interested in the stock market since I was 16. The fascination was around how the marketplaces functioned. The cascading effect of the collapse of an economic superpower like the US on all other global markets is fascinating. I started investing as soon as I got my first job. The motivation to choose stock price prediction for forecasting is a combination of my interest in the stock market and the challenge in getting accurate predictions. I love challenges!

Data

That data that I have chosen for analysis is the stock price of Alphabet Inc. that goes by the ticker symbol ‘GOOGL’. We shall source the publicly available data from Yahoo finance using the R library ‘Quantmod’. It is ironic that we are using Yahoo Finance to analyze GOOGL stock price :) Irrespective, the data is accurate and from a trustworthy source.

Out of a plethora of publicly listed companies, I chose to analyze GOOGL stock price for the following reasons -

It is Google!
It is a large cap company with less market volatility
I used to head the Developers Students Club by Google at my University. So, I always wanted it to be my first stock purchase in the US.

Another important aspect in time series is frequency of data. The stock market data can be obtained at a yearly, monthly, weekly, hourly, by every minute, or perhaps even every second. For our analysis, we will be looking at the closing stock price at a daily level since Jan-2007.

Although Yahoo Finance directly sources its data from exchanges, understanding the data generating process for the stock prices is not simple. In simple words, based on number of people willing to buy the stock in both the deliveries and options market, the stock price is computed by the exchange.

Exploratory Data Analysis

In this section, we shall explore the different elements of the time series. We shall check for the completeness of data. And then, we also check for any potential data quality issues such as missing data, erroneous data, etc. We end this section by visualizing the data as a time series and making some notable observations.

Library requirements

Throughout the analysis, we shall be using the following libraries -

library(quantmod)
library(dplyr)
library(lubridate)
library(tidyverse)
library(vtable)
library(sjPlot)
library(MASS)
library(tseries)
library(forecast)
library(prophet)

Loading the dataset

The dataset is obtained by the publicly available data from Yahoo finance using the R library ‘Quantmod’. The getSymbols() function from the library is used to input the data.

getSymbols("GOOGL", src = "yahoo")

## [1] "GOOGL"

GOOGL <- dplyr::select(data.frame(date = index(GOOGL), coredata(GOOGL)), c('date','GOOGL.Close'))
head(GOOGL)

##         date GOOGL.Close
## 1 2007-01-03    234.0290
## 2 2007-01-04    241.8719
## 3 2007-01-05    243.8388
## 4 2007-01-08    242.0320
## 5 2007-01-09    242.9930
## 6 2007-01-10    244.9750

Exploration

We start with an introductory exploration looking at the length and breadth of our data. Then, we look for any missing values and inconsistencies. We then look at the summary of the dataset to get an overall picture of the spread of data.

#No. of observations
nrow(GOOGL)

## [1] 3815

There are in total 3789 dates for which we have stock closing price.

#Check for missing dates and Nulls
nrow(GOOGL) == length(unique(GOOGL$date))

## [1] TRUE

sum(is.null(GOOGL$GOOGL.Close))

## [1] 0

We do not have any dates missing. Further, we do not have any Nulls in the stock price. As told earlier, stock market data is usually complete and accurate if it is from the right source :)

#Range of data
range(GOOGL$date, na.rm = TRUE)

## [1] "2007-01-03" "2022-02-25"

The dataset we have is from Jan 3rd, 2007, until yesterday. It is up-to-date. However, if we notice the data carefully, we shall see consistent missing data for each Saturday and Sunday in the week. This is expected in stock market data since the markets are closed these days. It is safe for us to neglect the missing data because of the nature of this missing data.

#Data Summary
sumtable(GOOGL, add.median=TRUE)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
GOOGL.Close	3815	767.137	642.772	128.849	287.878	550.04	1077.77	2996.77

Looking at the summary of the stock price, we see that during this period,

The stock has hit a low of $128 and a high of $2997. Incredible. It is ~22 times returns.
The mean is much higher than the median, indicating a right-skewed distribution. It makes sense because we know that the GOOGL stock has outperformed itself in the last 5-10 years.

Visualization

#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
  geom_line() +
  xlab("Date") + ylab("Closing Price of Google stock") +
  ggtitle("Google stock price trend over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The line chart clearly shows the trend of the stock over time. A few observations we can make from the chart are -

Volatility in the stock market can be seen
Although there are crests and troughs in short-period, over the long run, we can see a clear upward trend
The upward trend does not appear to be linear
(Additional for stock-market Geeks) Post-COVID recovery has been fantastic

Key Observations from Analysis

The data is clean. No preprocessing is essential
Boxplot suggests that there might be few outliers. But, removing them does not make sense since we know that they are not because of an error in data collection. However, we might consider subsetting the data to see trends for a specific time range. This might make more sense because the trends pre-COVID and post-COVID are different
Histograms, density plots, and box plots are great visualizations methods. But, they might not be well-suited to study time-series data
The line chart shows a strong positive trend over the long run

ARIMA Modelling

Stationarity

Intuitively, stationarity means that the statistical properties of a process generating a time series do not change over time. There are two properities that we generally check for while checking for stationarity of a time series -

Variance Stationarity
Mean Stationarity

Stationarity is important because stationary processes are easier to analyze and model. Due to this, stationarity has become a common assumption for many practices and tools in time series analysis, including the ARIMA modelling.

Variance and Mean Stationarity

For a variance stationary time series, the variance in data is near constant at different instances of time. For a mean stationary time series, the data needs to be mean reverting.

#line chart
ggplot(data = GOOGL, aes(x = date, y = GOOGL.Close )) +
  geom_line() +
  xlab("Date") + ylab("Closing Price of Google stock") +
  ggtitle("Google stock price trend over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Based on the plot above, we can see that the time series is not variance stationary, i.e., its variance is not constant across periods. Further, a clear positive trend indicates and not reverting back to a constant value indicates that the time series is not mean stationary. Generally, for better and easier time series modeling, these two properties are non-desirable.

To further confirm that it is variance non-stationary, we compare the original time-series data with the transformed data. We apply the log-normal transformation and Box-Cox transformation to do so. Further, to check for mean stationarity, we use the popular Augmented Dickey Fuller test.

Log transformation

#line chart after log transformation

GOOGL$GOOGL.log_close <- log(GOOGL$GOOGL.Close)

ggplot(data = GOOGL, aes(x = date, y = GOOGL.log_close)) +
  geom_line() +
  xlab("Date") + ylab("Log(Closing Price)") +
  ggtitle("Log transformed closing price of google over time")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

We can notice that the variance appears more constant after the transformation compared to the original data.

Box-Cox transformation

#line chart after box cox transformation
bc <- boxcox(GOOGL$GOOGL.Close ~ GOOGL$date)

lambda <- bc$x[which.max(bc$y)]

GOOGL$GOOGL.bc_close <- GOOGL$GOOGL.Close ** lambda
  
ggplot(data = GOOGL, aes(x = date, y = GOOGL.bc_close)) +
  geom_line() +
  xlab("Date") + ylab("Transformed Closing Price of Google stock") +
  ggtitle("Google stock price trend over time after Box-Cox transformation")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Although we see improvement in the graph with lesser fluctuations in variances, we can see that the trend has reversed. For, this reason, we shall proceed with the Log-normal transformation.

Mean Stationarity

The Augmented Dickey-Fuller test (ADF test) is a standard statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical tests for analyzing the stationarity of a series. It is based on the presence of unit root that indicates that the time series is non-stationary.

The null hypothesis in the ADF test is that the co-efficient of the first lag on Y is equal to 1. When the test statistic is lower than the critical value, we reject the null hypothesis and infer that the time series is stationary.

We perform the ADF test on the original closing price, log-transformed closing price, and the Box-Cox transformed closing price

GOOGL <- drop_na(GOOGL)

adf.test(x=GOOGL$GOOGL.Close)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.Close
## Dickey-Fuller = 0.081803, Lag order = 15, p-value = 0.99
## alternative hypothesis: stationary

adf.test(x=GOOGL$GOOGL.log_close)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.log_close
## Dickey-Fuller = -2.8746, Lag order = 15, p-value = 0.208
## alternative hypothesis: stationary

adf.test(x=GOOGL$GOOGL.bc_close)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.bc_close
## Dickey-Fuller = -3.2656, Lag order = 15, p-value = 0.07657
## alternative hypothesis: stationary

We can see with the p-value that although it decreases dramatically after transformation, we can not confidently reject the null hypothesis at a 95% significance level.

To resolve the mean non-stationary aspect, we compute the first order lag of the closing price.

#Mean Stationary
GOOGL$GOOGL.log.Close_diff = GOOGL$GOOGL.log_close - lag(GOOGL$GOOGL.log_close)

ggplot(data = GOOGL, aes(x = date, y = GOOGL.log.Close_diff)) +
  geom_line() +
  xlab("Date") + ylab("First order Lag") +
  ggtitle("Time series of first order lag for log transformed Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The graph above shows a good mean stationary property on first-order lag. However, we also have outliers that are deviating more than most points. We now perform the ADF test on the first-order lag closing price to see if we have achieved mean stationary property.

GOOGL <- drop_na(GOOGL)
adf.test(x=GOOGL$GOOGL.log.Close_diff)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOGL$GOOGL.log.Close_diff
## Dickey-Fuller = -15.288, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary

Clearly, the p-value from the test is below the critical value of 0.05, indicating that the series is now mean stationary.

Seasonality and Cyclicity

We do not see any evident seasonality or cyclicity based on the visual glance. This is expected in stock market data where seasonality has significantly less impact.

ACF and PACF plots

For any time series modeling, it becomes necessary to know the nature of the time series - i.e., whether it is an autoregressive process or a moving average process, or both, and determine the order. An intuition to this is provided by the ACF and PACF plots.

#par(mfrow = c(1,2))
acf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="ACF plot of Log transofrmed closing price of Google")

pacf(GOOGL$GOOGL.log.Close_diff,lag.max=20, main="PACF plot of Log transofrmed closing price of Google")

Based on the ACF graph, the data most likely appears to be a 1st order MA process.

ARIMA Model

ARIMA, short for ‘Auto-Regressive Integrated Moving Average’, is a class of models that ‘explains’ a given time series based on its past values, lags, and lagged forecast errors so that the equation can be used to forecast future values.

Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models. Three terms characterize an ARIMA model: p, d, q, where - p is the order of the AR term, q is the order of the MA term, and d is the number of differences required to make the time series stationary.

We believed that the time series could be following the first order MA process from the ACF graph. However, we shall build out different ARIMA models to verify this. We shall be using AIC and BIC to evaluate the model performance.

BIC(
    arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
  )

##                                                                 df        BIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML")  2   8764.686
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML")  3   3605.165
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML")  1 -19715.861
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML")  2 -19716.463
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML")  3 -19696.440
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML")  4 -19697.002
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML")  2 -19716.541
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML")  3 -19708.259
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML")  4 -19700.095
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML")  5 -19692.122

According to BIC values, the best model has an order = c(0,1,1)) with BIC = -19643.75

AIC(arima(GOOGL$GOOGL.log_close, order = c(0,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(0,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,0,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,0), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(1,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(2,1,1), method = 'ML'),
    arima(GOOGL$GOOGL.log_close, order = c(3,1,1), method = 'ML')
)

##                                                                 df        AIC
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 0), method = "ML")  2   8752.193
## arima(GOOGL$GOOGL.log_close, order = c(0, 0, 1), method = "ML")  3   3586.426
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 0), method = "ML")  1 -19722.107
## arima(GOOGL$GOOGL.log_close, order = c(0, 1, 1), method = "ML")  2 -19728.955
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 0), method = "ML")  3 -19715.179
## arima(GOOGL$GOOGL.log_close, order = c(1, 0, 1), method = "ML")  4 -19721.988
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 0), method = "ML")  2 -19729.033
## arima(GOOGL$GOOGL.log_close, order = c(1, 1, 1), method = "ML")  3 -19726.998
## arima(GOOGL$GOOGL.log_close, order = c(2, 1, 1), method = "ML")  4 -19725.079
## arima(GOOGL$GOOGL.log_close, order = c(3, 1, 1), method = "ML")  5 -19723.353

According to AIC, the best model is a first-order MA process with differencing factor 1. Hence, it is confirmed that the data follows a first-order MA process.

** Actuals vs Predicted for the best model: **

best_model <- arima(GOOGL$GOOGL.log_close, order = c(0,1,1))

resi <- best_model$residuals
predi <- resi + GOOGL$GOOGL.log_close

ggplot()+
  geom_line(aes(GOOGL$date,GOOGL$GOOGL.log_close)) +
  geom_line(aes(GOOGL$date,predi),color='blue',alpha=0.4) +
  theme_bw() +
  xlab("Date") + ylab("Log(Close price)") +
  ggtitle("In-sample prediction for the best model")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

** Analysis of residuals : Box-Ljung test **

This test is done to check whether or not the auto-correlations for the errors or residuals are non zero.

forecast::checkresiduals(best_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 23.564, df = 9, p-value = 0.005047
## 
## Model df: 1.   Total lags used: 10

acf(resi,lag.max=20, main="ACF plot of Residuals")

pacf(resi,lag.max=20, main="PACF plot of Residuals")

Box.test(resi,type='Ljung-Box',lag=20)

## 
##  Box-Ljung test
## 
## data:  resi
## X-squared = 39.447, df = 20, p-value = 0.005863

** In-sample performance **

RMSE = sqrt(mean((2.718**predi - 2.718**GOOGL$GOOGL.log_close)^2,na.rm=T))
MAE = mean(abs(2.718**predi - 2.718**GOOGL$GOOGL.log_close))
MAPE = mean(abs((2.718**predi - 2.718**GOOGL$GOOGL.log_close)/2.718**predi))

print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 16.8"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 8.96"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.01"

Out of sample testing

Train Test Split

train = GOOGL %>% 
    filter(GOOGL$date<'2020-01-01')
test = GOOGL %>% 
    filter(GOOGL$date>='2020-01-01')

Results on OOS Test data:

RMSE = sqrt(mean((2.718**test$GOOGL.log_close[1:30] - 2.718**pred_data2$Forecast.pred)^2))
MAE = mean(abs(2.718**test$GOOGL.log_close[1:30] - 2.718**pred_data2$Forecast.pred))
MAPE = mean(abs((2.718**test$GOOGL.log_close[1:30] - 2.718**pred_data2$Forecast.pred)/2.718**test$GOOGL.log_close[1:30]))


print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 118.7"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 111.82"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.08"

We see a very high RMSE compared to in-sample. This is primarily because the model is predicting a straignt line throughout the period and is unable to capture the variation.

Forecast using ARIMA

We will be forecasting the 30 time-period forecast from the best model.

prediction = predict(best_model, n.ahead = 30)

pred_data = data.frame(
  Forecast=prediction,
  date = (max(GOOGL$date)+1):(max(GOOGL$date)+1+29)
)
pred_data$date <- as.Date(pred_data$date)
head(pred_data)

##   Forecast.pred Forecast.se       date
## 1      7.896272  0.01819646 2022-02-26
## 2      7.896272  0.02512423 2022-02-27
## 3      7.896272  0.03051791 2022-02-28
## 4      7.896272  0.03509214 2022-03-01
## 5      7.896272  0.03913533 2022-03-02
## 6      7.896272  0.04279825 2022-03-03

Now, we shall plot the forecasted values for the time series. The date filter has been applies in order to clearly see the constructed blue ribbon.

pred_merge = GOOGL %>% 
  full_join(pred_data) %>%
  mutate(
    pred_high = (Forecast.pred + 2*Forecast.se),
    pred_low = (Forecast.pred - 2*Forecast.se),
    pred.pred = Forecast.pred,
    error = Forecast.pred - GOOGL.log_close
    )

pred_merge %>%
  filter(date > '2021-01-01') %>%
  ggplot()+
  geom_line(aes(date,GOOGL.log_close))+
  geom_line(aes(date,Forecast.pred),color='blue')+
  geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
  ylab("Log(Close price)") +
  xlab("Date") +
  ggtitle("Prediction of log of Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

We see here that with the best model, that is with parameters c(0,1,1), we are obtaining same values of prediction for t+1, t+2, and so on. Sadly this model is not good enough to predict a horizon of 30days into the future. An almost straight predicted line is observed showing how the model fails to capture the data generating process and the variations in data.

Note, that this is still the forecast on the log transformed values of the close price. We need to transform back to the original format and look at the forecast.

pred_merge %>%
  filter(date > '2021-01-01') %>%
  ggplot()+
  geom_line(aes(date,GOOGL.Close))+
  geom_line(aes(date,2.718**Forecast.pred),color='blue')+
  geom_ribbon(aes(x = date, ymin = pred_low, ymax = pred_high), fill='blue', alpha=0.2) +
  ylab("Log(Close price)") +
  xlab("Date") +
  ggtitle("Prediction of Google stock price")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Facebook Prophet Model

After modeling the time series using the ARIMA model, we now focus on modeling the same time series using the Facebook Prophet model. It is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series with strong seasonal effects and several seasons of historical data. However, we don’t expect the stock market data to have a seasonal or holiday effect.

Prophet is robust to missing data and shifts in the trend and typically handles outliers well. A significant difference compared to ARIMA models is that the prophet model does not require the time-series data input to be mean and variance stationery.

We shall be considering data before 1-Jan-2021 as the training dataset and from 1st-Jan-2021 to 15-Feb-2022 as the testing dataset.

GOOGL_prophet_data = GOOGL %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = GOOGL.Close) %>%
  dplyr::select(ds,y)


train_prophet = GOOGL_prophet_data %>% 
  filter(ds < ymd("2021-01-01"))

test_prophet = GOOGL_prophet_data %>% 
  filter(ds >= ymd("2021-01-01"))


model = prophet(train_prophet)
future_pred = make_future_dataframe(model, periods = 365)

forecast = predict(model,future_pred)

plot(model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
  ggtitle("GOOGL price Forecast using Prophet Model") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

The below chart presents an interactive version of the forecast plot. Hover over the chart to check for values. Use the slider to adjust time period.

dyplot.prophet(model,forecast)

Saturation points

Since we are modeling stock market closing price for GOOGL, it does not have a natural maximum. However, we need to specify a lower limit of zero since closing prices cannot have a negative value.

# Set floor in train data
train_prophet$floor = 0
train_prophet$cap = 2500

# Set floor in forecast data
future_pred$floor = 0
future_pred$cap = 2500

sat_model = prophet(train_prophet, growth='logistic')
sat_forecast = predict(sat_model,future_pred)

plot(sat_model,sat_forecast)+
  theme_bw()+ xlab("Date")+ylab("GOOGL Closing Price")

Decomposition

Plotting the individual components of the time series, we see no day of year chart indicating no seasonality at that level.

The day of the week chart has lows on Saturday and Sunday. It makes sense because stock markets are closed on Saturday and Sunday, and we have systematic missing data on these two days of the week, as discussed before.

prophet_plot_components(model,forecast)

Changepoints

Most real time series frequently have abrupt changes in their trajectories. These are referred to as changepoints. Identifying them could be important for accurately modelling changes in trends. By default, prophet identifies the changepoints. However, we can also control this using arguments.

Changepoints are only inferred for the first 80% of the data in order to prevent overfit.

plot(model,forecast)+add_changepoints_to_plot(model)+
  xlab("Date")+
  ylab("Close price of Google")+
  ggtitle("Default Changepoints in the forecasted time series") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

Now, we see by manually changing the number of changepoints captured by controlling the n.changepoints argument in the prophet function to 50.

model = prophet(train_prophet, n.changepoints=50)
forecast = predict(model,future_pred)

plot(model,forecast)+add_changepoints_to_plot(model) +
  xlab("Date")+
  ylab("Close price of Google")+
  ggtitle("Forecasted time series with 50 changepoints") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

We see that there are more changepoints detected now.

Additive and Multiplicative Seasonality

There are two techniques for combining time series components - Additive and Multiplicative. In additive, the individual components are added together and is typically used for linear trend/seasonality. In multiplicative, the individual components are multiplicative together and are generally used for non-linear trend/seasonality.

The default is additive. Hence, the plots that we saw earlier were for additive seasonality. Hence, we will only explore multiplicative seasonality below by setting seasonality.mode as ‘multiplicative’

Multiplicative seasonality

mul_model = prophet(train_prophet, seasonality.mode = 'multiplicative')
future_pred = make_future_dataframe(model, periods = 365)

forecast = predict(model,future_pred)

plot(mul_model,forecast)+
ylab("Closing Price of Google")+xlab("Date")+
  ggtitle("GOOGL price Forecast using Prophet Model (Multiplicative)") +
  theme_bw()+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5))

prophet_plot_components(mul_model, forecast)

Holidays

Another interesting and useful difference between ARIMA and Prophet is that we can include holiday information in Prophet models. However, we o not include in this analysis of GOOGL stock because of the nature of its data generation, i.e. we expect to have stock market data being missing on important holidays due to the markets being closed. For datasets like product demand, understanding impact of holidays might be more critical and insightful.

Evaluating In-sample performance

#Assessing Prophet Models

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test_prophet$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test_prophet$y - forecast_metric_data$yhat))
MAPE = mean(abs((test_prophet$y - forecast_metric_data$yhat)/test_prophet$y))

print(paste("RMSE:",round(rmse,2)))

## [1] "RMSE: 13.34"

print(paste("MAE:",round(mae,2)))

## [1] "MAE: 7.23"

print(paste("MAPE:",round(mape,2)))

## [1] "MAPE: 0.79"

Based on above metrics, we can conclude that the model is performing decently in sample.

Rolling window cross-validation

#Cross-validation with Prophet
df.cv <- cross_validation(model, initial = 730, period = 300, horizon = 300, units = 'days')
head(df.cv)

##          y         ds     yhat yhat_lower yhat_upper     cutoff
## 1 205.0100 2009-07-06 211.9774   200.7885   222.9298 2009-07-02
## 2 198.5135 2009-07-07 212.7559   201.2861   224.3727 2009-07-02
## 3 201.4464 2009-07-08 213.1945   202.1290   223.7628 2009-07-02
## 4 205.4004 2009-07-09 212.8311   201.7436   224.7930 2009-07-02
## 5 207.4074 2009-07-10 212.9399   201.9880   224.2011 2009-07-02
## 6 212.3624 2009-07-13 209.6956   198.6565   221.2828 2009-07-02

The model has fit data for 13 forecasts with an initial training period of 730 days,period of 300 days and horizon of 300 days.

unique(df.cv$cutoff)

##  [1] "2009-07-02 GMT" "2010-04-28 GMT" "2011-02-22 GMT" "2011-12-19 GMT"
##  [5] "2012-10-14 GMT" "2013-08-10 GMT" "2014-06-06 GMT" "2015-04-02 GMT"
##  [9] "2016-01-27 GMT" "2016-11-22 GMT" "2017-09-18 GMT" "2018-07-15 GMT"
## [13] "2019-05-11 GMT" "2020-03-06 GMT"

#cross-validation Actual vs Predicted
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Amazon Closing Price")+
  scale_color_discrete(name = 'Cutoff')

#Cross Validation RMSE metric plot
p1 <- plot_cross_validation_metric(df.cv, metric = 'rmse')
p2 <- plot_cross_validation_metric(df.cv, metric = 'mae')
p3 <- plot_cross_validation_metric(df.cv, metric = 'mape')
p4 <- gridExtra::grid.arrange(p1, p2,p3, nrow = 2)

The RMSE of model increases with Horizon. This seems reasonable because the model predicts the “near future” better than the farther away future.

Model Comparison

In-sample performance

ARIMA(0,1,1):

RMSE: $16.8

MAE: $8.96

MAPE: 0.98%

Prophet:

RMSE: 13.34$

MAE: 7.23$

MAPE: 0.79%

OOS model performance

ARIMA(0,1,1) (Train test split of 30days in future):

RMSE: $118.7

MAE: $111.82

MAPE: 15.34%

Prophet (CV of 10days):

RMSE: $39.79

MAE: $16.79

MAPE: 4.64%

Based on comparing the ARIMA model with c(0,1,1) specifications and Prophet mode, it seems like the Prophet model is doing a much better forecasting due to the lower prediction error values.