Description of the dataset and its source

I want to choose the Meta Platforms, Inc.(Facebook) dataset for my Forecasting and Time Series assignment-2. This dataset was published on Yahoo Finance website and contains the data of daily stock prices.

Adj.Close indicates the adjusted closing price of a stock and as per general information on the internet, Adj.Close is a more accurate variable for analysis.

Challenges with Stock Price data

Stock market data are non-stationary and chaotic in nature. This makes it challenging for investors to invest the money for making profits. Stock prices vary due to many factors. Primarily, they are affected by changes in investor demand.

Motivation for this project

Investing in stocks is an alternative source of income for many people who are interested in stock market. I am one among such people looking to make lucrative profits by investing in Facebook stocks. There is always a risk of losing money by taking impulsive decisions in this business. The ability to predict future stock prices helps to make the right decisions. For the same reason, I decided to work on Meta Platforms(Facebook) stock prices data.

Project Goal

My aim is to analyze historical stock prices data for Facebook and predict their future prices.

Data Import and Wrangling

We can load the required libraries for this project.

library(ggplot2)
library(quantmod)
library(dplyr)
library(mltools)
library(tidyr)
library(forecast)
library(tseries)
library(lubridate)
library(zoo)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
data <- read.csv("C:/UC MSBANA/Forecasting/Facebook Stock Price Prediction/FB.csv")
data['Date'] <- as.Date(data$Date)
head(data)
##         Date  Open  High   Low Close Adj.Close    Volume
## 1 2012-08-15 20.64 21.41 20.40 21.20     21.20  47861100
## 2 2012-08-16 20.44 20.48 19.69 19.87     19.87 157565300
## 3 2012-08-17 20.01 20.08 19.00 19.05     19.05 129293400
## 4 2012-08-20 19.05 20.13 18.75 20.01     20.01 101186600
## 5 2012-08-21 19.58 19.98 19.09 19.16     19.16  70640600
## 6 2012-08-22 19.36 19.53 18.96 19.44     19.44  49892200
tail(data)
##            Date   Open   High    Low  Close Adj.Close    Volume
## 2379 2022-01-27 297.75 301.71 294.26 294.64    294.64  21629900
## 2380 2022-01-28 295.62 301.90 293.03 301.71    301.71  21871600
## 2381 2022-01-31 300.68 313.79 299.32 313.26    313.26  21579500
## 2382 2022-02-01 314.56 319.66 312.12 319.00    319.00  18023800
## 2383 2022-02-02 327.82 328.00 316.87 323.00    323.00  58458300
## 2384 2022-02-03 244.65 248.00 235.75 237.76    237.76 187807000
summary(data)
##       Date                 Open             High             Low        
##  Min.   :2012-08-15   Min.   : 18.08   Min.   : 18.27   Min.   : 17.55  
##  1st Qu.:2014-12-29   1st Qu.: 77.25   1st Qu.: 78.17   1st Qu.: 76.51  
##  Median :2017-05-11   Median :141.03   Median :142.26   Median :139.74  
##  Mean   :2017-05-12   Mean   :149.24   Mean   :151.00   Mean   :147.46  
##  3rd Qu.:2019-09-24   3rd Qu.:191.35   3rd Qu.:193.52   3rd Qu.:188.81  
##  Max.   :2022-02-03   Max.   :381.68   Max.   :384.33   Max.   :378.81  
##      Close          Adj.Close          Volume         
##  Min.   : 17.73   Min.   : 17.73   Min.   :  5913100  
##  1st Qu.: 77.52   1st Qu.: 77.52   1st Qu.: 15671325  
##  Median :141.06   Median :141.06   Median : 22057450  
##  Mean   :149.29   Mean   :149.29   Mean   : 30347713  
##  3rd Qu.:191.50   3rd Qu.:191.50   3rd Qu.: 35227375  
##  Max.   :382.18   Max.   :382.18   Max.   :365457900
str(data)
## 'data.frame':    2384 obs. of  7 variables:
##  $ Date     : Date, format: "2012-08-15" "2012-08-16" ...
##  $ Open     : num  20.6 20.4 20 19 19.6 ...
##  $ High     : num  21.4 20.5 20.1 20.1 20 ...
##  $ Low      : num  20.4 19.7 19 18.8 19.1 ...
##  $ Close    : num  21.2 19.9 19 20 19.2 ...
##  $ Adj.Close: num  21.2 19.9 19 20 19.2 ...
##  $ Volume   : int  47861100 157565300 129293400 101186600 70640600 49892200 32813700 29622200 20704000 25417000 ...

Checking for missing values

colSums(is.na(data))
##      Date      Open      High       Low     Close Adj.Close    Volume 
##         0         0         0         0         0         0         0

Range and Standard Deviation

sd(data$Adj.Close)
## [1] 89.15445
range(data$Adj.Close)
## [1]  17.73 382.18
  • The standard deviation of Adj.Close is approximately 89.15$ and the lowest, highest values are 17.73\(, 382.18\) respectively.

Boxplot of Adjusted Closing Price

boxplot(data$Adj.Close, main = "Adjusted Closing Price")

From the above boxplot, we note that a few points lie above the IQR. But we can not consider them as outliers as they are stock price values and could denote important information.

Line Chart

ggplot(data = data, aes(x = Date, y = Adj.Close )) +
  geom_line() + labs(title = "Line Chart for Adjusted Closing Price across the years")

Histogram

hist(data$Adj.Close, main = "Histogram of Adjusted Closing Price", xlab = "Adjusted Closing Price($)")

Density Chart

ggplot(data = data, aes(x = Volume)) + geom_density(col = "#009688", fill = "#BDBDBD") + labs(title = "Meta Stocks Volume Traded Across Years", subtitle = "2012-2022", x = "Volume Traded (Qty)") + geom_vline(aes(xintercept = mean(Volume)), color = "#03A9F4", linetype = "dashed", size = 1)

From the above Density chart, we note that the mean of Volume of stocks traded lies around 3.3e+07 and the density of volume is between 2e+07 and 1.4e+08

Building Linear Regression Model

The task in this project is to predict the price of stocks over time.

lm01 = lm(Adj.Close ~ Date, data = data)
tab_model(lm01)
  Adj.Close
Predictors Estimates CI p
(Intercept) -1326.74 -1345.00 – -1308.49 <0.001
Date 0.09 0.08 – 0.09 <0.001
Observations 2384
R2 / R2 adjusted 0.914 / 0.914

Rolling Statistics

meta_roll <- data %>%
# filter(Date>=ymd('2012-08-15') & Date<=ymd("2013-02-15")) %>%
  mutate(
    close_mean = zoo::rollmean(
      Adj.Close, 
      k = 180, 
      fill = NA),
    close_sd = zoo::rollapply(
      Adj.Close, 
      FUN = sd, 
      width = 180, 
      fill = NA)
  )
meta_rollmean <- meta_roll %>%
  ggplot() +
  geom_line(aes(Date, close_mean)) +
  theme_bw() +
  ggtitle("Meta(FB) Mean over Time (6 month rolling window)") +
  ylab("Adjusted Closing Price") +
  xlab("Date")
meta_rollsd <- meta_roll %>%
  ggplot() +
  geom_line(aes(Date, close_sd)) +
  theme_bw() +
  ggtitle("Differenced Meta(FB) SD over Time (6 month rolling window)") +
  ylab("Adjusted Closing Price") +
  xlab("Date")
meta_rollmean 

meta_rollsd

From the Rolling Statistics plots, we note that the time series is both mean non-stationary and variance non-stationary.

Stationarity Check

ADF Test

print(adf.test(data$Adj.Close))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data$Adj.Close
## Dickey-Fuller = -2.6913, Lag order = 13, p-value = 0.2857
## alternative hypothesis: stationary

From the above ADF Test output, we observe the p-value is greater than 0.05. So, we fail to reject the Null Hypothesis and infer that the time series is non-stationary.

KPSS Test

kpss.test(data$Adj.Close, null="Trend")
## 
##  KPSS Test for Trend Stationarity
## 
## data:  data$Adj.Close
## KPSS Trend = 2.2555, Truncation lag parameter = 8, p-value = 0.01

From the output of KPSS Test, the p-value is less than 0.05. So, we reject the null hypothesis and infer that the time series is trend non-stationary.

Transformations

As the data is non-stationary, we perform transformations to get Constant Variance in the target variable.

Log Transformation for Constant Variance

meta_log = data %>%
  mutate(close_log = log1p(Adj.Close))
meta_log %>%
  ggplot()+
      geom_line(aes(Date, close_log))+
      theme_bw()+
      ggtitle("Transformed Adjusted Closing Price over Time (Log)")+
      ylab("Adjusted Closing Price (Log)")+
      xlab("Date")

Handling Mean stationarity

meta_diff = meta_log %>%
  mutate(close_diff_log = close_log - lag(close_log))
meta_diff %>% 
  select(Date,Adj.Close,close_diff_log) %>% 
  head()
##         Date Adj.Close close_diff_log
## 1 2012-08-15     21.20             NA
## 2 2012-08-16     19.87    -0.06177956
## 3 2012-08-17     19.05    -0.04008367
## 4 2012-08-20     20.01     0.04676941
## 5 2012-08-21     19.16    -0.04129807
## 6 2012-08-22     19.44     0.01379337
meta_diff %>%
  ggplot()+
      geom_line(aes(Date,close_diff_log))+
      theme_bw()+
      ggtitle("Adjusted Closing Price (First Difference)")+
      ylab("Differenced Closing Price (Difference))")+
      xlab("Date")

From the above output, we observe that the time series is Mean-stationary.

Rechecking stationarity with tests

We must remove the NA Values from the differenced data

meta_final <- na.omit(meta_diff) 
adf.test(meta_final$close_diff_log) # Differenced Log Close Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  meta_final$close_diff_log
## Dickey-Fuller = -13.182, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary

KPSS Test to re-verify

kpss.test(meta_final$close_diff_log) 
## 
##  KPSS Test for Level Stationarity
## 
## data:  meta_final$close_diff_log
## KPSS Level = 0.30095, Truncation lag parameter = 8, p-value = 0.1

The results of ADF, KPSS Tests indicate the stationarity of time series.

ACF and PACF plots for AR and MA Order

par(mfrow = c(1,2))
ACF_Plot <- acf(meta_final$close_diff_log,lag.max=20, main = "ACF plot for differenced Meta(FB) Time series")
PACF_Plot <- pacf(meta_final$close_diff_log,lag.max=20, main = "PACF plot for differenced Meta(FB) Time series")

Identifying Seasonality from Original Data

ACF plot for Seasonality

acf(data$Adj.Close, main = "ACF plot for Adjusted Closing Price of original Data")

  • From the above ACF plot of Adjusted Closing Price of original data, we note that there is no significant trend in the coefficients. So, this time series data has no seasonality.

PACF plot for Seasonality

pacf(data$Adj.Close, main = "PACF plot for Adjusted Closing Price of original Data")

Fitting ARIMA Models

Comparing AIC of several ARIMA Models

AIC(
  arima(meta_final$close_diff_log,order=c(0,1,1)),
  arima(meta_final$close_diff_log,order=c(0,1,2)),
  arima(meta_final$close_diff_log,order=c(0,1,3)),
  arima(meta_final$close_diff_log,order=c(1,1,0)),
  arima(meta_final$close_diff_log,order=c(1,1,2)),
  arima(meta_final$close_diff_log,order=c(1,1,1)),
  arima(meta_final$close_diff_log,order=c(2,1,0)),
  arima(meta_final$close_diff_log,order=c(3,1,0))
)
##                                                      df       AIC
## arima(meta_final$close_diff_log, order = c(0, 1, 1))  2 -11265.83
## arima(meta_final$close_diff_log, order = c(0, 1, 2))  3 -11270.46
## arima(meta_final$close_diff_log, order = c(0, 1, 3))  4 -11269.02
## arima(meta_final$close_diff_log, order = c(1, 1, 0))  2 -10362.28
## arima(meta_final$close_diff_log, order = c(1, 1, 2))  4 -11269.74
## arima(meta_final$close_diff_log, order = c(1, 1, 1))  3 -11270.81
## arima(meta_final$close_diff_log, order = c(2, 1, 0))  3 -10586.39
## arima(meta_final$close_diff_log, order = c(3, 1, 0))  4 -10710.63

As per AIC Values, ARIMA(1,1,1) is the best model with a AIC = -11270.81.

Comparing BIC

BIC(
  arima(meta_final$close_diff_log,order=c(0,1,1)),
  arima(meta_final$close_diff_log,order=c(0,1,2)),
  arima(meta_final$close_diff_log,order=c(0,1,3)),
  arima(meta_final$close_diff_log,order=c(1,1,2)),
  arima(meta_final$close_diff_log,order=c(1,1,0)),
  arima(meta_final$close_diff_log,order=c(1,1,1)),
  arima(meta_final$close_diff_log,order=c(2,1,0)),
  arima(meta_final$close_diff_log,order=c(3,1,0))
  
)
##                                                      df       BIC
## arima(meta_final$close_diff_log, order = c(0, 1, 1))  2 -11254.28
## arima(meta_final$close_diff_log, order = c(0, 1, 2))  3 -11253.13
## arima(meta_final$close_diff_log, order = c(0, 1, 3))  4 -11245.91
## arima(meta_final$close_diff_log, order = c(1, 1, 2))  4 -11246.64
## arima(meta_final$close_diff_log, order = c(1, 1, 0))  2 -10350.73
## arima(meta_final$close_diff_log, order = c(1, 1, 1))  3 -11253.48
## arima(meta_final$close_diff_log, order = c(2, 1, 0))  3 -10569.07
## arima(meta_final$close_diff_log, order = c(3, 1, 0))  4 -10687.53

As per BIC Values, ARIMA(0,1,1) is the best model with a BIC = -11254.28.

Auto ARIMA

auto_arima = auto.arima(meta_final$close_log,max.p = 7, max.q=7,max.order = 15)

auto_arima
## Series: meta_final$close_log 
## ARIMA(1,1,2) with drift 
## 
## Coefficients:
##           ar1     ma1     ma2   drift
##       -0.9443  0.8907  -0.063  0.0011
## s.e.   0.0440  0.0481   0.021  0.0004
## 
## sigma^2 = 0.0005115:  log likelihood = 5647.62
## AIC=-11285.24   AICc=-11285.21   BIC=-11256.36

From the output of Auto-ARIMA, ARIMA(1,1,2) is the best model. We finalize this model for further analysis

Ljung-Box Test

best_mod = arima(meta_final$close_log,order=c(1,1,2))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 25.221, df = 7, p-value = 0.0006932
## 
## Model df: 3.   Total lags used: 10

From the ACF plot, we observe that residuals appear to be white noise.

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

Ljung-Box Test Results

Box.test(resid,type='Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.017154, df = 1, p-value = 0.8958
Box.test(resid,type='Ljung-Box',lag=5)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 6.7406, df = 5, p-value = 0.2407
Box.test(resid,type='Ljung-Box',lag=10)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 25.221, df = 10, p-value = 0.004941
Box.test(resid,type='Ljung-Box',lag=15)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 26.953, df = 15, p-value = 0.02912
Box.test(resid,type='Ljung-Box',lag=20)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 44.115, df = 20, p-value = 0.001453

Ljung-Box test for Residual auto correlation suggests that if p-value is less than 0.05, then there exists resiual auto correlation at that lag. From the results above, we can conclude that there is auto-correlation at lags 10, 15, 20.

meta_final$resid = best_mod$residuals
meta_final$resid_invtransform=exp(meta_final$resid)-1
meta_final$adjclose_invtransform=exp(meta_final$close_log)-1
head(meta_final)
##         Date  Open  High   Low Close Adj.Close    Volume close_log
## 2 2012-08-16 20.44 20.48 19.69 19.87     19.87 157565300  3.038313
## 3 2012-08-17 20.01 20.08 19.00 19.05     19.05 129293400  2.998229
## 4 2012-08-20 19.05 20.13 18.75 20.01     20.01 101186600  3.044999
## 5 2012-08-21 19.58 19.98 19.09 19.16     19.16  70640600  3.003700
## 6 2012-08-22 19.36 19.53 18.96 19.44     19.44  49892200  3.017494
## 7 2012-08-23 19.50 19.73 19.36 19.44     19.44  32813700  3.017494
##   close_diff_log        resid resid_invtransform adjclose_invtransform
## 2    -0.06177956  0.003038311        0.003042932                 19.87
## 3    -0.04008367 -0.040003185       -0.039213621                 19.05
## 4     0.04676941  0.044504120        0.045509284                 20.01
## 5    -0.04129807 -0.037627267       -0.036928157                 19.16
## 6     0.01379337  0.009918894        0.009968249                 19.44
## 7     0.00000000  0.002384820        0.002387666                 19.44

Actual vs Predicted from Model

meta_final$pred =  meta_final$Adj.Close -  meta_final$resid_invtransform

ggplot()+
  geom_line(aes(meta_final$Date, meta_final$Adj.Close),color='red')+
  geom_line(aes(meta_final$Date, meta_final$pred),color='blue',alpha=0.4)+
  theme_bw()+
  xlab("Date")+
  ylab("Adjusted Closing Price ")

Estimating Model Performance

Since we modeled our outcome as log values, we need to exponentiate the predicted values to get the true RMSE.

RMSE = sqrt(mean((meta_final$pred - meta_final$Adj.Close)^2, na.rm=T))
RMSE
## [1] 0.02261912
  • We are getting an RMSE = 0.02262.

  • The result suggests that our model predicts the Meta stock prices within ~0 points on average in-sample.

Building Forecasts

head(best_mod)
## $coef
##         ar1         ma1         ma2 
## -0.38371785  0.32952355  0.00958274 
## 
## $sigma2
## [1] 0.0005120141
## 
## $var.coef
##              ar1         ma1           ma2
## ar1  0.060247786 -0.05994187  0.0041935925
## ma1 -0.059941867  0.06009803 -0.0040060197
## ma2  0.004193592 -0.00400602  0.0008118332
## 
## $mask
## [1] TRUE TRUE TRUE
## 
## $loglik
## [1] 5644.482
## 
## $aic
## [1] -11280.96
best_mod %>%
  forecast(h=150) %>% 
  autoplot()

Inverse Tranformation of Log values

forecast= best_mod %>%
  forecast(h=150)

forecast$mean=exp(forecast$mean)-1
forecast$fitted=exp(forecast$fitted)-1
forecast$x=exp(forecast$x)-1


autoplot(forecast, main = 'Forecast of Actual Stock Price over time', ylab = 'Actual Stock Price')

The forecast indicates that there could be decrease in the Meta stock Prices in the near future.

Fitting a Facebook Prophet Model

library(prophet)
prophet_data = data %>% 
    rename(ds = Date, # Have to name our date variable "ds"
    y = Adj.Close)  # Have to name our time series "y"
train = prophet_data %>% 
  filter(ds<ymd("2021-01-01"))
test = prophet_data %>%
  filter(ds>=ymd("2021-01-01"))
model = prophet(train)
future = make_future_dataframe(model,periods = 365)
forecast = predict(model,future)

Plotting Forecast

plot(model,forecast)+
ylab("Adjusted Closing Price of Meta Stocks")+xlab("Date")+theme_bw()

Interactive Charts

dyplot.prophet(model,forecast)

Time Series Decomposition

Visualization of Time Series components

prophet_plot_components(model,forecast)

Identifying Saturation Points

Let us make a two-year forecast to identify the need for saturation points.

two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Adjusted Closing Price")

From the above graph, we observe that there is no abnormal rise or fall in the Stock Price values. Also, naturally Stock Price values remain positive. So, there is no need to set Saturation Points for this model.

Estimating the trend and changepoints

Plotting Changepoints

  • The algorithm examines 25 equally-spaced potential changepoints by default as potential candidates for a change in the trend.
  • Examines the actual rate of change at each potential changepoint, and removes those with low rates of change.
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Adjusted Closing Price")

Changepoint Hyperparameters

We can manually specify more number of changepoints to improve performance.

# Number of Changepoints
model = prophet(train,n.changepoints=50)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Adjusted Stock Price")

prophet_plot_components(model,forecast)

forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))

forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+ xlab("Date") + ylab("Adjusted Closing Price from test data") +
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat), color='red')
forecast_scaled = forecast_not_scaled + 
ylim(0,1000) 
forecast_scaled

Assessment of Seasonality

prophet_plot_components(model,forecast)

Additive Seasonality

additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,2500)

prophet_plot_components(additive,add_fcst)

Multiplicative Seasonality

multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,2500)

prophet_plot_components(multi, multi_fcst)

Assessment of holidays

model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)

Examining Impact of Holidays

forecast %>% 
  filter(
    holidays!=0,
    ds == ymd("2020-11-11") | 
    ds == ymd("2020-10-12")) %>%
    select(ds,`Veterans Day`,`Columbus Day`)
##           ds Veterans Day Columbus Day
## 1 2020-10-12    0.0000000    0.9984599
## 2 2020-11-11    0.7518858    0.0000000
forecast %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  theme_bw()

  • From the above plot, Veterans Day, Columbus Day seem to be the most important holidays affecting stock prices

Prophet Model Assessment

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 42.51"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 35.61"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.11"

The observed values of in-sample performance metrics are RMSE = 42.51, MAE = 35.61, MAPE = 0.11.

Rolling Window Cross-Validation with Prophet Model

library(tidyverse)
library(caret)
library(prophet)
df.cv <- cross_validation(model, initial = 730, period = 30, horizon = 30, units = 'days')
head(df.cv)
##       y         ds     yhat yhat_lower yhat_upper     cutoff
## 1 77.26 2014-09-05 78.66006   75.69237   81.64817 2014-09-04
## 2 77.89 2014-09-08 79.52553   76.38671   82.71303 2014-09-04
## 3 76.67 2014-09-09 79.93858   76.95082   83.07826 2014-09-04
## 4 77.43 2014-09-10 80.45137   77.50634   83.46513 2014-09-04
## 5 77.92 2014-09-11 80.81329   77.84398   83.95192 2014-09-04
## 6 77.48 2014-09-12 81.04144   77.93581   83.99811 2014-09-04
unique(df.cv$cutoff)
##  [1] "2014-09-04 GMT" "2014-10-04 GMT" "2014-11-03 GMT" "2014-12-03 GMT"
##  [5] "2015-01-02 GMT" "2015-02-01 GMT" "2015-03-03 GMT" "2015-04-02 GMT"
##  [9] "2015-05-02 GMT" "2015-06-01 GMT" "2015-07-01 GMT" "2015-07-31 GMT"
## [13] "2015-08-30 GMT" "2015-09-29 GMT" "2015-10-29 GMT" "2015-11-28 GMT"
## [17] "2015-12-28 GMT" "2016-01-27 GMT" "2016-02-26 GMT" "2016-03-27 GMT"
## [21] "2016-04-26 GMT" "2016-05-26 GMT" "2016-06-25 GMT" "2016-07-25 GMT"
## [25] "2016-08-24 GMT" "2016-09-23 GMT" "2016-10-23 GMT" "2016-11-22 GMT"
## [29] "2016-12-22 GMT" "2017-01-21 GMT" "2017-02-20 GMT" "2017-03-22 GMT"
## [33] "2017-04-21 GMT" "2017-05-21 GMT" "2017-06-20 GMT" "2017-07-20 GMT"
## [37] "2017-08-19 GMT" "2017-09-18 GMT" "2017-10-18 GMT" "2017-11-17 GMT"
## [41] "2017-12-17 GMT" "2018-01-16 GMT" "2018-02-15 GMT" "2018-03-17 GMT"
## [45] "2018-04-16 GMT" "2018-05-16 GMT" "2018-06-15 GMT" "2018-07-15 GMT"
## [49] "2018-08-14 GMT" "2018-09-13 GMT" "2018-10-13 GMT" "2018-11-12 GMT"
## [53] "2018-12-12 GMT" "2019-01-11 GMT" "2019-02-10 GMT" "2019-03-12 GMT"
## [57] "2019-04-11 GMT" "2019-05-11 GMT" "2019-06-10 GMT" "2019-07-10 GMT"
## [61] "2019-08-09 GMT" "2019-09-08 GMT" "2019-10-08 GMT" "2019-11-07 GMT"
## [65] "2019-12-07 GMT" "2020-01-06 GMT" "2020-02-05 GMT" "2020-03-06 GMT"
## [69] "2020-04-05 GMT" "2020-05-05 GMT" "2020-06-04 GMT" "2020-07-04 GMT"
## [73] "2020-08-03 GMT" "2020-09-02 GMT" "2020-10-02 GMT" "2020-11-01 GMT"
## [77] "2020-12-01 GMT"

Cross-Validation Actual vs Predicted

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color = factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Adjusted Closing Price")+
  scale_color_discrete(name = 'Cutoff')

Table for Performance metrics

We can generate a table for various performance metrics for different time horizons.

performance_metrics(df.cv)
##    horizon      mse     rmse      mae       mape      mdape      smape
## 1   3 days 237.0759 15.39727 11.28247 0.06555776 0.05776137 0.06557412
## 2   4 days 254.1698 15.94270 11.89033 0.06920585 0.05870524 0.06922211
## 3   5 days 260.7918 16.14905 12.06646 0.07113284 0.06180914 0.07118866
## 4   6 days 276.2903 16.62198 12.32790 0.07366077 0.06260145 0.07357199
## 5   7 days 278.4242 16.68605 12.32800 0.07417991 0.06360208 0.07392957
## 6   8 days 287.8141 16.96509 12.61406 0.07571646 0.06605957 0.07520361
## 7   9 days 293.8805 17.14294 12.76515 0.07672832 0.06531488 0.07594572
## 8  10 days 300.2734 17.32840 12.80279 0.07727192 0.06616708 0.07621038
## 9  11 days 337.8040 18.37944 13.53539 0.08255582 0.06683741 0.08109157
## 10 12 days 361.9484 19.02494 14.16337 0.08755908 0.06983869 0.08592160
## 11 13 days 353.9851 18.81449 14.07069 0.08771223 0.07252843 0.08640626
## 12 14 days 332.5793 18.23676 13.56329 0.08452660 0.07115921 0.08363761
## 13 15 days 319.0513 17.86201 13.35490 0.08150186 0.07124705 0.08129182
## 14 16 days 333.5775 18.26410 13.71659 0.08370060 0.07124705 0.08297630
## 15 17 days 343.3003 18.52836 13.91428 0.08437655 0.07089151 0.08328492
## 16 18 days 350.0995 18.71094 13.84236 0.08457387 0.06355048 0.08318298
## 17 19 days 346.8999 18.62525 13.61401 0.08321377 0.06209464 0.08208697
## 18 20 days 323.4802 17.98555 13.20953 0.08165723 0.06112274 0.08128777
## 19 21 days 324.8037 18.02231 13.23080 0.08199376 0.06252495 0.08197834
## 20 22 days 328.6912 18.12984 13.46191 0.08355830 0.06366675 0.08388263
## 21 23 days 361.2234 19.00588 13.87081 0.08468539 0.06656644 0.08483306
## 22 24 days 365.0178 19.10544 13.90756 0.08519675 0.06656644 0.08493320
## 23 25 days 374.7946 19.35961 13.77430 0.08471988 0.06505186 0.08423006
## 24 26 days 333.9311 18.27378 13.11343 0.08216458 0.06357478 0.08124851
## 25 27 days 336.6305 18.34749 13.49021 0.08408171 0.06535908 0.08330225
## 26 28 days 350.7630 18.72867 13.99384 0.08590424 0.07119074 0.08532094
## 27 29 days 393.2120 19.82957 14.91387 0.09035837 0.08163422 0.09037273
## 28 30 days 409.7096 20.24128 14.89378 0.08941382 0.08112612 0.08959061
##     coverage
## 1  0.3577918
## 2  0.3333333
## 3  0.3018868
## 4  0.3102344
## 5  0.3342827
## 6  0.3207547
## 7  0.3144654
## 8  0.3001715
## 9  0.3039832
## 10 0.2641509
## 11 0.2641509
## 12 0.2720708
## 13 0.2893082
## 14 0.3018868
## 15 0.3253827
## 16 0.3476919
## 17 0.3477754
## 18 0.3333333
## 19 0.3291405
## 20 0.2955975
## 21 0.3018868
## 22 0.2870535
## 23 0.3021197
## 24 0.3039429
## 25 0.2791485
## 26 0.2642653
## 27 0.2327044
## 28 0.2641509

Plots for various metrics

Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mae')

plot_cross_validation_metric(df.cv, metric = 'mape')

plot_cross_validation_metric(df.cv, metric = 'mdape')

plot_cross_validation_metric(df.cv, metric = 'smape')

ARIMA Model for RMSE Comparison

best_mod_ar <- auto.arima(train$y,stationary=FALSE,allowdrift=FALSE,
                 seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
best_mod_ar
## Series: train$y 
## ARIMA(3,1,0) 
## 
## Coefficients:
##           ar1     ar2      ar3
##       -0.0661  0.0125  -0.0916
## s.e.   0.0217  0.0217   0.0217
## 
## sigma^2 = 9.389:  log likelihood = -5350.09
## AIC=10708.19   AICc=10708.2   BIC=10730.8
checkresiduals(best_mod_ar)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,0)
## Q* = 28.423, df = 7, p-value = 0.0001843
## 
## Model df: 3.   Total lags used: 10

We get ARIMA(3,1,0) as the best model on training dataset.

Out of Sample RMSE for ARIMA

test_pred = predict(best_mod_ar, 275)
test_pred_ar = test_pred$pred

error_ar = test$y - test_pred_ar

out_rmse_ar=sqrt(mean(error_ar^2,na.rm=T))

Prophet Model for RMSE comparison

best_mod_pr = prophet(train, weekly.seasonality=FALSE, daily.seasonality = FALSE, 
                yearly.seasonality=TRUE)

future = make_future_dataframe(model,periods = 275)
forecast = predict(best_mod_pr,future)

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))


out_rmse_pr = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

Out of Sample RMSE comparison for ARIMA and Prophet

tibble(
  `best_ARIMA` = round(out_rmse_ar,2),
  `best_Prophet` = round(out_rmse_pr,2)
)
## # A tibble: 1 x 2
##   best_ARIMA best_Prophet
##        <dbl>        <dbl>
## 1       58.3         43.4

As per the RMSE Criterion, the Prophet model has less RMSE. So, it is the better model of the two.

In general, ARIMA Modeling has its limitations for being computationally expensive and having difficulty to predict change points. It is deemed to give poor long-term forecast.

Prophet model gains an upper hand over ARIMA with the benefit of forecasting daily data with weekly and yearly seasonality, plus holiday effects.