I want to choose the Meta Platforms, Inc.(Facebook) dataset for my Forecasting and Time Series assignment-2. This dataset was published on Yahoo Finance website and contains the data of daily stock prices.
Date
, Open
, High
, Low
, Close
, Adj.Close
, Volume
.Adj.Close
indicates the adjusted closing price of a stock and as per general information on the internet, Adj.Close is a more accurate variable for analysis.
Stock market data are non-stationary and chaotic in nature. This makes it challenging for investors to invest the money for making profits. Stock prices vary due to many factors. Primarily, they are affected by changes in investor demand.
Investing in stocks is an alternative source of income for many people who are interested in stock market. I am one among such people looking to make lucrative profits by investing in Facebook stocks. There is always a risk of losing money by taking impulsive decisions in this business. The ability to predict future stock prices helps to make the right decisions. For the same reason, I decided to work on Meta Platforms(Facebook) stock prices data.
My aim is to analyze historical stock prices data for Facebook and predict their future prices.
We can load the required libraries for this project.
library(ggplot2)
library(quantmod)
library(dplyr)
library(mltools)
library(tidyr)
library(forecast)
library(tseries)
library(lubridate)
library(zoo)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
data <- read.csv("C:/UC MSBANA/Forecasting/Facebook Stock Price Prediction/FB.csv")
data['Date'] <- as.Date(data$Date)
head(data)
## Date Open High Low Close Adj.Close Volume
## 1 2012-08-15 20.64 21.41 20.40 21.20 21.20 47861100
## 2 2012-08-16 20.44 20.48 19.69 19.87 19.87 157565300
## 3 2012-08-17 20.01 20.08 19.00 19.05 19.05 129293400
## 4 2012-08-20 19.05 20.13 18.75 20.01 20.01 101186600
## 5 2012-08-21 19.58 19.98 19.09 19.16 19.16 70640600
## 6 2012-08-22 19.36 19.53 18.96 19.44 19.44 49892200
tail(data)
## Date Open High Low Close Adj.Close Volume
## 2379 2022-01-27 297.75 301.71 294.26 294.64 294.64 21629900
## 2380 2022-01-28 295.62 301.90 293.03 301.71 301.71 21871600
## 2381 2022-01-31 300.68 313.79 299.32 313.26 313.26 21579500
## 2382 2022-02-01 314.56 319.66 312.12 319.00 319.00 18023800
## 2383 2022-02-02 327.82 328.00 316.87 323.00 323.00 58458300
## 2384 2022-02-03 244.65 248.00 235.75 237.76 237.76 187807000
summary(data)
## Date Open High Low
## Min. :2012-08-15 Min. : 18.08 Min. : 18.27 Min. : 17.55
## 1st Qu.:2014-12-29 1st Qu.: 77.25 1st Qu.: 78.17 1st Qu.: 76.51
## Median :2017-05-11 Median :141.03 Median :142.26 Median :139.74
## Mean :2017-05-12 Mean :149.24 Mean :151.00 Mean :147.46
## 3rd Qu.:2019-09-24 3rd Qu.:191.35 3rd Qu.:193.52 3rd Qu.:188.81
## Max. :2022-02-03 Max. :381.68 Max. :384.33 Max. :378.81
## Close Adj.Close Volume
## Min. : 17.73 Min. : 17.73 Min. : 5913100
## 1st Qu.: 77.52 1st Qu.: 77.52 1st Qu.: 15671325
## Median :141.06 Median :141.06 Median : 22057450
## Mean :149.29 Mean :149.29 Mean : 30347713
## 3rd Qu.:191.50 3rd Qu.:191.50 3rd Qu.: 35227375
## Max. :382.18 Max. :382.18 Max. :365457900
str(data)
## 'data.frame': 2384 obs. of 7 variables:
## $ Date : Date, format: "2012-08-15" "2012-08-16" ...
## $ Open : num 20.6 20.4 20 19 19.6 ...
## $ High : num 21.4 20.5 20.1 20.1 20 ...
## $ Low : num 20.4 19.7 19 18.8 19.1 ...
## $ Close : num 21.2 19.9 19 20 19.2 ...
## $ Adj.Close: num 21.2 19.9 19 20 19.2 ...
## $ Volume : int 47861100 157565300 129293400 101186600 70640600 49892200 32813700 29622200 20704000 25417000 ...
colSums(is.na(data))
## Date Open High Low Close Adj.Close Volume
## 0 0 0 0 0 0 0
sd(data$Adj.Close)
## [1] 89.15445
range(data$Adj.Close)
## [1] 17.73 382.18
boxplot(data$Adj.Close, main = "Adjusted Closing Price")
From the above boxplot, we note that a few points lie above the IQR. But we can not consider them as outliers as they are stock price values and could denote important information.
ggplot(data = data, aes(x = Date, y = Adj.Close )) +
geom_line() + labs(title = "Line Chart for Adjusted Closing Price across the years")
hist(data$Adj.Close, main = "Histogram of Adjusted Closing Price", xlab = "Adjusted Closing Price($)")
ggplot(data = data, aes(x = Volume)) + geom_density(col = "#009688", fill = "#BDBDBD") + labs(title = "Meta Stocks Volume Traded Across Years", subtitle = "2012-2022", x = "Volume Traded (Qty)") + geom_vline(aes(xintercept = mean(Volume)), color = "#03A9F4", linetype = "dashed", size = 1)
From the above Density chart, we note that the mean of Volume of stocks traded lies around 3.3e+07 and the density of volume is between 2e+07 and 1.4e+08
The task in this project is to predict the price of stocks over time.
lm01 = lm(Adj.Close ~ Date, data = data)
tab_model(lm01)
 | Adj.Close | ||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | -1326.74 | -1345.00 – -1308.49 | <0.001 |
Date | 0.09 | 0.08 – 0.09 | <0.001 |
Observations | 2384 | ||
R2 / R2 adjusted | 0.914 / 0.914 |
From the above linear regression model, we observe that the p-value is less than 0.001. We can reject the null hypothesis and infer that there is a significant relationship between Closing value of Stock Price and Date.
From the summary statistics, adjusted R-Squared value is 0.914. It indicates that Date variable explains 91.4% variation in stock price.
meta_roll <- data %>%
# filter(Date>=ymd('2012-08-15') & Date<=ymd("2013-02-15")) %>%
mutate(
close_mean = zoo::rollmean(
Adj.Close,
k = 180,
fill = NA),
close_sd = zoo::rollapply(
Adj.Close,
FUN = sd,
width = 180,
fill = NA)
)
meta_rollmean <- meta_roll %>%
ggplot() +
geom_line(aes(Date, close_mean)) +
theme_bw() +
ggtitle("Meta(FB) Mean over Time (6 month rolling window)") +
ylab("Adjusted Closing Price") +
xlab("Date")
meta_rollsd <- meta_roll %>%
ggplot() +
geom_line(aes(Date, close_sd)) +
theme_bw() +
ggtitle("Differenced Meta(FB) SD over Time (6 month rolling window)") +
ylab("Adjusted Closing Price") +
xlab("Date")
meta_rollmean
meta_rollsd
From the Rolling Statistics plots, we note that the time series is both mean non-stationary and variance non-stationary.
print(adf.test(data$Adj.Close))
##
## Augmented Dickey-Fuller Test
##
## data: data$Adj.Close
## Dickey-Fuller = -2.6913, Lag order = 13, p-value = 0.2857
## alternative hypothesis: stationary
From the above ADF Test output, we observe the p-value is greater than 0.05. So, we fail to reject the Null Hypothesis and infer that the time series is non-stationary.
kpss.test(data$Adj.Close, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: data$Adj.Close
## KPSS Trend = 2.2555, Truncation lag parameter = 8, p-value = 0.01
From the output of KPSS Test, the p-value is less than 0.05. So, we reject the null hypothesis and infer that the time series is trend non-stationary.
As the data is non-stationary, we perform transformations to get Constant Variance in the target variable.
meta_log = data %>%
mutate(close_log = log1p(Adj.Close))
meta_log %>%
ggplot()+
geom_line(aes(Date, close_log))+
theme_bw()+
ggtitle("Transformed Adjusted Closing Price over Time (Log)")+
ylab("Adjusted Closing Price (Log)")+
xlab("Date")
meta_diff = meta_log %>%
mutate(close_diff_log = close_log - lag(close_log))
meta_diff %>%
select(Date,Adj.Close,close_diff_log) %>%
head()
## Date Adj.Close close_diff_log
## 1 2012-08-15 21.20 NA
## 2 2012-08-16 19.87 -0.06177956
## 3 2012-08-17 19.05 -0.04008367
## 4 2012-08-20 20.01 0.04676941
## 5 2012-08-21 19.16 -0.04129807
## 6 2012-08-22 19.44 0.01379337
meta_diff %>%
ggplot()+
geom_line(aes(Date,close_diff_log))+
theme_bw()+
ggtitle("Adjusted Closing Price (First Difference)")+
ylab("Differenced Closing Price (Difference))")+
xlab("Date")
From the above output, we observe that the time series is Mean-stationary.
We must remove the NA Values from the differenced data
meta_final <- na.omit(meta_diff)
adf.test(meta_final$close_diff_log) # Differenced Log Close Value
##
## Augmented Dickey-Fuller Test
##
## data: meta_final$close_diff_log
## Dickey-Fuller = -13.182, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(meta_final$close_diff_log)
##
## KPSS Test for Level Stationarity
##
## data: meta_final$close_diff_log
## KPSS Level = 0.30095, Truncation lag parameter = 8, p-value = 0.1
The results of ADF, KPSS Tests indicate the stationarity of time series.
par(mfrow = c(1,2))
ACF_Plot <- acf(meta_final$close_diff_log,lag.max=20, main = "ACF plot for differenced Meta(FB) Time series")
PACF_Plot <- pacf(meta_final$close_diff_log,lag.max=20, main = "PACF plot for differenced Meta(FB) Time series")
acf(data$Adj.Close, main = "ACF plot for Adjusted Closing Price of original Data")
pacf(data$Adj.Close, main = "PACF plot for Adjusted Closing Price of original Data")
AIC(
arima(meta_final$close_diff_log,order=c(0,1,1)),
arima(meta_final$close_diff_log,order=c(0,1,2)),
arima(meta_final$close_diff_log,order=c(0,1,3)),
arima(meta_final$close_diff_log,order=c(1,1,0)),
arima(meta_final$close_diff_log,order=c(1,1,2)),
arima(meta_final$close_diff_log,order=c(1,1,1)),
arima(meta_final$close_diff_log,order=c(2,1,0)),
arima(meta_final$close_diff_log,order=c(3,1,0))
)
## df AIC
## arima(meta_final$close_diff_log, order = c(0, 1, 1)) 2 -11265.83
## arima(meta_final$close_diff_log, order = c(0, 1, 2)) 3 -11270.46
## arima(meta_final$close_diff_log, order = c(0, 1, 3)) 4 -11269.02
## arima(meta_final$close_diff_log, order = c(1, 1, 0)) 2 -10362.28
## arima(meta_final$close_diff_log, order = c(1, 1, 2)) 4 -11269.74
## arima(meta_final$close_diff_log, order = c(1, 1, 1)) 3 -11270.81
## arima(meta_final$close_diff_log, order = c(2, 1, 0)) 3 -10586.39
## arima(meta_final$close_diff_log, order = c(3, 1, 0)) 4 -10710.63
As per AIC Values, ARIMA(1,1,1) is the best model with a AIC = -11270.81.
BIC(
arima(meta_final$close_diff_log,order=c(0,1,1)),
arima(meta_final$close_diff_log,order=c(0,1,2)),
arima(meta_final$close_diff_log,order=c(0,1,3)),
arima(meta_final$close_diff_log,order=c(1,1,2)),
arima(meta_final$close_diff_log,order=c(1,1,0)),
arima(meta_final$close_diff_log,order=c(1,1,1)),
arima(meta_final$close_diff_log,order=c(2,1,0)),
arima(meta_final$close_diff_log,order=c(3,1,0))
)
## df BIC
## arima(meta_final$close_diff_log, order = c(0, 1, 1)) 2 -11254.28
## arima(meta_final$close_diff_log, order = c(0, 1, 2)) 3 -11253.13
## arima(meta_final$close_diff_log, order = c(0, 1, 3)) 4 -11245.91
## arima(meta_final$close_diff_log, order = c(1, 1, 2)) 4 -11246.64
## arima(meta_final$close_diff_log, order = c(1, 1, 0)) 2 -10350.73
## arima(meta_final$close_diff_log, order = c(1, 1, 1)) 3 -11253.48
## arima(meta_final$close_diff_log, order = c(2, 1, 0)) 3 -10569.07
## arima(meta_final$close_diff_log, order = c(3, 1, 0)) 4 -10687.53
As per BIC Values, ARIMA(0,1,1) is the best model with a BIC = -11254.28.
auto_arima = auto.arima(meta_final$close_log,max.p = 7, max.q=7,max.order = 15)
auto_arima
## Series: meta_final$close_log
## ARIMA(1,1,2) with drift
##
## Coefficients:
## ar1 ma1 ma2 drift
## -0.9443 0.8907 -0.063 0.0011
## s.e. 0.0440 0.0481 0.021 0.0004
##
## sigma^2 = 0.0005115: log likelihood = 5647.62
## AIC=-11285.24 AICc=-11285.21 BIC=-11256.36
From the output of Auto-ARIMA, ARIMA(1,1,2) is the best model. We finalize this model for further analysis
best_mod = arima(meta_final$close_log,order=c(1,1,2))
forecast::checkresiduals(best_mod)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)
## Q* = 25.221, df = 7, p-value = 0.0006932
##
## Model df: 3. Total lags used: 10
From the ACF plot, we observe that residuals appear to be white noise.
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.017154, df = 1, p-value = 0.8958
Box.test(resid,type='Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: resid
## X-squared = 6.7406, df = 5, p-value = 0.2407
Box.test(resid,type='Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: resid
## X-squared = 25.221, df = 10, p-value = 0.004941
Box.test(resid,type='Ljung-Box',lag=15)
##
## Box-Ljung test
##
## data: resid
## X-squared = 26.953, df = 15, p-value = 0.02912
Box.test(resid,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resid
## X-squared = 44.115, df = 20, p-value = 0.001453
Ljung-Box test for Residual auto correlation suggests that if p-value is less than 0.05, then there exists resiual auto correlation at that lag. From the results above, we can conclude that there is auto-correlation at lags 10, 15, 20.
meta_final$resid = best_mod$residuals
meta_final$resid_invtransform=exp(meta_final$resid)-1
meta_final$adjclose_invtransform=exp(meta_final$close_log)-1
head(meta_final)
## Date Open High Low Close Adj.Close Volume close_log
## 2 2012-08-16 20.44 20.48 19.69 19.87 19.87 157565300 3.038313
## 3 2012-08-17 20.01 20.08 19.00 19.05 19.05 129293400 2.998229
## 4 2012-08-20 19.05 20.13 18.75 20.01 20.01 101186600 3.044999
## 5 2012-08-21 19.58 19.98 19.09 19.16 19.16 70640600 3.003700
## 6 2012-08-22 19.36 19.53 18.96 19.44 19.44 49892200 3.017494
## 7 2012-08-23 19.50 19.73 19.36 19.44 19.44 32813700 3.017494
## close_diff_log resid resid_invtransform adjclose_invtransform
## 2 -0.06177956 0.003038311 0.003042932 19.87
## 3 -0.04008367 -0.040003185 -0.039213621 19.05
## 4 0.04676941 0.044504120 0.045509284 20.01
## 5 -0.04129807 -0.037627267 -0.036928157 19.16
## 6 0.01379337 0.009918894 0.009968249 19.44
## 7 0.00000000 0.002384820 0.002387666 19.44
meta_final$pred = meta_final$Adj.Close - meta_final$resid_invtransform
ggplot()+
geom_line(aes(meta_final$Date, meta_final$Adj.Close),color='red')+
geom_line(aes(meta_final$Date, meta_final$pred),color='blue',alpha=0.4)+
theme_bw()+
xlab("Date")+
ylab("Adjusted Closing Price ")
Since we modeled our outcome as log values, we need to exponentiate the predicted values to get the true RMSE.
RMSE = sqrt(mean((meta_final$pred - meta_final$Adj.Close)^2, na.rm=T))
RMSE
## [1] 0.02261912
We are getting an RMSE = 0.02262.
The result suggests that our model predicts the Meta stock prices within ~0 points on average in-sample.
head(best_mod)
## $coef
## ar1 ma1 ma2
## -0.38371785 0.32952355 0.00958274
##
## $sigma2
## [1] 0.0005120141
##
## $var.coef
## ar1 ma1 ma2
## ar1 0.060247786 -0.05994187 0.0041935925
## ma1 -0.059941867 0.06009803 -0.0040060197
## ma2 0.004193592 -0.00400602 0.0008118332
##
## $mask
## [1] TRUE TRUE TRUE
##
## $loglik
## [1] 5644.482
##
## $aic
## [1] -11280.96
best_mod %>%
forecast(h=150) %>%
autoplot()
forecast= best_mod %>%
forecast(h=150)
forecast$mean=exp(forecast$mean)-1
forecast$fitted=exp(forecast$fitted)-1
forecast$x=exp(forecast$x)-1
autoplot(forecast, main = 'Forecast of Actual Stock Price over time', ylab = 'Actual Stock Price')
The forecast indicates that there could be decrease in the Meta stock Prices in the near future.
library(prophet)
prophet_data = data %>%
rename(ds = Date, # Have to name our date variable "ds"
y = Adj.Close) # Have to name our time series "y"
train = prophet_data %>%
filter(ds<ymd("2021-01-01"))
test = prophet_data %>%
filter(ds>=ymd("2021-01-01"))
model = prophet(train)
future = make_future_dataframe(model,periods = 365)
forecast = predict(model,future)
plot(model,forecast)+
ylab("Adjusted Closing Price of Meta Stocks")+xlab("Date")+theme_bw()
dyplot.prophet(model,forecast)
prophet_plot_components(model,forecast)
Let us make a two-year forecast to identify the need for saturation points.
two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Adjusted Closing Price")
From the above graph, we observe that there is no abnormal rise or fall in the Stock Price values. Also, naturally Stock Price values remain positive. So, there is no need to set Saturation Points for this model.
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Adjusted Closing Price")
We can manually specify more number of changepoints to improve performance.
# Number of Changepoints
model = prophet(train,n.changepoints=50)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Adjusted Stock Price")
prophet_plot_components(model,forecast)
forecast_plot_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2021-01-01"))
forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+ xlab("Date") + ylab("Adjusted Closing Price from test data") +
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat), color='red')
forecast_scaled = forecast_not_scaled +
ylim(0,1000)
forecast_scaled
prophet_plot_components(model,forecast)
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,2500)
prophet_plot_components(additive,add_fcst)
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,2500)
prophet_plot_components(multi, multi_fcst)
model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)
forecast %>%
filter(
holidays!=0,
ds == ymd("2020-11-11") |
ds == ymd("2020-10-12")) %>%
select(ds,`Veterans Day`,`Columbus Day`)
## ds Veterans Day Columbus Day
## 1 2020-10-12 0.0000000 0.9984599
## 2 2020-11-11 0.7518858 0.0000000
forecast %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2021-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 42.51"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 35.61"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.11"
The observed values of in-sample performance metrics are RMSE
= 42.51, MAE
= 35.61, MAPE
= 0.11.
library(tidyverse)
library(caret)
library(prophet)
df.cv <- cross_validation(model, initial = 730, period = 30, horizon = 30, units = 'days')
head(df.cv)
## y ds yhat yhat_lower yhat_upper cutoff
## 1 77.26 2014-09-05 78.66006 75.69237 81.64817 2014-09-04
## 2 77.89 2014-09-08 79.52553 76.38671 82.71303 2014-09-04
## 3 76.67 2014-09-09 79.93858 76.95082 83.07826 2014-09-04
## 4 77.43 2014-09-10 80.45137 77.50634 83.46513 2014-09-04
## 5 77.92 2014-09-11 80.81329 77.84398 83.95192 2014-09-04
## 6 77.48 2014-09-12 81.04144 77.93581 83.99811 2014-09-04
unique(df.cv$cutoff)
## [1] "2014-09-04 GMT" "2014-10-04 GMT" "2014-11-03 GMT" "2014-12-03 GMT"
## [5] "2015-01-02 GMT" "2015-02-01 GMT" "2015-03-03 GMT" "2015-04-02 GMT"
## [9] "2015-05-02 GMT" "2015-06-01 GMT" "2015-07-01 GMT" "2015-07-31 GMT"
## [13] "2015-08-30 GMT" "2015-09-29 GMT" "2015-10-29 GMT" "2015-11-28 GMT"
## [17] "2015-12-28 GMT" "2016-01-27 GMT" "2016-02-26 GMT" "2016-03-27 GMT"
## [21] "2016-04-26 GMT" "2016-05-26 GMT" "2016-06-25 GMT" "2016-07-25 GMT"
## [25] "2016-08-24 GMT" "2016-09-23 GMT" "2016-10-23 GMT" "2016-11-22 GMT"
## [29] "2016-12-22 GMT" "2017-01-21 GMT" "2017-02-20 GMT" "2017-03-22 GMT"
## [33] "2017-04-21 GMT" "2017-05-21 GMT" "2017-06-20 GMT" "2017-07-20 GMT"
## [37] "2017-08-19 GMT" "2017-09-18 GMT" "2017-10-18 GMT" "2017-11-17 GMT"
## [41] "2017-12-17 GMT" "2018-01-16 GMT" "2018-02-15 GMT" "2018-03-17 GMT"
## [45] "2018-04-16 GMT" "2018-05-16 GMT" "2018-06-15 GMT" "2018-07-15 GMT"
## [49] "2018-08-14 GMT" "2018-09-13 GMT" "2018-10-13 GMT" "2018-11-12 GMT"
## [53] "2018-12-12 GMT" "2019-01-11 GMT" "2019-02-10 GMT" "2019-03-12 GMT"
## [57] "2019-04-11 GMT" "2019-05-11 GMT" "2019-06-10 GMT" "2019-07-10 GMT"
## [61] "2019-08-09 GMT" "2019-09-08 GMT" "2019-10-08 GMT" "2019-11-07 GMT"
## [65] "2019-12-07 GMT" "2020-01-06 GMT" "2020-02-05 GMT" "2020-03-06 GMT"
## [69] "2020-04-05 GMT" "2020-05-05 GMT" "2020-06-04 GMT" "2020-07-04 GMT"
## [73] "2020-08-03 GMT" "2020-09-02 GMT" "2020-10-02 GMT" "2020-11-01 GMT"
## [77] "2020-12-01 GMT"
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color = factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Adjusted Closing Price")+
scale_color_discrete(name = 'Cutoff')
We can generate a table for various performance metrics for different time horizons.
performance_metrics(df.cv)
## horizon mse rmse mae mape mdape smape
## 1 3 days 237.0759 15.39727 11.28247 0.06555776 0.05776137 0.06557412
## 2 4 days 254.1698 15.94270 11.89033 0.06920585 0.05870524 0.06922211
## 3 5 days 260.7918 16.14905 12.06646 0.07113284 0.06180914 0.07118866
## 4 6 days 276.2903 16.62198 12.32790 0.07366077 0.06260145 0.07357199
## 5 7 days 278.4242 16.68605 12.32800 0.07417991 0.06360208 0.07392957
## 6 8 days 287.8141 16.96509 12.61406 0.07571646 0.06605957 0.07520361
## 7 9 days 293.8805 17.14294 12.76515 0.07672832 0.06531488 0.07594572
## 8 10 days 300.2734 17.32840 12.80279 0.07727192 0.06616708 0.07621038
## 9 11 days 337.8040 18.37944 13.53539 0.08255582 0.06683741 0.08109157
## 10 12 days 361.9484 19.02494 14.16337 0.08755908 0.06983869 0.08592160
## 11 13 days 353.9851 18.81449 14.07069 0.08771223 0.07252843 0.08640626
## 12 14 days 332.5793 18.23676 13.56329 0.08452660 0.07115921 0.08363761
## 13 15 days 319.0513 17.86201 13.35490 0.08150186 0.07124705 0.08129182
## 14 16 days 333.5775 18.26410 13.71659 0.08370060 0.07124705 0.08297630
## 15 17 days 343.3003 18.52836 13.91428 0.08437655 0.07089151 0.08328492
## 16 18 days 350.0995 18.71094 13.84236 0.08457387 0.06355048 0.08318298
## 17 19 days 346.8999 18.62525 13.61401 0.08321377 0.06209464 0.08208697
## 18 20 days 323.4802 17.98555 13.20953 0.08165723 0.06112274 0.08128777
## 19 21 days 324.8037 18.02231 13.23080 0.08199376 0.06252495 0.08197834
## 20 22 days 328.6912 18.12984 13.46191 0.08355830 0.06366675 0.08388263
## 21 23 days 361.2234 19.00588 13.87081 0.08468539 0.06656644 0.08483306
## 22 24 days 365.0178 19.10544 13.90756 0.08519675 0.06656644 0.08493320
## 23 25 days 374.7946 19.35961 13.77430 0.08471988 0.06505186 0.08423006
## 24 26 days 333.9311 18.27378 13.11343 0.08216458 0.06357478 0.08124851
## 25 27 days 336.6305 18.34749 13.49021 0.08408171 0.06535908 0.08330225
## 26 28 days 350.7630 18.72867 13.99384 0.08590424 0.07119074 0.08532094
## 27 29 days 393.2120 19.82957 14.91387 0.09035837 0.08163422 0.09037273
## 28 30 days 409.7096 20.24128 14.89378 0.08941382 0.08112612 0.08959061
## coverage
## 1 0.3577918
## 2 0.3333333
## 3 0.3018868
## 4 0.3102344
## 5 0.3342827
## 6 0.3207547
## 7 0.3144654
## 8 0.3001715
## 9 0.3039832
## 10 0.2641509
## 11 0.2641509
## 12 0.2720708
## 13 0.2893082
## 14 0.3018868
## 15 0.3253827
## 16 0.3476919
## 17 0.3477754
## 18 0.3333333
## 19 0.3291405
## 20 0.2955975
## 21 0.3018868
## 22 0.2870535
## 23 0.3021197
## 24 0.3039429
## 25 0.2791485
## 26 0.2642653
## 27 0.2327044
## 28 0.2641509
Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.
plot_cross_validation_metric(df.cv, metric = 'rmse')
plot_cross_validation_metric(df.cv, metric = 'mae')
plot_cross_validation_metric(df.cv, metric = 'mape')
plot_cross_validation_metric(df.cv, metric = 'mdape')
plot_cross_validation_metric(df.cv, metric = 'smape')
best_mod_ar <- auto.arima(train$y,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
best_mod_ar
## Series: train$y
## ARIMA(3,1,0)
##
## Coefficients:
## ar1 ar2 ar3
## -0.0661 0.0125 -0.0916
## s.e. 0.0217 0.0217 0.0217
##
## sigma^2 = 9.389: log likelihood = -5350.09
## AIC=10708.19 AICc=10708.2 BIC=10730.8
checkresiduals(best_mod_ar)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,0)
## Q* = 28.423, df = 7, p-value = 0.0001843
##
## Model df: 3. Total lags used: 10
We get ARIMA(3,1,0) as the best model on training dataset.
test_pred = predict(best_mod_ar, 275)
test_pred_ar = test_pred$pred
error_ar = test$y - test_pred_ar
out_rmse_ar=sqrt(mean(error_ar^2,na.rm=T))
best_mod_pr = prophet(train, weekly.seasonality=FALSE, daily.seasonality = FALSE,
yearly.seasonality=TRUE)
future = make_future_dataframe(model,periods = 275)
forecast = predict(best_mod_pr,future)
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2021-01-01"))
out_rmse_pr = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
tibble(
`best_ARIMA` = round(out_rmse_ar,2),
`best_Prophet` = round(out_rmse_pr,2)
)
## # A tibble: 1 x 2
## best_ARIMA best_Prophet
## <dbl> <dbl>
## 1 58.3 43.4
As per the RMSE Criterion, the Prophet model has less RMSE. So, it is the better model of the two.
In general, ARIMA Modeling has its limitations for being computationally expensive and having difficulty to predict change points. It is deemed to give poor long-term forecast.
Prophet model gains an upper hand over ARIMA with the benefit of forecasting daily data with weekly and yearly seasonality, plus holiday effects.