Author: Fnu Jagriti
In the modern era, both organizations and individuals make data-driven decisions. We often have to deal with time-series data i.e. data are taken sequentially in time. I have chosen the Amazon stock price dataset for my analysis. Stock price prediction is a hot topic in the financial world, and if done accurately, it may yield significant profit. As an investor, the stock price data plays a crucial role in buy or sell decisions. The aim of this project is to do technical analysis in order to predict Amazon’s future stock prices using its historical data.
# Required packages
library(tidyverse)
library(quantmod)
library(highcharter)
library(ggplot2)
library(dplyr)
library(tseries)
library(lubridate)
library(forecast)
library(patchwork)
library(data.table)
library(prophet)
# Loading AMAZON Stock Price data in a variable
AMZN_data <- getSymbols(Symbols = "AMZN", src = "yahoo", from = "2007-01-03", to = "2022-01-19", auto.assign = FALSE)
The dataset which I am using is present within “quantmod” package of R which has gathered data from Yahoo finance and includes multiple historical values for Amazon’s stock from 2007-01-03 to 2022-01-18.
# Structure of the AMAZON Stock Price data
glimpse(AMZN_data)
## An 'xts' object on 2007-01-03/2022-01-18 containing:
## Data: num [1:3788, 1:6] 38.7 38.6 38.7 38.2 37.6 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:6] "AMZN.Open" "AMZN.High" "AMZN.Low" "AMZN.Close" ...
## Indexed by objects of class: [Date] TZ: UTC
## xts Attributes:
## List of 2
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2022-02-26 20:37:30"
There are total 6 variables (columns) and 3788 observations (rows) in the dataset. The different variables in the dataset are as follow:
AMZN.Open: The opening price for that day
AMZN.High: The highest price paid for the stock that day
AMZN.Low: The lowest price paid for the stock that day
AMZN.Close: The closing price for that day
AMZN.Volume: The volume of stocks moved that day
AMZN_Adjusted: The adjusted closing price for that day
# Displaying last 6 observations of the AMAZON Stock Price data
tail(AMZN_data)
## AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume AMZN.Adjusted
## 2022-01-10 3211.71 3233.23 3126.09 3229.72 4389900 3229.72
## 2022-01-11 3230.00 3327.00 3214.03 3307.24 3140300 3307.24
## 2022-01-12 3331.50 3337.56 3288.34 3304.14 2501500 3304.14
## 2022-01-13 3305.01 3324.43 3221.82 3224.28 2609400 3224.28
## 2022-01-14 3203.00 3245.00 3196.01 3242.76 2295800 3242.76
## 2022-01-18 3182.10 3194.69 3153.29 3178.35 3364600 3178.35
The above table shows last 6 observations of the AMAZON Stock Price dataset. It gives an overview of the dataset on which I will perform exploratory data analysis and create visualizations which will later help in forecast modeling.
The closing price is going to be the variable which will be forecasted. I believe I might face difficulties while forecasting this because of the random fluctuations happening in data which are not captured within seasonality or trend such as sudden occurrence of COVID-19.
Missing Values Check:
#To check for missing values in each column
colSums(is.na(AMZN_data))
## AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume
## 0 0 0 0 0
## AMZN.Adjusted
## 0
From above, we can see there are no missing values which means it’s a very clean data.
Summary Statistics
# Summary Statistics of Amazon Stock Price Data
summary(AMZN_data)
## Index AMZN.Open AMZN.High AMZN.Low
## Min. :2007-01-03 Min. : 35.29 Min. : 37.07 Min. : 34.68
## 1st Qu.:2010-10-05 1st Qu.: 156.98 1st Qu.: 159.26 1st Qu.: 154.51
## Median :2014-07-12 Median : 335.20 Median : 337.45 Median : 331.56
## Mean :2014-07-11 Mean : 875.34 Mean : 884.58 Mean : 865.09
## 3rd Qu.:2018-04-16 3rd Qu.:1525.01 3rd Qu.:1541.28 3rd Qu.:1497.64
## Max. :2022-01-18 Max. :3744.00 Max. :3773.08 Max. :3696.79
## AMZN.Close AMZN.Volume AMZN.Adjusted
## Min. : 35.03 Min. : 881300 Min. : 35.03
## 1st Qu.: 156.43 1st Qu.: 3021500 1st Qu.: 156.43
## Median : 333.82 Median : 4331550 Median : 333.82
## Mean : 875.09 Mean : 5449362 Mean : 875.09
## 3rd Qu.:1517.01 3rd Qu.: 6531825 3rd Qu.:1517.01
## Max. :3731.41 Max. :104329200 Max. :3731.41
Outlier Detection
A very important aspect of data cleaning is to find out the outliers present in the data. These outliers may be due to incorrect data or due to the extreme values. We are going to find out the outliers of each numeric column to determine whether to keep or remove those outliers.
I have used boxplots to determine the outliers of each numeric column.
# Creating boxplot of all numeric variables
par(mfrow=c(2,3))
boxplot(AMZN_data$AMZN.Open, ylab = "Opening Price")
boxplot(AMZN_data$AMZN.High, ylab = "Highest Price")
boxplot(AMZN_data$AMZN.Low, ylab = "Lowest Price")
boxplot(AMZN_data$AMZN.Close, ylab = "Closing Price")
boxplot(AMZN_data$AMZN.Volume, ylab = "Transaction Volume")
boxplot(AMZN_data$AMZN.Adjusted, ylab = "Adjusted Price")
From boxplots above, we can visualize that there are outliers present for all numeric columns. At this point, these outliers seems to be something originating from extreme values and not due to error. Thus, we will keep them in our dataset.
As we know, time-series data is collected sequentially in time. Usually, it is recorded in a specific time interval, such as a day, a month, or a year. We can identify some patterns in data like trend, seasonality, or a combination of them when we plot them. Identification of such patterns helps in making effective decisions such as seasonal patterns in our stock price data can help determine when to buy or sell the stock. We can see these patterns in more detail if we decompose the data.
Stock price data can be visualized using a static or interactive plot. These plots will help in identifying the patterns in the data and thus, will aid in making an informed investment decision.
Static Plot
# Candle stick chart
chartSeries(AMZN_data, name = "Amazon Stock Price Over Years")
From the above chart, we can see there is an overall upward trend and some seasonal pattern. To get a better clarity, I will decompose it and focus on the closing price. The decomposition will be done on daily and monthly closing price.
First, we will use closing price and transform it into monthly closing price and then decompose the data to find patterns.
# Transforming closing price on monthly period
monthly_cp <- Cl(to.monthly(AMZN_data))
# Monthly decomposition of Amazon Stock Price Data
monthly_decomp <- decompose(as.ts(monthly_cp, start=c(2007,1)))
plot(monthly_decomp)
The output above shows four plots of our monthly closing price data, which are:
Observed: It displays the original plot of the data.
Trend: It depicts long term movements in the mean.We can see continuous upward trend since 2007. However, since 2015,the rate at which closing prices are increasing are significant.
Seasonal: It represents repetitive seasonal fluctuation of the data. The closing price of Amazon tends to reach the highest in August and the lowest in March. Based on this finding, we can say the best time to sell Amazon stock is at in August and to buy is in March.
Random: These are irregular or random fluctuations not captured by the trend and seasonal. The current COVID situation is one of the finest examples of random fluctuation. When random component is dominant in a data, forecasting becomes difficult.
Now, we will use the daily closing price and data decomposition for further analysis. We will take logarithmic values of the daily closing prices and the differentiated values of the logarithmic daily closing prices with a lag of one. We analyse the logarithmic closing price because it takes away the stock’s growth rate, and we differentiate this to remove any autocorrelation and seasonal behavior for the stock. We will clean our data by checking and removing if there are any missing values.
# Transforming closing price on daily period
daily_cp <- Cl(to.daily(AMZN_data))
# Log of Daily closing price
daily_cp_log <- log(daily_cp)
# Differentiation of Log of Daily closing price having lag 1
daily_cp_log_diff <- diff(daily_cp_log, lag = 1)
# Checking for NA values in daily_cp_log_diff
anyNA(daily_cp_log_diff)
## [1] TRUE
# Counting number of NA values in daily_cp_log_diff
colSums(is.na(daily_cp_log_diff))
## AMZN_data.Close
## 1
# Filtering out the observations with NA values in daily_cp_log_diff
daily_cp_log_diff <- daily_cp_log_diff[!is.na(daily_cp_log_diff)]
Since there is only one missing value in this data frame, there will be almost no data loss or bias while removing this observation from plot.
Logarithmic closing price and the differentiated logarithmic closing price are the important parameters for our forecasting modeling later on because we take away the seasonal factors or growth factors that may affect our predictions.
#Plot of Logarithmic Daily closing Price
plot(daily_cp_log, main = "Amazon Log Returns")
Above plot indicate a clear growing trend but there is till some variety in the stock growth. This indicates that the stock movement is non-stationary. We remove this non-stationary nature of the stock by differentiating it.
#Plot of Differentiation of Logarithmic Daily closing Price
plot(daily_cp_log_diff,type=1, main = "Amazon Log Differentiation")
From the above plot, we can see after differentiating, we have more stationary closing price.This will later help in better prediction of our forecasting model.
Interactive Plot
highchart(type="stock") %>%
hc_add_series(AMZN_data) %>%
hc_add_series(SMA(na.omit(Cl(AMZN_data)),n=50),name="SMA(50)") %>%
hc_add_series(SMA(na.omit(Cl(AMZN_data)),n=200),name="SMA(200)") %>%
hc_title(text="<b>Amazon Price Chart Over the Years</b>")
The above chart shows a short-term SMA(Single Moving Average) in black and long-term SMA in green. These help us in commenting on the trend by using the golden cross and death cross theories.
We have the golden cross when the short-term SMA crosses the long-term SMA in an upwards trend and death cross is the opposite of it. In other words, the golden cross is considered as a good time to buy stock and the death cross is considered as a good time to sell stock. However, there are other factors that also needs to be considered.
From the plot, we can see that mostly the short-term SMA (black line)is above the long-term SMA (green line) indicating the golden cross. However, we can see black line is below the green line (death cross) between Jan 2019-March 2019 which was the peak time of COVID.
#Converting the dataset from xts to dataframe
AMZN_data_df <- as.data.frame(AMZN_data)
AMZN_data_df$Date <- row.names(AMZN_data_df)
# Move last column to first position
# AMZN_data <- AMZN_data[,c(7, 1:6)]
cln <- ncol(AMZN_data_df) # 7
AMZN_data_df <- AMZN_data_df[, c(cln, 1:(cln-1))]
row.names(AMZN_data_df) <- NULL
head(AMZN_data_df)
## Date AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume AMZN.Adjusted
## 1 2007-01-03 38.68 39.06 38.05 38.70 12405100 38.70
## 2 2007-01-04 38.59 39.14 38.26 38.90 6318400 38.90
## 3 2007-01-05 38.72 38.79 37.60 38.37 6619700 38.37
## 4 2007-01-08 38.22 38.31 37.17 37.50 6783000 37.50
## 5 2007-01-09 37.60 38.06 37.34 37.78 5703000 37.78
## 6 2007-01-10 37.49 37.70 37.07 37.15 6527500 37.15
AMZN_data_df <- AMZN_data_df%>%arrange(Date)
AMZN_data_df$Index <- 1:3788
#Linear Model fit
linearmodel = lm(AMZN.Close~Index, data = AMZN_data_df)
summary(linearmodel)
##
## Call:
## lm(formula = AMZN.Close ~ Index, data = AMZN_data_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -714.25 -431.25 -83.11 299.37 1434.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.559e+02 1.653e+01 -39.68 <2e-16 ***
## Index 8.081e-01 7.556e-03 106.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 508.6 on 3786 degrees of freedom
## Multiple R-squared: 0.7513, Adjusted R-squared: 0.7512
## F-statistic: 1.144e+04 on 1 and 3786 DF, p-value: < 2.2e-16
From the above statistics we can see, that we have R square value equal to 0.75 which means that 75% of the variability in the closing price can be determined by the index. However, there are chances that other models do better forecast and handles seasonality and trends better than the linear regression. As we will continue further in this progress,we will try to find out what is the better fit for time series model.
# To keep index of Amazon Stock Price dataset in a variable 'date'
date = zoo::index(AMZN_data)
# Split into Train/Test
AMZN_Train = AMZN_data_df %>%
filter(date < ymd("2018-01-01"))
AMZN_Test = AMZN_data_df %>%
filter(date >= ymd("2018-01-01"))
# Plot Time Series with a vertical line at 2018 to partition train and test data
AMZN_data %>%
ggplot() +
geom_line(aes(date, AMZN.Close)) +
geom_vline(aes(xintercept = ymd("2018-01-01")), color = 'red')
From the above time series plot, we can see that this time-series is not stationary as there is no constant variance over time.It is showing what I expected it to be as it is a stock data which fluctuates frequently based on a lot of random factors. A stock price time-series is not expected to be variance stationary as it will become very easy to predict stock price but is it? No.
Since, we found that our time-series is variance non-stationary, we will transform it using a Box-Cox transformation and log transformation.
# Create Box-Cox and log transformed variables
AMZN_data_df = AMZN_data_df %>%
mutate(
AMZN_close_box = BoxCox(AMZN.Close, lambda = 'auto'),
AMZN_close_log = log1p(AMZN.Close)
)
# Save best lambda
best_lambda = attr(BoxCox(AMZN_data_df$AMZN.Close, lambda = 'auto'), 'lambda')
best_lambda
## [1] 0.157565
After doing Box-Cox transformation, we can see that the best lambda value obtained is 0.157565 which gives the best approximation of a normal distribution curve for the box-cox transformation.
# Plot each var
par(mfrow = c(1, 3))
plot(AMZN_data_df$AMZN.Close, type = 'l')
plot(AMZN_data_df$AMZN_close_box, type = 'l')
plot(log1p(AMZN_data_df$AMZN_close_log), type = 'l')
# Plot rolling SD of each var
AMZN_close_sd = AMZN_data_df %>%
mutate(sd = zoo::rollapply(AMZN.Close, 7, FUN = sd, fill = NA)) %>%
ggplot() +
geom_line(aes(date, sd)) +
labs(title = 'Original Close 7 Day Rolling Average Standard Deviation') +
theme_bw()
AMZN_box_close_sd = AMZN_data_df %>%
mutate(sd = zoo::rollapply(AMZN_close_box, 7, FUN = sd, fill = NA)) %>%
ggplot() +
geom_line(aes(date, sd)) +
labs(title = 'Box-Cox Transformed 7 Day Rolling Average Standard Deviation') +
theme_bw()
AMZN_log_close_sd = AMZN_data_df %>%
mutate(sd = zoo::rollapply(AMZN_close_log, 7, FUN = sd, fill = NA)) %>%
ggplot() +
geom_line(aes(date, sd)) +
labs(title = 'Log Transformed 7 Day Rolling Average Standard Deviation') +
theme_bw()
AMZN_close_sd + AMZN_box_close_sd + AMZN_log_close_sd
The Box-Cox transformation and the log transformation helps in stabilizing the variance in this time-series.
# Test for stationarity
adf.test(AMZN_data_df$AMZN_close_box)
##
## Augmented Dickey-Fuller Test
##
## data: AMZN_data_df$AMZN_close_box
## Dickey-Fuller = -2.4154, Lag order = 15, p-value = 0.4024
## alternative hypothesis: stationary
kpss.test(AMZN_data_df$AMZN_close_box)
##
## KPSS Test for Level Stationarity
##
## data: AMZN_data_df$AMZN_close_box
## KPSS Level = 36.945, Truncation lag parameter = 9, p-value = 0.01
The p-value of Augmented Dickey-Fuller Test is 0.4024 which is greater than 0.05 (significance level) which suggests that the time-series is non-stationary.
Also, the p-value of KPSS Test is 0.01 which is less than 0.05 (significance level) which also suggests that the time series is non-stationary.
This suggests that the process is not mean stationary.Let’s calculate a first difference of data and then check if it makes the data mean stationary.
# Take first difference
AMZN_data_df = AMZN_data_df %>%
mutate(AMZN_close_dif = AMZN_close_box - lag(AMZN_close_box))
# Test for stationary again
adf.test(AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)]) # Stationary
##
## Augmented Dickey-Fuller Test
##
## data: AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)]
## Dickey-Fuller = -16.292, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)]) # Maybe Non-Stationary
##
## KPSS Test for Level Stationarity
##
## data: AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)]
## KPSS Level = 0.039197, Truncation lag parameter = 9, p-value = 0.1
The p-value of Augmented Dickey-Fuller Test is 0.01 which is less than 0.05 (significance level) and the p-value of KPSS Test is 0.1 which is greater than 0.05 (significance level) which suggests that the time series is now mean stationary.
# Plot differenced data
AMZN_data_df %>%
ggplot() +
geom_line(aes(date, AMZN_close_dif))
The above plot also shows that our time series is now mean stationary. It looks like there is no seasonality in this time series.This is because we have taken difference of log value of closing price and its lag value which removed the seasonal component of the time-series.
# Examine ACF/PACF
par(mfrow = c(1, 2))
acf(AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)])
pacf(AMZN_data_df$AMZN_close_dif[!is.na(AMZN_data_df$AMZN_close_dif)])
From ACF plot above,it looks like there is no dampening, so it is not an Auto regressive process. Since, there is no dampening in ACF, it is moving average and we will check ACF to find the order of MA and it looks like we have two significant lines above the blue line which makes it moving average of order 2. Also, since, we got an almost stationary time series after one differentiation, it looks like we have Integration of order 1.
Let’s calculate the order of this series using “auto.arima”.
# Run auto ARIMA
auto_arima = auto.arima(AMZN_data_df$AMZN_close_box)
auto_arima
## Series: AMZN_data_df$AMZN_close_box
## ARIMA(0,1,2) with drift
##
## Coefficients:
## ma1 ma2 drift
## -0.0302 -0.0346 3e-03
## s.e. 0.0163 0.0165 9e-04
##
## sigma^2 = 0.003328: log likelihood = 5431.28
## AIC=-10854.57 AICc=-10854.56 BIC=-10829.61
Interestingly, we get the same order through auto.arima as well which we observed. Thus, our ARIMA model will be an ARIMA (0,1,2)with drift for our differentiated logarithmic time series.The AIC value is -10854.57.
However, I do find that there are some more lines crossing the blue line in ACF plot which seems like a white noise but let’s check by including them in our model. We can try out some models to see. Please note that we are using Box-Cox Transformed Parameter
# Creating 6 different Arima model for different orders
mod0 = arima(AMZN_data_df$AMZN_close_box,order=c(0,0,1))
mod1 = arima(AMZN_data_df$AMZN_close_box,order=c(0,1,0))
mod2 = arima(AMZN_data_df$AMZN_close_box,order=c(1,0,0))
mod3 = arima(AMZN_data_df$AMZN_close_box,order=c(1,1,1))
mod4 = arima(AMZN_data_df$AMZN_close_box,order=c(0,1,2))
mod5 = arima(AMZN_data_df$AMZN_close_box,order=c(0,2,1))
mod6 = arima(AMZN_data_df$AMZN_close_box,order=c(2,1,0))
#Comparing AIC and BIC values of different models
AIC_BIC_Table <- matrix(c(AIC(mod0),AIC(mod1), AIC(mod2),AIC(mod3),AIC(mod4),AIC(mod5),AIC(mod6),BIC(mod0),BIC(mod1), BIC(mod2),BIC(mod3),BIC(mod4),BIC(mod5),BIC(mod6)), nrow=7,ncol=2, byrow = FALSE, dimnames = list(c('model 0- ARIMA(0,0,1): ', 'model 1- ARIMA(0,1,0): ', 'model 2- ARIMA(1,0,0): ','model 3- ARIMA(1,1,1): ','model 4- ARIMA(0,1,2): ','model 5- ARIMA(0,2,1): ','model 6- ARIMA(2,1,0): '),
c('AIC','BIC')))
AIC_BIC_Table <- as.table(AIC_BIC_Table)
AIC_BIC_Table
## AIC BIC
## model 0- ARIMA(0,0,1): 14903.47 14922.19
## model 1- ARIMA(0,1,0): -10842.81 -10836.57
## model 2- ARIMA(1,0,0): -10832.55 -10813.83
## model 3- ARIMA(1,1,1): -10845.02 -10826.30
## model 4- ARIMA(0,1,2): -10845.01 -10826.30
## model 5- ARIMA(0,2,1): -10838.86 -10826.38
## model 6- ARIMA(2,1,0): -10844.77 -10826.05
Based on the above summary of 7 different ARIMA models, it is evident that model 3 that is ARIMA(1,1,1)seems to perform best with lowest AIC value of -10845.02 which is slighly lower than model 4 that is ARIMA(0,1,2) which has AIC value of -10845.02.
However, based on BIC value, the model which seems to be best is model 1 that is ARIMA(0,1,0) with lowest BIC value of -10836.57.
I will go with ARIMA(1,1,1) as the “best” model as it has lower AIC value than BIC and lower AIC is preferred over BIC because AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. For forecasting future closing price, it is therefore, a better model.
#Finding in-sample Predicted values and doing Inverse box cox transformation
in_sample_predict = mod3$residuals+AMZN_data_df$AMZN_close_box
in_sample_predict = InvBoxCox(in_sample_predict,lambda=best_lambda)
#Converting Date column in Amazon dataset to Date type
AMZN_data_df$Date <- as.Date(AMZN_data_df$Date)
# Plot of Actual vs Predicted Values
ggplot()+
geom_line(aes(AMZN_data_df$Date,AMZN_data_df$AMZN.Close))+
geom_line(aes(AMZN_data_df$Date,in_sample_predict),color='red')
From above actual vs predicted value graph, we can see that the predicted values capture the trends in actual data perfectly. This is also because our model is trained on in-sample data and it does overfitting.
# Checking the residuals
checkresiduals(mod3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 10.288, df = 8, p-value = 0.2454
##
## Model df: 2. Total lags used: 10
Based on above plot, the residuals seem to be almost normally distributed. Some autocorrelation is observed at higher lags but that is not of major concern.
#Box-Ljung Test for autocorrelation
Box.test(mod3$residuals,type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: mod3$residuals
## X-squared = 0.048461, df = 1, p-value = 0.8258
Box.test(mod3$residuals,type = 'Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: mod3$residuals
## X-squared = 1.1637, df = 5, p-value = 0.9483
From the Box-Ljung Test above, it is evident that the residuals have no autocorrelation at lags 1 and 5. This means that our residuals are white noise.
Now, we will try to assess the in-sample root mean squared error of the model based on the residuals.
# RMSE Calculation
RMSE = sqrt(mean((in_sample_predict - AMZN_data_df$AMZN.Close)^2,na.rm=T))
RMSE
## [1] 24.88717
We found that the RMSE is 24.88717 which is pretty good.
#5 time-period forecast for my model from the end of the data
Prediction_5 = predict(mod3,n.ahead=150)
# Create Prediction Dataset for next 5 months
AMZN_Prediction = data.frame(
pred = InvBoxCox(Prediction_5$pred,lambda=best_lambda),
se = InvBoxCox(Prediction_5$se,lambda=best_lambda),
Date=c(seq(as.Date('2022-01-19'), as.Date('2022-06-17'), by = "days"))
)
AMZN_Prediction_MERGE = AMZN_data_df %>%
mutate(in_sample_predict = in_sample_predict) %>%
full_join(AMZN_Prediction) %>%
mutate(
pred_high = pred+2*se,
pred_low = pred-2*se,
pred = pred,lambda=best_lambda,
error = pred - AMZN.Close
)
# Forecast
AMZN_Prediction_MERGE%>%
mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_predict,pred)) %>%
ggplot()+
geom_line(aes(Date,AMZN.Close))+
geom_line(aes(Date,in_sample_predict),color='blue')+
geom_line(aes(Date,pred),color='red')+
geom_ribbon(aes(x=Date,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)
From above forecast, we can see the predicted values (red lines) gives the same closing price for all the next periods. This is not reasonable and hence, we have to look for better way of modeling our dataset for more accurate predictions of stock price of Amazon.
The algorithm of Facebook Prophet model draws from time-series decomposition, breaking down the time-series into:
Seasonal components: Daily, weekly, monthly, yearly, etc.
Holidays: For daily data
Trend: Estimated along the data with unique slopes identified using changepoint detection
The basic model is fit as:
yt=gt+st+ht+ϵt
where gt is the trend, st is seasonality, and ht are holidays.
The advantage of using Prophet model is that we don’t need to make our data mean/variance stationary. Also, it is not required to view ACF/PCF plots.
For our Amazon Stock Price data, I will do decomposition of the elements of the time series (trend, seasonality, etc.).
#Fitting a Basic Model
prophet_data = AMZN_data_df %>%
rename(ds = Date, # Have to name our date variable "ds"
y = AMZN.Close) # Have to name our time series "y"
#Train data
train = prophet_data %>%
filter(ds<ymd("2020-01-01"))
#Test data
test = prophet_data %>%
filter(ds>=ymd("2020-01-01"))
model = prophet(train,daily.seasonality = FALSE)
future = make_future_dataframe(model,periods = 365)
forecast = predict(model,future)
Above, I split my data into train and test. Train data contains observations before 2020 and Test data contains observations after 2020. Now, I have trained my prophet model using train data. Since, I don’t have daily data, I have set my prophet model’s daily seasonality to be FALSE. I have used this model to forecast future values now.
#Plotting the Forecast
plot(model,forecast)+
ylab("Amazon Closing Price")+xlab("Date")+theme_bw()
Above plot shows the forecast value based on prophet model.
#Interactive Charts
dyplot.prophet(model,forecast)
Above is a dynamic chart showing the forecast based on this model.We can see that our forecast value pretty much follows the same trend.
Next, we would like to observe the decomposition of the time series data such as trend, weekly and yearly seasonality.
#Visualizing the Time series Decomposition
prophet_plot_components(model,forecast)
Base on above decomposition, we can see that the trend is increasing substantially after 2015. Also, the weekly decomposition shows that there is drop in closing price on weekends.The year seasonality shows that there is an increase in closing prive around August and decrease in December that is year end.
#plotting a two year forecast
two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Amazon Close Price")
Since the model is forecasting amazon closing price, it does not have a natural maximum. However, we need to specify a lower limit of zero since closing prices cannot have a negative value.
# Set "floor" in training data
train$floor = 0
train$cap = 2500
future$floor = 0
future$cap = 2500
# Set floor in forecast data
two_yr_future$floor = 0
two_yr_future$cap = 2500
sat_model = prophet(train,growth='logistic')
sat_two_yr_forecast = predict(sat_model,two_yr_future)
plot(sat_model,sat_two_yr_forecast)+ylim(0,2500)+
theme_bw()+xlab("Date")+ylab("Amazon Closing Price")
#trend and Seasonality in Prophet
prophet_plot_components(model,forecast)
Amazon Closing prices have the maximum value around the month of August and minimum by year end i.e. December. Also, Thursday seems to have the highest stock price value.
#Plotting Changepoints
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Amazon Closing Price")
Changepoint shows the points at which there are changes in the slope.
After increasing number of change points, we can see our plot looks like below.
# Number of Changepoints
model = prophet(train,n.changepoints=50)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Amazon Closing Price")
#Assessing seasonal decomposition
prophet_plot_components(model,forecast)
Above plot shows trend and seasonality after identification of changepoints.
#additive Seasonality
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,2500)
#additive Seasonality Plot
prophet_plot_components(additive,add_fcst)
#Multiplicative Seasonality
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,2500)
#Multiplicative Seasonality Plot
prophet_plot_components(multi, multi_fcst)
#specifying Holidays
model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)
#Examining Impact of Holidays
forecast %>%
filter(
holidays!=0,
ds == ymd("2018-11-12") |
ds == ymd("2010-12-31")) %>%
select(ds,`Veterans Day (Observed)`,`New Year's Day (Observed)`)
## ds Veterans Day (Observed) New Year's Day (Observed)
## 1 2010-12-31 0.000000 49.09847
## 2 2018-11-12 2.151406 0.00000
From above, we can see a positive effect was observed on New Year’s day as well as Veteran’s Day.
#Holiday that Impacts the most
forecast %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
Based on above plot, we can see that there is one holiday which has most impact which is New Year’s Day (Observed).
#Assessing Prophet Models
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
#RMSE
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 928.7"
#MAE
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 834.52"
#M
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.26"
Based on above metrics, we can conclude that the model is performing pretty well in sample.
#Cross-validation with Prophet
df.cv <- cross_validation(model, initial = 730, period = 300, horizon = 300, units = 'days')
head(df.cv)
## y ds yhat yhat_lower yhat_upper cutoff
## 1 82.40 2009-04-28 88.63560 83.90079 93.79087 2009-04-27
## 2 79.79 2009-04-29 89.33093 83.91845 94.59263 2009-04-27
## 3 80.52 2009-04-30 90.20225 85.26709 95.27904 2009-04-27
## 4 78.96 2009-05-01 90.80308 86.14640 95.72392 2009-04-27
## 5 79.77 2009-05-04 92.14277 87.36572 97.13775 2009-04-27
## 6 81.90 2009-05-05 92.91316 87.38120 97.95071 2009-04-27
The model has fit data for 13 forecasts with an initial training period of 730 days,period of 300 days and horizon of 300 days.
unique(df.cv$cutoff)
## [1] "2009-04-27 GMT" "2010-02-21 GMT" "2010-12-18 GMT" "2011-10-14 GMT"
## [5] "2012-08-09 GMT" "2013-06-05 GMT" "2014-04-01 GMT" "2015-01-26 GMT"
## [9] "2015-11-22 GMT" "2016-09-17 GMT" "2017-07-14 GMT" "2018-05-10 GMT"
## [13] "2019-03-06 GMT"
#cross-validation Actual vs Predicted
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Amazon Closing Price")+
scale_color_discrete(name = 'Cutoff')
#Cross Validation RMSE metric plot
plot_cross_validation_metric(df.cv, metric = 'rmse')
The RMSE of model increases with Horizon. This seems reasonable because the model predicts the “near future” better than the farther away future.
ARIMA Model for RMSE comparison
To check out of sample performance of ARIMA model, we will fit ARIMA(1,1,1) to the training data.
arima_var_transform = AMZN_Train %>%
mutate(closingPrice_box = BoxCox(AMZN.Close,lambda='auto'))
# Save Lambda from Box Cox to Use Later
best_lambda2 = attr(BoxCox(AMZN_Train$AMZN.Close,lambda='auto'),'lambda')
mod3_auto = arima(arima_var_transform$closingPrice_box,order=c(1,1,1))
checkresiduals(mod3_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 9.1219, df = 8, p-value = 0.3321
##
## Model df: 2. Total lags used: 10
Box.test(mod3_auto$residuals,type = 'Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: mod3_auto$residuals
## X-squared = 7.9862, df = 5, p-value = 0.157
Since the p value is greater than 0.05, the residuals are independently distributed.
# Get Predictions for 1019 steps ahead
prediction = predict(mod3_auto,n.ahead=1019)
Calculating Out of Sample RMSE for ARIMA:
test_pred = predict(mod3_auto, n.ahead= 1019)
test_pred_ar = test_pred$pred
error_ar = AMZN_Test$AMZN.Close - test_pred_ar
out_rmse_ar=sqrt(mean(error_ar^2,na.rm=T))
out_mae_ar = mean(abs(error_ar))
out_mape_ar = mean(abs((error_ar)/AMZN_Test$AMZN.Close))
out_rmse_ar
## [1] 2454.235
out_mae_ar
## [1] 2334.464
out_mape_ar
## [1] 0.9812361
Out-of-sample RMSE for best ARIMA model is 2454.235 Out-of-sample MAE for best ARIMA model is 2334.464 Out-of-sample MAPE for best ARIMA model is 0.9812361
Prophet Model for RMSE comparison
library(prophet)
Prophet_best_model = prophet(train)
future = make_future_dataframe(Prophet_best_model,periods = 300)
forecast = predict(Prophet_best_model,future)
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2018-01-01"))
out_rmse_pr = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
out_mae_pr = mean(abs(test$y - forecast_metric_data$yhat))
out_mape_pr = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
out_rmse_pr
## [1] 1184.803
out_mae_pr
## [1] 1075.054
out_mape_pr
## [1] 0.3525634
Out-of-sample RMSE for best Prophet model is 1184.803 Out-of-sample MAE for best Prophet model is 1075.054 Out-of-sample MAPE for best Prophet model is 0.3525634
Out of Sample RMSE comparison between best models built on ARIMA and Prophet
tibble(
`best_ARIMA` = round(out_rmse_ar,2),
`best_Prophet` = round(out_rmse_pr,2)
)
## # A tibble: 1 x 2
## best_ARIMA best_Prophet
## <dbl> <dbl>
## 1 2454. 1185.
The RMSE value of Prophet is lower than that of ARIMA model and hence, we pick Prophet model as our best pick for prediction of stock price data of Amazon.
ARIMA model is extremely useful to know the data generating process of time series data, however Prophet model usually performs better out-of-sample prediction. Therefore, comparing the model gives us overall picture of which model performs better for this particular time series data.