Forecasting Amazon stock price

Problem Statement

  • Amazon is a leading provider of online retail services, providing global sales through Amazon Web Services (AWS) for computing, storage, database, and corporate cloud services. What started off a with a online book store, now offers even more services, such video streaming service Amazon Prime Video, digital personal assistant Amazon Echo, e-book store Kindle etc. Amazon has grown from 10 employees in 1995 to 1.3 million employees working today. Such a company growth has been possible because of it’s financial strength. Today it has a $1.6 trillion market cap because of the amazing growth that Amazon stock has had over the last quarter century.
  • The goal of the project is to forecast the closing price of Amazon stock for current year (Jan 2021 to May 2021)
  • This project uses data from Yahoo Finance from the year 2006 to 2020 for training the ARIMA and Prophet model which are widely used for time series forecasting.

Variable dictionary

Variable description

Variable Description
Open Opening price of the stock on the day
High Highest price of the stock on the day
Low Lowest price of the stock on the day
Close Closing price of the stock on the day
Volume Total Volume Traded
Adjusted Adjusted price of the stock including any risks or strategies

Libraries used

library(quantmod)
library(xts)
library(dygraphs)
library(dplyr)
library(forecast)
library(prophet)
library(lubridate)
library(TSA)
library(lmtest)

Exploratory Data Analysis

Data Vizualization and Analysis

Reading data

getSymbols("AMZN", src = "yahoo", from = "2006-01-01", to = "2020-12-31",resetIndex = TRUE)
amz_stock  <- as.data.frame(AMZN)
colnames(amz_stock) <- c("Open","High","Low","Close","Volume","Adjusted")

Studying data

head(amz_stock)
##             Open  High   Low Close  Volume Adjusted
## 2006-01-03 47.47 47.85 46.25 47.58 7582200    47.58
## 2006-01-04 47.49 47.73 46.69 47.25 7440900    47.25
## 2006-01-05 47.16 48.20 47.11 47.65 5417200    47.65
## 2006-01-06 47.97 48.58 47.32 47.87 6152900    47.87
## 2006-01-09 46.55 47.10 46.40 47.08 8943100    47.08
## 2006-01-10 46.41 46.75 45.36 45.65 9686100    45.65

Plotting data

Statistical summary of the data

summary(amz_stock)
##       Open              High              Low              Close        
##  Min.   :  26.09   Min.   :  26.30   Min.   :  25.76   Min.   :  26.07  
##  1st Qu.:  95.00   1st Qu.:  95.88   1st Qu.:  93.09   1st Qu.:  94.43  
##  Median : 282.00   Median : 284.87   Median : 279.74   Median : 282.10  
##  Mean   : 646.96   Mean   : 653.97   Mean   : 639.08   Mean   : 646.86  
##  3rd Qu.: 882.25   3rd Qu.: 891.92   3rd Qu.: 880.57   3rd Qu.: 885.61  
##  Max.   :3547.00   Max.   :3552.25   Max.   :3486.69   Max.   :3531.45  
##      Volume             Adjusted      
##  Min.   :   881300   Min.   :  26.07  
##  1st Qu.:  3156600   1st Qu.:  94.43  
##  Median :  4614600   Median : 282.10  
##  Mean   :  5706870   Mean   : 646.86  
##  3rd Qu.:  6810300   3rd Qu.: 885.61  
##  Max.   :104329200   Max.   :3531.45
  • The minimum price of Amazon stocks were around $25 and has rose to approx $3500 which is 13900% increase in 14 years.
  • The 3rd quantile shows that 75% of records has value of stock between $800 to $900 which means that the tremendous growth of Amazon stock price has occurred in last few years.

Examine the stock data

Plot the data and examine for non constant variance or trend or stationary in the series

When analyzing a time series plot it is important to consider the presence of trend, seasonality and non constant variance. These components play a major role in determining the techniques used for transformation and model fitting for time series data.

plot.ts(amz_stock[,"Close"], ylab ="Closing Price")

  • The plot for closing price has non-constant variance, non-constant mean
  • There is a trend over time but no visible seasonality so the stock closing price data is not stationary.

Using Box Cox for stabilizing the variance

  • The Box-Cox transformation is a family of power transformations indexed by a parameter lambda which is estimated from the data. The lambda value is then used to transform a non- stationary series tom stationary series.
library(forecast)
(lambda <- BoxCox.lambda(amz_stock[,"Close"]))
## [1] 0.1335584
  • The lambda value is 0.1 so we perform transformation as log of time series.

Plot of time series after transformation

 tclose_price <- log(amz_stock[,"Close"])
plot(ts(tclose_price), ylab =" Transformed closing price")

  • We see that log transformation reduces variance as compared to the original time series.

Taking difference to check for seasonality

  • Seasonal variation, or seasonality, are cycles that repeat regularly over time. The seasonality in time series data can occur daily, weekly, monthly or yearly.
ts(tclose_price) %>% diff() %>% ggtsdisplay()

  • There is no recurring pattern in lags in acf or pacf plots so there is no presence of seasonality in the time series data.

Checking for stationary using ADF test

  • Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. The test statistic and the p-value, can be used to make inference as to whether a given series is stationary or not.
  • Null hypothesis: The series is non-stationary
  • Alternative hypothesis: The series is stationary
library(tseries)
ts(tclose_price) %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -3.4717, Lag order = 15, p-value = 0.04499
## alternative hypothesis: stationary
  • We perform ADF test after taking log transformation to check stationary of the series. Since the p value is 0.04499 which is very c close to the significance level of 0.05 it is better to take difference and make the series stationary.

Differencing to remove non-stationary

  • Differencing is a type of transformation that is used to make a time series stationary as well as stabilize the mean of the time series. In a stationary time series the effect of time is removed, and the statistical distribution can be reasoned with a standard probability distribution function
ts(tclose_price) %>% diff() %>% ggtsdisplay()

ts(tclose_price) %>% diff() %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -16.107, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
  • We perform ADF test after taking first difference and since the p value is 0.01 which is smaller than 0.05 we reject the null hypothesis and hence the series is now stationary.

Model fitting

Plotting ACF and PACF

Autocorrelation function (ACF)

For a stationary time series model, autocorrelation of Xt and Xt+k (or k-th order autocorrelation) is \(\rho_k = \gamma_k/\gamma_0\)

for all t = 1, …, n. As a function of k, \(\rho_k\) is called autocorrelation function (ACF) and its graph has been called the correlogram. Importance of \(\rho_k\) is to identify a suitable model for a time series data by knowing its autocorrelation function or an estimate of this function

Partial Autocorrelation function (PACF)

Partial correlation is the autocorrelation between Xt and Xt+k after their dependency on the intervening variables \(X_{t+1}\), …,\(X_{t+k−1}\) has been removed. In other words, it is the conditional correlation between \(X_{t}\) and \(X_{t+k}\) conditional on \(X_{t+1}\), …,\(X_{t+k-1}\), that is, \(\phi_{kk}\) = cor(\(X_{t}\),\(X_{t+k}\)|\(X_{t+1}\), …,\(X_{t+k-1}\)) for any t = 1, …, n.

tclose_price %>% diff() %>% acf()

tclose_price %>% diff() %>% pacf()

  • The ACF plot doesn’t have any significant lags and PACF plots have a significant lags at higher order so we will using EACF to determine the order for ARIMA model

Using extended sample ACF (ESACF) for testing the order and finding potential order of model

  • The extended autocorrelation function (EACF) is one method proposed to assess the orders of a ARMA(p, q) model.
  • The EACF table for an ARMA(p, q) process should theoretically have a triangular pattern of zeroes with the top-left zero occurring in the p-th row and q-th column (with the row and column labels both starting from 0).
  • H0: Null hypothesis: No Autocorrelation in the residual after we fit the model
  • HA: Alternative hypothesis: Autocorrelation at some lag
  • o –> Depicts that we fail to reject null (preferred)
  • x –> Depicts we reject null (which means there is some autocorrelation which is not preferred)
eacf(diff(ts(tclose_price)))
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o x o o o o o o o o x  o  o  o 
## 1 x x o o o o o o o o o  o  o  o 
## 2 x x o o o o o o o o x  o  o  o 
## 3 x x x o o o o o o o o  o  o  o 
## 4 x x o x o o o o o o o  o  o  o 
## 5 x o x x x o o o o o o  o  o  o 
## 6 x x x x x x o o o o o  o  o  o 
## 7 x x x x x x o o o o o  o  o  o
  • The top left corner o suggests MA(2). Thus we consider our model to be ARIMA(0,1,2). But, lets check AIC values for ARIMA(1,1,2) and ARIMA(2,1,2)

ARIMA models

Overview of ARIMA models

  • ARIMA is “autoregressive integrated moving average.” It is a model used in statistics and econometrics to measure events that happen over a period of time. The model is used to understand past data or predict future data in a series. It is used when a metric is recorded in regular intervals, from fractions of a second to daily, weekly or monthly periods.
  • AR: Autoregression A model that uses the dependent relationship between an observation and some number of lagged observations.
  • I: Integrated The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.
  • MA: Moving Average A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

The parameters of the ARIMA model are defined as follows:

  • p: The number of lag observations included in the model, also called the lag order.
  • d: The number of times that the raw observations are differenced, also called the degree of differencing.
  • q: The size of the moving average window, also called the order of moving average.

Fitting ARIMA(0,1,2)

fit1= Arima(ts(close_price), order = c(0,1,2), lambda=0)

Checking residuals

  • For ARIMA(0,1,2) the residual plot doesn’t look much like white noise. The ACF and PACF plots for residuals have significant lags at higher order.

Ljung Box test

  • The Ljung-Box test is a diagnostic tool used to test the lack of fit of a time series model. The test is applied to the residuals of a time series after fitting an ARMA(p,q) model to the data. The test examines autocorrelation between the residuals.

  • H0: Null hypothesis: There is no auto-correlation between residuals.

  • HA: Alternative hypothesis : There is auto-correlation between residuals.

Box.test (resid(fit1), lag = 30, type = "Ljung",fitdf = 2)
## 
##  Box-Ljung test
## 
## data:  resid(fit1)
## X-squared = 42.426, df = 28, p-value = 0.03951
  • As per the Ljung Box test there is autocorrelation between the residuals as p value is 0.03951 which is less than 0.05 so we reject the null. Let’s try increasing the order and fitting new model.

Fitting ARIMA(2,1,2)

fit2= Arima(ts(close_price), order = c(2,1,2),lambda = 0)

Box.test (resid(fit2), lag = 30, type = "Ljung", fitdf = 4)
## 
##  Box-Ljung test
## 
## data:  resid(fit2)
## X-squared = 42.03, df = 26, p-value = 0.02437
  • For ARIMA(2,1,2) the ACF and PACF plots of the residuals still have significant lags at 14 and 34 and and the Ljung Box test show that there is autocorrelation between the lags. Now let’s try higher order of AR and MA to model the high correlation in the data.

Fitting ARIMA(5,1,5)

fit3= Arima(ts(close_price), order = c(5,1,5),lambda = 0)

Box.test (resid(fit3), lag = 30, type = "Ljung", fitdf = 10)
## 
##  Box-Ljung test
## 
## data:  resid(fit3)
## X-squared = 27.127, df = 20, p-value = 0.1317
  • For ARIMA(5,1,5) the ACF and PACF plots do not have much significant lags but there are high spikes around lag 30. The p value for Ljung Box test is 0.1317 which is greater than 0.05 so we fail to reject the null signifying and can say that the model doesn’t have correlation between the residuals. Trying higher order to see if we get reduce the high spikes in the plots.

Fitting ARIMA(5,1,6)

fit4= Arima(ts(close_price), order = c(5,1,6), lambda = 0)

Box.test (resid(fit4), lag = 30, type = "Ljung", fitdf = 11)
## 
##  Box-Ljung test
## 
## data:  resid(fit4)
## X-squared = 27.009, df = 19, p-value = 0.1044
  • For ARIMA(5,1,6) the ACF and PACF plot do not have any significant lag and the p value for Ljung Box test shows no autocorrelation between the residuals thus resembling white noise. Trying one more higher order to see if we can get better performing model.

Fitting ARIMA(6,1,6)

fit5= Arima(ts(close_price), order = c(6,1,6),lambda = 0)
coeftest(fit5)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1 -0.033699   0.650313  -0.0518  0.95867    
## ar2 -0.017453   0.116998  -0.1492  0.88142    
## ar3  0.026187   0.054029   0.4847  0.62790    
## ar4 -0.100001   0.051676  -1.9352  0.05297 .  
## ar5 -0.890664   0.068194 -13.0608  < 2e-16 ***
## ar6  0.127646   0.603800   0.2114  0.83257    
## ma1  0.014840   0.661096   0.0224  0.98209    
## ma2 -0.009416   0.109770  -0.0858  0.93164    
## ma3 -0.030883   0.055413  -0.5573  0.57731    
## ma4  0.078929   0.051956   1.5192  0.12872    
## ma5  0.882031   0.060370  14.6104  < 2e-16 ***
## ma6 -0.134723   0.610777  -0.2206  0.82542    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Box.test (resid(fit5), lag = 30, type = "Ljung", fitdf = 12)
## 
##  Box-Ljung test
## 
## data:  resid(fit5)
## X-squared = 26.88, df = 18, p-value = 0.08128
  • For ARIMA(6,1,6) the acf and pacf plot do not have any significant lag and the p value for Ljung Box test shows no autocorrelation between the residuals thus resembling white noise

Summary of the ARIMA models

Model Residuals AIC
ARIMA(0,1,2) Violate assumptions -17390.95
ARIMA(2,1,2) Violate assumptions -17387.25
ARIMA(5,1,5) Do not violate assumptions -17399.22
ARIMA(5,1,6) Do not violate assumptions -17397.71
ARIMA(6,1,6) Do not violate assumptions -17395.97

Overview of AIC

  • The Akaike Information Critera (AIC) is a widely used measure of a statistical model. It quantifies
  1. the goodness of fit
  2. the simplicity/parsimony, of the model into a single statistic. In forecasting when comparing two models, the one with the lower AIC is generally “better
  • The two ARIMA models which have lowest AIC are ARIMA(5,1,5) and ARIMA(5,1,6). As seen from the residual plots above for model ARIMA(5,1,6) the residuals closely resemble white noise as compared to ARIMA(5,1,5) so we finalize ARIMA(5,1,6) for forecasting

Summary of the selected model ARIMA(5,1,6)

## Series: ts(close_price) 
## ARIMA(5,1,6) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ar5     ma1     ma2      ma3
##       -0.1589  -0.0272  0.0168  -0.1095  -0.9206  0.1426  0.0008  -0.0218
## s.e.   0.0457   0.0560  0.0510   0.0406   0.0353  0.0484  0.0567   0.0497
##          ma4     ma5      ma6
##       0.0897  0.9093  -0.0056
## s.e.  0.0391  0.0376   0.0184
## 
## sigma^2 estimated as 0.0005807:  log likelihood=8710.85
## AIC=-17397.71   AICc=-17397.63   BIC=-17322.88
## 
## Training set error measures:
##                     ME     RMSE     MAE        MPE     MAPE     MASE
## Training set 0.8947686 20.88799 9.21193 0.08785396 1.594305 1.004396
##                     ACF1
## Training set -0.04903058

Forecasting using ARIMA model

Forecasting the closing price of Amazon stock

  • We will forecast the closing price of Amazon stock for the month of Jan 2021 to May 2021 using ARIMA(5,1,6). This data is not used for training the model so will be performing out-of-sample prediction.

Forecast plot

Forecasted values

forecasted_values <- final_model %>% forecast(h = 102)

RMSE of predicted data

accuracy(amz_stock_curr$Close,forecasted_values$mean)
##               ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 53.0394 129.9629 105.3485 1.618909 3.216088 0.9127146  51.20192
  • The out-of-sample RMSE is 129.9629 as the stock data is volatile and forecasting is a challenging task in the financial market because the trends of stock prices are non-linear and non-stationary time-series data.

Prophet model

Overview of Prophet model

  • Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

  • Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

  • Prophet is open source software released by Facebook Core Data Science team

  • The prophet function that performs fitting and returns a model object.

  • The function requires dataframe with columns names as ds and y, containing the date and numeric value respectively.

Preprocesing the data

data <- read.csv("AMZN.csv")
amz_data <- data[,c("Date","Close")]
colnames(amz_data) <- c("ds","y")
amz_data <- add_country_holidays(amz_data, country_name = 'US')
  • The above function will help to include the holidays on specified model initialization depending on country_name passed as parameter.

Fitting prophet on time series data

pmodel <- prophet(amz_data)
  • Predictions are made on a dataframe with a column ds containing the dates for which predictions are to be made.
  • The make_future_dataframe function takes the model object and a number of periods to forecast and produces a suitable dataframe. By default it will also include the historical dates so we can evaluate in-sample fit.

Predicting the closing price

future <- make_future_dataframe(pmodel, periods = 151) %>% filter(!wday(ds) %in% c(1,7)) 
  • The predict function is used to get forecast. The forecast object is a dataframe with a column yhat containing the forecast. It has additional columns for uncertainty intervals and seasonal components.
forecast <- predict(pmodel, future)
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##              ds     yhat yhat_lower yhat_upper
## 3878 2021-05-24 3063.254   2891.191   3257.671
## 3879 2021-05-25 3067.107   2879.869   3258.662
## 3880 2021-05-26 3070.103   2886.742   3249.940
## 3881 2021-05-27 3071.602   2883.461   3250.283
## 3882 2021-05-28 3070.218   2877.223   3262.024
## 3883 2021-05-31 3077.940   2895.322   3247.888

Plotting the forecast

dyplot.prophet(pmodel, forecast)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
  • The prophet_plot_components function to see the forecast broken down into trend, weekly seasonality, and yearly seasonality.

Plotting the forecast components

prophet_plot_components(pmodel,forecast)

  • There is a steep upward trend in the closing price of Amazon stock since 2015

Evaluating accuracy

forecast_insample <- forecast %>% select(ds, yhat) %>% filter(ds <= as.Date("2021-01-01"))
forecast_outofsample <- forecast %>% select(ds, yhat) %>% filter(ds >= as.Date("2021-01-01"))
  • In-sample accuracy
accuracy(forecast_insample$yhat,amz_data$y)
##                    ME     RMSE      MAE       MPE     MAPE
## Test set -0.001355992 143.9533 78.69301 -2.441521 21.66887
  • Out-of-sample accuracy
accuracy(forecast_outofsample$yhat,amz_stock_curr$Close)
##               ME     RMSE     MAE      MPE     MAPE
## Test set 286.955 310.8159 286.955 8.805399 8.805399
  • The prophet model is better predicts the stock price for the in-sample data than out-of-sample data for Amazon stock prices.

Conclusion

  • In this project we used stock price data of Amazon from last 14 years and build models to forecast the daily stock price for period of Jan to May 2021.

  • The variance of the data is stabilized using log differencing is used to achieve stationarity in the time series data.

  • ARIMA models are fitted over stabilized data and residual diagnostic is performed to examine whether the assumptions are violated. AIC of the models is checked to finalize the model having lower AIC value.

  • Summary of the models meeting residual assumptions

Model AIC
ARIMA(5,1,5) -17399.22
ARIMA(5,1,6) -17397.71
ARIMA(6,1,6) -17395.97
  • Out of the five ARIMA models, ARIMA (5,1,6) has good ACF, PACF residual plots which resemble white noise and has lower AIC value hence it is selected as the final model for forecasting the closing price.

  • Sample forecasted values using ARIMA model

Date Actual Forecasted
1/19/2021 3120.76001 3278.193
1/20/2021 3263.379883 3274.846
1/21/2021 3306.98999 3273.525
1/22/2021 3292.22998 3267.755
1/25/2021 3294 3270.031
1/26/2021 3326.129883 3274.589
1/27/2021 3232.580078 3276.931
1/28/2021 3237.620117 3278.323
1/29/2021 3206.199951 3283.193
  • The Prophet model is implemented to incorporate holiday effect in stock data, and it performs better for in-sample than out-sample prediction.