1.0 Abstract

Stock prediction is about trying to predict the future value of a company’s stock. The most common way to do this is by using the historic values of a stock, in order words conduct a time series analysis. In this paper, we will take a closer look at the Amazon stock and try to use its historic values in order to predict its future returns. With this in mind we would like to answer the following research question in this paper:

“Can we predict the future price for a stock by looking at the historical data?”

The dataset in this paper is gathered from Yahoo Finance, consists of all the historical values for the Amazon stock from 2015-01-01 to 2021-07-12. In total, we have 1510 observations, and six different variables (Opening price, closing price, lowest price, highest price, adjusted closing price, and volume). In the exploratory data analysis, we will get a better understanding of the stock, and its trend. Furthermore, we will make use of the ARIMA model in order to forecast future stock prices.

One important moment that we had to take into consideration when we fitted our model, was the fact the logarithmic closing price was no stationarity. In order to deal with this, we differentiated the time series, so that we get stationarity in the time series. The ARIMA model showed that there was a rising trend in the stock, with fairly good accuracy. However, it is important to note that this is merely a technical analysis, and there are a lot of aspects that change stock prices that have not been taken into consideration. For instance, we have not taken market trends or company news into consideration. Because of this, the results should be an indicator to further look into the stock, and not a recommendation to buy the stock

2.0 Keywords

Stock Analysis, Stock Forecasting, ARIMA Modelling, AMZN

3.0 Introduction

Stock price prediction is a popular topic in the financial world, and if it is done accurately, it may yield significant profit. There are mainly two methods used for stock prediction, technical analysis, and fundamental analysis. Technical analysis is about looking at the price movement and stocks data in order to predict its future price. Fundamental analysis, on the other hand, looks more in-depth at the underlying factors that affect the company and its profits. In this paper, we are mainly conduction a technical analysis in order to predict Amazon’s future stock prices.

Since we are conduction a technical stock price analysis, we are trying to answer the following research question in this paper:

“Can we predict the future price for a stock by looking at the historical data?”

In order to answer this, we will conduct an exploratory data analysis of the historical values of the company’s stock, and a times series analysis with the ARIMA model to obtain forecasting results of the Amazon stock.

4.0 Methodology

4.1 Dataset Description

The data in this paper is gathered from Yahoo finance and includes multiple historical values for Amazon’s stock from 2015-01-01 to 2021-07-12. There are in total six different variables in the dataset and 1510 observations. The different variables in the dataset are as follow:

AMZN.Open: The opening price for that day

AMZN.High: The highest price paid for the stock that day

AMZN.Low: The lowest price paid for the stock that day

AMZN.Close: The closing price for that day

AMZN.Volume: The volume of stocks moved that day

AMZN_Adjusted: The adjusted closing price for that day

require(xts)
require(quantmod)
getSymbols.warning4.0=FALSE
AMZN <- getSymbols(Symbols = "AMZN", src = "yahoo", 
                   from = "2015-01-01", to = "2020-12-31", auto.assign = FALSE)
tail(AMZN)

##            AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume AMZN.Adjusted
## 2020-12-22   3202.84   3222.00  3180.08    3206.52     2369400       3206.52
## 2020-12-23   3205.00   3210.13  3184.17    3185.27     2093800       3185.27
## 2020-12-24   3193.90   3202.00  3169.00    3172.69     1451900       3172.69
## 2020-12-28   3194.00   3304.00  3172.69    3283.96     5686800       3283.96
## 2020-12-29   3309.94   3350.65  3281.22    3322.00     4872900       3322.00
## 2020-12-30   3341.00   3342.10  3282.47    3285.85     3209300       3285.85

In the table over, there is a small portion of the dataset. This gives an overview of the initial dataset that will be used for both the visualization and the forecast modelling.

4.2 Data Analytics: Modelling, Methods and Tools

The dataset gives us the basis for analysing the historical trends for the stock, and to forecast the future price of the stock. In the exploratory data analysis, we will take a closer look at the overall movement of the stock and compare this with the single moving average (SMA). We will also take a closer look at the closing price during this paper. This is because this is the most frequently used measure for analysing a stock performance (Hayes, 2021). Furthermore, we will analyse both the logarithmic closing price and the differentiated logarithmic closing price. We analyse the logarithmic closing price because it takes away the stock’s growth rate, and we differentiate this to remove any autocorrelation and seasonal behaviour for the stock (Anggatama, 2020).

In order to forecast the stock, we will use an autoregressive integrated moving average (ARIMA) model in R. To find the best-suited model for the stock we will begin with conducting the augmented Dickey-Fuller test. This is to see if the data is stationary or not. Further, we will plot the ACF and PACF to figure out which values the moving average terms (MA) and the autoregression terms (AR) should have for our ARIMA model. Finally, we will fit the model and apply it in order to forecast the future logarithmic closing prices for the stock.

5.0 Wrangling and Visualization

5.1 Wranling: Data Filtering, Transformation and Combination

As mentioned in the methodology we do need to subset the initial dataset. The first subset is to create two datasets for the closing price. One for the daily and one for the monthly closing price. The monthly closing price is for the decomposed visualization, and the daily is for further analysis. Then we create a data frame for the logarithmic values of the daily closing prices. Finally, we create a data frame for the differentiated values of the logarithmic daily closing prices with a lag of one. Furthermore, we remove any missing values from this data frame, and this is to avoid any complications further on in the analysis. Since there is only one missing value in this data frame, there will be next to no data loss or bias when doing this.

AMZN_closing <- Cl(to.daily(AMZN))
AMZN_dc <- Cl(to.monthly(AMZN))
AMZN_log <- log(AMZN_closing)
AMZN_diff <- diff(AMZN_log, lag = 1)
anyNA(AMZN_diff)
colSums(is.na(AMZN_diff))
AMZN_diff <- AMZN_diff[!is.na(AMZN_diff)]

5.2 Explorative data analysis

n order to understand the raw data better, we create a graph that displays the movement of the stock of all the different variables during the given time period. The figure under is a snippet of an interactive graph (Graph 1), which we can see in the added URL, and this helps us to understand the movement better and to look at certain events that may give us some initial trend indications.

require(dplyr)
require(highcharter)
highchart(type = "stock") %>% 
  hc_add_series(AMZN) %>%
  hc_add_series(SMA(na.omit(Cl(AMZN)), n = 50), name = "SMA(50)") %>%
  hc_add_series(SMA(na.omit(Cl(AMZN)), n = 200), name = "SMA(200)") %>%
  hc_title(text = "<b> AMZN Price 2015-2021")

# Code Source (Source 4)
# Title: "Time Series & Stock Analysis" 
# Author: Christie, A
# Date: 6/11/2020
# Code Version: R code
# Availability: https://www.rpubs.com/AurelliaChristie/time-series-and-stock-analysis

We get an overlay of the stock movement during the time period, and there is also visualized the short-term SMA curve and the long-term SMA curve. The short-term SMA curve is the black curve, and the long-term SMA curve is the green curve in the graph. With these curves, we are able to comment on the trend by using the theory of the golden cross.

The golden cross is a chart pattern that can indicate when a stock has a growing or decanting trend (Hayes, 2021). In other words, we have the golden cross when the short-term SMA crosses the long-term SMA in an upwards trend. We can see this instance on October 10th, 2019, and we can see that the stock did indeed have a growing trend after this. Furthermore, we can see from the graph that both the short-term SMA and the long-term SMA have either an upward trend or a stable trend. If we take a closer look at the final year in the graph, it may look like the short-term SMA has an upward trend, whilst the long-term SMA has a stable trend. If this is the case, we may see another golden cross in the future for the stock. From the initial analysis of the stock, we get an indication that the stock has a growing trend.

For further analysis, we will now look at the decomposed closing price for Amazon stock. We do this in order to get a better understanding of the stock movement, and to see the trend of the stock, the seasonal pattern, and if there are random factors that have affected the stock’s movement. In the graph under, we get an overview of the different factors that may affect the closing price for the stock.

plot(dc)

From the graph, we can see that there is indeed a growing trend for the stock, and this growth has been relatively steady for the past six years. Furthermore, we can see that the stock has a seasonal trend. We can see that the stock tends to be at its peak during the summer season and that the best time to sell the stock is during the start of the year. An interesting factor we can take a closer look at is the random fluctuation. We can see a clear descend in the stock during the month of February 2020. This is easily explained by the covid-19 pandemic, and this had a big effect on the stock price. We can also see this decent in the first graph, where we can see a dip in the stock price during the same period. This fluctuation is something that we have to take into consideration when we fit our forecasting model.

The last thing we will look at in this part is the logarithmic closing price and the differentiated logarithmic closing price. These values are the main values that our forecasting will be based on, and this is because we take away the seasonal factors or growth factors that may affect our predictions. In the graph below we can see a visualisation of both of the values.

plot(AMZN_log, main = "AMZN log returns")

When we look at the logarithmic closing price, we can see that there is a clear growing trend, whilst there is still some variety in the growth of the stock. This gives us an initial indication that the stock movement is non-stationary. This means that the variance and the autocorrelation are constant over time, in other words, the price is equal to last year’s price with some white noise. In order to fit the ARIMA model in the best way possible we want to try to remove the non-stationarity of the logged closing price, and we do this by differentiating it.

plot(AMZN_diff, type = "l", main = "AMZN diff")

After we differentiate the logged closing price with a lag equal to one, we can see that we now have a more stationary closing price. We do this in order to get a more accurate prediction from our forecasting model.

In the next part of this paper, we will take a closer look at chosen forecasting model for our prediction, the ARIMA model.

6.0 Results

6.1 Results from ARIMA Forecasting

The ARIMA model is an autoregressive model that is used to either better understand or to predict future trends in a time series. The model is widely used for technical financial analysis, and it forecasts future prices by looking back at the historical prices. Before we fit and apply the model, we will analyse and find the best values for our model. We start the process by applying an augmented Dickey-Fuller test.

6.1.1 Augmented Dickey-Fuller test

Augmented Dickey-Fuller (ADF) test is a statistical test that is often used to determine whether a time series is stationary or not (Prabhakaran, 2019). In order to determine whether or not we have stationarity in our time series, we have to formulate a null hypothesis and an alternative hypothesis. In this case, our null- and the alternative hypothesis is:

H0: There is no stationarity in the time series HA: There is stationarity in the time series

We will these hypotheses for testing both the logarithmic time series and the differentiated logarithmic time series, and we will test at a significance equal to 0.05

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AMZN_log
## Dickey-Fuller = -3.2936, Lag order = 0, p-value = 0.07177
## alternative hypothesis: stationary

When we conduct an ADF test for the logarithmic time series we get a p-value of 0.07 . This means that we keep the null hypothesis and that there is no stationarity in the time series.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AMZN_diff
## Dickey-Fuller = -39.773, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

When we test the differentiated logarithmic time series we get a p-value of 0.01. This means that we can confidently assume that there is stationarity in the time series, which is what we initially concluded within the exploratory data analysis. We can now confidently say that there is stationarity in or time series, and we will continue to use the differentiated logarithmic time series for fitting and applying our ARIMA model.

6.1.2 ACF and PACF

The first step for the ARIMA model is to explore the ACF and PACF values for our time series. The ACF plots and measures the average correlation between the data points in our time series, whilst the PACF measures and plots the partial correlation between the data points. ACF and PACF are very similar measures expect for that each correlation in PACF controls for correlation between observations of a shorter lag length (Sage Publications, 2017). Therefore, when we lag the time series with the values of one, the ACF and the PACF will measure and plot the same. It is important to evaluate both of the plots in order to determine the autoregressive terms (AR) and the moving average terms (MA) for our ARIMA model.

require(astsa)
acf2(AMZN_diff, max.lag = 30)

Since we lagged our logarithmic time series with one, we can see that both the ACF plot and the PACF plot are the same. Furthermore, we can see that there is no significant cut-off in either of the plots, and they are not tailing off at any certain level. This gives us an indication that the best-fitted model is with a value of MA equal to zero and the AR is also equal to zero. In other words, our ARIMA model will be an ARIMA (0,0,0) for our differentiated logarithmic time series.

6.1.3 ARIMA Model

6.1.3.1Residual analysis and overfitting

In order to be confident with our initial conclusion that we have an ARIMA (0,0,0) for this time series, we want to conduct a residual analysis of the model. Furthermore, we will also overfit the model in order to see if other values might have significance and give us a better result. During a residual analysis, we look at the standardized residuals, ACF of the residuals, the normal Q-Q plot, and q-statistic p-values. When we look at the standardized residuals, we look for obvious patterns in the residuals, to determine if the noise in the time series is white. This is because if a time series is not white noise, then we have to make changes in order to make a prediction. The ACF of the residuals can also be used to assess the whitens of the time series, and the Q-Q plot is used to assess normalities in the time series. Finally, the p-values are used also to assess whiteness, and we want some of the points to be above the blue line.

Res <- sarima(AMZN_diff,0,0,0)

The residual analysis for our ARIMA (0,0,0) visualizes that we have indeed the right parameters for our model. We can see from the standardized residuals that there is no obvious pattern. Furthermore, the ACF for the residuals is mainly in between the blue lines in the plot, and the extreme values for our Q-Q plot are only at the ends of the graph. Lastly, we can see that some of the white points in the p-value plot are above the blue line, and we can see that the p-value for this model is significant.

##       Estimate    SE t.value p.value
## xmean   0.0016 5e-04  3.1396  0.0017

All of this indicates that this is the most fitted model for us to use for our forecast. However, we will also overfit our model to look for the significance when we add values to our model.

When we check the residuals and the p-value for an ARIMA (0,1,0) we can see that this is not a fitting model. This is because all of the points in the p-value plot are under the blue line, which indicates that there is no whiteness in the time series. Furthermore, we can also see that the p-value for this parameter is not significant, and we can conclude that we will not add this parameter.

AMZN_fit1 <- sarima(AMZN_diff,0,1,0)

##          Estimate    SE t.value p.value
## constant        0 7e-04  0.0091  0.9928

When we check the residuals and the p-value for an ARIMA (0,0,1) and an ARIMA (1,0,0), we can also see that these models are not fitting for our time series. The residuals plot for both of these models are fairly similar to the ARIMA (0,0,0) plot, however, we can see that the p-value for both of these parameters is not significant. We can, therefore, conclude that the ARIMA (0,0,0) model is indeed the most fitting model for our differentiated logarithmic closing price.

AMZN_fit2 <- sarima(AMZN_diff,0,0,1)

##       Estimate     SE t.value p.value
## ma1    -0.0244 0.0259 -0.9453  0.3447
## xmean   0.0016 0.0005  3.2191  0.0013

AMZN_fit3 <- sarima(AMZN_diff,1,0,0)

##       Estimate     SE t.value p.value
## ar1    -0.0243 0.0257 -0.9439  0.3454
## xmean   0.0016 0.0005  3.2221  0.0013

6.1.3.2 Applying the model

We have now found the best fitting model for our differentiated logarithmic closing price, and we will apply the model to our logarithmic closing price since this is the value we want to forecast. However, as we recall from our ADF-test there is no stationarity in this time series. In order to resolve this problem, we will apply the ARIMA (0,1,0) instead of the ARIMA (0,0,0) model.

arima_log <- arima(AMZN_log, order = c(0,1,0))
# Code Source (Source 2)
# Title: "Forecasting Stock using Arima Model" 
# Author: Anggatama, T, K 
# Date: 5/10/2020
# Code Version: R code 
# Availability: https://rpubs.com/kevinTongam/arimaforecast

We can see that we changed the d parameter in our model from zero to one. This is to capture the non-stationarity that we have in our logarithmic time series and to capture the randomness of the time series, which we saw during the exploratory data analysis.

sarima(AMZN_log, 0,1,0)

When we check the residuals for this model, we can see that the plots are similar to the ARIMA (0,0,0) model for our differentiated logarithmic time series. We can, therefore, move on to forecasting the logarithmic closing price for our stock.

6.1.4 Forecasting

We will now forecast the stock, and we will look at the logarithmic closing prices 24 weeks into the future. We can see from the graph that there is a growing trend for the logarithmic closing price for the stock. This is on par with our initial analysis of the stock, which we concluded during the exploratory data analysis.

sarima.for(AMZN_log, n.ahead = 24,0,1,0)

require(forecast)
summary(arima_log)

## 
## Call:
## arima(x = AMZN_log, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 0.0003777:  log likelihood = 3805.38,  aic = -7608.77
## 
## Training set error measures:
##                       ME       RMSE        MAE       MPE      MAPE      MASE
## Training set 0.001570414 0.01942813 0.01317818 0.0225606 0.1872634 0.9996257
##                     ACF1
## Training set -0.02445004

Furthermore, we can look at the mean absolute percentage error (MAPE) for our forecasting model. The MAPE shows the average sum of all the percentage errors that occur in our forecast, and this measure is commonly used for measuring the accuracy when forecasting (Springer, n.d). In our case, the MAPE is 18.7%, which indicates that the results are good (Allwright, 2021) and we are content with these results.

7.0 Conclusion and Discussion

In this paper, we have worked on predicting the future stock prices for Amazon. We have conducted a technical analysis and applied the ARIMA model in order to predict future values. In addition, we have also conducted an exploratory data analysis to get a better understanding of the stock, and we have looked at the trend and factors of the stock that may have affected this trend. We have done these analyses in order to answer our research question.

We have seen that it is indeed possible to predict a stock’s future values by looking at its historical data. However, there are some limitations to this prediction. Firstly, the ARIMA model is not the best model for handling outliers in a time series. As we know this is the case for most of the stocks on the market now, and it is because of the covid-19 pandemic. Therefore, it could be necessary to make use of more intricate forecasting models than the ARIMA model. Furthermore, the results are not necessarily a recommendation to buy the stock. This is because we have not included different market factors that may change a stock price, for instance, company incidences or market trends. Therefore, the results should be more interpreted for the trend of the stock, and not as a purchasing recommendation.

8.0 Bibliography

Allwright, S. (2021). What is a good MAPE score?. StephenAllwright.com.
https://stephenallwright.com/good-mape-score/
Anggatama, K, T. (2020). Forecasting INDF Stock price using ARIMA model in R. RPubs. https://rpubs.com/kevinTongam/arimaforecast
Brownlee, J. (2017). White Noise Time Series with Python. Machine Learning Mastery.
https://machinelearningmastery.com/white-noise-time-series-python/
Christie, A. (2020). Time Series & Stock Analysis. RPubs.
https://www.rpubs.com/AurelliaChristie/time-series-and-stock-analysis
Hayes, A. (2021). Closing Price. Investopedia.
https://www.investopedia.com/terms/c/closingprice.asp
Hayes, A. (2021). Golden Cross Definition. Investopedia. https://www.investopedia.com/terms/g/goldencross.asp
Prabhakaran, S. (2019). Augmented Dickey Fuller Test (ADF Test) – Must Read Guide. Machine learning +.
https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/
Sage Publications. (2017). Learn About Time Series AVF and PACF in SPSS With Data From the USDA Feed Grains Database (1876-2015). Sage researchmethods datasets. https://methods.sagepub.com/base/download/DatasetStudentGuide/time-series-acf-pacf-in-us-feedgrains-1876-2015
Springer, B. (n.d.). FORECAST ERROS. Encyclopedia of ‘production and Manufacturing Management.
https://link.springer.com/referenceworkentry/10.1007%2F1-4020-0612-8_358
Stoffer, D. (n.d.) ARIMA Models in R. Datacamp. https://app.datacamp.com/learn/courses/arima-models-in-r
Xu, S, Y. (n.d.). Stock Price Forecasting Using Information from Yahoo Finance and Google Trend. UC Berkeley.
https://www.econ.berkeley.edu/sites/default/files/Selene%20Yue%20Xu.pdf

Amazon Stock Forecasting

S148295

07/12/2021