Apple Stock Prediction (Time Series Forecasting)

Christopher Nindyo

2022-09-12

Objective

For a long time,I use product from Apple such as iPad. Overall, I like it. They look elegant and easy to use. Even the last few product kind of disappointment for me. One thing lead to another, this make me think how well Apple’s stock in the market. With this article, I try to make a model to predict their stock value in the future.

Library

The library that we use, you could see below.

# load library

library(dplyr)     # data wrangling
library(lubridate) # date manipulation
library(forecast)  # time series library
library(MLmetrics) # calculate error
library(ggplot2)   # Beautify the graph
library(tidyr)     # Tidy the data
library(zoo)       # Order index observations
library(tseries)   # adf.test

Read Data

Import data and read data from Apple Stock Dataset. I get the data from kaggle, you could see here

# Read data
AAPL <- read.csv("AAPL.csv")

# Check the data
AAPL %>% head()
#>         Date     Open     High      Low    Close Adj.Close    Volume
#> 1 1980-12-12 0.128348 0.128906 0.128348 0.128348  0.100178 469033600
#> 2 1980-12-15 0.122210 0.122210 0.121652 0.121652  0.094952 175884800
#> 3 1980-12-16 0.113281 0.113281 0.112723 0.112723  0.087983 105728000
#> 4 1980-12-17 0.115513 0.116071 0.115513 0.115513  0.090160  86441600
#> 5 1980-12-18 0.118862 0.119420 0.118862 0.118862  0.092774  73449600
#> 6 1980-12-19 0.126116 0.126674 0.126116 0.126116  0.098436  48630400


We change Date datatype. Select the column we need (Date, Close). Then complete the date, so it appears sequentially.

AAPL <- AAPL %>% 
  mutate(Date = ymd(Date)) %>% 
  select(c(Date, Close)) %>% 
  complete(Date = seq.Date(min(Date), max(Date), by="day")) %>% 
  arrange()

AAPL %>% head()
#> # A tibble: 6 × 2
#>   Date        Close
#>   <date>      <dbl>
#> 1 1980-12-12  0.128
#> 2 1980-12-13 NA    
#> 3 1980-12-14 NA    
#> 4 1980-12-15  0.122
#> 5 1980-12-16  0.113
#> 6 1980-12-17  0.116
AAPL %>% tail()
#> # A tibble: 6 × 2
#>   Date       Close
#>   <date>     <dbl>
#> 1 2022-06-12   NA 
#> 2 2022-06-13  132.
#> 3 2022-06-14  133.
#> 4 2022-06-15  135.
#> 5 2022-06-16  130.
#> 6 2022-06-17  132.

This dataset has Date from 1980-12-12 to 2022-06-17.

Before we do anything else, check if the data has missing value or not.

# Check missing value
AAPL %>% is.na() %>% colSums()
#>  Date Close 
#>     0  4695

The data has missing value

Inspect the data with plot graph

AAPL$Close %>% tail(200) %>% 
  plot(type="l",
       col = "blue",
       lwd = 2,
       xlab = "time (days)",
       ylab = 'Close',
       main = "Stock Market Close")

We can not use the data like that. Because time series data must appear sequentially. Let’s recall from fundamental thing about time series data.

Main requirements for time series data: (1) The data must be sorted according to the time period from the oldest data to the newest data (2) The time interval must be the same (3) No missing data for each interval (4) Lastly, nothing can be missing on the dataset

So, we need to give some treatment to this dataset.

library(zoo)
AAPL_new <- AAPL %>% 
  mutate(Close = na.fill(Close, "extend")) # Fill the missing value with "extend"

AAPL_new %>%   
  tail(200) %>% # get the last 200 data
  plot(type="l",
       col = "blue",
       lwd = 2,
       xlab = "time (days)",
       ylab = 'Close',
       main = "Stock Market Close") # Plot AAPL_new

Check the whole dataset with plot

AAPL_new %>% 
    plot(type="l",
       col = "blue",
       lwd = 2,
       xlab = "time (years)",
       ylab = 'Close',
       main = "Stock Market Close")

Check the last 6 observations in dataset with plot

AAPL_new %>% tail() %>% 
  plot(type="l",
       col = "blue",
       lwd = 2,
       xlab = "time (days)",
       ylab = 'Close',
       main = "Stock Market Close")

The graph already connected, we can continue the process.

Clean Data

Check data class AAPL_new

class(AAPL_new)
#> [1] "tbl_df"     "tbl"        "data.frame"

The dataset has observation with day by day. We could make to month by month. So, the analysis more applicable to the problem.

APPL_m <- AAPL_new %>% 
    group_by(month = lubridate::floor_date(Date, "month")) %>%
    summarize(Close = mean(Close))

APPL_m %>% head()
#> # A tibble: 6 × 2
#>   month      Close
#>   <date>     <dbl>
#> 1 1980-12-01 0.137
#> 2 1981-01-01 0.142
#> 3 1981-02-01 0.118
#> 4 1981-03-01 0.111
#> 5 1981-04-01 0.121
#> 6 1981-05-01 0.131


Products of Apple company

Time Series

Make our data to Object TS / Time Series

AAPL_ts <- ts(
  data = APPL_m$Close,
  start = c(1980,12,01),
  frequency = 12
)

class(AAPL_ts)
#> [1] "ts"

We successfuly make our data APPL_m to ts (time series).

Plot AAPL_ts

AAPL_ts %>% autoplot()

In my opinion, we should subset our time observation for analysis. If we use the year from 1980 to 2022, may many things is not relevant anymore. Based on graph above, let’s only use the data from year 2010 to 2022. So, we have 12 years of observation.

AAPL_w <- window(AAPL_ts, start = 2010) 

AAPL_w%>% autoplot()

Cross Validation

Cross Validation by splitting Data Train and Data Test.

# test
AAPL_test <- tail(AAPL_w, 24)


# train
AAPL_train <- head(AAPL_w, -length(AAPL_test))

Check AAPL_train with visualization

AAPL_train %>% autoplot()

The graph above show AAPL_train already splitted. As we can see the time (x-axis) has less observation than before.

Check AAPL_train trend and seasonal with decompose

AAPL_dc <- decompose(AAPL_train, type = "multiplicative")
AAPL_dc %>% autoplot()

> AAPL_train, This data has an increasing trend from time to time. And also has seasonal. So, we use Hold Winters method.

One of the most famous Apple advertisement. This is for iPod

Holt Winters

Modeling

We are going model AAPL_train with Hold Winters method

model <- HoltWinters(
  x = AAPL_train, 
  seasonal = "multiplicative"
)

model
#> Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
#> 
#> Call:
#> HoltWinters(x = AAPL_train, seasonal = "multiplicative")
#> 
#> Smoothing parameters:
#>  alpha: 0.9956081
#>  beta : 0.03856241
#>  gamma: 1
#> 
#> Coefficients:
#>           [,1]
#> a   93.5121822
#> b    1.7645664
#> s1   0.9679656
#> s2   0.9190867
#> s3   0.9556855
#> s4   1.0186695
#> s5   1.0325204
#> s6   1.0409575
#> s7   1.0672196
#> s8   1.0761699
#> s9   1.0366772
#> s10  0.9896450
#> s11  0.9716539
#> s12  0.9248786

Forecasting

Forecast the model for 24 months (2 years)

forecast_AAPL <- forecast(
  object = model,
  h = 24
) 

forecast_AAPL
#>          Point Forecast     Lo 80     Hi 80    Lo 95     Hi 95
#> Jul 2020       92.22462  88.25284  96.19640 86.15031  98.29893
#> Aug 2020       89.18939  83.62336  94.75542 80.67688  97.70189
#> Sep 2020       94.42735  87.23740 101.61730 83.43127 105.42343
#> Oct 2020      102.44805  93.57224 111.32385 88.87368 116.02242
#> Nov 2020      105.66299  95.53639 115.78958 90.17569 121.15028
#> Dec 2020      108.36324  97.07572 119.65076 91.10047 125.62602
#> Jan 2021      112.98029 100.36895 125.59163 93.69291 132.26767
#> Feb 2021      115.82679 102.09362 129.55995 94.82373 136.82985
#> Mar 2021      113.40552  99.17239 127.63864 91.63783 135.17320
#> Apr 2021      110.00681  95.42555 124.58807 87.70670 132.30692
#> May 2021      109.72150  94.42216 125.02085 86.32317 133.11983
#> Jun 2021      106.07153  90.90578 121.23728 82.87752 129.26554
#> Jul 2021      112.72109  95.87671 129.56547 86.95984 138.48235
#> Aug 2021      108.65086  91.70010 125.60162 82.72691 134.57481
#> Sep 2021      114.66380  96.08228 133.24531 86.24582 143.08177
#> Oct 2021      124.01817 103.24551 144.79083 92.24912 155.78721
#> Nov 2021      127.52639 105.49894 149.55385 93.83830 161.21449
#> Dec 2021      130.40531 107.22061 153.59001 94.94737 165.86325
#> Jan 2022      135.57844 110.82010 160.33679 97.71382 173.44307
#> Feb 2022      138.61447 112.65246 164.57647 98.90900 178.31993
#> Mar 2022      135.35694 109.36184 161.35205 95.60086 175.11303
#> Apr 2022      130.96234 105.17412 156.75056 91.52267 170.40201
#> May 2022      130.29608 104.00686 156.58530 90.09019 170.50197
#> Jun 2022      125.65565  99.89934 151.41195 86.26478 165.04652

Visualising

Let us see how the model fit the unseen data (data test) with visualization.

AAPL_train %>% 
  autoplot()+
  autolayer(AAPL_test, series = "test")+ # data test
  autolayer(model$fitted[,1], series = "fitted values")+ # fitted value data train
  autolayer(forecast_AAPL$mean, series = "forecast data test") + # forecast model 
  ggtitle("AAPL Stock") + # Give name to the graph
  scale_x_continuous(name = "Time (years)") + # Edit x-axis name
  scale_y_continuous(name = "Value (dollars)") # Edit y-axis name

Conclusion

This Holt Winters model did well enough to capture the trend that happen in the last 24 months. As we can see from the graph above, the green line follow the trend that blue line. When the blue line go up, the green one do the same.

Unfotunately, this model not well enough to capture the actual value of Apple Stock

Seasonal ARIMA

Seasonal Arima is an Arima method where the existing time series objects have a seasonal pattern. The process in make modeling using SARIMA are the same as when making ARIMA modeling.
ARIMA models are can be used in business problem where data has evidence of non-stationarity , where an initial differencing step can be applied one or more times to negate/abolish the non-stationarity of the mean function (for example like the trend).

Stationary Test

We test the stationary of data with adf.test(). Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not.
The hypothesis are :
H0 : Data is not stationary
H1 : Data is stationary

# ADF Test with `AAPL_w`
adf.test(AAPL_train)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  AAPL_train
#> Dickey-Fuller = 0.54405, Lag order = 4, p-value = 0.99
#> alternative hypothesis: stationary

We decide alpha = 5%. From Augmented Dickey-Fuller Test, we get p-value = 0.99. This means fail to reject HO, so the data is not stationary.

Differencing

diff(AAPL_train, lag = 12) %>% 
  diff() %>% 
  adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -4.91, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary

Now, the data is stasionary, with d = 1, D = 1

Model Fitting

Make SARIMA model with auto.arima()
SARIMA(p,d,q)(P,D,Q)[freq]

model_auto_sarima <- auto.arima(AAPL_train, seasonal = T)
model_auto_sarima
#> Series: AAPL_train 
#> ARIMA(0,1,1)(0,0,1)[12] with drift 
#> 
#> Coefficients:
#>          ma1     sma1   drift
#>       0.4258  -0.4266  0.5571
#> s.e.  0.0940   0.1144  0.1832
#> 
#> sigma^2 = 5.463:  log likelihood = -283.28
#> AIC=574.55   AICc=574.89   BIC=585.87

Tuning the model manually

Framework to get the value of (p,d,q)(P,D,Q)seasonal

Step 1: Define the value of D.

Step 2: Define the value of d.

Step 3: Define the value of P and Q. For seasonal index, observe how much lag comes out in ACF/PACF when lag = freq/freq multiples.

diff(AAPL_train, lag = 12) %>% 
  diff() %>% 
  tsdisplay(lag.max = 36)

For seasonal, determining P and Q we look at the lag multiples based on the frequency of our data.

  • P : 0, 1, 2
  • D : 1
  • Q : 0, 1

for the whole data determining p and q we see the first 5 lags that come out.

  • p : 0, 1 ,2, 3
  • d : 1
  • q : 0, 1, 2

The graph for seasonal

The combination of model that might be formed :
* SARIMA(0,1,0)(0,1,0)[12]
* SARIMA(1,1,0)(0,1,0)[12]
* SARIMA(2,1,0)(0,1,0)[12]
* SARIMA(0,1,1)(0,1,0)[12]
* SARIMA(1,1,1)(0,1,0)[12]
* SARIMA(2,1,1)(0,1,0)[12]

and still many more, we are only use these 6 combination because the scope would be to big if we test all of them.

model_sarima_1 <- Arima(
  AAPL_train, order = c(0,1,0), seasonal = c(0,1,0))
model_sarima_2 <- Arima(
  AAPL_train, order = c(1,1,0), seasonal = c(0,1,0))
model_sarima_3 <- Arima(
  AAPL_train, order = c(2,1,0), seasonal = c(0,1,0))
model_sarima_4 <- Arima(
  AAPL_train, order = c(0,1,1), seasonal = c(0,1,0))
model_sarima_5 <- Arima(
  AAPL_train, order = c(1,1,1), seasonal = c(0,1,0))
model_sarima_6 <- Arima(
  AAPL_train, order = c(2,1,1), seasonal = c(0,1,0))

Goodness of fit

Analysis those models with the value of AIC

model_sarima_1$aic
#> [1] 627.9755
model_sarima_2$aic
#> [1] 614.1248
model_sarima_3$aic
#> [1] 607.3965
model_sarima_4$aic
#> [1] 611.1119
model_sarima_5$aic
#> [1] 612.5639
model_sarima_6$aic
#> [1] 605.9684

From 6 SARIMA models, with the least value of AIC is model_sarima_6. We still choose model_auto_sarima, because it still has the least value of AIC. But, I have suspicion with model_auto_sarima, it has D = 0. It should be 1, we differencing 1 time for seasonal. But, let see how it goes.

Forecasting

Forecast with model_auto_sarima for the next 24 months

forecast_AAPL_sarima <- forecast(
  object = model_auto_sarima,
  h = 24
) 

forecast_AAPL_sarima
#>          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
#> Jul 2020       88.57132 85.57599  91.56666 83.99036  93.15229
#> Aug 2020       88.29391 83.07758  93.51024 80.31622  96.27160
#> Sep 2020       87.47726 80.73573  94.21878 77.16698  97.78753
#> Oct 2020       86.51932 78.53895  94.49969 74.31440  98.72424
#> Nov 2020       85.69997 76.64876  94.75118 71.85734  99.54260
#> Dec 2020       86.32323 76.31511  96.33135 71.01713 101.62933
#> Jan 2021       84.09560 73.21440  94.97680 67.45424 100.73696
#> Feb 2021       84.25919 72.56994  95.94845 66.38203 102.13636
#> Mar 2021       89.83250 77.38756 102.27745 70.79960 108.86541
#> Apr 2021       88.91064 75.75333 102.06795 68.78827 109.03301
#> May 2021       85.87427 72.04124  99.70731 64.71847 107.03007
#> Jun 2021       82.97098 68.49372  97.44823 60.82992 105.11203
#> Jul 2021       82.43946 67.65611  97.22281 59.83027 105.04864
#> Aug 2021       82.99652 68.01174  97.98130 60.07928 105.91376
#> Sep 2021       83.55358 68.37005  98.73712 60.33237 106.77480
#> Oct 2021       84.11064 68.73092  99.49037 60.58939 107.63190
#> Nov 2021       84.66771 69.09427 100.24115 60.85019 108.48523
#> Dec 2021       85.22477 69.45999 100.98954 61.11463 109.33491
#> Jan 2022       85.78183 69.82801 101.73565 61.38257 110.18109
#> Feb 2022       86.33889 70.19825 102.47954 61.65391 111.02388
#> Mar 2022       86.89596 70.57062 103.22129 61.92851 111.86340
#> Apr 2022       87.45302 70.94506 103.96098 62.20627 112.69976
#> May 2022       88.01008 71.32150 104.69866 62.48709 113.53307
#> Jun 2022       88.56714 71.69987 105.43442 62.77087 114.36341

Visualising

Visualize the forecast model_auto_sarima then see how the model fit the unseen data (data test) with visualization.

AAPL_train %>% 
  autoplot() +
  autolayer(AAPL_test, series = 'test') +
  autolayer(model$fitted[,1], series = "fitted values")+ 
  autolayer(forecast_AAPL_sarima$mean, series = "forecast data test (Auto Sarima)") +
  ggtitle("AAPL Stock") + 
  scale_x_continuous(name = "Time (years)") + 
  scale_y_continuous(name = "Value (dollars)") 

Conclusion

I think is not as well as I thought. This model even can not predict the trend right. The predicted value is so far from the test data. My suspicion before actually turn out right. Let’s try with ‘model_sarima_6’

(Manual) Seasonal ARIMA

Forecasting

Forecast with model_sarima_6 for the next 24 months

forecast_AAPL_sarima_manual <- forecast(
  object = model_sarima_6,
  h = 24
) 

forecast_AAPL_sarima_manual
#>          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
#> Jul 2020       90.80738  86.38474  95.23002  84.04354  97.57122
#> Aug 2020       87.93078  80.22136  95.64021  76.14023  99.72133
#> Sep 2020       88.38712  78.65448  98.11977  73.50233 103.27192
#> Oct 2020       91.98675  81.12672 102.84679  75.37776 108.59575
#> Nov 2020       99.04682  87.44986 110.64377  81.31080 116.78283
#> Dec 2020      103.42345  91.17437 115.67252  84.69010 122.15679
#> Jan 2021      112.37318  99.43489 125.31147  92.58578 132.16059
#> Feb 2021      112.44965  98.78313 126.11617  91.54851 133.35079
#> Mar 2021       99.71399  85.32680 114.10117  77.71069 121.71729
#> Apr 2021      102.29676  87.23068 117.36284  79.25518 125.33834
#> May 2021      111.83614  96.13772 127.53457  87.82747 135.84481
#> Jun 2021      120.73993 104.44415 137.03571  95.81769 145.66218
#> Jul 2021      125.09894 106.58405 143.61382  96.78287 153.41501
#> Aug 2021      122.24501 100.86378 143.62623  89.54524 154.94477
#> Sep 2021      122.70289  98.95617 146.44961  86.38542 159.02036
#> Oct 2021      126.29397 100.82215 151.76580  87.33818 165.24977
#> Nov 2021      133.34643 106.50558 160.18727  92.29689 174.39596
#> Dec 2021      137.72057 109.61542 165.82571  94.73745 180.70368
#> Jan 2022      146.67157 117.29298 176.05015 101.74090 191.60223
#> Feb 2022      146.75014 116.08454 177.41573  99.85116 193.64912
#> Mar 2022      134.01564 102.08849 165.94279  85.18727 182.84400
#> Apr 2022      136.59845 103.46417 169.73273  85.92394 187.27296
#> May 2022      146.13736 111.85406 180.42066  93.70558 198.56914
#> Jun 2022      155.04074 119.65539 190.42610 100.92352 209.15797

Visualising

AAPL_train %>% 
  autoplot() +
  autolayer(AAPL_test, series = 'test') +
  autolayer(model$fitted[,1], series = "fitted values")+ 
  autolayer(forecast_AAPL_sarima_manual$mean, series = "forecast data test (Manual SARIMA)") +
  ggtitle("AAPL Stock") + 
  scale_x_continuous(name = "Time (years)") + 
  scale_y_continuous(name = "Value (dollars)") 

Conclusion

We can see that model_sarima_6 is forecasting better than the auto sarima model. Overall, even it has many wrong point, the model can predict the train well. But, the model can follow the value of data test.

Evaluation

Evaluate the model using the accuracy()function, the look at MAPE value

Model Holt Winters Evaluation

accuracy(forecast_AAPL$mean, AAPL_test)
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 23.45628 25.48452 23.45628 16.36253 16.36253 0.4459276  2.490074

Model Holt Winters’s MAPE is 16.36253

Model Auto SARIMA Evaluation

accuracy(forecast_AAPL_sarima$mean, AAPL_test)
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 53.03586 57.04421 53.03586 36.66569 36.66569 0.7951778  5.344257

Model Auto SAMIRA’s MAPE is 36.66569

Model Manual SARIMA Evaluation

accuracy(forecast_AAPL_sarima_manual$mean, AAPL_test)
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 20.17145 22.78431 21.37533 14.50648 15.36277 0.2301237  2.341362

Model Manual SAMIRA’s MAPE is 15.36277

Model Manual SAMIRA has the least value of MAPE. We can say this is the best model among other 2. In my opinion, we can use this model for forcasting.

Visualising

AAPL_train %>% 
  autoplot()+
  autolayer(AAPL_test, series = "test")+ 
  autolayer(model$fitted[,1], series = "fitted values")+ 
  autolayer(forecast_AAPL$mean, series = "forecast data test (Holt Winters)") +
  autolayer(forecast_AAPL_sarima_manual$mean, series = "forecast data test (Manual SARIMA)") + 
  ggtitle("AAPL Stock") + 
  scale_x_continuous(name = "Time (years)") + 
  scale_y_continuous(name = "Value (dollars)") 

My Opinion

From All of 3 models we made, I can see Model Manual Sarima and Model Holt Winters are relatively good model. Why ? Because they can forecast when the stock value increasing or not.


In my opinion, for this particular problem, Model Holt Winters is the best one among others, even this model doesn’t have the least AIC. This model almost always right when forecasting the trend. I can say it mimick well enough so we can use it to determine the time when to buy or sell the stock.


The Auto SARIMA is the worst in this case, I spend a lot of time to find the mistake above. Still can not find anything. This method should has accuracy not that different with others. Even we still has to check the paramater manually, the auto sarima shouldn’t this bad.