Objective
For a long time,I use product from Apple such as iPad. Overall, I like it. They look elegant and easy to use. Even the last few product kind of disappointment for me. One thing lead to another, this make me think how well Apple’s stock in the market. With this article, I try to make a model to predict their stock value in the future.
Library
The library that we use, you could see below.
# load library
library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(forecast) # time series library
library(MLmetrics) # calculate error
library(ggplot2) # Beautify the graph
library(tidyr) # Tidy the data
library(zoo) # Order index observations
library(tseries) # adf.test
Read Data
Import data and read data from Apple Stock Dataset. I get the data from kaggle, you could see here
# Read data
AAPL <- read.csv("AAPL.csv")
# Check the data
AAPL %>% head()
#> Date Open High Low Close Adj.Close Volume
#> 1 1980-12-12 0.128348 0.128906 0.128348 0.128348 0.100178 469033600
#> 2 1980-12-15 0.122210 0.122210 0.121652 0.121652 0.094952 175884800
#> 3 1980-12-16 0.113281 0.113281 0.112723 0.112723 0.087983 105728000
#> 4 1980-12-17 0.115513 0.116071 0.115513 0.115513 0.090160 86441600
#> 5 1980-12-18 0.118862 0.119420 0.118862 0.118862 0.092774 73449600
#> 6 1980-12-19 0.126116 0.126674 0.126116 0.126116 0.098436 48630400
We change Date datatype. Select the column we need
(Date, Close). Then complete the date, so it appears sequentially.
AAPL <- AAPL %>%
mutate(Date = ymd(Date)) %>%
select(c(Date, Close)) %>%
complete(Date = seq.Date(min(Date), max(Date), by="day")) %>%
arrange()
AAPL %>% head()
#> # A tibble: 6 × 2
#> Date Close
#> <date> <dbl>
#> 1 1980-12-12 0.128
#> 2 1980-12-13 NA
#> 3 1980-12-14 NA
#> 4 1980-12-15 0.122
#> 5 1980-12-16 0.113
#> 6 1980-12-17 0.116
AAPL %>% tail()
#> # A tibble: 6 × 2
#> Date Close
#> <date> <dbl>
#> 1 2022-06-12 NA
#> 2 2022-06-13 132.
#> 3 2022-06-14 133.
#> 4 2022-06-15 135.
#> 5 2022-06-16 130.
#> 6 2022-06-17 132.
This dataset has Date from 1980-12-12
to 2022-06-17. Before we do anything else,
check if the data has missing value or not.
# Check missing value
AAPL %>% is.na() %>% colSums()
#> Date Close
#> 0 4695
The data has missing value Inspect the data with plot graph
AAPL$Close %>% tail(200) %>%
plot(type="l",
col = "blue",
lwd = 2,
xlab = "time (days)",
ylab = 'Close',
main = "Stock Market Close")
We can not use the data like that. Because time series
data must appear sequentially. Let’s recall from fundamental thing about
time series data.
Main requirements for time series data: (1) The data must be sorted according to the time period from the oldest data to the newest data (2) The time interval must be the same (3) No missing data for each interval (4) Lastly, nothing can be missing on the dataset
So, we need to give some treatment to this dataset.
library(zoo)
AAPL_new <- AAPL %>%
mutate(Close = na.fill(Close, "extend")) # Fill the missing value with "extend"
AAPL_new %>%
tail(200) %>% # get the last 200 data
plot(type="l",
col = "blue",
lwd = 2,
xlab = "time (days)",
ylab = 'Close',
main = "Stock Market Close") # Plot AAPL_new
Check the whole dataset with plot
AAPL_new %>%
plot(type="l",
col = "blue",
lwd = 2,
xlab = "time (years)",
ylab = 'Close',
main = "Stock Market Close")
Check the last 6 observations in dataset with plot
AAPL_new %>% tail() %>%
plot(type="l",
col = "blue",
lwd = 2,
xlab = "time (days)",
ylab = 'Close',
main = "Stock Market Close")
The graph already connected, we can continue the process.
Clean Data
Check data class AAPL_new
class(AAPL_new)
#> [1] "tbl_df" "tbl" "data.frame"
The dataset has observation with day by day. We could make to month by month. So, the analysis more applicable to the problem.
APPL_m <- AAPL_new %>%
group_by(month = lubridate::floor_date(Date, "month")) %>%
summarize(Close = mean(Close))
APPL_m %>% head()
#> # A tibble: 6 × 2
#> month Close
#> <date> <dbl>
#> 1 1980-12-01 0.137
#> 2 1981-01-01 0.142
#> 3 1981-02-01 0.118
#> 4 1981-03-01 0.111
#> 5 1981-04-01 0.121
#> 6 1981-05-01 0.131
Products of Apple company
Time Series
Make our data to Object TS / Time Series
AAPL_ts <- ts(
data = APPL_m$Close,
start = c(1980,12,01),
frequency = 12
)
class(AAPL_ts)
#> [1] "ts"
We successfuly make our data APPL_m to ts
(time series).
Plot AAPL_ts
AAPL_ts %>% autoplot()
In my opinion, we should subset our time observation for analysis. If we use the year from 1980 to 2022, may many things is not relevant anymore. Based on graph above, let’s only use the data from year 2010 to 2022. So, we have 12 years of observation.
AAPL_w <- window(AAPL_ts, start = 2010)
AAPL_w%>% autoplot()
Cross Validation
Cross Validation by splitting Data Train and Data Test.
# test
AAPL_test <- tail(AAPL_w, 24)
# train
AAPL_train <- head(AAPL_w, -length(AAPL_test))
Check AAPL_train with visualization
AAPL_train %>% autoplot()
The graph above show
AAPL_train already splitted. As we can
see the time (x-axis) has less observation than before.
Check AAPL_train trend and seasonal with
decompose
AAPL_dc <- decompose(AAPL_train, type = "multiplicative")
AAPL_dc %>% autoplot()
>
AAPL_train, This data has an increasing trend from
time to time. And also has seasonal. So, we use Hold Winters method.
One of the most famous Apple advertisement. This is for iPod
Holt Winters
Modeling
We are going model AAPL_train with Hold Winters
method
model <- HoltWinters(
x = AAPL_train,
seasonal = "multiplicative"
)
model
#> Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
#>
#> Call:
#> HoltWinters(x = AAPL_train, seasonal = "multiplicative")
#>
#> Smoothing parameters:
#> alpha: 0.9956081
#> beta : 0.03856241
#> gamma: 1
#>
#> Coefficients:
#> [,1]
#> a 93.5121822
#> b 1.7645664
#> s1 0.9679656
#> s2 0.9190867
#> s3 0.9556855
#> s4 1.0186695
#> s5 1.0325204
#> s6 1.0409575
#> s7 1.0672196
#> s8 1.0761699
#> s9 1.0366772
#> s10 0.9896450
#> s11 0.9716539
#> s12 0.9248786
Forecasting
Forecast the model for 24 months (2 years)
forecast_AAPL <- forecast(
object = model,
h = 24
)
forecast_AAPL
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Jul 2020 92.22462 88.25284 96.19640 86.15031 98.29893
#> Aug 2020 89.18939 83.62336 94.75542 80.67688 97.70189
#> Sep 2020 94.42735 87.23740 101.61730 83.43127 105.42343
#> Oct 2020 102.44805 93.57224 111.32385 88.87368 116.02242
#> Nov 2020 105.66299 95.53639 115.78958 90.17569 121.15028
#> Dec 2020 108.36324 97.07572 119.65076 91.10047 125.62602
#> Jan 2021 112.98029 100.36895 125.59163 93.69291 132.26767
#> Feb 2021 115.82679 102.09362 129.55995 94.82373 136.82985
#> Mar 2021 113.40552 99.17239 127.63864 91.63783 135.17320
#> Apr 2021 110.00681 95.42555 124.58807 87.70670 132.30692
#> May 2021 109.72150 94.42216 125.02085 86.32317 133.11983
#> Jun 2021 106.07153 90.90578 121.23728 82.87752 129.26554
#> Jul 2021 112.72109 95.87671 129.56547 86.95984 138.48235
#> Aug 2021 108.65086 91.70010 125.60162 82.72691 134.57481
#> Sep 2021 114.66380 96.08228 133.24531 86.24582 143.08177
#> Oct 2021 124.01817 103.24551 144.79083 92.24912 155.78721
#> Nov 2021 127.52639 105.49894 149.55385 93.83830 161.21449
#> Dec 2021 130.40531 107.22061 153.59001 94.94737 165.86325
#> Jan 2022 135.57844 110.82010 160.33679 97.71382 173.44307
#> Feb 2022 138.61447 112.65246 164.57647 98.90900 178.31993
#> Mar 2022 135.35694 109.36184 161.35205 95.60086 175.11303
#> Apr 2022 130.96234 105.17412 156.75056 91.52267 170.40201
#> May 2022 130.29608 104.00686 156.58530 90.09019 170.50197
#> Jun 2022 125.65565 99.89934 151.41195 86.26478 165.04652
Visualising
Let us see how the model fit the unseen data (data test) with visualization.
AAPL_train %>%
autoplot()+
autolayer(AAPL_test, series = "test")+ # data test
autolayer(model$fitted[,1], series = "fitted values")+ # fitted value data train
autolayer(forecast_AAPL$mean, series = "forecast data test") + # forecast model
ggtitle("AAPL Stock") + # Give name to the graph
scale_x_continuous(name = "Time (years)") + # Edit x-axis name
scale_y_continuous(name = "Value (dollars)") # Edit y-axis name
Conclusion
This Holt Winters model did well enough to capture the trend that happen in the last 24 months. As we can see from the graph above, the green line follow the trend that blue line. When the blue line go up, the green one do the same.
Unfotunately, this model not well enough to capture the actual value of Apple Stock
Seasonal ARIMA
Seasonal Arima is an Arima method where the existing time series objects have a seasonal pattern. The process in make modeling using SARIMA are the same as when making ARIMA modeling. ARIMA models are can be used in business problem where data has evidence of non-stationarity , where an initial differencing step can be applied one or more times to negate/abolish the non-stationarity of the mean function (for example like the trend).
Stationary Test
We test the stationary of data with adf.test().
Augmented Dickey Fuller test (ADF Test) is a common
statistical test used to test whether a given Time series is stationary
or not. The hypothesis are : H0 : Data is not stationary
H1 : Data is stationary
# ADF Test with `AAPL_w`
adf.test(AAPL_train)
#>
#> Augmented Dickey-Fuller Test
#>
#> data: AAPL_train
#> Dickey-Fuller = 0.54405, Lag order = 4, p-value = 0.99
#> alternative hypothesis: stationary
We decide alpha = 5%. From Augmented Dickey-Fuller Test, we get p-value = 0.99. This means fail to reject HO, so the data is not stationary.
Differencing
diff(AAPL_train, lag = 12) %>%
diff() %>%
adf.test()
#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -4.91, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary
Now, the data is stasionary, with d = 1, D = 1
Model Fitting
Make SARIMA model with auto.arima()
SARIMA(p,d,q)(P,D,Q)[freq]
model_auto_sarima <- auto.arima(AAPL_train, seasonal = T)
model_auto_sarima
#> Series: AAPL_train
#> ARIMA(0,1,1)(0,0,1)[12] with drift
#>
#> Coefficients:
#> ma1 sma1 drift
#> 0.4258 -0.4266 0.5571
#> s.e. 0.0940 0.1144 0.1832
#>
#> sigma^2 = 5.463: log likelihood = -283.28
#> AIC=574.55 AICc=574.89 BIC=585.87
Tuning the model manually
Framework to get the value of (p,d,q)(P,D,Q)seasonal
Step 1: Define the value of D.
Step 2: Define the value of d.
Step 3: Define the value of P and Q. For seasonal index,
observe how much lag comes out in ACF/PACF when lag = freq/freq
multiples.
diff(AAPL_train, lag = 12) %>%
diff() %>%
tsdisplay(lag.max = 36)
For seasonal, determining P and Q we look at the lag multiples based on the frequency of our data.
- P : 0, 1, 2
- D : 1
- Q : 0, 1
for the whole data determining p and q we see the first 5 lags that come out.
- p : 0, 1 ,2, 3
- d : 1
- q : 0, 1, 2
The graph for seasonal
The combination of model that might be formed : * SARIMA(0,1,0)(0,1,0)[12] * SARIMA(1,1,0)(0,1,0)[12] * SARIMA(2,1,0)(0,1,0)[12] * SARIMA(0,1,1)(0,1,0)[12] * SARIMA(1,1,1)(0,1,0)[12] * SARIMA(2,1,1)(0,1,0)[12]
and still many more, we are only use these 6 combination because the scope would be to big if we test all of them.
model_sarima_1 <- Arima(
AAPL_train, order = c(0,1,0), seasonal = c(0,1,0))
model_sarima_2 <- Arima(
AAPL_train, order = c(1,1,0), seasonal = c(0,1,0))
model_sarima_3 <- Arima(
AAPL_train, order = c(2,1,0), seasonal = c(0,1,0))
model_sarima_4 <- Arima(
AAPL_train, order = c(0,1,1), seasonal = c(0,1,0))
model_sarima_5 <- Arima(
AAPL_train, order = c(1,1,1), seasonal = c(0,1,0))
model_sarima_6 <- Arima(
AAPL_train, order = c(2,1,1), seasonal = c(0,1,0))
Goodness of fit
Analysis those models with the value of AIC
model_sarima_1$aic
#> [1] 627.9755
model_sarima_2$aic
#> [1] 614.1248
model_sarima_3$aic
#> [1] 607.3965
model_sarima_4$aic
#> [1] 611.1119
model_sarima_5$aic
#> [1] 612.5639
model_sarima_6$aic
#> [1] 605.9684
From 6 SARIMA models, with the least value of AIC is
model_sarima_6. We still choose
model_auto_sarima, because it still has the least value of
AIC. But, I have suspicion with model_auto_sarima, it has D
= 0. It should be 1, we differencing 1 time for seasonal. But, let see
how it goes.
Forecasting
Forecast with model_auto_sarima for the next 24
months
forecast_AAPL_sarima <- forecast(
object = model_auto_sarima,
h = 24
)
forecast_AAPL_sarima
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Jul 2020 88.57132 85.57599 91.56666 83.99036 93.15229
#> Aug 2020 88.29391 83.07758 93.51024 80.31622 96.27160
#> Sep 2020 87.47726 80.73573 94.21878 77.16698 97.78753
#> Oct 2020 86.51932 78.53895 94.49969 74.31440 98.72424
#> Nov 2020 85.69997 76.64876 94.75118 71.85734 99.54260
#> Dec 2020 86.32323 76.31511 96.33135 71.01713 101.62933
#> Jan 2021 84.09560 73.21440 94.97680 67.45424 100.73696
#> Feb 2021 84.25919 72.56994 95.94845 66.38203 102.13636
#> Mar 2021 89.83250 77.38756 102.27745 70.79960 108.86541
#> Apr 2021 88.91064 75.75333 102.06795 68.78827 109.03301
#> May 2021 85.87427 72.04124 99.70731 64.71847 107.03007
#> Jun 2021 82.97098 68.49372 97.44823 60.82992 105.11203
#> Jul 2021 82.43946 67.65611 97.22281 59.83027 105.04864
#> Aug 2021 82.99652 68.01174 97.98130 60.07928 105.91376
#> Sep 2021 83.55358 68.37005 98.73712 60.33237 106.77480
#> Oct 2021 84.11064 68.73092 99.49037 60.58939 107.63190
#> Nov 2021 84.66771 69.09427 100.24115 60.85019 108.48523
#> Dec 2021 85.22477 69.45999 100.98954 61.11463 109.33491
#> Jan 2022 85.78183 69.82801 101.73565 61.38257 110.18109
#> Feb 2022 86.33889 70.19825 102.47954 61.65391 111.02388
#> Mar 2022 86.89596 70.57062 103.22129 61.92851 111.86340
#> Apr 2022 87.45302 70.94506 103.96098 62.20627 112.69976
#> May 2022 88.01008 71.32150 104.69866 62.48709 113.53307
#> Jun 2022 88.56714 71.69987 105.43442 62.77087 114.36341
Visualising
Visualize the forecast model_auto_sarima then see how
the model fit the unseen data (data test) with visualization.
AAPL_train %>%
autoplot() +
autolayer(AAPL_test, series = 'test') +
autolayer(model$fitted[,1], series = "fitted values")+
autolayer(forecast_AAPL_sarima$mean, series = "forecast data test (Auto Sarima)") +
ggtitle("AAPL Stock") +
scale_x_continuous(name = "Time (years)") +
scale_y_continuous(name = "Value (dollars)")
Conclusion
I think is not as well as I thought. This model even can not predict the trend right. The predicted value is so far from the test data. My suspicion before actually turn out right. Let’s try with ‘model_sarima_6’
(Manual) Seasonal ARIMA
Forecasting
Forecast with model_sarima_6 for the next 24 months
forecast_AAPL_sarima_manual <- forecast(
object = model_sarima_6,
h = 24
)
forecast_AAPL_sarima_manual
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Jul 2020 90.80738 86.38474 95.23002 84.04354 97.57122
#> Aug 2020 87.93078 80.22136 95.64021 76.14023 99.72133
#> Sep 2020 88.38712 78.65448 98.11977 73.50233 103.27192
#> Oct 2020 91.98675 81.12672 102.84679 75.37776 108.59575
#> Nov 2020 99.04682 87.44986 110.64377 81.31080 116.78283
#> Dec 2020 103.42345 91.17437 115.67252 84.69010 122.15679
#> Jan 2021 112.37318 99.43489 125.31147 92.58578 132.16059
#> Feb 2021 112.44965 98.78313 126.11617 91.54851 133.35079
#> Mar 2021 99.71399 85.32680 114.10117 77.71069 121.71729
#> Apr 2021 102.29676 87.23068 117.36284 79.25518 125.33834
#> May 2021 111.83614 96.13772 127.53457 87.82747 135.84481
#> Jun 2021 120.73993 104.44415 137.03571 95.81769 145.66218
#> Jul 2021 125.09894 106.58405 143.61382 96.78287 153.41501
#> Aug 2021 122.24501 100.86378 143.62623 89.54524 154.94477
#> Sep 2021 122.70289 98.95617 146.44961 86.38542 159.02036
#> Oct 2021 126.29397 100.82215 151.76580 87.33818 165.24977
#> Nov 2021 133.34643 106.50558 160.18727 92.29689 174.39596
#> Dec 2021 137.72057 109.61542 165.82571 94.73745 180.70368
#> Jan 2022 146.67157 117.29298 176.05015 101.74090 191.60223
#> Feb 2022 146.75014 116.08454 177.41573 99.85116 193.64912
#> Mar 2022 134.01564 102.08849 165.94279 85.18727 182.84400
#> Apr 2022 136.59845 103.46417 169.73273 85.92394 187.27296
#> May 2022 146.13736 111.85406 180.42066 93.70558 198.56914
#> Jun 2022 155.04074 119.65539 190.42610 100.92352 209.15797
Visualising
AAPL_train %>%
autoplot() +
autolayer(AAPL_test, series = 'test') +
autolayer(model$fitted[,1], series = "fitted values")+
autolayer(forecast_AAPL_sarima_manual$mean, series = "forecast data test (Manual SARIMA)") +
ggtitle("AAPL Stock") +
scale_x_continuous(name = "Time (years)") +
scale_y_continuous(name = "Value (dollars)")
Conclusion
We can see that model_sarima_6 is forecasting better
than the auto sarima model. Overall, even it has many wrong point, the
model can predict the train well. But, the model can follow the value of
data test.
Evaluation
Evaluate the model using the accuracy()function, the
look at MAPE value
Model Holt Winters Evaluation
accuracy(forecast_AAPL$mean, AAPL_test)
#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 23.45628 25.48452 23.45628 16.36253 16.36253 0.4459276 2.490074
Model Holt Winters’s MAPE is 16.36253
Model Auto SARIMA Evaluation
accuracy(forecast_AAPL_sarima$mean, AAPL_test)
#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 53.03586 57.04421 53.03586 36.66569 36.66569 0.7951778 5.344257
Model Auto SAMIRA’s MAPE is 36.66569
Model Manual SARIMA Evaluation
accuracy(forecast_AAPL_sarima_manual$mean, AAPL_test)
#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 20.17145 22.78431 21.37533 14.50648 15.36277 0.2301237 2.341362
Model Manual SAMIRA’s MAPE is 15.36277
Model Manual SAMIRA has the least value of MAPE. We can say this is the best model among other 2. In my opinion, we can use this model for forcasting.
Visualising
AAPL_train %>%
autoplot()+
autolayer(AAPL_test, series = "test")+
autolayer(model$fitted[,1], series = "fitted values")+
autolayer(forecast_AAPL$mean, series = "forecast data test (Holt Winters)") +
autolayer(forecast_AAPL_sarima_manual$mean, series = "forecast data test (Manual SARIMA)") +
ggtitle("AAPL Stock") +
scale_x_continuous(name = "Time (years)") +
scale_y_continuous(name = "Value (dollars)")
My Opinion
From All of 3 models we made, I can see Model Manual Sarima and Model Holt Winters are relatively good model. Why ? Because they can forecast when the stock value increasing or not.
In my opinion, for this particular problem, Model Holt Winters is the best one among others, even this model doesn’t have the least AIC. This model almost always right when forecasting the trend. I can say it mimick well enough so we can use it to determine the time when to buy or sell the stock.
The Auto SARIMA is the worst in this case, I spend a lot of time to find the mistake above. Still can not find anything. This method should has accuracy not that different with others. Even we still has to check the paramater manually, the auto sarima shouldn’t this bad.