The goal of this project is to analyze the trends of the Google stock “Open” price and predict its future stock price based on its historical data using time series functions.
The stock price will be analyzed based the impact of COVID-19 -
From the model generated, the future stock price of Google is predicted
The data is obtained from Yahoo Finance using quantmod package
About Google stock -
library(quantmod)
library(dplyr)
library(ggplot2)
library(forecast)
library(tseries)
library(DT)
library(tsfknn)
library(dygraphs)
library(TSA)
getSymbols("GOOG", src = "yahoo", from = "2004-08-19", to = "2019-02-28")
data_pre_COVID <- as.data.frame(GOOG)
colnames(data_pre_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")
head(data_pre_COVID)
## Open High Low Close Volume Adjusted
## 2004-08-19 49.81329 51.83571 47.80083 49.98266 44871361 49.98266
## 2004-08-20 50.31640 54.33633 50.06235 53.95277 22942874 53.95277
## 2004-08-23 55.16822 56.52812 54.32139 54.49574 18342897 54.49574
## 2004-08-24 55.41230 55.59163 51.59162 52.23920 15319808 52.23920
## 2004-08-25 52.28403 53.79835 51.74604 52.80209 9232276 52.80209
## 2004-08-26 52.27905 53.77345 52.13459 53.75352 7128620 53.75352
getSymbols("GOOG", src = "yahoo", from = "2019-03-01")
data_post_COVID <- as.data.frame(GOOG)
colnames(data_post_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")
head(data_post_COVID)
## Open High Low Close Volume Adjusted
## 2019-03-01 1124.90 1142.970 1124.750 1140.99 1450300 1140.99
## 2019-03-04 1146.99 1158.280 1130.690 1147.80 1446000 1147.80
## 2019-03-05 1150.06 1169.610 1146.195 1162.03 1443200 1162.03
## 2019-03-06 1162.49 1167.566 1155.490 1157.86 1099300 1157.86
## 2019-03-07 1155.72 1156.755 1134.910 1143.30 1166600 1143.30
## 2019-03-08 1126.73 1147.080 1123.300 1142.32 1212400 1142.32
Data Dictionary
| Variable | Class | Description |
|---|---|---|
| Open | num | Opening price of the stock on the day |
| High | num | Highest price of the stock on the day |
| Low | num | Lowest price of the stock on the day |
| Close | num | Closing price of the stock on the day |
| Volume | num | Total Volume Traded |
| Adjusted | num | Adjusted price of the stock including any risks or strategies |
Summary of Pre-covid data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.41 233.48 306.53 448.05 604.77 1271.00
Summary of Post-covid data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1043 1209 1413 1498 1719 2437
Plot the data and examine for non constant variance, trend or stationarity in the series
From the data we can observe non-constant mean, non-constant variance and increasing trend.
The ACF and PACF graphs will help in determining the seasonality and the ARIMA order.
ACF - Auto-correlation function
PACF - Partial Auto-correlation function
ACF and PACF for pre-covid data
ACF and PACF for post-covid data
Stabilizing the variance
Stationarity test
After taking log transformation to stabilize the variance, the series is plotted to see for trend and non-stationary. This will be confirmed by performing Augumented Dickey-Fuller(adf) test.
Null hypothesis : The series is non-stationary
Alternative hypothesis : The series is stationary
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -4.3945, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -2.386, Lag order = 8, p-value = 0.4149
## alternative hypothesis: stationary
Taking first order difference to remove non-stationary from post-COVID data
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -7.2145, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
ARIMA model
Autoregressive integrated moving average (ARIMA)
Using ARIMA model, a time series can be forecasted using the series past values
An ARIMA model is denoted as ARIMA(p,d,q) where ‘p’ stands for AR order, ‘d’ denotes the order of differencing and ‘q’ stands for ‘MA’ order
For the AR(Autoregressive) process, the ACF plot will gradually decrease and simultaneously the PACF should have a sharp drop after p significant lags.
For MA(Moving average) process, the ACF will show a sharp drop after a certain q number of lags while PACF will show a gradual decreasing trend
If both ACF and PACF plots demonstrate a gradual decreasing pattern, then the ARMA process will be considered for modeling.
Fitting ARIMA for pre-covid data
## Series: log(ts(ts_data_pre_COVID))
## ARIMA(4,1,2) with drift
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 drift
## -0.2098 -0.7158 -0.0205 0.0506 0.1656 0.7335 8e-04
## s.e. 0.2102 0.1423 0.0204 0.0169 0.2103 0.1372 3e-04
##
## sigma^2 estimated as 0.0003727: log likelihood=9244.69
## AIC=-18473.39 AICc=-18473.35 BIC=-18423.76
##
## Ljung-Box test
##
## data: Residuals from ARIMA(4,0,2) with non-zero mean
## Q* = 12.051, df = 3, p-value = 0.00721
##
## Model df: 7. Total lags used: 10
Fitting ARIMA(7,2,2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,2,2)
## Q* = 7.0144, df = 3, p-value = 0.07144
##
## Model df: 9. Total lags used: 12
On fitting ARIMA(10,0,10), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise. There is one significant spike at lag=33, however with the large amount of data that is being dealt, this can be considered as insignificant
Ljung-Box test
## Series: ts(ts_data_pre_COVID)
## ARIMA(7,2,2)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1
## -0.8643 -0.0154 0.0234 0.0204 -0.0083 -0.0603 -0.0586 -0.1791
## s.e. 0.0889 0.0224 0.0222 0.0221 0.0222 0.0222 0.0167 0.0877
## ma2
## -0.8169
## s.e. 0.0876
##
## sigma^2 estimated as 0.0003726: log likelihood=9241.22
## AIC=-18462.44 AICc=-18462.38 BIC=-18400.4
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.13945 8.669302 5.194017 -0.07431531 1.302684 0.997012
## ACF1
## Training set -0.0332432
Fitting ARIMA for post-covid data
## Series: log(ts(ts_data_post_COVID))
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.0916 0.0013
## s.e. 0.0420 0.0007
##
## sigma^2 estimated as 0.000349: log likelihood=1461.04
## AIC=-2916.07 AICc=-2916.03 BIC=-2903.03
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 6.6255, df = 9, p-value = 0.676
##
## Model df: 1. Total lags used: 10
On fitting ARIMA(0,1,1), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise.
Ljung-Box test
Fitting ARIMA(0,2,1) for post-covid data
## Series: ts(ts_data_post_COVID)
## ARIMA(0,1,1)
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ma1
## -0.0850
## s.e. 0.0417
##
## sigma^2 estimated as 0.0003505: log likelihood=1459.27
## AIC=-2914.53 AICc=-2914.51 BIC=-2905.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.514891 27.74935 19.08345 0.1303972 1.291125 0.9933073 0.02749667
Forecasting the pre-COVID data
pre_covid_forecasted_values <- data.frame(forecasted_values=pre_covid_forecast$mean)
head(pre_covid_forecasted_values)
## forecasted_values
## 1 1107.819
## 2 1108.277
## 3 1109.808
## 4 1109.391
## 5 1110.591
## 6 1111.239
Forecasting the post-COVID data
post_covid_first_60_observations <- ts_data_post_COVID[1:60]
error <- (pre_covid_forecasted_values - post_covid_first_60_observations)^2
sqrt(sum(error)/60)
## [1] 73.19979