Google stock analysis

Problem Statement

  • The goal of this project is to analyze the trends of the Google stock “Open” price and predict its future stock price based on its historical data using time series functions.

  • The stock price will be analyzed based the impact of COVID-19 -

    • For this, the training data set will be from “Aug 19, 2004” till “Feb 28, 2020”
    • A model will be fit onto the training data and will be used forecast the future prices based on the training data
    • The actual data is the stock prices from “Mar 1 2020” till “4 June 2021”
    • The error in forecast will be derived from the comparison between the actual and predicted values
  • From the model generated, the future stock price of Google is predicted

  • The data is obtained from Yahoo Finance using quantmod package

About Google stock -

  • Google.com is the most visited website worldwide was founded in 1998 and is headquartered at Mountain View, USA.
  • Google’s initial public offering (IPO) took place on August 19, 2004.
  • The company offered 19,605,052 shares at a price of $85 per share for the IPO.

Libraries used

library(quantmod)
library(dplyr)
library(ggplot2)
library(forecast)
library(tseries)
library(DT)
library(tsfknn)
library(dygraphs)
library(TSA)

Data

Pre-COVID data

  • This data consists of Google “Open” price from “2004-08-19” to “2019-02-28”
  • The data is obtained using the getSymbols function from the quantmod library
  • getSymbols is used to load the data from a specified source, the source here being yahoo. Yahoo Finance has the historical stock data which will be the source of truth
getSymbols("GOOG", src = "yahoo", from = "2004-08-19", to = "2019-02-28")
data_pre_COVID <- as.data.frame(GOOG)

colnames(data_pre_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")
head(data_pre_COVID)
##                Open     High      Low    Close   Volume Adjusted
## 2004-08-19 49.81329 51.83571 47.80083 49.98266 44871361 49.98266
## 2004-08-20 50.31640 54.33633 50.06235 53.95277 22942874 53.95277
## 2004-08-23 55.16822 56.52812 54.32139 54.49574 18342897 54.49574
## 2004-08-24 55.41230 55.59163 51.59162 52.23920 15319808 52.23920
## 2004-08-25 52.28403 53.79835 51.74604 52.80209  9232276 52.80209
## 2004-08-26 52.27905 53.77345 52.13459 53.75352  7128620 53.75352
  • Visualization

Post-COVID data

  • This data consists of Google “Open” price from “2019-03-01” till “2021-06-04”
getSymbols("GOOG", src = "yahoo", from = "2019-03-01")
data_post_COVID <- as.data.frame(GOOG)

colnames(data_post_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")
head(data_post_COVID)
##               Open     High      Low   Close  Volume Adjusted
## 2019-03-01 1124.90 1142.970 1124.750 1140.99 1450300  1140.99
## 2019-03-04 1146.99 1158.280 1130.690 1147.80 1446000  1147.80
## 2019-03-05 1150.06 1169.610 1146.195 1162.03 1443200  1162.03
## 2019-03-06 1162.49 1167.566 1155.490 1157.86 1099300  1157.86
## 2019-03-07 1155.72 1156.755 1134.910 1143.30 1166600  1143.30
## 2019-03-08 1126.73 1147.080 1123.300 1142.32 1212400  1142.32
  • Visualization

Summary of variables

Data Dictionary

Variable Class Description
Open num Opening price of the stock on the day
High num Highest price of the stock on the day
Low num Lowest price of the stock on the day
Close num Closing price of the stock on the day
Volume num Total Volume Traded
Adjusted num Adjusted price of the stock including any risks or strategies
  • For this project the “Open” price of the share will be used for analysis and understanding the trend
  • A stock’s opening price is always not identical to its prior day closing price
  • The difference rises with the after-hours trading that changes the investor valuations, deals, aquisitions, losses and expectations for the stock

Summary of Pre-covid data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.41  233.48  306.53  448.05  604.77 1271.00
  • Before COVID-19, the average value of Google stock was $448.05
  • The highest the stock has risen to is $1271

Summary of Post-covid data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1043    1209    1413    1498    1719    2437
  • After COVID-19, the average value of Google stock rose to $1498
  • The highest the stock has risen to is $2437
  • From this, it can be clearly noticed that COVID has had a positive impact on the share price of Google. The same will be acheived through the time series forecasting and modeling in further steps.

Data Exploration

Plot the data and examine for non constant variance, trend or stationarity in the series

  • From the data we can observe non-constant mean, non-constant variance and increasing trend.

  • The ACF and PACF graphs will help in determining the seasonality and the ARIMA order.

  • ACF - Auto-correlation function

    • ACF explains how the present value of a given time series is correlated with the past (1-unit past, 2-unit past, …, n-unit past) values
    • In the ACF plot, the x-axis expresses the correlation coefficient whereas the y-axis mentions the number of lags
  • PACF - Partial Auto-correlation function

    • PACF is the partial autocorrelation function that explains the partial correlation between the series and lags of itself

ACF and PACF for pre-covid data

  • From the ACF plot, it can be noticed that there are significant lags uptil 36
  • From PACF plot, there is only one high spike at lag=1.
  • Also, there appears no seasonality

ACF and PACF for post-covid data

  • From the ACF plot, it can be noticed that there is a slow decay of lag
  • From PACF plot, there is only one high spike at lag=1.
  • Also, there appears no seasonality

Stabilizing the variance

  • Since there is non-constant variance in the data, a log transformation is taken to stabilize the variance.

  • It can be observed that the variance has been stabilized with log transformation.
  • The original variance was varying from 200 to 1200 for pre-covid data. It has now been transformed from 4 to 7
  • Similarly, the original variance was varying from 1000 to 2200 for pre-covid data. It has now been transformed from 7 to 7.8

Stationarity test

  • After taking log transformation to stabilize the variance, the series is plotted to see for trend and non-stationary. This will be confirmed by performing Augumented Dickey-Fuller(adf) test.

  • Null hypothesis : The series is non-stationary

  • Alternative hypothesis : The series is stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -4.3945, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -2.386, Lag order = 8, p-value = 0.4149
## alternative hypothesis: stationary
  • Pre-COVID data - The series is stationary. This is because the p-value 0.01 is less than 0.05, with with the null hypothesis can be rejected and the alternate hypothesis is acceptes that the series is stationary
  • Post-COVID data - Here, the p-value 0.41 is greater than 0.05, hence the null hypothesis cannot be rejected and thus conclude that the series is non-stationary

Taking first order difference to remove non-stationary from post-COVID data

  • On taking the first order difference it can be noticed that the correlation from the ACF and PACF graphs has been reduced significantly.Next, the p-value is checked for stationarity using ADF test.
  • The the mean is constant and acf cuts off after lag 1 and pacf cut off after lag 1 with a significant spike at lag 26.
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -7.2145, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
  • The series is stationarity and the same is proved using adf test as the p-value 0.01 is less than 0.05. Hence the null hypothesis can be rejected and thus conclude that the post-covid series is stationarity.

Model fitting

ARIMA model

  • Autoregressive integrated moving average (ARIMA)

  • Using ARIMA model, a time series can be forecasted using the series past values

  • An ARIMA model is denoted as ARIMA(p,d,q) where ‘p’ stands for AR order, ‘d’ denotes the order of differencing and ‘q’ stands for ‘MA’ order

  • For the AR(Autoregressive) process, the ACF plot will gradually decrease and simultaneously the PACF should have a sharp drop after p significant lags.

  • For MA(Moving average) process, the ACF will show a sharp drop after a certain q number of lags while PACF will show a gradual decreasing trend

  • If both ACF and PACF plots demonstrate a gradual decreasing pattern, then the ARMA process will be considered for modeling.

Fitting ARIMA for pre-covid data

## Series: log(ts(ts_data_pre_COVID)) 
## ARIMA(4,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2      ar3     ar4     ma1     ma2  drift
##       -0.2098  -0.7158  -0.0205  0.0506  0.1656  0.7335  8e-04
## s.e.   0.2102   0.1423   0.0204  0.0169  0.2103  0.1372  3e-04
## 
## sigma^2 estimated as 0.0003727:  log likelihood=9244.69
## AIC=-18473.39   AICc=-18473.35   BIC=-18423.76
  • auto.arima function suggests the order as (4,1,2). However, since d=0, the model ARIMA(4,0,2) can be tried

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,2) with non-zero mean
## Q* = 12.051, df = 3, p-value = 0.00721
## 
## Model df: 7.   Total lags used: 10
  • There are significant lags, hence a higher order of ARIMA can be tried

Fitting ARIMA(7,2,2)

  • On trying for higher order, the residuals are good. However, p-value is less than 0.05 which means there is correlation between the lags.
  • On increasing the differing order to 1, the p-value is still less than 0.05
  • Hence, the model is tried with a differencing order 2

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,2,2)
## Q* = 7.0144, df = 3, p-value = 0.07144
## 
## Model df: 9.   Total lags used: 12
  • On fitting ARIMA(10,0,10), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise. There is one significant spike at lag=33, however with the large amount of data that is being dealt, this can be considered as insignificant

  • Ljung-Box test

    • H0: Null hypothesis: There is no auto-correlation between residuals.
    • HA: Alternative hypothesis: There is auto-correlation between residuals.
    • The p value is greater than 0.01, hence the null hypothesis can be rejected. The residuals behave like white noise and have no correlation between them.
## Series: ts(ts_data_pre_COVID) 
## ARIMA(7,2,2) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4      ar5      ar6      ar7      ma1
##       -0.8643  -0.0154  0.0234  0.0204  -0.0083  -0.0603  -0.0586  -0.1791
## s.e.   0.0889   0.0224  0.0222  0.0221   0.0222   0.0222   0.0167   0.0877
##           ma2
##       -0.8169
## s.e.   0.0876
## 
## sigma^2 estimated as 0.0003726:  log likelihood=9241.22
## AIC=-18462.44   AICc=-18462.38   BIC=-18400.4
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE     MAPE     MASE
## Training set -0.13945 8.669302 5.194017 -0.07431531 1.302684 0.997012
##                    ACF1
## Training set -0.0332432
  • The AIC of this model is -18462.44 and BIC is -18400.4

Fitting ARIMA for post-covid data

## Series: log(ts(ts_data_post_COVID)) 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1   drift
##       -0.0916  0.0013
## s.e.   0.0420  0.0007
## 
## sigma^2 estimated as 0.000349:  log likelihood=1461.04
## AIC=-2916.07   AICc=-2916.03   BIC=-2903.03
  • auto.arima suggests the model (0,1,1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 6.6255, df = 9, p-value = 0.676
## 
## Model df: 1.   Total lags used: 10
  • On fitting ARIMA(0,1,1), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise.

  • Ljung-Box test

    • H0: Null hypothesis: There is no auto-correlation between residuals.
    • HA: Alternative hypothesis: There is auto-correlation between residuals.
    • The p value is greater than 0.01, hence the null hypothesis cannot be rejected, which means there is no autocorrelation between the lags

Fitting ARIMA(0,2,1) for post-covid data

## Series: ts(ts_data_post_COVID) 
## ARIMA(0,1,1) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ma1
##       -0.0850
## s.e.   0.0417
## 
## sigma^2 estimated as 0.0003505:  log likelihood=1459.27
## AIC=-2914.53   AICc=-2914.51   BIC=-2905.84
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 2.514891 27.74935 19.08345 0.1303972 1.291125 0.9933073 0.02749667

Forecasting the pre-COVID data

  • The forecast for next 60 days is obtained
pre_covid_forecasted_values <- data.frame(forecasted_values=pre_covid_forecast$mean)
head(pre_covid_forecasted_values)
##   forecasted_values
## 1          1107.819
## 2          1108.277
## 3          1109.808
## 4          1109.391
## 5          1110.591
## 6          1111.239

Forecasting the post-COVID data

  • The forecast for next 30 days is obtained

Validation of forecast

  • Here, the forecast obtained from the pre-covid data is matched up with the first 60 days of post-covid data to identify the error in forecast
post_covid_first_60_observations <- ts_data_post_COVID[1:60]

error <- (pre_covid_forecasted_values - post_covid_first_60_observations)^2
sqrt(sum(error)/60)
## [1] 73.19979
  • From this, the RMSE is 73.19 which means the error rate is 73.19%. Since we had considered a model with a differencing order=2 to obtain a p-value greater than 0.05, the model that was fit was not good to yield a lower RMSE.