Google-stock-analysis.utf8

Problem Statement

The goal of this project is to analyze the trends of the Google stock “Open” price and predict its future stock price based on its historical data using time series functions.
The stock price will be analyzed based the impact of COVID-19 -
- For this, the training data set will be from “Aug 19, 2004” till “Feb 28, 2020”
- A model will be fit onto the training data and will be used forecast the future prices based on the training data
- The actual data is the stock prices from “Mar 1 2020” till “4 June 2021”
- The error in forecast will be derived from the comparison between the actual and predicted values
From the model generated, the future stock price of Google is predicted
The data is obtained from Yahoo Finance using quantmod package

About Google stock -

Google.com is the most visited website worldwide was founded in 1998 and is headquartered at Mountain View, USA.
Google’s initial public offering (IPO) took place on August 19, 2004.
The company offered 19,605,052 shares at a price of $85 per share for the IPO.

Libraries used

library(quantmod)
library(dplyr)
library(ggplot2)
library(forecast)
library(tseries)
library(DT)
library(tsfknn)
library(dygraphs)
library(TSA)

Data

Pre-COVID data

This data consists of Google “Open” price from “2004-08-19” to “2019-02-28”
The data is obtained using the getSymbols function from the quantmod library
getSymbols is used to load the data from a specified source, the source here being yahoo. Yahoo Finance has the historical stock data which will be the source of truth

getSymbols("GOOG", src = "yahoo", from = "2004-08-19", to = "2019-02-28")
data_pre_COVID <- as.data.frame(GOOG)

colnames(data_pre_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")

head(data_pre_COVID)

##                Open     High      Low    Close   Volume Adjusted
## 2004-08-19 49.81329 51.83571 47.80083 49.98266 44871361 49.98266
## 2004-08-20 50.31640 54.33633 50.06235 53.95277 22942874 53.95277
## 2004-08-23 55.16822 56.52812 54.32139 54.49574 18342897 54.49574
## 2004-08-24 55.41230 55.59163 51.59162 52.23920 15319808 52.23920
## 2004-08-25 52.28403 53.79835 51.74604 52.80209  9232276 52.80209
## 2004-08-26 52.27905 53.77345 52.13459 53.75352  7128620 53.75352

Visualization

Post-COVID data

This data consists of Google “Open” price from “2019-03-01” till “2021-06-04”

getSymbols("GOOG", src = "yahoo", from = "2019-03-01")
data_post_COVID <- as.data.frame(GOOG)

colnames(data_post_COVID) <- c("Open","High","Low","Close","Volume","Adjusted")

head(data_post_COVID)

##               Open     High      Low   Close  Volume Adjusted
## 2019-03-01 1124.90 1142.970 1124.750 1140.99 1450300  1140.99
## 2019-03-04 1146.99 1158.280 1130.690 1147.80 1446000  1147.80
## 2019-03-05 1150.06 1169.610 1146.195 1162.03 1443200  1162.03
## 2019-03-06 1162.49 1167.566 1155.490 1157.86 1099300  1157.86
## 2019-03-07 1155.72 1156.755 1134.910 1143.30 1166600  1143.30
## 2019-03-08 1126.73 1147.080 1123.300 1142.32 1212400  1142.32

Visualization

Summary of variables

Data Dictionary

Variable	Class	Description
Open	num	Opening price of the stock on the day
High	num	Highest price of the stock on the day
Low	num	Lowest price of the stock on the day
Close	num	Closing price of the stock on the day
Volume	num	Total Volume Traded
Adjusted	num	Adjusted price of the stock including any risks or strategies

For this project the “Open” price of the share will be used for analysis and understanding the trend
A stock’s opening price is always not identical to its prior day closing price
The difference rises with the after-hours trading that changes the investor valuations, deals, aquisitions, losses and expectations for the stock

Summary of Pre-covid data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.41  233.48  306.53  448.05  604.77 1271.00

Before COVID-19, the average value of Google stock was $448.05
The highest the stock has risen to is $1271

Summary of Post-covid data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1043    1209    1413    1498    1719    2437

After COVID-19, the average value of Google stock rose to $1498
The highest the stock has risen to is $2437
From this, it can be clearly noticed that COVID has had a positive impact on the share price of Google. The same will be acheived through the time series forecasting and modeling in further steps.

Data Exploration

Plot the data and examine for non constant variance, trend or stationarity in the series

From the data we can observe non-constant mean, non-constant variance and increasing trend.
The ACF and PACF graphs will help in determining the seasonality and the ARIMA order.
ACF - Auto-correlation function
- ACF explains how the present value of a given time series is correlated with the past (1-unit past, 2-unit past, …, n-unit past) values
- In the ACF plot, the x-axis expresses the correlation coefficient whereas the y-axis mentions the number of lags
PACF - Partial Auto-correlation function
- PACF is the partial autocorrelation function that explains the partial correlation between the series and lags of itself

ACF and PACF for pre-covid data

From the ACF plot, it can be noticed that there are significant lags uptil 36
From PACF plot, there is only one high spike at lag=1.
Also, there appears no seasonality

ACF and PACF for post-covid data

From the ACF plot, it can be noticed that there is a slow decay of lag
From PACF plot, there is only one high spike at lag=1.
Also, there appears no seasonality

Stabilizing the variance

Since there is non-constant variance in the data, a log transformation is taken to stabilize the variance.

It can be observed that the variance has been stabilized with log transformation.
The original variance was varying from 200 to 1200 for pre-covid data. It has now been transformed from 4 to 7
Similarly, the original variance was varying from 1000 to 2200 for pre-covid data. It has now been transformed from 7 to 7.8

Stationarity test

After taking log transformation to stabilize the variance, the series is plotted to see for trend and non-stationary. This will be confirmed by performing Augumented Dickey-Fuller(adf) test.
Null hypothesis : The series is non-stationary
Alternative hypothesis : The series is stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -4.3945, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -2.386, Lag order = 8, p-value = 0.4149
## alternative hypothesis: stationary

Pre-COVID data - The series is stationary. This is because the p-value 0.01 is less than 0.05, with with the null hypothesis can be rejected and the alternate hypothesis is acceptes that the series is stationary
Post-COVID data - Here, the p-value 0.41 is greater than 0.05, hence the null hypothesis cannot be rejected and thus conclude that the series is non-stationary

Taking first order difference to remove non-stationary from post-COVID data

On taking the first order difference it can be noticed that the correlation from the ACF and PACF graphs has been reduced significantly.Next, the p-value is checked for stationarity using ADF test.
The the mean is constant and acf cuts off after lag 1 and pacf cut off after lag 1 with a significant spike at lag 26.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -7.2145, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The series is stationarity and the same is proved using adf test as the p-value 0.01 is less than 0.05. Hence the null hypothesis can be rejected and thus conclude that the post-covid series is stationarity.

Model fitting

ARIMA model

Autoregressive integrated moving average (ARIMA)
Using ARIMA model, a time series can be forecasted using the series past values
An ARIMA model is denoted as ARIMA(p,d,q) where ‘p’ stands for AR order, ‘d’ denotes the order of differencing and ‘q’ stands for ‘MA’ order
For the AR(Autoregressive) process, the ACF plot will gradually decrease and simultaneously the PACF should have a sharp drop after p significant lags.
For MA(Moving average) process, the ACF will show a sharp drop after a certain q number of lags while PACF will show a gradual decreasing trend
If both ACF and PACF plots demonstrate a gradual decreasing pattern, then the ARMA process will be considered for modeling.

Fitting ARIMA for pre-covid data

## Series: log(ts(ts_data_pre_COVID)) 
## ARIMA(4,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2      ar3     ar4     ma1     ma2  drift
##       -0.2098  -0.7158  -0.0205  0.0506  0.1656  0.7335  8e-04
## s.e.   0.2102   0.1423   0.0204  0.0169  0.2103  0.1372  3e-04
## 
## sigma^2 estimated as 0.0003727:  log likelihood=9244.69
## AIC=-18473.39   AICc=-18473.35   BIC=-18423.76

auto.arima function suggests the order as (4,1,2). However, since d=0, the model ARIMA(4,0,2) can be tried

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,2) with non-zero mean
## Q* = 12.051, df = 3, p-value = 0.00721
## 
## Model df: 7.   Total lags used: 10

There are significant lags, hence a higher order of ARIMA can be tried

Fitting ARIMA(7,2,2)

On trying for higher order, the residuals are good. However, p-value is less than 0.05 which means there is correlation between the lags.
On increasing the differing order to 1, the p-value is still less than 0.05
Hence, the model is tried with a differencing order 2

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,2,2)
## Q* = 7.0144, df = 3, p-value = 0.07144
## 
## Model df: 9.   Total lags used: 12

On fitting ARIMA(10,0,10), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise. There is one significant spike at lag=33, however with the large amount of data that is being dealt, this can be considered as insignificant
Ljung-Box test
- H0: Null hypothesis: There is no auto-correlation between residuals.
- HA: Alternative hypothesis: There is auto-correlation between residuals.
- The p value is greater than 0.01, hence the null hypothesis can be rejected. The residuals behave like white noise and have no correlation between them.

## Series: ts(ts_data_pre_COVID) 
## ARIMA(7,2,2) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4      ar5      ar6      ar7      ma1
##       -0.8643  -0.0154  0.0234  0.0204  -0.0083  -0.0603  -0.0586  -0.1791
## s.e.   0.0889   0.0224  0.0222  0.0221   0.0222   0.0222   0.0167   0.0877
##           ma2
##       -0.8169
## s.e.   0.0876
## 
## sigma^2 estimated as 0.0003726:  log likelihood=9241.22
## AIC=-18462.44   AICc=-18462.38   BIC=-18400.4
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE     MAPE     MASE
## Training set -0.13945 8.669302 5.194017 -0.07431531 1.302684 0.997012
##                    ACF1
## Training set -0.0332432

The AIC of this model is -18462.44 and BIC is -18400.4

Fitting ARIMA for post-covid data

## Series: log(ts(ts_data_post_COVID)) 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1   drift
##       -0.0916  0.0013
## s.e.   0.0420  0.0007
## 
## sigma^2 estimated as 0.000349:  log likelihood=1461.04
## AIC=-2916.07   AICc=-2916.03   BIC=-2903.03

auto.arima suggests the model (0,1,1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 6.6255, df = 9, p-value = 0.676
## 
## Model df: 1.   Total lags used: 10

On fitting ARIMA(0,1,1), the residual plot almost has constant mean and variance and acf and pacf plot resemble white noise.
Ljung-Box test
- H0: Null hypothesis: There is no auto-correlation between residuals.
- HA: Alternative hypothesis: There is auto-correlation between residuals.
- The p value is greater than 0.01, hence the null hypothesis cannot be rejected, which means there is no autocorrelation between the lags

Fitting ARIMA(0,2,1) for post-covid data

## Series: ts(ts_data_post_COVID) 
## ARIMA(0,1,1) 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ma1
##       -0.0850
## s.e.   0.0417
## 
## sigma^2 estimated as 0.0003505:  log likelihood=1459.27
## AIC=-2914.53   AICc=-2914.51   BIC=-2905.84
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 2.514891 27.74935 19.08345 0.1303972 1.291125 0.9933073 0.02749667

Forecasting the pre-COVID data

The forecast for next 60 days is obtained

pre_covid_forecasted_values <- data.frame(forecasted_values=pre_covid_forecast$mean)
head(pre_covid_forecasted_values)

##   forecasted_values
## 1          1107.819
## 2          1108.277
## 3          1109.808
## 4          1109.391
## 5          1110.591
## 6          1111.239

Forecasting the post-COVID data

The forecast for next 30 days is obtained

Validation of forecast

Here, the forecast obtained from the pre-covid data is matched up with the first 60 days of post-covid data to identify the error in forecast

post_covid_first_60_observations <- ts_data_post_COVID[1:60]

error <- (pre_covid_forecasted_values - post_covid_first_60_observations)^2
sqrt(sum(error)/60)

## [1] 73.19979

From this, the RMSE is 73.19 which means the error rate is 73.19%. Since we had considered a model with a differencing order=2 to obtain a p-value greater than 0.05, the model that was fit was not good to yield a lower RMSE.

Google stock analysis