Overview

‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time. But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. This project describes different timeseries and machine learning forecasting models applied to compare two real stocks close price dataset. ANLY 565 Time Series Project encourage us to apply machine learning techniques that go beyond standard linear regression. For this study we will start with a general idea of the stock price, incluiding dataset analysis. Followed by a general description and analysis of the dataset, our objective is to apply different forecasting predictive models for “SPDR S&P Oil & Gas Exploration & Production ETF (XOP) and Energy Select Sector SPDR Fund ETF (XLE)” stocks daily close price. This project will provide a step-by-step guide for fitting an regression and ARIMA model and we will compare the performance of both XOP and XLE stocks.

Introduction

XOP tracks an equal-weighted index of companies in the US oil & gas exploration & production space.XOP offers an equal-weight approach to oil & gas exploration & production. Equal weighting provides diversified exposure to an industry normally dominated by a few big names, but also produces a massive bias to small- and midcaps.XOP is reasonably priced, and usually earns back most of its fee through securities lending revenue. Whereas, XLE offers supremely liquid exposure to a marketlike basket of US energy firms. “Marketlike” in the context of the energy sector means concentrated exposure to the giants of the oil and gas industries. XLE pulls its stocks from the S&P 500 rather than the total market, so its portfolio is somewhat smaller than that of peer funds, and it favors large-caps. We choose these stocks to know if the stock value would retain its growth based on its previous values. As stock values change with time and are affected by other indicators that may or may not substantiate its real value, we can use a time-series model that examines differences between values in the series instead of actual values. This scenario falls perfectly in line with regression analysis, or more specifically, an ARIMA model.

Dataset

Dataset can be upload from Yahoo Finance,one of the most popular public resources for stock data. I used a genaralized code from which we can pull any stock and any range by using ETF code of the stock. We are using past five years data of XOP and XLE

Project Outline

Impoting the data

stock1<-"XOP"
stock2<- "XLE"
n.chosen.stocks<-50
#my.tickers<-c(paste0(nifty100list$Symbol,".NS"),ticker.MktIdx)
my.tickers <- c(stock1,stock2 )


first.date <- Sys.Date() - (365*5)
last.date <- Sys.Date()+1
freq.data <- 'weekly'
# set tickers

l.out <- BatchGetSymbols(tickers = my.tickers,
                         first.date = first.date,
                         last.date = last.date,
                         freq.data = freq.data,
                         bench.ticker = stock1,
                         do.fill.missing.prices = TRUE,
                         
                         cache.folder = file.path(tempdir(),
                                                  'BGS_Cache') )
l.out$df.tickers<-na_interpolation(l.out$df.tickers)
df.stocks <- l.out$df.tickers

#Create a new variable for XOP stock closing price
stock1_df<-df.stocks$price.close[df.stocks$ticker== stock1]
#Create a new variable for SPY stock closing price
stock2_df<-df.stocks$price.close[df.stocks$ticker== stock2]
#Create a date variable using reference date of stock dataset
date<-df.stocks$ref.date
date<-as.Date(date)

#Create a new data frame that includes XOP, SPY close price and date
new_df<-as.data.frame(cbind(stock1_df, stock2_df,date))
names(new_df)<-c(stock1,stock2,"date")
new_df$date<-as.Date(new_df$date)

Exploratory Data Analysis

Univariate distribution

p1<-ggplot(new_df, aes(XOP)) + 
  geom_histogram(bins = 50, aes(y = ..density..), col = "red", fill = "orange", alpha = 0.3) + 
  geom_density()

p2<-ggplot(new_df, aes(XLE)) + 
  geom_histogram(bins = 50, aes(y = ..density..), col = "red", fill = "orange", alpha = 0.3) + 
  geom_density()
grid.arrange(p1,p2, nrow=2,ncol=1)

From the above plot, we can say XOP has a left skewed data and moslty values are concentrated over the interval. Also, we can observe that the peak falls between 102 to 170 for XOP. For XLE, the distribution is quite similar to XOP, it is also left skewed and it seems correlated to each other which we will study in this project.

Time series plot

par(mfrow=c(1,2))
stock1ts<-ts(stock1_df,start = c(2015,33), end = c(2020,31),frequency = 52)
plot1<-ts_plot(stock1ts,title="XOP Weekly Stock Price")

stock2ts<-ts(stock2_df, start = c(2015,33),end = c(2020,31),frequency = 52)
plot2<-ts_plot(stock2ts, title="XLE Weekly Stock Price")

don <- xts(x = new_df[,-3], order.by = new_df$date)
p <- dygraph(don, main= "Time Series plot of XOP and XLE")
p

we can see that XOP had outperformed compared to XLE. We can see the ts plot and conclude that XOP has a lower 5-year return than XLE and this is may be XOP has a higher expense ratio than XLE.

Decomposing the Data into Components

To further analyze trends in the data, we will decompose the data into seasonal, and trend components. Decomposition is the foundation for building the ARIMA model by providing insight towards the changes in stock fluctuations.

Decomposition of XOP time series

ts_decompose(stock1ts)

In the above graph, the trend component describes the overall pattern of the series over the entire range of time,taking into account increases and decreases in prices together. From the plot, the trend is overall decreasing, the downfall in stock price occurs every year in the end of the winter season specially January and February. Random error shows the difference between the observed price and the estimated price. We can see more random error fluctuation in 2019, meaning there is greater variance and greater statistical error.

Decompositing of XLE time series

ts_decompose(stock2ts)

In the above graph, the trend component describes the overall pattern of the series over the entire range of time,taking into account increases and decreases in prices together. From the plot, the trend is overall increasing, the peak in stock price occurs every year in winter season(December, January,February) and sudden decreases after winter season.Random error shows the difference between the observed price and the estimated price. We can see more random error fluctuation in 2019, meaning there is greater variance and greater statistical error. This means more recent and future points become more unpredictable.

Testing of Stationary

H0: The time series data includes a unit root and time series is a Non-stationary time series

HA: The time-series data does not include a unit root, and is stationary.

adf.test(stock1ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  stock1ts
## Dickey-Fuller = -1.8444, Lag order = 6, p-value = 0.6414
## alternative hypothesis: stationary
adf.test(stock2ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  stock2ts
## Dickey-Fuller = -1.7806, Lag order = 6, p-value = 0.6683
## alternative hypothesis: stationary

The p-value for XOP is 0.6 which means we fail to reject the null hypothesis and the time series is nonstationary. So, we conclude that the XOP is likely to contain unit roots. The p-value of XLE is 0.71 which means the time series is also a nonstationary and does not include unit roots.

Making time series Stationary

We already know that both stocks time series is a nonstationary time series so we use differencing to make it a stationary time series.

Defferencing

In this project, we only difference once. Differencing to the first order which does mean, variance and autocorrelation constant over time. Also, y-intercept and slope, are constant over time.

# Differenced Price Returns
stock1_dif<-c(NA,diff(log(stock1ts)))
stock2_dif<-c(NA,diff(log(stock2ts)))
plot.ts(stock1_dif, main = "XOP Differenced Log Price")

plot.ts(stock2_dif, main = "XOP Differenced Log Price")

In the above plot, most major fluctuations are found in 2019, which is due to COVID-19, which was when the stock returns (and price) changed the most.

Lets create a new variable by using ts() and cbind()

new_df <- ts(cbind(stock1_dif,stock2_dif), start = c(2015,33), end = c(2020,31), frequency = 52)
new_df<- na.omit(new_df)
#Create new variable for both time series data
stock01<-new_df[,1]
stock02<-new_df[,2]

Lets again perform the ADF Test

adf.test(stock01, alternative = "stationary", k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  stock01
## Dickey-Fuller = -16.19, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
adf.test(stock02, alternative = "stationary", k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  stock02
## Dickey-Fuller = -16.189, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

The results here show what types of data would be appropriate to run a model. The Dickey-Fuller statistic shows that the more negative the number, the stronger the null hypothesis rejection, meaning there is no unit root. Similarly with any hypothesis test, the small p-value means there is strong evidence against the null hypothesis, meaning there is no unit root. Our both time series are now stationary.

Check cointegration

he Phillips-Ouliaris test shows whether there is evidence that the series are cointegrated, which justifies the use of a regression model. We will check po.test on our data.

po.test(new_df)
## 
##  Phillips-Ouliaris Cointegration Test
## 
## data:  new_df
## Phillips-Ouliaris demeaned = -269.49, Truncation lag parameter = 2,
## p-value = 0.01

The value of p is greater than 0.01 so it is cointegrated, so we can use linear regression model.

Regression Model

We will run a regression model to find out how XOP stock price will change when XLE stock price fluctuates. Here, our Dependent variable is XOP and Independent varible is XLE stock price.

stocks_lm<- lm(stock01~stock02)
summary(stocks_lm)
## 
## Call:
## lm(formula = stock01 ~ stock02)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.174313 -0.012225  0.000104  0.011833  0.183954 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.001179   0.001620  -0.728    0.467    
## stock02      1.267416   0.035516  35.686   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02598 on 256 degrees of freedom
## Multiple R-squared:  0.8326, Adjusted R-squared:  0.832 
## F-statistic:  1273 on 1 and 256 DF,  p-value: < 2.2e-16

We can say from the result that both variables have a positive relationship. When XLE stock price goes up, XOP will also increase and vice varsa. ### Check the residuals

lm_resid<-stocks_lm$residuals
acf(lm_resid)

pacf(lm_resid)

From the Acf and Pacf plots, we can say that many Residual points are significant, but this isnt good ACF for Residuals. The residuals seems to have an auto correlation, which suggest this is not a good fit of the model.

Building ARIMA Model

#Convert nonstationary to stationary time series to run ARIMA Model
stock1diff<-diff(stock1ts)
stock2diff<-diff(stock2ts)

Create train and test dataset We will split our data into two parts train and test dataset. We build model on train dataset and make prediction on test dataset.

#Create train and test dataset for XLE
train_st1<-window(stock1diff, start = c(2015,33), end=c(2020,20))
test_st1<-window(stock1diff, start = c(2020, 21))

#Create train and test dataset for XLE
train_st2<-window(stock2diff, start = c(2015,33), end=c(2020,20))
test_st2<-window(stock2diff, start = c(2020, 21))

Building Model and Forecasting: Autoregresive Integrating Moving Average

# Auto Arima Model for XOP
model1<-auto.arima(train_st1)
#Finding the best ARIMA model for XOP stock close price
azfinal.aic <- Inf
azfinal.order <- c(0,0,0)
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
   azcurrent.aic <- AIC(arima(train_st1, order=c(p, d, q)))
    if (azcurrent.aic < azfinal.aic) {
         azfinal.aic <- azcurrent.aic
       azfinal.order <- c(p, d, q)
         model2 <- arima(train_st1, order=azfinal.order)
       }
  }

#Order of th model
azfinal.order
## [1] 1 1 1
#Make prediction
pred <- predict(model1, n.ahead = 11)

First, I built auto.arima model1 and we got AIC 1611.532 then I tried a for loop function to find best fitting model where I used AR or MA process order from 1 to 4. In this case, I found best order 1,1,1 and AIC 1610.534. We move with model2 for furthur analysis because Aic is lower in model2 than model1.

Model Performance

#We can plot the residuals of the fitted model to see if we have evidence of discrete white noise:
checkresiduals(model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 38.242, df = 47, p-value = 0.815
## 
## Model df: 2.   Total lags used: 49

The run sequence plot shows that the residuals do not violate the assumption of constant location and scale. It also shows that most of the residuals are in the range (-10, 10). The ACF plot of the residuals from the ARIMA(1,1,1) model shows that all autocorrelations are within the threshold limits, indicating that the residuals are not autocorrelated at lag 1 all sample autocorrelations except Lag 15 fall inside the 95 % confidence bounds indicating the residuals appear to be random. The histogram plot indicate that the normal distribution provides an adequate fit for this model.

Ljung-Box test

#Next step is to run the Ljung-Box test and confirm that we have a good model fit:
Box.test(resid(model2), lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  resid(model2)
## X-squared = 16.332, df = 20, p-value = 0.6958

In the Ljung-Box test result, our null hypothesis is that our model does not show lack of fit and residuals are independently distributed. And our Ha is the residuals are not independently distributed; they exhibit serial correlation. We got (p-value = 0.080), indicating that the residuals are random and that the model provides an adequate fit to the data.

Forecasting for the next 8 weeks (2months)

fc1<-forecast(model2, h=19)
autoplot(forecast(fc1))

In the above plot, ws we can see, we have a blue line that represents the mean of our prediction. With the blue line explained we can see a darker and light darker areas, representing 80% and 95% confidence intervals respectively in lower and upper scenarios. We can say the XOP price is going to constant in coming 8 weeks.

Let’s carry out the same procedure for the XLE Stock closing price

# Auto.Arima Model for XLE
model01<-auto.arima(train_st2)

#Finding the best ARIMA model for 
azfinal.aic <- Inf
azfinal.order <- c(0,0,0)
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
  azcurrent.aic <- AIC(arima(train_st2, order=c(p, d, q)))
  if (azcurrent.aic < azfinal.aic) {
    azfinal.aic <- azcurrent.aic
    azfinal.order <- c(p, d, q)
    model02 <- arima(train_st2, order=azfinal.order)
  }
}


#Order of th model
azfinal.order
## [1] 2 0 2
#Make prediction
pred2 <- predict(model02, n.ahead = 11)

First, we built model01 with auto.arima function and we got AIC 1101.568 then I tried a for loop function to find best fitting model where I used AR or MA process order from 1 to 4. In this case, we I found best order 2,0,2 and AIC 1101.356. We move with model2 for furthur analysis because Aic is lower in model02 than model01.

#We can plot the residuals of the fitted model to see if we have evidence of discrete white noise:
checkresiduals(model02)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2) with non-zero mean
## Q* = 28.23, df = 44, p-value = 0.9689
## 
## Model df: 5.   Total lags used: 49

The ACF plot of the residuals from the ARIMA(2,0,2) model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.

#Next step is to run the Ljung-Box test and confirm that we have a good model fit:
Box.test(resid(model02), lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  resid(model02)
## X-squared = 14.291, df = 20, p-value = 0.8155

We got (p-value = 0.080), indicating that the residuals are random and that the model provides an adequate fit to the data.

Forecasting XLE stock close price for the next 8 weeks (2months)

#Let's now plot a forecast for the next 10 days of the XOP daily log returns:
fc2<-forecast(model02, h=19)
autoplot(forecast(fc2))

In the above plot, ws we can see, we have a blue line that represents the mean of our prediction. With the blue line explained we can see a darker and light darker areas, representing 80% and 95% confidence intervals respectively in lower and upper scenarios. We can see the XLE stock price is is doing to decrease in coming 8 weeks.

Results

creates a binary comparison, and computes an accuracy percentage for both Stocks

#Create actual stocks value to check accuracy

#Stock-1 XOP
forecasted<-pred$pred
realprice<-test_st1
comparison<-as.data.frame(cbind(realprice,forecasted))
comparison$accuracy<-sign(comparison$realprice)==sign(comparison$forecasted)
accuracy_percentage<-sum(comparison$accuracy==1)*100/length(comparison$accuracy)

print(comparison)
##    realprice   forecasted accuracy
## 1  -3.020001  1.122839866    FALSE
## 2   5.099999 -0.820373945    FALSE
## 3   0.049999 -0.017990578    FALSE
## 4   4.220001 -0.097384339    FALSE
## 5  -1.099998 -0.029304758     TRUE
## 6   1.750000  0.177448418     TRUE
## 7  -5.400002 -0.011920994     TRUE
## 8  -0.290001 -0.033844132     TRUE
## 9  -4.279999 -0.007759604     TRUE
## 10  4.079998 -0.277990501    FALSE
## 11  1.370003  0.001979706     TRUE
print(accuracy_percentage)
## [1] 54.54545
#Stock-2 XLE
forecasted2<-pred2$pred
realprice2<-test_st2
comparison2<-as.data.frame(cbind(realprice2,forecasted2))
comparison2$accuracy2<-sign(comparison2$realprice2)==sign(comparison2$forecasted2)
accuracy_percentage2<-sum(comparison2$accuracy2==1)*100/length(comparison2$accuracy2)

print(comparison2)
##    realprice2 forecasted2 accuracy2
## 1   -0.959999   0.6732310     FALSE
## 2    2.590000   0.1162043      TRUE
## 3    0.219998  -0.8075724     FALSE
## 4    2.800003  -0.4693488     FALSE
## 5   -2.900002   0.4176798     FALSE
## 6    1.340000   0.2691783      TRUE
## 7   -2.360000  -0.5518277      TRUE
## 8   -0.480000  -0.5618760      TRUE
## 9   -2.900001   0.1723022     FALSE
## 10   2.770000   0.3093302      TRUE
## 11   0.510002  -0.3248756     FALSE
print(accuracy_percentage2)
## [1] 45.45455

The ARIMA model is 28% accurate when making a forecast of an increase or decrease in the return of stock. The ARIMA model is 55% accurate when making a forecast of an increase or decrease in the return of stock.

Conclusion

In this study we focused in the application of different models, learning how to use them with the objective to forecast new price values. As we can see from our results, the models performed with similar future tendency predictions. ARIMA model predicting 27% accurate for XOP and 55% accurate for XLE stock. We tried regression model as well but after checking the cointegration we found that the regression model is not a good fit so we moved to another method. Other very relevant point we have not mentioned is that Auto Regressive models, as they base on the past data to predict future values, tend to have an asymptotic prediction in long period future forecasts. Finally, the model built here is not strong enough to be used for investment strategies, but it can accurately tell us whether the stock value will increase or decrease in short-term intervals more than half of the time. I will try to explore other models like SARIMA and Garch in the future study.