PROJECT-8

DATA ANALYSIS OF OPEC COUNTRIES CRUDE OIL PRICES

VIDULA RAJAIN - 2017B4A11582H

Introduction

Crude oil production by the Organization of the Petroleum Exporting Countries (OPEC) is an important factor that affects oil prices. This organization seeks to actively manage oil production in its member countries by setting production targets. Historically, crude oil prices have seen increases in times when OPEC production targets are reduced.OPEC member countries produce about 40 percent of the world’s crude oil. Equally important to global prices, OPEC’s oil exports represent about 60 percent of the total petroleum traded internationally. Because of this market share, OPEC’s actions can, and do, influence international oil prices.

Data

The given dataset is an excel sheet containing the monthly U.S. Landed Costs of OPEC Countries Crude Oil per Barrel in dollars from October 1973 to May 2019. I intend to analyze this data, the historical trends shown and make predictions about the future prices of oil. The given data is a time series. A time series is a series of data points indexed (or listed or graphed) in time order. Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for to make better analysis or predictions.

R Code

We begin by loading the given dataset and reading the data. While reading the data, we skip the first 4 lines which contain irrelevant information. We then specify the respective data type in the given columns.

#setting working directory
setwd("C:/Users/vidul/Documents")

#reading file

library(readxl)
NDataset <- read_excel("OPEC COUNTRIES CRUDE OIL PRICES.xlsx", 
                       col_types = c("date", "numeric"), skip = 4)

We now take a look at the data to make sure it has been read correctly. We then sequence the data according to the dates in ascending order. We change names of the columns to something more simple.

View(NDataset)
NDataset<- NDataset[seq(dim(NDataset)[1],1),]

names(NDataset) <- c("Date","Prices") 

Generating time series

Next we store the prices in a new variable and generate a time series of the prices starting from October 1973 to May 2019 with the frequency 12, since it has monthly data.

prices <- NDataset$Prices
prices_ts <- ts(prices, frequency=12, start=c(1973,10))
prices_ts
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1973                                                                  5.89
## 1974  11.94  13.10  13.04  13.01  12.93  12.84  12.33  12.36  11.97  12.10
## 1975  12.30  12.43  12.41  12.32  12.53  12.58  12.46  12.54  12.58  13.19
## 1976  13.27  13.25  13.29  13.26  13.20  13.35  13.28  13.38  13.32  13.32
## 1977  14.05  14.27  14.33  14.33  14.41  14.28  14.38  14.34  14.46  14.43
## 1978  14.37  14.33  14.33  14.29  14.24  14.24  14.23  14.22  14.30  14.35
## 1979  15.29  15.55  16.17  17.73  18.75  21.13  22.71  23.26  24.48  24.48
## 1980  30.74  32.27  32.17  32.91  33.60  34.04  34.24  34.63  34.55  34.32
## 1981  37.83  38.07  37.98  37.37  36.90  36.78  35.03  35.19  35.38  35.41
## 1982  35.74  35.52  35.14  34.51  34.38  35.19  35.01  34.16  34.08  34.69
## 1983  33.11  30.42  28.88  29.42  29.78  30.12  29.98  29.95  29.62  29.80
## 1984  29.34  29.40  29.60  29.37  29.42  29.44  28.68  28.94  28.95  28.46
## 1985  27.60  27.68  27.60  27.95  27.50  26.63  26.87  26.44  26.61  26.90
## 1986  22.68  15.40  13.67  12.97  11.98  11.70  11.14  11.54  12.60  13.15
## 1987  16.02  16.95  17.25  17.69  17.82  18.28  18.53  18.73  18.14  17.41
## 1988  14.61  14.59  14.74  15.27  15.50  15.00  14.25  13.69  12.92  12.74
## 1989  15.98  16.74  17.80  19.23  18.15  17.45  17.13  16.86  17.29  17.97
## 1990  19.77  18.98  17.68  15.83  15.15  15.53  19.01  26.31  30.27  31.08
## 1991  20.16  17.43  17.88  18.17  17.98  17.32  17.96  18.40  18.70  19.03
## 1992  15.38  15.87  16.29  18.07  18.65  19.57  19.06  18.70  18.83  18.56
## 1993  16.67  17.44  17.84  17.71  17.22  16.06  15.32  15.23  14.85  14.70
## 1994  12.94  12.59  13.05  14.47  15.62  16.48  16.88  15.69  15.25  15.51
## 1995  16.15  16.53  16.86  18.33  17.93  16.64  15.73  16.02  16.22  15.60
## 1996  17.41  18.06  19.81  20.26  19.17  18.64  19.15  20.16  21.66  22.78
## 1997  21.49  19.19  18.05  17.46  17.58  17.01  17.12  17.80  17.86  17.79
## 1998  13.89  12.93  12.45  12.04  11.75  11.07  11.06  11.06  12.07  11.34
## 1999   9.75  10.72  13.22  14.89  15.40  16.32  18.09  19.69  21.28  21.67
## 2000  25.86  26.61  26.23  24.97  26.84  28.06  27.96  29.00  30.13  29.06
## 2001  24.08  23.90  23.21  23.26  23.67  23.26  22.43  22.70  21.06  17.58
## 2002  17.57  19.68  22.79  24.03  24.11  23.98  25.06  25.94  26.37  24.73
## 2003  30.79  30.73  28.24  24.86  25.30  27.38  27.58  27.70  25.99  27.76
## 2004  31.23  31.86  32.92  33.69  35.70  35.21  37.85  40.65  42.83  44.21
## 2005  40.48  43.31  47.58  47.19  46.61  52.96  55.93  61.10  60.84  55.75
## 2006  58.10  56.72  60.38  65.76  66.09  67.16  69.21  65.49  57.86  54.98
## 2007  52.81  56.06  59.60  63.73  66.38  69.58  73.63  71.73  77.37  83.58
## 2008  89.66  94.71 103.78 112.11 122.98 128.10 124.20 109.56  89.55  64.33
## 2009  40.26  42.75  48.55  54.22  60.06  66.63  66.27  70.00  70.02  73.71
## 2010  75.94  76.32  78.40  80.00  73.98  74.55  74.84  75.46  76.39  80.52
## 2011  94.00 100.19 109.26 116.57 111.75 109.87 111.61 106.27 107.67 107.95
## 2012 110.32 115.76 117.99 116.10 108.26  97.29  99.49 105.27 107.02 105.81
## 2013 106.84 108.86 107.50 101.76 101.63 101.21 103.96 104.91 104.10  99.53
## 2014  98.30 100.41 100.36 101.81 101.54 102.39 100.17  97.19  91.07  82.50
## 2015  48.17  51.44  51.13  55.39  59.11  56.79  50.45  43.17  43.31  41.57
## 2016  28.94  29.63  34.02  36.80  42.48  44.70  41.78  42.46  42.62  45.65
## 2017  49.22  50.48  48.91  48.47  47.36  45.77  46.91  49.55  53.53  55.71
## 2018  64.78  62.93  63.53  66.95  71.50  70.65  70.54  70.48  72.45  72.19
## 2019  57.46  62.76  65.30  70.04  69.82                                   
##         Nov    Dec
## 1973   7.07   8.01
## 1974  12.16  12.34
## 1975  13.42  13.29
## 1976  13.39  13.46
## 1977  14.44  14.48
## 1978  14.48  14.63
## 1979  27.64  28.89
## 1980  35.42  35.65
## 1981  36.16  36.26
## 1982  34.46  34.16
## 1983  29.78  29.00
## 1984  28.42  27.95
## 1985  26.82  24.74
## 1986  13.72  15.01
## 1987  17.54  16.05
## 1988  12.87  14.67
## 1989  18.27  19.93
## 1990  27.77  23.26
## 1991  17.95  15.94
## 1992  17.28  16.62
## 1993  13.34  12.05
## 1994  15.63  15.34
## 1995  16.30  16.91
## 1996  22.15  22.22
## 1997  16.63  15.01
## 1998   9.73   8.87
## 1999  22.76  23.65
## 2000  27.86  24.82
## 2001  16.12  16.02
## 2002  24.53  28.07
## 2003  28.36  29.84
## 2004  39.15  37.18
## 2005  53.00  54.76
## 2006  54.77  56.21
## 2007  88.53  88.30
## 2008  47.34  38.36
## 2009  75.18  75.01
## 2010  84.38  89.30
## 2011 110.10 109.63
## 2012 102.26 103.38
## 2013  96.32  98.02
## 2014  73.17  58.35
## 2015  37.86  32.60
## 2016  44.98  49.07
## 2017  59.83  62.13
## 2018  63.30  57.11
## 2019

Generating Plot

We plot the time series of the monthly oil prices with dates on the X-axis and prices on the Y-axis. geom_line() connects all the observations together in order of the variable on the x axis. We use geom_smooth() to add a smoothing line in order to see what the trends look like more easily. It adds confidence bands on the smoothing line as well.

#loading ggplot2
library(ggplot2)

#creating plot
p <- qplot(data=NDataset,x=Date, y=Prices)+
  geom_line()+
  geom_smooth()
p
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To make better interpretations from the plot, we decompose the series into different components.

Decomposition

The decomposition of time series is a statistical task that deconstructs a time series into several components, each representing one of the underlying categories of patterns. The decomposition model used here reduces the observed time series into 3 components: trend, seasonal effects, and random errors.The decomposition is additive, that is, we can add the various components to get the original series.

prices_tscomponents <- decompose(prices_ts)
plot(prices_tscomponents)

We can observe that there exists a seasonal component in the time series, but when we take a look at the y-axis of the seasonal component, we see that the magnitude is negligible compared to the other components. So, we can conclude that the time series has no seasonal component or removing it would not cause any noticeable changes in the time series.

From the trend and random component graphs we can make the following observations:- Since this is a historical data, we now know the reason for the price fluctuations.

  1. The sudden decline of crude oil prices in 2008.

The 2008 financial crisis and the Great Recession that followed induced a bear market and had a pronounced negative impact on the oil and gas sector as it led to a steep decline in the prices.The recession led to a general drop in asset prices around the world as credit contracted and earnings projections fell.At the same time, rising unemployment and lower spending led to less demand for oil by both consumers and businesses.

  1. Decline in oil prices in 2015

The initial drop in oil prices from mid-2014 to early 2015 was primarily driven by supply factors, including booming U.S. oil production, receding geopolitical concerns, and shifting OPEC policies. However, deteriorating demand prospects played a role as well, particularly from mid-2015 to early 2016. This partly explains why the oil price plunge failed to provide a subsequent boost to global activity.

The Augmented Dickey Fuller (ADF) Test

Augmented Dickey Fuller test (ADF Test) is a common statistical test used to determine whether a given Time series is stationary or not.

The hypotheses for the test:

The null hypothesis for this test is that there is a unit root. The alternate hypothesis differs slightly according to which equation you’re using. The basic alternate is that the time series is stationary (or trend-stationary).

The data is inspected by the function automatically to figure out an appropriate regression model. For example, a nonzero mean indicates the regression will have a constant term. The three basic regression models are:

No constant, no trend: Δyt = γyt-1 + vt Constant, no trend: Δyt = α + γyt-1 + vt Constant and trend: Δyt = α + γyt-1 + λt + vt

The ADF function adds lagged differences to these models automatically as well.

library(aTSA)
## 
## Attaching package: 'aTSA'
## The following object is masked from 'package:graphics':
## 
##     identify
adf.test(prices_ts) 
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -0.032   0.635
## [2,]   1 -1.313   0.209
## [3,]   2 -0.990   0.325
## [4,]   3 -0.777   0.401
## [5,]   4 -0.724   0.420
## [6,]   5 -0.670   0.439
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -1.22  0.6235
## [2,]   1 -2.67  0.0831
## [3,]   2 -2.23  0.2343
## [4,]   3 -1.89  0.3720
## [5,]   4 -1.86  0.3831
## [6,]   5 -1.81  0.4034
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -1.71  0.7012
## [2,]   1 -3.76  0.0207
## [3,]   2 -3.13  0.0992
## [4,]   3 -2.75  0.2579
## [5,]   4 -2.67  0.2957
## [6,]   5 -2.58  0.3318
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Conclusion:- ADF test gives us a p value much higher than 0.01 and 0.05, hence we can say that our series in non-stationary. Also, if we compare the critical values for our time series (with trend) we can again clearly say that the series is non- stationary.

Phillips Perron unit root test (PP Test)

pp.test(prices_ts)
## Phillips-Perron Unit Root Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##  lag Z_rho p.value
##    6 -1.63   0.436
## ----- 
##  Type 2: with drift no trend 
##  lag Z_rho p.value
##    6 -7.27    0.33
## ----- 
##  Type 3: with drift and trend 
##  lag Z_rho p.value
##    6 -14.1    0.28
## --------------- 
## Note: p-value = 0.01 means p.value <= 0.01

Since the p value given by the PP test is also much higher than 0.01 and 0.05, we can surely conclude that our time series is non-stationary.

Forecasting using ARIMA

An ARIMA model is a class of statistical models for analyzing and forecasting time series data.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration. This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are: AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations. I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary. MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.

#Arima
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:aTSA':
## 
##     forecast
auto.arima(prices_ts)
## Series: prices_ts 
## ARIMA(2,1,1) 
## 
## Coefficients:
##          ar1      ar2      ma1
##       1.3864  -0.5855  -0.6524
## s.e.  0.0964   0.0579   0.1102
## 
## sigma^2 estimated as 5.039:  log likelihood=-1217.29
## AIC=2442.58   AICc=2442.66   BIC=2459.8

The parameters of the ARIMA model are defined as follows:

p: The number of lag observations included in the model, also called the lag order. d: The number of times that the raw observations are differenced, also called the degree of differencing. q: The size of the moving average window, also called the order of moving average.

The auto arima function automatically calculates the required values of p,d and q. And thus, generates the ARIMA model by itself without the need of any additional calculations by us.

prices_tsarima <- arima(prices_ts, order=c(2,1,1))
prices_tsarima 
## 
## Call:
## arima(x = prices_ts, order = c(2, 1, 1))
## 
## Coefficients:
##          ar1      ar2      ma1
##       1.3864  -0.5855  -0.6524
## s.e.  0.0964   0.0579   0.1102
## 
## sigma^2 estimated as 5.011:  log likelihood = -1217.29,  aic = 2442.58

We now predict the prices for the next 12 months with a 95% confidence level.

prices_tsforecasts <- forecast(prices_tsarima, h=12, level=c(95))  
prices_tsforecasts
##          Point Forecast    Lo 95    Hi 95
## Jun 2019       68.94280 64.55541 73.33019
## Jul 2019       67.85545 59.07316 76.63773
## Aug 2019       66.86151 53.92112 79.80191
## Sep 2019       66.12015 49.61381 82.62649
## Oct 2019       65.67425 46.28770 85.06079
## Nov 2019       65.49010 43.84712 87.13308
## Dec 2019       65.49587 42.08982 88.90192
## Jan 2020       65.61168 40.79510 90.42827
## Feb 2020       65.76887 39.77284 91.76490
## Mar 2020       65.91899 38.88221 92.95577
## Apr 2020       66.03509 38.03206 94.03812
## May 2020       66.10815 37.17280 95.04350

We plot the all the prices now, including the forecasted price:-

plot(prices_tsforecasts)

According to this, the prices drop a little but seem to remain stable in the future. We now perform some tests to check the validity of our predictions.

ACF(Auto Correlation Function) plot of residuals:-

ACF gives us values of auto-correlation of any series with its lagged values.We plot these values along with the confidence band.In simple terms, it describes how well the present value of the series is related with its past values.

acf(prices_tsforecasts$residuals, lag.max=20)

The autocorrelation plot of the residuals shows the plots at a very few lags apart from the lag at 0 touch or crosses the boundary. We can clearly see correlation at one lag crossing the boundary, so autocorrelation may be present but with very less significance.

Box-Ljung Test:

The Box-Ljung test is a diagnostic tool used to test the lack of fit of a time series model The test is applied to the residuals of a time series after fitting an ARIMA model to the data. The null hypothesis of the Box Ljung Test, is that our model does not show lack of fit (or in simple terms—the model is just fine). The alternate hypothesis is that the model does show a lack of fit.

Box.test(prices_tsforecasts$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  prices_tsforecasts$residuals
## X-squared = 31.519, df = 20, p-value = 0.0487

At p value of 0.05 we can say with 95% confidence that autocorrelation is present. But we get the p value to be 0.0487. Thus we are almost on the boundary, concluding that there is no autocorrelation at the 0.05 significance level. But, it is still very close to 0.05,therefore it is a good model . But if we are still sceptical about whether or not we should reject the null hypothesis, we can perform the following tasks:-

plot.ts(prices_tsforecasts$residuals)

The plot of the forecast residuals in the sample indicates that the variance of the forecast errors tends to be approximately constant over time. But there may some change in variance in the later part of the plot.

Histogram :-

We generate a histogram of the residuals.

hist(prices_tsforecasts$residuals)

The histogram indicates that the mean tends to be fairly near zero. It is therefore valid to conclude that that the errors in the forecast are normally distributed with mean zero and constant variance. The forecast can be seen to be fairly acceptable and accurate neglecting the occurrence of unforeseeable circumstances (like COVID’19).