Introduction
Crude oil production by the Organization of the Petroleum Exporting Countries (OPEC) is an important factor that affects oil prices. This organization seeks to actively manage oil production in its member countries by setting production targets. Historically, crude oil prices have seen increases in times when OPEC production targets are reduced.OPEC member countries produce about 40 percent of the world’s crude oil. Equally important to global prices, OPEC’s oil exports represent about 60 percent of the total petroleum traded internationally. Because of this market share, OPEC’s actions can, and do, influence international oil prices.
Data
The given dataset is an excel sheet containing the monthly U.S. Landed Costs of OPEC Countries Crude Oil per Barrel in dollars from October 1973 to May 2019. I intend to analyze this data, the historical trends shown and make predictions about the future prices of oil. The given data is a time series. A time series is a series of data points indexed (or listed or graphed) in time order. Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for to make better analysis or predictions.
R Code
We begin by loading the given dataset and reading the data. While reading the data, we skip the first 4 lines which contain irrelevant information. We then specify the respective data type in the given columns.
#setting working directory
setwd("C:/Users/vidul/Documents")
#reading file
library(readxl)
NDataset <- read_excel("OPEC COUNTRIES CRUDE OIL PRICES.xlsx",
col_types = c("date", "numeric"), skip = 4)We now take a look at the data to make sure it has been read correctly. We then sequence the data according to the dates in ascending order. We change names of the columns to something more simple.
Generating time series
Next we store the prices in a new variable and generate a time series of the prices starting from October 1973 to May 2019 with the frequency 12, since it has monthly data.
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1973 5.89
## 1974 11.94 13.10 13.04 13.01 12.93 12.84 12.33 12.36 11.97 12.10
## 1975 12.30 12.43 12.41 12.32 12.53 12.58 12.46 12.54 12.58 13.19
## 1976 13.27 13.25 13.29 13.26 13.20 13.35 13.28 13.38 13.32 13.32
## 1977 14.05 14.27 14.33 14.33 14.41 14.28 14.38 14.34 14.46 14.43
## 1978 14.37 14.33 14.33 14.29 14.24 14.24 14.23 14.22 14.30 14.35
## 1979 15.29 15.55 16.17 17.73 18.75 21.13 22.71 23.26 24.48 24.48
## 1980 30.74 32.27 32.17 32.91 33.60 34.04 34.24 34.63 34.55 34.32
## 1981 37.83 38.07 37.98 37.37 36.90 36.78 35.03 35.19 35.38 35.41
## 1982 35.74 35.52 35.14 34.51 34.38 35.19 35.01 34.16 34.08 34.69
## 1983 33.11 30.42 28.88 29.42 29.78 30.12 29.98 29.95 29.62 29.80
## 1984 29.34 29.40 29.60 29.37 29.42 29.44 28.68 28.94 28.95 28.46
## 1985 27.60 27.68 27.60 27.95 27.50 26.63 26.87 26.44 26.61 26.90
## 1986 22.68 15.40 13.67 12.97 11.98 11.70 11.14 11.54 12.60 13.15
## 1987 16.02 16.95 17.25 17.69 17.82 18.28 18.53 18.73 18.14 17.41
## 1988 14.61 14.59 14.74 15.27 15.50 15.00 14.25 13.69 12.92 12.74
## 1989 15.98 16.74 17.80 19.23 18.15 17.45 17.13 16.86 17.29 17.97
## 1990 19.77 18.98 17.68 15.83 15.15 15.53 19.01 26.31 30.27 31.08
## 1991 20.16 17.43 17.88 18.17 17.98 17.32 17.96 18.40 18.70 19.03
## 1992 15.38 15.87 16.29 18.07 18.65 19.57 19.06 18.70 18.83 18.56
## 1993 16.67 17.44 17.84 17.71 17.22 16.06 15.32 15.23 14.85 14.70
## 1994 12.94 12.59 13.05 14.47 15.62 16.48 16.88 15.69 15.25 15.51
## 1995 16.15 16.53 16.86 18.33 17.93 16.64 15.73 16.02 16.22 15.60
## 1996 17.41 18.06 19.81 20.26 19.17 18.64 19.15 20.16 21.66 22.78
## 1997 21.49 19.19 18.05 17.46 17.58 17.01 17.12 17.80 17.86 17.79
## 1998 13.89 12.93 12.45 12.04 11.75 11.07 11.06 11.06 12.07 11.34
## 1999 9.75 10.72 13.22 14.89 15.40 16.32 18.09 19.69 21.28 21.67
## 2000 25.86 26.61 26.23 24.97 26.84 28.06 27.96 29.00 30.13 29.06
## 2001 24.08 23.90 23.21 23.26 23.67 23.26 22.43 22.70 21.06 17.58
## 2002 17.57 19.68 22.79 24.03 24.11 23.98 25.06 25.94 26.37 24.73
## 2003 30.79 30.73 28.24 24.86 25.30 27.38 27.58 27.70 25.99 27.76
## 2004 31.23 31.86 32.92 33.69 35.70 35.21 37.85 40.65 42.83 44.21
## 2005 40.48 43.31 47.58 47.19 46.61 52.96 55.93 61.10 60.84 55.75
## 2006 58.10 56.72 60.38 65.76 66.09 67.16 69.21 65.49 57.86 54.98
## 2007 52.81 56.06 59.60 63.73 66.38 69.58 73.63 71.73 77.37 83.58
## 2008 89.66 94.71 103.78 112.11 122.98 128.10 124.20 109.56 89.55 64.33
## 2009 40.26 42.75 48.55 54.22 60.06 66.63 66.27 70.00 70.02 73.71
## 2010 75.94 76.32 78.40 80.00 73.98 74.55 74.84 75.46 76.39 80.52
## 2011 94.00 100.19 109.26 116.57 111.75 109.87 111.61 106.27 107.67 107.95
## 2012 110.32 115.76 117.99 116.10 108.26 97.29 99.49 105.27 107.02 105.81
## 2013 106.84 108.86 107.50 101.76 101.63 101.21 103.96 104.91 104.10 99.53
## 2014 98.30 100.41 100.36 101.81 101.54 102.39 100.17 97.19 91.07 82.50
## 2015 48.17 51.44 51.13 55.39 59.11 56.79 50.45 43.17 43.31 41.57
## 2016 28.94 29.63 34.02 36.80 42.48 44.70 41.78 42.46 42.62 45.65
## 2017 49.22 50.48 48.91 48.47 47.36 45.77 46.91 49.55 53.53 55.71
## 2018 64.78 62.93 63.53 66.95 71.50 70.65 70.54 70.48 72.45 72.19
## 2019 57.46 62.76 65.30 70.04 69.82
## Nov Dec
## 1973 7.07 8.01
## 1974 12.16 12.34
## 1975 13.42 13.29
## 1976 13.39 13.46
## 1977 14.44 14.48
## 1978 14.48 14.63
## 1979 27.64 28.89
## 1980 35.42 35.65
## 1981 36.16 36.26
## 1982 34.46 34.16
## 1983 29.78 29.00
## 1984 28.42 27.95
## 1985 26.82 24.74
## 1986 13.72 15.01
## 1987 17.54 16.05
## 1988 12.87 14.67
## 1989 18.27 19.93
## 1990 27.77 23.26
## 1991 17.95 15.94
## 1992 17.28 16.62
## 1993 13.34 12.05
## 1994 15.63 15.34
## 1995 16.30 16.91
## 1996 22.15 22.22
## 1997 16.63 15.01
## 1998 9.73 8.87
## 1999 22.76 23.65
## 2000 27.86 24.82
## 2001 16.12 16.02
## 2002 24.53 28.07
## 2003 28.36 29.84
## 2004 39.15 37.18
## 2005 53.00 54.76
## 2006 54.77 56.21
## 2007 88.53 88.30
## 2008 47.34 38.36
## 2009 75.18 75.01
## 2010 84.38 89.30
## 2011 110.10 109.63
## 2012 102.26 103.38
## 2013 96.32 98.02
## 2014 73.17 58.35
## 2015 37.86 32.60
## 2016 44.98 49.07
## 2017 59.83 62.13
## 2018 63.30 57.11
## 2019
Generating Plot
We plot the time series of the monthly oil prices with dates on the X-axis and prices on the Y-axis. geom_line() connects all the observations together in order of the variable on the x axis. We use geom_smooth() to add a smoothing line in order to see what the trends look like more easily. It adds confidence bands on the smoothing line as well.
#loading ggplot2
library(ggplot2)
#creating plot
p <- qplot(data=NDataset,x=Date, y=Prices)+
geom_line()+
geom_smooth()
p## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To make better interpretations from the plot, we decompose the series into different components.
Decomposition
The decomposition of time series is a statistical task that deconstructs a time series into several components, each representing one of the underlying categories of patterns. The decomposition model used here reduces the observed time series into 3 components: trend, seasonal effects, and random errors.The decomposition is additive, that is, we can add the various components to get the original series.
We can observe that there exists a seasonal component in the time series, but when we take a look at the y-axis of the seasonal component, we see that the magnitude is negligible compared to the other components. So, we can conclude that the time series has no seasonal component or removing it would not cause any noticeable changes in the time series.
From the trend and random component graphs we can make the following observations:- Since this is a historical data, we now know the reason for the price fluctuations.
- The sudden decline of crude oil prices in 2008.
The 2008 financial crisis and the Great Recession that followed induced a bear market and had a pronounced negative impact on the oil and gas sector as it led to a steep decline in the prices.The recession led to a general drop in asset prices around the world as credit contracted and earnings projections fell.At the same time, rising unemployment and lower spending led to less demand for oil by both consumers and businesses.
- Decline in oil prices in 2015
The initial drop in oil prices from mid-2014 to early 2015 was primarily driven by supply factors, including booming U.S. oil production, receding geopolitical concerns, and shifting OPEC policies. However, deteriorating demand prospects played a role as well, particularly from mid-2015 to early 2016. This partly explains why the oil price plunge failed to provide a subsequent boost to global activity.
The Augmented Dickey Fuller (ADF) Test
Augmented Dickey Fuller test (ADF Test) is a common statistical test used to determine whether a given Time series is stationary or not.
The hypotheses for the test:
The null hypothesis for this test is that there is a unit root. The alternate hypothesis differs slightly according to which equation you’re using. The basic alternate is that the time series is stationary (or trend-stationary).
The data is inspected by the function automatically to figure out an appropriate regression model. For example, a nonzero mean indicates the regression will have a constant term. The three basic regression models are:
No constant, no trend: Δyt = γyt-1 + vt Constant, no trend: Δyt = α + γyt-1 + vt Constant and trend: Δyt = α + γyt-1 + λt + vt
The ADF function adds lagged differences to these models automatically as well.
##
## Attaching package: 'aTSA'
## The following object is masked from 'package:graphics':
##
## identify
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -0.032 0.635
## [2,] 1 -1.313 0.209
## [3,] 2 -0.990 0.325
## [4,] 3 -0.777 0.401
## [5,] 4 -0.724 0.420
## [6,] 5 -0.670 0.439
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -1.22 0.6235
## [2,] 1 -2.67 0.0831
## [3,] 2 -2.23 0.2343
## [4,] 3 -1.89 0.3720
## [5,] 4 -1.86 0.3831
## [6,] 5 -1.81 0.4034
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -1.71 0.7012
## [2,] 1 -3.76 0.0207
## [3,] 2 -3.13 0.0992
## [4,] 3 -2.75 0.2579
## [5,] 4 -2.67 0.2957
## [6,] 5 -2.58 0.3318
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Conclusion:- ADF test gives us a p value much higher than 0.01 and 0.05, hence we can say that our series in non-stationary. Also, if we compare the critical values for our time series (with trend) we can again clearly say that the series is non- stationary.
Phillips Perron unit root test (PP Test)
## Phillips-Perron Unit Root Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag Z_rho p.value
## 6 -1.63 0.436
## -----
## Type 2: with drift no trend
## lag Z_rho p.value
## 6 -7.27 0.33
## -----
## Type 3: with drift and trend
## lag Z_rho p.value
## 6 -14.1 0.28
## ---------------
## Note: p-value = 0.01 means p.value <= 0.01
Since the p value given by the PP test is also much higher than 0.01 and 0.05, we can surely conclude that our time series is non-stationary.
Forecasting using ARIMA
An ARIMA model is a class of statistical models for analyzing and forecasting time series data.
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration. This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are: AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations. I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary. MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:aTSA':
##
## forecast
## Series: prices_ts
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 1.3864 -0.5855 -0.6524
## s.e. 0.0964 0.0579 0.1102
##
## sigma^2 estimated as 5.039: log likelihood=-1217.29
## AIC=2442.58 AICc=2442.66 BIC=2459.8
The parameters of the ARIMA model are defined as follows:
p: The number of lag observations included in the model, also called the lag order. d: The number of times that the raw observations are differenced, also called the degree of differencing. q: The size of the moving average window, also called the order of moving average.
The auto arima function automatically calculates the required values of p,d and q. And thus, generates the ARIMA model by itself without the need of any additional calculations by us.
##
## Call:
## arima(x = prices_ts, order = c(2, 1, 1))
##
## Coefficients:
## ar1 ar2 ma1
## 1.3864 -0.5855 -0.6524
## s.e. 0.0964 0.0579 0.1102
##
## sigma^2 estimated as 5.011: log likelihood = -1217.29, aic = 2442.58
We now predict the prices for the next 12 months with a 95% confidence level.
## Point Forecast Lo 95 Hi 95
## Jun 2019 68.94280 64.55541 73.33019
## Jul 2019 67.85545 59.07316 76.63773
## Aug 2019 66.86151 53.92112 79.80191
## Sep 2019 66.12015 49.61381 82.62649
## Oct 2019 65.67425 46.28770 85.06079
## Nov 2019 65.49010 43.84712 87.13308
## Dec 2019 65.49587 42.08982 88.90192
## Jan 2020 65.61168 40.79510 90.42827
## Feb 2020 65.76887 39.77284 91.76490
## Mar 2020 65.91899 38.88221 92.95577
## Apr 2020 66.03509 38.03206 94.03812
## May 2020 66.10815 37.17280 95.04350
We plot the all the prices now, including the forecasted price:-
According to this, the prices drop a little but seem to remain stable in the future. We now perform some tests to check the validity of our predictions.
ACF(Auto Correlation Function) plot of residuals:-
ACF gives us values of auto-correlation of any series with its lagged values.We plot these values along with the confidence band.In simple terms, it describes how well the present value of the series is related with its past values.
The autocorrelation plot of the residuals shows the plots at a very few lags apart from the lag at 0 touch or crosses the boundary. We can clearly see correlation at one lag crossing the boundary, so autocorrelation may be present but with very less significance.
Box-Ljung Test:
The Box-Ljung test is a diagnostic tool used to test the lack of fit of a time series model The test is applied to the residuals of a time series after fitting an ARIMA model to the data. The null hypothesis of the Box Ljung Test, is that our model does not show lack of fit (or in simple terms—the model is just fine). The alternate hypothesis is that the model does show a lack of fit.
##
## Box-Ljung test
##
## data: prices_tsforecasts$residuals
## X-squared = 31.519, df = 20, p-value = 0.0487
At p value of 0.05 we can say with 95% confidence that autocorrelation is present. But we get the p value to be 0.0487. Thus we are almost on the boundary, concluding that there is no autocorrelation at the 0.05 significance level. But, it is still very close to 0.05,therefore it is a good model . But if we are still sceptical about whether or not we should reject the null hypothesis, we can perform the following tasks:-
The plot of the forecast residuals in the sample indicates that the variance of the forecast errors tends to be approximately constant over time. But there may some change in variance in the later part of the plot.
Histogram :-
We generate a histogram of the residuals.
The histogram indicates that the mean tends to be fairly near zero. It is therefore valid to conclude that that the errors in the forecast are normally distributed with mean zero and constant variance. The forecast can be seen to be fairly acceptable and accurate neglecting the occurrence of unforeseeable circumstances (like COVID’19).