Are WTI & Brent oil prices co-integrated?

The U.S. Energy Information Administration (under the Department of Energy) publishes a monthly short-term outlook report on energy prices and fundamental data driving those forecast. The EIA also makes available the data in Excel format for transparency. The task at hand is to load a particular sheet from this Excel workbook (already saved as a CSV file and available on my GitHub page), only import WTI and Brent price data, conduct some exploratory data analysis, and confirm if the two price series are co-integrated to perhaps build a pairs trading strategy down the line.

First step: Get the data and switch a long format:

steo <- read_csv(curl("https://raw.githubusercontent.com/sjv1030/Data607-Project2/master/STEO_m.csv"),n_max=11)
## Warning: Missing column names filled in: 'X3' [3], 'X4' [4], 'X5' [5],
## 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
## 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17], 'X18' [18],
## 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23], 'X24' [24],
## 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29], 'X30' [30],
## 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34], 'X35' [35], 'X36' [36],
## 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40], 'X41' [41], 'X42' [42],
## 'X43' [43], 'X44' [44], 'X45' [45], 'X46' [46], 'X47' [47], 'X48' [48],
## 'X49' [49], 'X50' [50], 'X51' [51], 'X52' [52], 'X53' [53], 'X54' [54],
## 'X55' [55], 'X56' [56], 'X57' [57], 'X58' [58], 'X59' [59], 'X60' [60],
## 'X61' [61], 'X62' [62], 'X63' [63], 'X64' [64], 'X65' [65], 'X66' [66],
## 'X67' [67], 'X68' [68], 'X69' [69], 'X70' [70], 'X71' [71], 'X72' [72],
## 'X73' [73], 'X74' [74]
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.
steo <- steo[,-1] %>% drop_na(X3)
steo.t <- as.data.frame(t(steo)) 

colnames(steo.t) <- c('year','month','wti','brent','usimpt','usref','gasoline')
steo.t <- steo.t[-1,]

data <- steo.t %>% 
        fill(1) %>%
        unite('date',1:2,sep="-",remove=FALSE) %>%
         dplyr::select(1:5) %>%
        mutate(
                wti = as.numeric(as.character(wti)),
                brent = as.numeric(as.character(brent)),
                wti_d = c(0,diff(log(wti))),
                brent_d = c(0,diff(log(brent))),
                spread = brent-wti,
                season = case_when(
                        .$month == 'Dec' | .$month == 'Jan' | .$month == 'Feb' ~ 'winter',
                        .$month == 'Mar' | .$month == 'Apr' | .$month == 'May' ~ 'spring',
                        .$month == 'Jun' | .$month == 'Jul' | .$month == 'Aug' ~ 'summer',
                        .$month == 'Sep' | .$month == 'Oct' | .$month == 'Nov' ~ 'fall')
        )

Create a time series (XTS object) of both WTI and Brent, as well as the spread, for charting purposes. It’s always good to visualize the data beforehand (easy way of spotting outliers, NAs, etc). The charts will only be of actual data as of Aug 2017. Data post Aug 2017 are EIA forecast, which we aren’t interested in.

idx <- seq(as.Date('2013/1/1'), as.Date('2017/8/1'),"month")

prices <- xts(cbind(data$wti[1:56],data$brent[1:56]),order.by = as.Date(idx))
colnames(prices) <- c('wti','brent')
dygraph(prices,main="WTI & Brent Prices")
spread <- xts(data$spread[1:56],order.by=as.Date(idx))
colnames(spread) <- 'spread'
dygraph(spread,main="Spread between Brent & WTI")

The spread chart above shows that the difference between Brent & WTI isn’t necessarily mean reverting and has dropped significantly since 2013 and 2014. Additionally, the affects of the Harvey hurricane show up in the Aug 2017 data point where the spread widen.

Let’s conduct some exploratory data analysis. The data are from 2013 to Aug 2017. For information purposes, it’s good to know what the average oil prices have been on an annual basis. As the chart below illustrates, prices have plunged post 2014.

annual <- data[1:56,] %>% 
        group_by(year) %>%
        summarise(wtiAVG = mean(wti),
                  brentAVG = mean(brent)
                  ) %>%
        gather('oil','price',2:3)

ggplot(annual,aes(x=year,y=price,fill=oil)) +
        geom_bar(stat='identity',position='dodge', width=0.75) +
        theme_gdocs() + scale_fill_manual(values = c("gray","navy")) +
        scale_y_continuous(labels = scales::dollar)

Oil prices can be subject to seasonal behavior. At least in the US, oil prices tend to increase as more individuals in the USA tend to drive more increasing gasoline demand, and thereby demand for WTI. Let’s confirm this behavior and see if the same thing can be spotted in Brent.

As the chart illustrates, WTI tends to be highest during the summer. Brent tends to be highest during the spring. Perhaps Europeans start their driving season slightly ahead of the USA, or there’s something else to explain the difference.

seasonal <- data[1:56,] %>%
        group_by(season) %>%
        summarise(wtiAVG = mean(wti),
                  brentAVG = mean(brent)
                  ) %>%
        gather('oil','price',2:3)

ggplot(seasonal,aes(x=season,y=price,fill=oil)) +
        geom_bar(stat='identity',position='dodge', width=0.75) +
        theme_gdocs() + scale_fill_manual(values = c("gray","navy")) +
        scale_y_continuous(labels = scales::dollar)

Co-Integration

Now we’ll assess if WTI and Brent are strongly statistically related allowing an individual to create a pairs trade by either shorting one and going long the other, or by trading the spread itself (there’s a futures contract on it that trades on the Chicago Mercantile Exchange).

First, we’ll look at a summary table of our data, then run a basic regression to see if there’s a positive or negative relationship between the two.

summary(data[1:56,])
##      date             year        month         wti        
##  Length:56          2013:12   Apr    : 5   Min.   : 30.32  
##  Class :character   2014:12   Aug    : 5   1st Qu.: 46.53  
##  Mode  :character   2015:12   Feb    : 5   Median : 52.98  
##                     2016:12   Jan    : 5   Mean   : 67.69  
##                     2017: 8   Jul    : 5   3rd Qu.: 94.90  
##                     2018: 0   Jun    : 5   Max.   :106.57  
##                               (Other):26                   
##      brent            wti_d              brent_d              spread      
##  Min.   : 30.70   Min.   :-0.245526   Min.   :-0.266415   Min.   :-0.980  
##  1st Qu.: 47.73   1st Qu.:-0.069184   1st Qu.:-0.059699   1st Qu.: 1.373  
##  Median : 57.33   Median : 0.001322   Median : 0.002949   Median : 3.220  
##  Mean   : 72.38   Mean   :-0.012131   Mean   :-0.013936   Mean   : 4.685  
##  3rd Qu.:107.55   3rd Qu.: 0.034532   3rd Qu.: 0.028027   3rd Qu.: 6.798  
##  Max.   :116.05   Max.   : 0.213866   Max.   : 0.195977   Max.   :20.740  
##                                                                           
##     season         
##  Length:56         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
attach(data[1:56,])
## The following object is masked _by_ .GlobalEnv:
## 
##     spread
reg1 <- lm(wti~brent)
summary(reg1)
## 
## Call:
## lm(formula = wti ~ brent)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7199  -1.2418   0.3666   1.8732   5.7682 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.15550    1.17717    3.53 0.000858 ***
## brent        0.87785    0.01511   58.10  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.263 on 54 degrees of freedom
## Multiple R-squared:  0.9843, Adjusted R-squared:  0.984 
## F-statistic:  3376 on 1 and 54 DF,  p-value: < 2.2e-16
dwtest(reg1)
## 
##  Durbin-Watson test
## 
## data:  reg1
## DW = 0.48354, p-value = 5.289e-13
## alternative hypothesis: true autocorrelation is greater than 0
reg2 <- lm(brent~wti)
summary(reg2)
## 
## Call:
## lm(formula = brent ~ wti)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.908 -2.032 -0.619  1.213 12.707 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.5198     1.3960  -2.521   0.0147 *  
## wti           1.1212     0.0193  58.105   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.688 on 54 degrees of freedom
## Multiple R-squared:  0.9843, Adjusted R-squared:  0.984 
## F-statistic:  3376 on 1 and 54 DF,  p-value: < 2.2e-16
dwtest(reg2)
## 
##  Durbin-Watson test
## 
## data:  reg2
## DW = 0.47726, p-value = 3.832e-13
## alternative hypothesis: true autocorrelation is greater than 0

Both basic regressions above indicate strong positive correlation between the two variables. However, the Durbin-Watson test tells us that both equations suffer from serial correlation (this will lead one to make wrong conclusions because the errors aren’t normally disturbed, which affect the measures of statistical significance). One solution is to test if the changes in each variable also suffers from serial correlation.

We can test for significance in serial correlation using the augmented Dickey-Fuller (ADF) test:

adf.test(wti)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  wti
## Dickey-Fuller = -1.5794, Lag order = 3, p-value = 0.7442
## alternative hypothesis: stationary
Acf(wti)

adf.test(brent)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  brent
## Dickey-Fuller = -1.4602, Lag order = 3, p-value = 0.7923
## alternative hypothesis: stationary
Acf(brent)

For both WTI & Brent, the ADF test confirm what the Durbin-Watson test had already told us, serial correlation exists on the price level data. Let’s now run the same test on the first difference of the data:

adf.test(wti_d)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  wti_d
## Dickey-Fuller = -3.95, Lag order = 3, p-value = 0.01819
## alternative hypothesis: stationary
Acf(wti_d)

adf.test(brent_d)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  brent_d
## Dickey-Fuller = -3.9268, Lag order = 3, p-value = 0.0192
## alternative hypothesis: stationary
Acf(brent_d)

The ADF test and ACF plots show that the first differenced data doesn’t suffer from serial correlation. Now we can run the same regression as before but using differenced data. However, better techniques exist such as using the Cochrane-Orcutt method.

reg1 <- cochrane.orcutt(reg1)
summary(reg1)
## Call:
## lm(formula = wti ~ brent)
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.088577   2.182641   1.415   0.1629    
## brent       0.899657   0.028802  31.236   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.0442 on 53 degrees of freedom
## Multiple R-squared:  0.9485 ,  Adjusted R-squared:  0.9475
## F-statistic: 975.7 on 1 and 53 DF,  p-value: < 8.25e-36
## 
## Durbin-Watson statistic 
## (original):    0.48354 , p-value: 5.289e-13
## (transformed): 1.76620 , p-value: 1.594e-01
reg2 <- cochrane.orcutt(reg2)
summary(reg2)
## Call:
## lm(formula = brent ~ wti)
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.075576   3.196733   0.962   0.3404    
## wti         1.004531   0.044272  22.690   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1925 on 53 degrees of freedom
## Multiple R-squared:  0.9067 ,  Adjusted R-squared:  0.9049
## F-statistic: 514.8 on 1 and 53 DF,  p-value: < 5.814e-29
## 
## Durbin-Watson statistic 
## (original):    0.47726 , p-value: 3.832e-13
## (transformed): 1.91828 , p-value: 3.426e-01

The Cochrane-Orcutt fix address multiple issues with running a regular OLS regression on WTI & Brent. Thankfully, both revised regressions allow us to conclude a statistically significant positive relationship between the two variables.

Now, when two variables aren’t stationary (suffer from serial correlation, like WTI & Brent do at the price level), one can still run a very specific type of econometric model called a Vector Error Correction model. This kind of model is used when data are stationary, more information is extracted from price level versus first difference (i.e., the level of oil is more important that the percent change), and the data are co-integrated.

Co-integrated means the data move more or less in tandem. This is harder to test and a simple correlation analysis won’t suffice. In order to test for co-integration, we’ll use the egcm function from the egcm package which uses the Engle-Granger Cointegration Model (egcm). Engle and Granger actually won the Nobel prize in economics for this methodology.

e <- egcm(wti,brent, robust = TRUE,normalize=TRUE,log=TRUE)
summary(e)
## Y[i] =   1.0520 X[i] -   0.1022 + R[i], R[i] =   0.6882 R[i-1] + eps[i], eps ~ N(0,  0.0329^2)
##         (0.0140)        (0.0182)                (0.0865)
## 
## R[56] = 0.0364 (t = 0.836)
## 
## Unit Root Tests of Residuals
##                                                     Statistic    p-value
##   Augmented Dickey Fuller (ADF)                        -3.544    0.02386
##   Phillips-Perron (PP)                                -20.204    0.03880
##   Pantula, Gonzales-Farias and Fuller (PGFF)            0.646    0.04882
##   Elliott, Rothenberg and Stock DF-GLS (ERSD)          -1.192    0.57248
##   Johansen's Trace Test (JOT)                         -20.450    0.07059
##   Schmidt and Phillips Rho (SPR)                       -3.996    0.81186
## 
## Variances
##   SD(diff(X))          =   0.089560
##   SD(diff(Y))          =   0.092031
##   SD(diff(residuals))  =   0.037116
##   SD(residuals)        =   0.043566
##   SD(innovations)      =   0.032938
## 
## Half life       =   1.854585
## R[last]         =   0.036428 (t=0.84)
plot(e)

is.cointegrated(e)
## [1] TRUE

The egcm concludes that there is indeed co-integration. This dataset can now be used to create a pairs trading strategy.