A calendar spread in a futures contract is when a contract is bought and another sold x-number of months later for the same asset. The contract that is closer to expiration is called the front contract, and the contract that expires out in the future is called the back contract. Depending on the price differential of the asset, in this case WTI futures, knowing if the back contract price will be lower than the front contract, then an investor can buy the front contract and sell the back contract for a positive investment return. The converse is also true (e.g., sell the front and buy the back contract).
When the back contract price is higher than the front contract price, the spread (differential) is negative and the curve is upward sloping (contango). When the opposite is true, the spread is positive and the curve is downward sloping (backwardation).
Below is an observational study using supply and demand data from the U.S. Energy Information Administration (EIA) to see if this data can help forecast if the curve will be upward or downward sloping using a logit regression model.
The main question this analysis attempts to answer is: Are changes in WTI and gasoline inventory predictive of an upward sloping (contango) or downward sloping (backwardation) oil curve 1-month out?
Data is reported by the EIA and available via Quandl. The data are monthly between Feb 1991 and Oct 2017.
The response variable is the shape of the oil futures curve and is binary: 0 for upward sloping (contango), or +1 for downward sloping (backwardation).
The explanatory variables are numerical - percent change in oil and oil & gasoline inventory data.
Each case represents a monthly observation of a futures contract and inventory data. There are 321 observations in the given data set.
library(xts)
library(Quandl)
library(tidyverse)
library(zoo)
library(xts)
library(psych)
library(ggplot2)
library(reshape2)
Quandl.api_key("TWp9zrWTMziHW_hbugiN")
f1 <- Quandl("EIA/PET_RCLC1_M", type="xts")
f2 <- Quandl("EIA/PET_RCLC2_M", type="xts")
f3 <- Quandl("EIA/PET_RCLC3_M", type="xts")
f4 <- Quandl("EIA/PET_RCLC4_M", type="xts")
wti_stock <- Quandl("EIA/STEO_COSXPUS_M", type="xts")
gas_stock <- Quandl("EIA/STEO_MGTSPUS_M", type="xts")
# Calculate monthly, 3-month, 6-month, and annual changes in both oil and gasoline inventory data
wstkd <- diff(log(wti_stock))
wstkd3 <- diff(log(wti_stock),lag=3)
wstkd6 <- diff(log(wti_stock),lag=6)
wstkd12 <- diff(log(wti_stock),lag=12)
gstkd <- diff(log(gas_stock))
gstkd3 <- diff(log(gas_stock),lag=3)
gstkd6 <- diff(log(gas_stock),lag=6)
gstkd12 <- diff(log(gas_stock),lag=12)
# Lag the independent variables by 1 month since we want to use them as leading indicators
wstkdL1 <- lag(wstkd,1)
wstkd3L1 <- lag(wstkd3,1)
wstkd6L1 <- lag(wstkd6,1)
wstkd12L1 <- lag(wstkd12,1)
gstkdL1 <- lag(gstkd,1)
gstkd3L1 <- lag(gstkd3,1)
gstkd6L1 <- lag(gstkd6,1)
gstkd12L1 <- lag(gstkd12,1)
# Create calendar spreads and return sign. Negative means curve is in contango, and positive means curve is in backwardation. Since all the variables are of XTS data type, the dates are automatically taken into account.
f1_2 <- sign(f1-f2)
f1_3 <- sign(f1-f3)
f1_4 <- sign(f1-f4)
f2_3 <- sign(f2-f3)
f2_4 <- sign(f2-f4)
f3_4 <- sign(f3-f4)
# Combine all the data:
df <- na.omit(cbind(f1_2,f1_3,f1_4,f2_3,f2_4,f3_4,wstkdL1,wstkd3L1,wstkd6L1,wstkd12L1,gstkdL1,gstkd3L1,gstkd6L1,gstkd12L1))
colnames(df) <- c('f1_2','f1_3','f1_4','f2_3','f2_4','f3_4','wstkdL1','wstkd3L1','wstkd6L1','wstkd12L1','gstkdL1','gstkd3L1','gstkd6L1','gstkd12L1')
Before jumping into regression analysis, it’ll be interesting to see if there are any meaningful correlations between our depedent and independent variables. The corr.test function from the psych package shows significant correlations.
There are very little times when the spread variables equal 0, so they are all dropped. All -1s are then converted to 0 for the purpose of using a logit model.
describe(df[,1:6])
describe(df[,7:14]*100)
print("# of Observations in Contango and Backwardation")
## [1] "# of Observations in Contango and Backwardation"
table(df$f1_2)
##
## -1 0 1
## 198 1 122
print("% of Observations in Contango and Backwardation")
## [1] "% of Observations in Contango and Backwardation"
prop.table(table(df$f1_2)) * 100
##
## -1 0 1
## 61.6822430 0.3115265 38.0062305
print("# of Observations in Contango and Backwardation")
## [1] "# of Observations in Contango and Backwardation"
table(df$f1_3)
##
## -1 0 1
## 190 4 127
print("% of Observations in Contango and Backwardation")
## [1] "% of Observations in Contango and Backwardation"
prop.table(table(df$f1_3)) * 100
##
## -1 0 1
## 59.190031 1.246106 39.563863
print("# of Observations in Contango and Backwardation")
## [1] "# of Observations in Contango and Backwardation"
table(df$f1_4)
##
## -1 0 1
## 181 1 139
print("% of Observations in Contango and Backwardation")
## [1] "% of Observations in Contango and Backwardation"
prop.table(table(df$f1_4)) * 100
##
## -1 0 1
## 56.3862928 0.3115265 43.3021807
print("# of Observations in Contango and Backwardation")
## [1] "# of Observations in Contango and Backwardation"
table(df$f2_3)
##
## -1 0 1
## 180 1 140
print("% of Observations in Contango and Backwardation")
## [1] "% of Observations in Contango and Backwardation"
prop.table(table(df$f2_3)) * 100
##
## -1 0 1
## 56.0747664 0.3115265 43.6137072
print("# of Observations in Contango and Backwardation")
## [1] "# of Observations in Contango and Backwardation"
table(df$f2_4)
##
## -1 1
## 173 148
print("% of Observations in Contango and Backwardation")
## [1] "% of Observations in Contango and Backwardation"
prop.table(table(df$f2_4)) * 100
##
## -1 1
## 53.89408 46.10592
boxplot(coredata(df[,7:14]*100),col=c('powderblue'),main="Boxplots of the Explanatory Variables",ylab="%",las=2)
# Check correlations
corr.test(df)
## Call:corr.test(x = df)
## Correlation matrix
## f1_2 f1_3 f1_4 f2_3 f2_4 f3_4 wstkdL1 wstkd3L1 wstkd6L1
## f1_2 1.00 0.95 0.88 0.86 0.82 0.79 -0.18 -0.28 -0.31
## f1_3 0.95 1.00 0.94 0.92 0.87 0.85 -0.17 -0.27 -0.31
## f1_4 0.88 0.94 1.00 0.97 0.94 0.91 -0.18 -0.25 -0.30
## f2_3 0.86 0.92 0.97 1.00 0.95 0.92 -0.18 -0.24 -0.30
## f2_4 0.82 0.87 0.94 0.95 1.00 0.97 -0.14 -0.22 -0.27
## f3_4 0.79 0.85 0.91 0.92 0.97 1.00 -0.14 -0.22 -0.29
## wstkdL1 -0.18 -0.17 -0.18 -0.18 -0.14 -0.14 1.00 0.59 0.32
## wstkd3L1 -0.28 -0.27 -0.25 -0.24 -0.22 -0.22 0.59 1.00 0.65
## wstkd6L1 -0.31 -0.31 -0.30 -0.30 -0.27 -0.29 0.32 0.65 1.00
## wstkd12L1 -0.56 -0.57 -0.56 -0.55 -0.53 -0.53 0.24 0.39 0.55
## gstkdL1 -0.08 -0.08 -0.08 -0.08 -0.10 -0.08 -0.07 -0.03 -0.18
## gstkd3L1 -0.08 -0.09 -0.10 -0.07 -0.10 -0.08 0.14 0.05 -0.03
## gstkd6L1 -0.16 -0.16 -0.14 -0.12 -0.10 -0.10 0.30 0.40 0.28
## gstkd12L1 -0.39 -0.36 -0.34 -0.33 -0.32 -0.29 0.06 0.12 0.15
## wstkd12L1 gstkdL1 gstkd3L1 gstkd6L1 gstkd12L1
## f1_2 -0.56 -0.08 -0.08 -0.16 -0.39
## f1_3 -0.57 -0.08 -0.09 -0.16 -0.36
## f1_4 -0.56 -0.08 -0.10 -0.14 -0.34
## f2_3 -0.55 -0.08 -0.07 -0.12 -0.33
## f2_4 -0.53 -0.10 -0.10 -0.10 -0.32
## f3_4 -0.53 -0.08 -0.08 -0.10 -0.29
## wstkdL1 0.24 -0.07 0.14 0.30 0.06
## wstkd3L1 0.39 -0.03 0.05 0.40 0.12
## wstkd6L1 0.55 -0.18 -0.03 0.28 0.15
## wstkd12L1 1.00 0.07 0.13 0.20 0.38
## gstkdL1 0.07 1.00 0.53 0.33 0.18
## gstkd3L1 0.13 0.53 1.00 0.57 0.24
## gstkd6L1 0.20 0.33 0.57 1.00 0.34
## gstkd12L1 0.38 0.18 0.24 0.34 1.00
## Sample Size
## [1] 321
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## f1_2 f1_3 f1_4 f2_3 f2_4 f3_4 wstkdL1 wstkd3L1 wstkd6L1
## f1_2 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00
## f1_3 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.00 0.00
## f1_4 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00
## f2_3 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00
## f2_4 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00
## f3_4 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 0.00
## wstkdL1 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00
## wstkd3L1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## wstkd6L1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## wstkd12L1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## gstkdL1 0.16 0.14 0.15 0.17 0.08 0.17 0.18 0.63 0.00
## gstkd3L1 0.14 0.12 0.09 0.18 0.07 0.14 0.01 0.36 0.57
## gstkd6L1 0.00 0.00 0.01 0.04 0.08 0.09 0.00 0.00 0.00
## gstkd12L1 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.03 0.01
## wstkd12L1 gstkdL1 gstkd3L1 gstkd6L1 gstkd12L1
## f1_2 0.00 1.00 1.00 0.10 0.00
## f1_3 0.00 1.00 1.00 0.13 0.00
## f1_4 0.00 1.00 1.00 0.32 0.00
## f2_3 0.00 1.00 1.00 0.76 0.00
## f2_4 0.00 1.00 1.00 1.00 0.00
## f3_4 0.00 1.00 1.00 1.00 0.00
## wstkdL1 0.00 1.00 0.32 0.00 1.00
## wstkd3L1 0.00 1.00 1.00 0.00 0.76
## wstkd6L1 0.00 0.03 1.00 0.00 0.21
## wstkd12L1 0.00 1.00 0.36 0.01 0.00
## gstkdL1 0.22 0.00 0.00 0.00 0.05
## gstkd3L1 0.02 0.00 0.00 0.00 0.00
## gstkd6L1 0.00 0.00 0.00 0.00 0.00
## gstkd12L1 0.00 0.00 0.00 0.00 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
# Plot correlation matrix as a heatmap
cor_df <- cor(df)
melt_cor <- melt(cor_df)
# ggplot(data=melt_cor, aes(x=Var1,y=Var2,fill=value)) + geom_raster()
# NA the bottom half of matrix -- redundant
cor_top <- cor_df
cor_top[upper.tri(cor_top)] <- NA
# Create new heatmap
# Source: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization#create-the-correlation-heatmap-with-ggplot2
melt_cor_top <- melt(cor_top, na.rm = TRUE)
p <- ggplot(data=melt_cor_top,aes(x=Var1,y=Var2,fill=value)) + geom_raster() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed()
p + theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
# Plot some Histograms of indepdent variables
hist(na.omit(gstkd), main="Monthly Log Difference of Gasoline Inventory",xlab="")
hist(na.omit(wstkd), main="Monthly Log Difference of Oil Inventory",xlab = "")
# Remove any row where dependent is 0 (less than 10 instances)
df2 <- subset(df,!(f1_2 == 0 | f1_3 == 0 | f1_4 == 0 | f2_3 == 0 | f2_4 == 0 | f3_4 == 0))
# Convert -1s to 0s
df2[,1:6] <- apply(df2[,1:6],2,function(x) ifelse(x==-1,0,1))
f1_2m <- glm(f1_2~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
f1_3m <- glm(f1_3~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
f1_4m <- glm(f1_4~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
f2_3m <- glm(f2_3~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
f2_4m <- glm(f2_4~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
f3_4m <- glm(f3_4~gstkd3L1+gstkd6L1+gstkd12L1+wstkd3L1+wstkd6L1+wstkd12L1,data=df2,
family=binomial(link='logit'))
# Print model summaries
summary(f1_2m)
##
## Call:
## glm(formula = f1_2 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6507 -0.6229 -0.2397 0.6192 2.1745
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5262 0.1540 -3.417 0.000633 ***
## gstkd3L1 1.3530 3.2027 0.422 0.672679
## gstkd6L1 2.1535 3.0742 0.701 0.483609
## gstkd12L1 -15.1226 4.0675 -3.718 0.000201 ***
## wstkd3L1 -4.5253 3.6846 -1.228 0.219385
## wstkd6L1 1.8599 3.2402 0.574 0.565965
## wstkd12L1 -21.8053 3.4141 -6.387 1.69e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 419.61 on 314 degrees of freedom
## Residual deviance: 268.27 on 308 degrees of freedom
## AIC: 282.27
##
## Number of Fisher Scoring iterations: 5
summary(f1_3m)
##
## Call:
## glm(formula = f1_3 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7561 -0.6844 -0.2639 0.6656 2.5176
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4023 0.1483 -2.713 0.00667 **
## gstkd3L1 1.2213 3.1226 0.391 0.69570
## gstkd6L1 1.0177 2.9900 0.340 0.73358
## gstkd12L1 -10.1917 3.7861 -2.692 0.00710 **
## wstkd3L1 -3.8023 3.5659 -1.066 0.28629
## wstkd6L1 2.3011 3.1347 0.734 0.46291
## wstkd12L1 -21.9582 3.3246 -6.605 3.98e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 424.00 on 314 degrees of freedom
## Residual deviance: 282.34 on 308 degrees of freedom
## AIC: 296.34
##
## Number of Fisher Scoring iterations: 5
summary(f1_4m)
##
## Call:
## glm(formula = f1_4 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7449 -0.7408 -0.2532 0.8165 2.3662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2064 0.1434 -1.439 0.1502
## gstkd3L1 0.4318 3.0299 0.143 0.8867
## gstkd6L1 1.4696 2.9146 0.504 0.6141
## gstkd12L1 -8.4730 3.6350 -2.331 0.0198 *
## wstkd3L1 -2.3462 3.4517 -0.680 0.4967
## wstkd6L1 1.4137 3.0268 0.467 0.6405
## wstkd12L1 -21.0718 3.1808 -6.625 3.48e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 430.23 on 314 degrees of freedom
## Residual deviance: 296.03 on 308 degrees of freedom
## AIC: 310.03
##
## Number of Fisher Scoring iterations: 5
summary(f2_3m)
##
## Call:
## glm(formula = f2_3 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6608 -0.7384 -0.2623 0.8400 2.3488
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2024 0.1425 -1.420 0.1557
## gstkd3L1 1.0214 3.0040 0.340 0.7338
## gstkd6L1 2.1870 2.8985 0.755 0.4505
## gstkd12L1 -9.0417 3.6275 -2.493 0.0127 *
## wstkd3L1 -2.5830 3.4319 -0.753 0.4517
## wstkd6L1 0.8882 3.0062 0.295 0.7676
## wstkd12L1 -20.2130 3.1022 -6.516 7.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 430.23 on 314 degrees of freedom
## Residual deviance: 299.13 on 308 degrees of freedom
## AIC: 313.13
##
## Number of Fisher Scoring iterations: 5
summary(f2_4m)
##
## Call:
## glm(formula = f2_4 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6064 -0.8123 -0.2794 0.8494 2.3914
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05299 0.13798 -0.384 0.7010
## gstkd3L1 -1.67878 2.90025 -0.579 0.5627
## gstkd6L1 3.96949 2.81049 1.412 0.1578
## gstkd12L1 -8.41063 3.47432 -2.421 0.0155 *
## wstkd3L1 -2.57890 3.30890 -0.779 0.4358
## wstkd6L1 1.17319 2.86970 0.409 0.6827
## wstkd12L1 -18.09918 2.87684 -6.291 3.15e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.01 on 314 degrees of freedom
## Residual deviance: 317.28 on 308 degrees of freedom
## AIC: 331.28
##
## Number of Fisher Scoring iterations: 5
summary(f3_4m)
##
## Call:
## glm(formula = f3_4 ~ gstkd3L1 + gstkd6L1 + gstkd12L1 + wstkd3L1 +
## wstkd6L1 + wstkd12L1, family = binomial(link = "logit"),
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5612 -0.8196 -0.2421 0.8689 2.3630
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.016022 0.136871 -0.117 0.9068
## gstkd3L1 -0.666632 2.866282 -0.233 0.8161
## gstkd6L1 2.954731 2.778631 1.063 0.2876
## gstkd12L1 -6.673811 3.409350 -1.958 0.0503 .
## wstkd3L1 -1.764891 3.269005 -0.540 0.5893
## wstkd6L1 0.007402 2.834743 0.003 0.9979
## wstkd12L1 -17.639403 2.817824 -6.260 3.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.70 on 314 degrees of freedom
## Residual deviance: 322.28 on 308 degrees of freedom
## AIC: 336.28
##
## Number of Fisher Scoring iterations: 5
Judging from the various regressions above, it appears as though some factors should be strongly considered by investment professionals in this strategy.
For example, when forecasting the spread 1-month out, the year-over-year change in both oil and gas stock are significant. The negative coefficient implies that if inventories decrease, then the odds of the front contract being higher than the back contract is more likely, ceteris paribus.
This conclusion is actually the same for all models!
Diez, David M.; Barr, Chistopher D.; and Cetinkaya-Rundel, Mine - OpenIntro Statistics 3rd edition
U.S. Energy Information Administration
Chantziara, Thalia and Skiadopoulos, George - Can the Dynamics of the Term Structure of Petroleum Futures be forecasted? Evidence from Major Markets