Silverio J. Vasquez
Oct. 4, 2016
I decided to take a deeper dive into the following short-term WTI curves:
-3 month out contract (F3) less current contract (F0)
-2 month out contract (F2) less current contract (F0)
-1 month out contract (F1) less current contract (F0)
I didn't want to recreate the wheel by programming a roll schedule (over 5 days to smooth it out), but instead used EIA data where they have already rolled NYMEX contracts forward.
PCA was used to extract the largest driver of variation among the 3 WTI curves.
Macro data (both level and differenced) and fundamental data were collected. PCA was also used to extract the largest drivers of variation (PCs) among these datasets. Then, all explanatory variables were grouped together and PCA was used to extract the largest drivers of variation from all explanatory variables.
The first principal component of the WTI curves (PC1) explained the majority of the variation and was regressed on all the PCs gathered from the step above.
Due to data limitations, analysis is between Jan 1997 and Jul 2016 on a monthly basis.
Summary Statistics (C1 = F0, C2 = F1, C3 = F2, C4 = F3)
C2-C1 C3-C1 C4-C1
Min. :-2.0800 Min. :-3.3100 Min. :-4.0400
1st Qu.:-0.3425 1st Qu.:-0.7025 1st Qu.:-1.1500
Median : 0.1750 Median : 0.2700 Median : 0.2600
Mean : 0.2088 Mean : 0.3305 Mean : 0.3848
3rd Qu.: 0.5925 3rd Qu.: 1.1300 3rd Qu.: 1.5625
Max. : 4.4700 Max. : 7.1900 Max. : 8.9800
Correlation Table (C1 = F0, C2 = F1, C3 = F2, C4 = F3)
C2-C1 C3-C1 C4-C1
C2-C1 1.000 0.990 0.972
C3-C1 0.990 1.000 0.995
C4-C1 0.972 0.995 1.000
The first principal component (PC1) explains 99% of the variation. Given PC1's negative relationship with all the curves and the fact that WTI is usually in backwardation, PC1 is likely related to the term structure. Small chart below is of % variance explained by PC.
Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 1.7236824 0.169035888 0.0185964117
Proportion of Variance 0.9903603 0.009524377 0.0001152755
Cumulative Proportion 0.9903603 0.999884724 1.0000000000
Comp.1 Comp.2 Comp.3
C2-C1 -0.5755046 0.74650575 0.3339515
C3-C1 -0.5800279 -0.08472591 -0.8101785
C4-C1 -0.5765085 -0.65996264 0.4817543
PC1 looks very similar to the WTI curves, as expected given PC1 explains 99% of the variation across all three curves.
Macro Variables:
HY option-adjusted spread over US 10-Year yield
VIX
10-yr and 2-yr Treasuries
Yield Curve (10s2s)
Fed Trade-Weighted Dollar (broad)
US LEI (Conference Board)
NOTE: Some transformations where applied - differencing and log-level
Fundamental Variables:
US net oil imports (EIA)
US oil production (EIA)
US refinery capacity (EIA)
US crude stock including SPR (EIA)
C2-C1 C3-C1 C4-C1 pc1.z.97
C2-C1 1.00 0.99 0.97 -0.99
C3-C1 0.99 1.00 0.99 -1.00
C4-C1 0.97 0.99 1.00 -0.99
pc1.z.97 -0.99 -1.00 -0.99 1.00
d10yr.97 0.10 0.09 0.08 -0.09
d2yr.97 0.09 0.08 0.06 -0.08
dvix.97 -0.03 -0.03 -0.03 0.03
dyc.97 0.02 0.03 0.03 -0.02
dtwd.97 -0.11 -0.10 -0.09 0.10
dlei.97 -0.01 -0.02 -0.03 0.02
dhyspd.97 -0.15 -0.14 -0.12 0.14
yc.97 0.02 0.02 0.01 -0.02
hyspd.97 0.22 0.25 0.26 -0.24
llei.97 -0.04 -0.06 -0.07 0.06
us10yr.97 -0.25 -0.27 -0.28 0.27
us2yr.97 -0.17 -0.18 -0.19 0.18
vix.97 0.00 0.03 0.04 -0.02
dnetimp.97 -0.02 -0.02 -0.02 0.02
dloilprod.97 -0.09 -0.07 -0.07 0.08
loilprod.97 -0.08 -0.07 -0.06 0.07
dstock.97 -0.02 0.00 0.01 0.00
lstock.97 0.52 0.53 0.54 -0.53
refcap.97 -0.30 -0.32 -0.33 0.32
Top 2 principal components explain 85% of the variation.
Top 3 principal components explain 79% of the variation.
Top 3 principal components explain 70% of the variation.
Top 5 principal components explain 67% of the variation.
pc1.z.97. = PC1 from WTI curves
pcml = PCs from macro data (level & log)
pcmd = PCs from macro data (differenced)
pcf = PCs from fundamental data
pca = PCs from all explanatory variables
[,1]
pc1.z.97. 1.00
pcml1 0.19
pcml2 -0.05
pcmd1 0.10
pcmd2 0.02
pcmd3 -0.05
pcf1 0.31
pcf2 -0.33
pcf3 0.05
pca1 -0.29
pca2 -0.09
pca3 0.06
pca4 -0.06
pca5 -0.23
This model results in a low adjusted R-squared with only one significant variable.
Call:
lm(formula = pc1.z.97 ~ ., data = pc.df1[1:216, ])
Residuals:
Min 1Q Median 3Q Max
-7.1020 -0.7614 -0.0014 1.1862 3.5487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01330 0.11739 0.113 0.9099
pcml1 0.15043 0.06249 2.407 0.0169 *
pcml2 0.06926 0.09091 0.762 0.4470
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.708 on 213 degrees of freedom
Multiple R-squared: 0.03031, Adjusted R-squared: 0.02121
F-statistic: 3.329 on 2 and 213 DF, p-value: 0.0377
This model suffers from serial correlation (DW test) and heteroskedasticity (BP test).
Durbin-Watson test
data: pc.df1.model
DW = 0.29692, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
studentized Breusch-Pagan test
data: pc.df1.model
BP = 7.5294, df = 2, p-value = 0.02317
This is a really bad model. No more analysis necessary.
Call:
lm(formula = pc1.z.97 ~ ., data = pc.df2[1:216, ])
Residuals:
Min 1Q Median 3Q Max
-8.1480 -0.9312 0.1153 1.2024 3.5891
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01796 0.11752 0.153 0.879
pcmd1 0.09946 0.06770 1.469 0.143
pcmd2 0.04876 0.09704 0.502 0.616
pcmd3 -0.09015 0.10613 -0.849 0.397
Residual standard error: 1.726 on 212 degrees of freedom
Multiple R-squared: 0.01417, Adjusted R-squared: 0.000224
F-statistic: 1.016 on 3 and 212 DF, p-value: 0.3865
Original model resulted in 3rd PC not being significant, so it was dropped. This model is much better.
Call:
lm(formula = pc1.z.97 ~ pcf1 + pcf2, data = pc.df3[1:216, ])
Residuals:
Min 1Q Median 3Q Max
-6.7117 -0.8285 0.0592 0.9765 2.9797
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01873 0.10699 -0.175 0.861211
pcf1 0.30846 0.09216 3.347 0.000965 ***
pcf2 -0.55765 0.08909 -6.259 2.1e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.551 on 213 degrees of freedom
Multiple R-squared: 0.2, Adjusted R-squared: 0.1925
F-statistic: 26.63 on 2 and 213 DF, p-value: 4.776e-11
However, this model suffers from serial correlation (DW test) and heteroskedasticity (BP test).
Durbin-Watson test
data: pc.df3.model2
DW = 0.51641, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
studentized Breusch-Pagan test
data: pc.df3.model2
BP = 8.7839, df = 2, p-value = 0.01238
Original model resulted in 2nd & 4th PC not being significant, so they were dropped. This model is OK but has a poor fit.
Call:
lm(formula = pc1.z.97 ~ pca1 + pca3 + pca5, data = pc.df4[1:216,
])
Residuals:
Min 1Q Median 3Q Max
-6.6854 -0.8598 -0.0455 1.0755 3.5761
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.008501 0.112829 0.075 0.94001
pca1 -0.183266 0.055491 -3.303 0.00112 **
pca3 0.198078 0.082525 2.400 0.01725 *
pca5 -0.315239 0.097390 -3.237 0.00140 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.626 on 212 degrees of freedom
Multiple R-squared: 0.126, Adjusted R-squared: 0.1136
F-statistic: 10.18 on 3 and 212 DF, p-value: 2.707e-06
However, this model suffers from serial correlation (DW test) and heteroskedasticity (BP test).
Durbin-Watson test
data: pc.df4.model2
DW = 0.41739, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
studentized Breusch-Pagan test
data: pc.df4.model2
BP = 12.034, df = 3, p-value = 0.007268
While the 1st PC of the WTI curves is most likley the term structure since it explains 99% of the variation across all 3 curves, the macro and fundamental explanatory variables selected here didn't do a great job as illustrated by the regression outputs.
Perhaps the right relationship wasn't explored in this analysis and more work can be done. I also think Partial Least Squares (PLS) can improve on model specification. PLS is similar to PCA but takes into consideration the dependent variable you are trying to explain, while PCA just focuses on the matrix at hand - the independent variables only in this case.
Thank you for your time.