WTI Curve Analysis Using Principal Component Analysis (PCA)

Silverio J. Vasquez
Oct. 4, 2016

WTI Curves

I decided to take a deeper dive into the following short-term WTI curves:

-3 month out contract (F3) less current contract (F0)

-2 month out contract (F2) less current contract (F0)

-1 month out contract (F1) less current contract (F0)

I didn't want to recreate the wheel by programming a roll schedule (over 5 days to smooth it out), but instead used EIA data where they have already rolled NYMEX contracts forward.

source: https://www.eia.gov/dnav/pet/PET_PRI_FUT_S1_M.htm

Methodology

PCA was used to extract the largest driver of variation among the 3 WTI curves.

Macro data (both level and differenced) and fundamental data were collected. PCA was also used to extract the largest drivers of variation (PCs) among these datasets. Then, all explanatory variables were grouped together and PCA was used to extract the largest drivers of variation from all explanatory variables.

The first principal component of the WTI curves (PC1) explained the majority of the variation and was regressed on all the PCs gathered from the step above.

Due to data limitations, analysis is between Jan 1997 and Jul 2016 on a monthly basis.

Plots of WTI Curves

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Plots of WTI Curves (Cont'd)

plot of chunk unnamed-chunk-3

Summary Statistics (C1 = F0, C2 = F1, C3 = F2, C4 = F3)

     C2-C1             C3-C1             C4-C1        
 Min.   :-2.0800   Min.   :-3.3100   Min.   :-4.0400  
 1st Qu.:-0.3425   1st Qu.:-0.7025   1st Qu.:-1.1500  
 Median : 0.1750   Median : 0.2700   Median : 0.2600  
 Mean   : 0.2088   Mean   : 0.3305   Mean   : 0.3848  
 3rd Qu.: 0.5925   3rd Qu.: 1.1300   3rd Qu.: 1.5625  
 Max.   : 4.4700   Max.   : 7.1900   Max.   : 8.9800  

Correlation Table (C1 = F0, C2 = F1, C3 = F2, C4 = F3)

      C2-C1 C3-C1 C4-C1
C2-C1 1.000 0.990 0.972
C3-C1 0.990 1.000 0.995
C4-C1 0.972 0.995 1.000

PCA for WTI curves

The first principal component (PC1) explains 99% of the variation. Given PC1's negative relationship with all the curves and the fact that WTI is usually in backwardation, PC1 is likely related to the term structure. Small chart below is of % variance explained by PC.

Importance of components:
                          Comp.1      Comp.2       Comp.3
Standard deviation     1.7236824 0.169035888 0.0185964117
Proportion of Variance 0.9903603 0.009524377 0.0001152755
Cumulative Proportion  0.9903603 0.999884724 1.0000000000
          Comp.1      Comp.2     Comp.3
C2-C1 -0.5755046  0.74650575  0.3339515
C3-C1 -0.5800279 -0.08472591 -0.8101785
C4-C1 -0.5765085 -0.65996264  0.4817543

plot of chunk unnamed-chunk-6

Plot of the First Principal Component (PC1) of WTI Curves

PC1 looks very similar to the WTI curves, as expected given PC1 explains 99% of the variation across all three curves. plot of chunk unnamed-chunk-7

Explanatory Variables

Macro Variables:

HY option-adjusted spread over US 10-Year yield

VIX

10-yr and 2-yr Treasuries

Yield Curve (10s2s)

Fed Trade-Weighted Dollar (broad)

US LEI (Conference Board)

NOTE: Some transformations where applied - differencing and log-level

Fundamental Variables:

US net oil imports (EIA)

US oil production (EIA)

US refinery capacity (EIA)

US crude stock including SPR (EIA)

Correlation of Variables with WTI Curve's 1st Principal Component

             C2-C1 C3-C1 C4-C1 pc1.z.97
C2-C1         1.00  0.99  0.97    -0.99
C3-C1         0.99  1.00  0.99    -1.00
C4-C1         0.97  0.99  1.00    -0.99
pc1.z.97     -0.99 -1.00 -0.99     1.00
d10yr.97      0.10  0.09  0.08    -0.09
d2yr.97       0.09  0.08  0.06    -0.08
dvix.97      -0.03 -0.03 -0.03     0.03
dyc.97        0.02  0.03  0.03    -0.02
dtwd.97      -0.11 -0.10 -0.09     0.10
dlei.97      -0.01 -0.02 -0.03     0.02
dhyspd.97    -0.15 -0.14 -0.12     0.14
yc.97         0.02  0.02  0.01    -0.02
hyspd.97      0.22  0.25  0.26    -0.24
llei.97      -0.04 -0.06 -0.07     0.06
us10yr.97    -0.25 -0.27 -0.28     0.27
us2yr.97     -0.17 -0.18 -0.19     0.18
vix.97        0.00  0.03  0.04    -0.02
dnetimp.97   -0.02 -0.02 -0.02     0.02
dloilprod.97 -0.09 -0.07 -0.07     0.08
loilprod.97  -0.08 -0.07 -0.06     0.07
dstock.97    -0.02  0.00  0.01     0.00
lstock.97     0.52  0.53  0.54    -0.53
refcap.97    -0.30 -0.32 -0.33     0.32

PCA on Macro (level & log-level) Explanatory Variables - Picked top 2

Top 2 principal components explain 85% of the variation. plot of chunk unnamed-chunk-9

PCA on Macro (differenced) Explanatory Variables - Picked top 3

Top 3 principal components explain 79% of the variation. plot of chunk unnamed-chunk-10

PCA on Fundamental Explanatory Variables - Picked top 3

Top 3 principal components explain 70% of the variation. plot of chunk unnamed-chunk-11

PCA on All Explanatory Variables - Picked top 5

Top 5 principal components explain 67% of the variation. plot of chunk unnamed-chunk-12

Correlation of PCs with PC1 from Curves

pc1.z.97. = PC1 from WTI curves

pcml = PCs from macro data (level & log)

pcmd = PCs from macro data (differenced)

pcf = PCs from fundamental data

pca = PCs from all explanatory variables

           [,1]
pc1.z.97.  1.00
pcml1      0.19
pcml2     -0.05
pcmd1      0.10
pcmd2      0.02
pcmd3     -0.05
pcf1       0.31
pcf2      -0.33
pcf3       0.05
pca1      -0.29
pca2      -0.09
pca3       0.06
pca4      -0.06
pca5      -0.23

Regress PC1 from Curves on PCs from Macro data (level & log-level)

This model results in a low adjusted R-squared with only one significant variable.


Call:
lm(formula = pc1.z.97 ~ ., data = pc.df1[1:216, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1020 -0.7614 -0.0014  1.1862  3.5487 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.01330    0.11739   0.113   0.9099  
pcml1        0.15043    0.06249   2.407   0.0169 *
pcml2        0.06926    0.09091   0.762   0.4470  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.708 on 213 degrees of freedom
Multiple R-squared:  0.03031,   Adjusted R-squared:  0.02121 
F-statistic: 3.329 on 2 and 213 DF,  p-value: 0.0377
  • astericks denote level of significance

Regress PC1 from Curves on PCs from Macro data (level & log-level) - Cont'd

This model suffers from serial correlation (DW test) and heteroskedasticity (BP test).


    Durbin-Watson test

data:  pc.df1.model
DW = 0.29692, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

    studentized Breusch-Pagan test

data:  pc.df1.model
BP = 7.5294, df = 2, p-value = 0.02317

Regress PC1 from Curves on PCs from Macro differenced data

This is a really bad model. No more analysis necessary.


Call:
lm(formula = pc1.z.97 ~ ., data = pc.df2[1:216, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1480 -0.9312  0.1153  1.2024  3.5891 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.01796    0.11752   0.153    0.879
pcmd1        0.09946    0.06770   1.469    0.143
pcmd2        0.04876    0.09704   0.502    0.616
pcmd3       -0.09015    0.10613  -0.849    0.397

Residual standard error: 1.726 on 212 degrees of freedom
Multiple R-squared:  0.01417,   Adjusted R-squared:  0.000224 
F-statistic: 1.016 on 3 and 212 DF,  p-value: 0.3865

Regress PC1 from Curves on PCs from Fundamental Data

Original model resulted in 3rd PC not being significant, so it was dropped. This model is much better.


Call:
lm(formula = pc1.z.97 ~ pcf1 + pcf2, data = pc.df3[1:216, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7117 -0.8285  0.0592  0.9765  2.9797 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01873    0.10699  -0.175 0.861211    
pcf1         0.30846    0.09216   3.347 0.000965 ***
pcf2        -0.55765    0.08909  -6.259  2.1e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.551 on 213 degrees of freedom
Multiple R-squared:    0.2, Adjusted R-squared:  0.1925 
F-statistic: 26.63 on 2 and 213 DF,  p-value: 4.776e-11

Regress PC1 from Curves on PCs from Fundamental data - Cont'd

However, this model suffers from serial correlation (DW test) and heteroskedasticity (BP test).


    Durbin-Watson test

data:  pc.df3.model2
DW = 0.51641, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

    studentized Breusch-Pagan test

data:  pc.df3.model2
BP = 8.7839, df = 2, p-value = 0.01238

Regress PC1 from Curves on PCs from All Explanatory Data

Original model resulted in 2nd & 4th PC not being significant, so they were dropped. This model is OK but has a poor fit.


Call:
lm(formula = pc1.z.97 ~ pca1 + pca3 + pca5, data = pc.df4[1:216, 
    ])

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6854 -0.8598 -0.0455  1.0755  3.5761 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.008501   0.112829   0.075  0.94001   
pca1        -0.183266   0.055491  -3.303  0.00112 **
pca3         0.198078   0.082525   2.400  0.01725 * 
pca5        -0.315239   0.097390  -3.237  0.00140 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.626 on 212 degrees of freedom
Multiple R-squared:  0.126, Adjusted R-squared:  0.1136 
F-statistic: 10.18 on 3 and 212 DF,  p-value: 2.707e-06

Regress PC1 from Curves on PCs from All Explanatory Data - Cont'd

However, this model suffers from serial correlation (DW test) and heteroskedasticity (BP test).


    Durbin-Watson test

data:  pc.df4.model2
DW = 0.41739, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

    studentized Breusch-Pagan test

data:  pc.df4.model2
BP = 12.034, df = 3, p-value = 0.007268

Conclusion

While the 1st PC of the WTI curves is most likley the term structure since it explains 99% of the variation across all 3 curves, the macro and fundamental explanatory variables selected here didn't do a great job as illustrated by the regression outputs.

Perhaps the right relationship wasn't explored in this analysis and more work can be done. I also think Partial Least Squares (PLS) can improve on model specification. PLS is similar to PCA but takes into consideration the dependent variable you are trying to explain, while PCA just focuses on the matrix at hand - the independent variables only in this case.

Thank you for your time.