Analyisis of Oil Prices by Neil Shah

Introduction and Motivation

Oil and fossil fuels play a crtical part of modern life, fueling transpotation, powering industrial activity and driving global economic growth. Energy independence is a common theme in US rheotric. and the nation has have had a love-hate relationship with the commodity. During the 1970s oil embargo prices for fuel sky rocketed hobbling the US economy, and the spectre of “foreign oil” permeated through the national dialogue, with an “added tax” added to the consumer. However this relationshp shifted in 2008, when the “Shale Revolution”, a domestic boom in oil production, helped turn the US from an net importer to net exporter of oil. In recent time the administration has tweeted about oil prices as both a tax cut and integral part of the economy, and the recent crash in oil prices have highlighted the importance of oil in the US.

The goal of this report is to analyze the domestic impact on the United States economy in a top down macro-micro economic fashion–first looking at the impact on consumers, then on individual oil rich states and finally contrasting two cities within a state.

Specifically: Datasets will be split between “high oil prices” and “low oil prices”, and samples measured by statistical correlation.

Is high or low oil prices good for the US economy?

Methodology and Data Work Flow

While the analysis will be split into three focus area, they will all share the following work flow.

Research Question/Motivation:
Identification of Scope: Identifying the data of interest
Initial Data Acquistion: Acquiring relevant data through API, reading csv or other means.
Data Cleaning/Initial Exploratory Data Analysis: Cleaning and munging the data, and visualization of trends.
Modeling and Analysis: Testing research hypothesis/question through visual modeling
Evaluating Results and Conclusion: Evaluating research question and results in context.

Packages

The following packages are used for analysis

#Quandl for API 
library(Quandl)
#our good friend tidyverse
library(tidyverse)
library(readxl)
#Zoo for time series
library(zoo)
#Correlation plots 
library(corrplot)
source("http://www.sthda.com/upload/rquery_cormat.r")
#Statistical
library("Hmisc")

Data Sources and Management

Data will be acquired in the following distinct fashions

Reading of organized tabulated data found online through various soruces
Ingesting data through API calls via Quandl–notably the EIA [Energyy Information Agency], Bureau of Labor & Statistics, and other data sources.

Data will be stored in dataframes for ease of analysis, with all master data hosted on project GitHub

Quandl API

Quandl is a unique aggregator of various economic data that provides interfacing through an API which has documentation here. By specifying the type of data and then using a unique code, data can be ingested relatively simply. Users needs to register for the API, and authorize it via key, and use the QuandL R package, which was loaded earlier.

The author has saved his unique API key to the variable “key” which has been loaded without echo in this markdown–for reproducible results, one needs to do the same

For example going to the EIA webpage on Cushing oil Data one can see from the EIA url that the API key is “W_EPC0_SAX_YCUOK_MBBL&f=W”.

The data set can be passed to Quandl API via R.

cushing <-Quandl("EIA/PET_W_EPC0_SAX_YCUOK_MBBL_W", api_key=key)
head(cushing)

##         Date Value
## 1 2020-05-08 62444
## 2 2020-05-01 65446
## 3 2020-04-24 63378
## 4 2020-04-17 59741
## 5 2020-04-10 54965
## 6 2020-04-03 49241

Depending on the query, Quandl API can also have other parameters such as time-range or even format (which we will see below.)

For example specifying a start-date parameter below only queries data for 2005 onward.

crude <-Quandl("CHRIS/CME_CL3.6", type='zoo',start_date='2004-12-31', api_key=key)
head(crude)

## 2005-01-03 2005-01-04 2005-01-05 2005-01-06 2005-01-07 2005-01-10 
##      42.37      44.05      43.55      45.63      45.47      45.25

By chaining and manipulating the query, data sets can be generated which will be used extensively in this report.

Preliminaries: Time Series Analysis and Correlation

R [and the quandl API] have two unique objects for dealing with time-series. The “zoo” and “ts” time-series object; A “zoo” object is a special class object that consists of a series (or dataframe) where the index is time stamps, which may be irregular. Full documentation is here The “ts” is similar object but has fixed increments.

Dataseries can be formatted via type in the API call to “zoo”

Quandl("EIA/PET_W_EPC0_SAX_YCUOK_MBBL_W",type='zoo', api_key=key) %>% head()

## 2004-04-09 2004-04-16 2004-04-23 2004-04-30 2004-05-07 2004-05-14 
##      11677      12512      12905      12731      13631      14358

Either way the resultant is a vector with index/column time-stamps–the base sampling period is Weekly for EIA data.

Times series present their own nuances–specifically that time series data is usually compoesed of additive components of seasonality, trend and a randomness; as seen below.

More importantly another challenge is how to compare data on various time scales?

Resampling Time Series

One way solution is to transform this data is to re-sample said data to another granularity–an up-sampling involves converting a time series to a higher time frame (days to months for example) while down sampling would be the opposite (months to days.) Up sampling essentially averages out your data and smooths it out while down sampling, uses extrapolation and relies on the user for how to backfill values.

Due to the varied time measurements of the datasets in this analysis, this presents a challenge and thus the author sought to find a way to quickly resample the data, so direct comparison can be made in the same time frame.

One method could simply taking average over a time-series but that assumes there are fixed increments, and an easy cut off. Another solution the author found was using the xts object, an extension of zoo, which allows resampling based on months, and years, and other time frames. Full documentation is here.

For example calling the same cushing data as a zoo and using “as.xts()’ and”to.monthly()" makes a monthly sampled data from our weekly, automatically aggregating the data by the passed time period.

Quandl("EIA/PET_W_EPC0_SAX_YCUOK_MBBL_W",type='zoo', api_key=key) %>% as.xts() %>% to.monthly() %>% head()

##          ..Open ..High ..Low ..Close
## Apr 2004  11677  12905 11677   12731
## May 2004  13631  15971 13631   15971
## Jun 2004  16521  16521 11810   11810
## Jul 2004  15297  16577 15236   15236
## Aug 2004  14767  14767 14215   14395
## Sep 2004  15896  15896 13968   14094

The data is recast in OHLC or Open High Low Close format where Open is the price at the beginning of the original sample, high/low reflect the max/min values over the resampled range and close is the ending. To approximate the price over the period we can use the RowMeans function.

In order to expedite data munging the author made the following function can take a resample dataframe, calcualte the mean, drop the other columns, and rename it to a passed name.

cleaner <-function(dataframe,name)
{
  dataframe$mean<-rowMeans(dataframe)
  dataframe <- subset(dataframe, select = c(names(dataframe)[5]))
  names(dataframe)<-c(name)
  return(dataframe)
}

Using the above function to clean the previous Quandl function provides a clean dataset.

cushing <-Quandl("EIA/PET_W_EPC0_SAX_YCUOK_MBBL_W",type='zoo', api_key=key)
head(cleaner(to.monthly(cushing),'Cushing'))

##          Cushing
## Apr 2004 12247.5
## May 2004 14801.0
## Jun 2004 14165.5
## Jul 2004 15586.5
## Aug 2004 14536.0
## Sep 2004 14963.5

Merging time-series zoo objects can be accomplished via “merge()” function which joins time series objects in an outerjoin fashion on the index, which will populate columns with shared index and NAs for other values. This is a handy feature as by using a na.trim() function, which removes rows with NA, the dataset will automatically be truncated to a frame where there are shared values.

This method of resampling, transforming, merging and then truncating data will be used extensively to generate the core datasets for analysis.

Finally plotting time series zoo can be achieved via a normal plot or plot.zoo

plot(cleaner(to.monthly(cushing),'Cushing'),main='Cushing')

Statistical Correlation

To quantify the strength of relationship between time-series, the Spearman correlation coefficient will be calculated and compared. Simialr to the Pearson, the Spearman coefficiant (rho) ranges from strong negative relationship -1 to a strong positive relationship, +1; with 0 being no relationship. However the Spearman coefficient is non-paramaetric, that is it doesn’t require a normal distribution underlying like the Pearson, and therefore appropriate to use for time-series, which have components of trend, season that preclude normal distribution.

It’s important to note that

Spearman coefficient will tell the strength of monotonic relationship but not type–linear or what not.
It isn’t a robust statistic or metric used to judge one model from another, since it’s robust–if models were build RMSE could be used. That is beyond the scope of this report

For this analysis however this should be sufficient to evaluate general causality.

Other Data Considerations

Preliminary: Oil Data

Naturaly we’ll need oil pricing data to start this analysis–and there are various ways to obtain it but the easiest was through the Quandl API, which hosts databases for various statistics and economic activity. The price of oil can be proxied by the WTI futures (West Texas Intermediate) oil contract, which trades on the NYMEX. While the intracies of a futures contract is beyond the scope of this report, WTI will be chosen as a proxy, specifically the M+3 contract, which is the month+3 constract, representing the price of oil contract 3 months (or a quarter) in the future. This timing is used as it directly impacts the forward business cycle.

Key Assumptions Made: M+3 (next quarter) WTI oil derivative price is proxy for oil impact.

Loading Oil Data

Using the Quandl API a dataframe for crude can be generated along with a quick visualization of historical oil prices. The contract code (CL3 for 3rd month Crude) is married with the CME (exchange) and passed to the CHRIS (continuous future database).

Using Quandl a query can pull the last 20 years of data (start date of approximately 12-20 ).

crude <-Quandl("CHRIS/CME_CL3.6", type='zoo', start_date='2000-12-01',api_key=key)
plot.zoo(crude,main='Crude +3',xlab='Time',ylab='Price $ per bbl')

The plot shows the trend of crude oil prices for the 20th century–prices started off low and then exponentially increased until hitting a zenith of $140 in 2007, followed by a sharp crash From 2010-2015 oil prices remained elevated until another crash, at the beginning of 2015, followed by a period of lower prices. Most recently oil prices crashed at the start of 2020.

Oil Regimes

Focusing on the 2010 onward portion of the plot, and resampling crude prices on a monthly basis.

crude <-Quandl("CHRIS/CME_CL3.6", type='zoo', start_date='2009-12-01',api_key=key)
crude <-to.monthly(crude)
crude <-cleaner(crude,'Oil Price')
plot.zoo(crude,main='Crude +3',xlab='Time',ylab='Price $ per bbl')

Two distinct regions are now apparent

2010-end of 2014: Oil prices above $80–or “expensive oil prices”
2015 onward: Oil prices below $50 or “cheap oil prices”.

These two regime time-periods will serve as the basis for comparisons.

Given that the crude dataset has 126 rows and picking index 60 and 70, which mark off $72 and higher crude and approximately $45

These index locations can serve as the splits for testing out the expensive vs cheap crude.

Comparison Metrics

In order to provide a top-down analysis, the author has chosen metrics of comparison that cover macro (consumer), states and micro (city).

Macro (Consumer)

There isn’t a single metric to define “average consumer” but the author used the following as proxies. Oil is used for transportation fuels, fertilizer/agriculture/feed and plastic products, changes in oil prices could manifest in these metrics. One way to measure this is to compare oil prices to a price index, which is the measure of the changes of a basket of goods. The following metrics were used

Food Consumer Price Index: Prince index covering food/groceries–since oil is used to grow plants and feed animals, there could be an impact on input costs to what we eat.
Disposable Income: Index for how much money consumers have extra to spend–if oil is a burden/boon to consumer, this could reflect a gain/tax on consumers.
Transportation Price Index: Price index on fuel/public transportation cost: since oil is used as transportation fuel, there could be a relationship.

Data Sources

These datasets are compiled by the Federal Reserve Economic Department (FRED) which houses data on it’s website, and has an API for query, which connection via Quandl.

State Based Metrics:

While it’s no suprise that Saudi Arabia, Russia and Nigeria are among global oil producing countries, previous analysis has shown that the United States has emerged as a signfiicant producer of oil. Naturaly the US is far more diversed in economies than OPEC+ countries, that rely primarily on oil revenue, however there are still states that have significant exposure to the oil markets. This portion will attempt to find relationships between ’oil states" and how the commodity influences their economies.

The EIA provides the following graph on breakdown for oil.

Economists use job and labor statistics to help evaluate relations between markets and broader state economies.. A similar analysis will be used to evaluate oil prices impact on Texas and North Dakota based on the unemployment rate.

Unemployment Rate: Percentage of labor that isn’t actively employed, a wide proxy for economy state

Data Sources

One rich source of data is the BLS that classifies all industries and collect information on wages, and employment. These datasets are compiled by the Bureau of Labor and Statistics (BLS) which houses data on it’s website, and has an API for query, which connection via Quandl.

Micro (City) based analysis

Finally to drill down into the most micro-scale economy, it would be interesting to see how local cities are impact by the price of oil. Specifically comparing the economies of Midland, TX, deep in West Texas oil patch and Houston, TX the largest city in Texas.

Unemployment Rate: Percentage of labor that isn’t actively employed within each city.

Similar to the state , unemployment rates for Midland, TX and Houston, TX will be compared to the two regimes of oil prices.

Data Sources

Unemployment rates are available via FRED but there is no API call–thus the .csv files were loaded onto the project GitHub

Initial Data Acquistion

All the unique data series were loaded via API calls or CSV scrapes.

income <-Quandl("FRED/DSPI",type='zoo',api_key=key)
food=Quandl("FRED/PPIFDF",type='zoo',api_key=key)
transport=Quandl("FRED/CPIETRANS",type='zoo',api_key=key)
ND=Quandl("BLSE/LAUST380000000000003",type='zoo',api_key=key)
TX=Quandl("BLSE/LAUST480000000000003",type='zoo',api_key=key)
Midland=read.csv.zoo('https://raw.githubusercontent.com/shahneilp/DATA608/master/Project/midland.csv')
Houston=read.csv.zoo('https://raw.githubusercontent.com/shahneilp/DATA608/master/Project/houston.csv')

Data Cleaning

Resampling all the data to make sure they are in the same timeframe and cleaning via our helper.

food <-to.monthly(food)
food <-cleaner(food,'Food')
income <-to.monthly(income)
income <-cleaner(income,'Income')
transport <-to.monthly(transport)
transport <-cleaner(transport,'Transport')
ND<-to.monthly(ND)
ND <-cleaner(ND,'North Dakota')
TX<-to.monthly(TX)
TX <-cleaner(TX,'Texas')
Midland <-to.monthly(Midland)
Houston <-to.monthly(Houston)
Midland <-cleaner(Midland,'Midland')
Houston <-cleaner(Houston,'Houston')

Finally merging all the data sets with crude oil, and trimming NA values

df <- merge(crude,food,income,transport,ND,TX,Midland,Houston)
df <-na.trim(df)

Exploratory Data Analysis

Plotting the series and making use of the following best practices:

Data Ink Ratio: Remove unecessary colors
Chart Junk: Removing scales and ticks just to show the time evolution
Small Multiples to show many variables at the same time

plot(df,main='Oil Price relationships',xlab='Time',yaxt='n',xaxt='n')

The above visualization juxtaposes the Oil price with each of our metrics on the larger time scale. Visually there seems to be an obvious correlation between Crude oil prices and Transportation, but the other variables are less clear, and there is no wau to quantify the relationship. More importantly this method doesn’t tell us any more information about the low/high crude oil regimes.

We can split up each regime into it’s low and high regime via slicing the dataframe.

highoil <-df[1:70]
lowoil <-df[85:120]

Correlation Analysis

With the datasets now separtaed by high and low oil price, we can recast them in a series of plots to get a more granular looks.

Correlation can be calculated and tabluated via a table like the following:

cor(highoil)

##                Oil Price        Food     Income  Transport North Dakota
## Oil Price     1.00000000 -0.01599129 -0.4413433  0.6053968    0.1342111
## Food         -0.01599129  1.00000000  0.8641046  0.6927194   -0.6354154
## Income       -0.44134335  0.86410458  1.0000000  0.3360379   -0.6388786
## Transport     0.60539677  0.69271940  0.3360379  1.0000000   -0.4223503
## North Dakota  0.13421114 -0.63541535 -0.6388786 -0.4223503    1.0000000
## Texas         0.42998356 -0.82438772 -0.9564569 -0.2734683    0.6351877
## Midland       0.05852562 -0.93902093 -0.8653130 -0.5830499    0.6555254
## Houston       0.36343087 -0.85662049 -0.9490365 -0.3358234    0.6370820
##                   Texas     Midland    Houston
## Oil Price     0.4299836  0.05852562  0.3634309
## Food         -0.8243877 -0.93902093 -0.8566205
## Income       -0.9564569 -0.86531296 -0.9490365
## Transport    -0.2734683 -0.58304994 -0.3358234
## North Dakota  0.6351877  0.65552537  0.6370820
## Texas         1.0000000  0.89110900  0.9948196
## Midland       0.8911090  1.00000000  0.9218315
## Houston       0.9948196  0.92183146  1.0000000

cor(lowoil)

##                Oil Price       Food     Income  Transport North Dakota
## Oil Price     1.00000000 -0.2505667  0.3307491  0.7394208  -0.04960112
## Food         -0.25056666  1.0000000  0.6543135  0.3099155  -0.40515444
## Income        0.33074914  0.6543135  1.0000000  0.7454053  -0.42759771
## Transport     0.73942081  0.3099155  0.7454053  1.0000000  -0.42892950
## North Dakota -0.04960112 -0.4051544 -0.4275977 -0.4289295   1.00000000
## Texas        -0.32954812 -0.5880237 -0.8177779 -0.6980964   0.61194919
## Midland      -0.50503832 -0.4765928 -0.8090775 -0.7748388   0.52333661
## Houston      -0.34069426 -0.6267049 -0.8941161 -0.7371188   0.57821111
##                   Texas    Midland    Houston
## Oil Price    -0.3295481 -0.5050383 -0.3406943
## Food         -0.5880237 -0.4765928 -0.6267049
## Income       -0.8177779 -0.8090775 -0.8941161
## Transport    -0.6980964 -0.7748388 -0.7371188
## North Dakota  0.6119492  0.5233366  0.5782111
## Texas         1.0000000  0.8908496  0.9805283
## Midland       0.8908496  1.0000000  0.9046031
## Houston       0.9805283  0.9046031  1.0000000

However this table format is all numeric and it’s difficult to discern anything without doing tedious math.

The solution is to use a correlation matrix which scales the numerical correlation values with a color scheme, so trends can be identified without calculation.

Combining the correlation matrix with the multiple time-series provides a powerful visualzation to not only see trends, but also quickly gauge correlation relationships on visual inspection, and numerical strength.

Another added benefit of this technique is that cross-correlations between other variables can be identified as well [the author will not go into them since the 14 distinct combinations is beyond the scope of this report]

High Oil Prices

plot(highoil,main='High Oil Price Regime',xlab='Time',yaxt='n',xaxt='n')

corrplot(cor(highoil, method = c("spearman")))

Low Oil Prices

plot(lowoil,main='Low Oil Price Regime',xlab='Time',yaxt='n',xaxt='n')

corrplot(cor(lowoil, method = c("spearman")))

Conclusions

Macro Impact: Conclusions

The price of of food rises with high oil prices but has no correlation during low oil prices.

This is particularly interesting showing that under a certain threshold, high oil prices as possible inputs for fertilizer/fuel make their way into food prices. However there is a floor to this effect as as oil prices decreases, there is no relationship.

The price of transportation correlates with oil prices in both regimes, but is strongest during low oil regimes.

This makes sense given that oil is an input for all transportation vehicles but it seems that during the low price regimes, savings are passed on to the consumer. From experience most transportation tack on fuel surcharges to guess-timate their fuel costs, which might overshoot true cost. Once oil prices reach a threshold, those fuel surchages might be rescinded, and it benefits the consumers more.

Consumers have a higher disposable income correlation in low oil price regimes than in high price regime.

This result could be due to higher oil prices means higher prices for transport/food, as mentioned above–but in a low oil environment it’s hard to make a case.

State Impact: Conclusions

Texas:

Higher Oil prices actually have a negative correlation with unemployment rate. This isn’t suprising as it implies that higher oil prices means more jobs overall for Texas and less unemployment, which aligns with Texas being the most prolific oil producer, especially when prices are booming.
Low oil prices had little relationship with unemployment rate. What’s interesting is low oil prices had no overall impact on the unemployment rate–signifying that Texas’s economy is much more diverse than just oil & gas, and that depression in prices doesn’t always depress the overall economy.

North Dakota:

High oil prices had a weaker and negative correlation with unemployment rates.
Low oil prices had a weaker and negative correlation with unemployment rate.

Comparing both states, overall Texas is better positioned to take advantage of higher oil prices and grow it’s economy, probably due to the larger population and size of the state–this is evident in the stronger correlation during high oil prices in Texas, than North Dakota. Furthermore Texas is shielded from low oil prices as the there is less negative relationship with unemployment rate in low oil prices, than North Dakota.

While Texas is the oil king on the throne, it doesn’t solely rely on it despite it’s production status.

City Impact: Conclusions

Midland

Midland had negative correlations in both cases which makes sense given that as oil prices go up, unemployment rate goes down; however the stronger negative for high oil prices indicate more sensitivity.

Houston

Houston has a negative correlation with oil prices once again Texas being oil centric (from the previous section) the sign makes sense, and that higher oil prices have a bigger impact on brining more people to work.

Overall this analysis points to Midland being more sensitive to oil prices than Houston, given more negative correlation coefficients in both cases. This isn’t suprising given that

Houston is a large multi-economic city
Midland is primarily in oil region of Texas, and hence the economy is more impacted by the industry.

Overall Conclusions

This analysis has sought to answer the role of oil in the broader US economy and it’s importance by factoring in low/high prices and drew the following final insights.

Overall:

1. The broader consumer is penalized by higher oil prices but there is a marginal benefit in a fall of low oil prices, at the expense of oil focused states/cities.

2. Specialized economies that focus on oil disproportionately are impacted by oil prices and research suggest that diversification helps blunt the economic impact of low prices.

In addition this research helps point that there is a natural sweet spot for oil prices between 50-60$ in which the consumer benefits yet oil producing economies are not impacted severely–it is no suprise that this number is the common profit goal for domestic producers.

Future Work Suggestion

Different Oil Contracts Analysis: The use of different futures contracts by time and location could identify locational nuances and which factors are more sensitive to duration
Predictive Modeling: Given that correlation is establishd here, a multfactor model can be developed to try to predict price/metric based on factors
Advanced Metrics: If models are built, recall R^2 isn’t the basis for comparison a more robust statistical like RMSE is appropriate.

Introduction and Motivation

Methodology and Data Work Flow

Packages

Data Sources and Management

Quandl API

Preliminaries: Time Series Analysis and Correlation

Resampling Time Series

Statistical Correlation

Other Data Considerations

Preliminary: Oil Data

Loading Oil Data

Oil Regimes

Comparison Metrics

Macro (Consumer)

Data Sources

State Based Metrics:

Data Sources

Micro (City) based analysis

Data Sources

Initial Data Acquistion

Data Cleaning

Exploratory Data Analysis

Correlation Analysis

High Oil Prices

Low Oil Prices

Conclusions

Macro Impact: Conclusions

State Impact: Conclusions

Texas:

North Dakota:

City Impact: Conclusions

Midland

Houston

Overall Conclusions

Future Work Suggestion

References