The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.

Data Codebook

Description of variables, units of measurements, data category and the source of data, can be found on the tables below:

Codebook Varible Description

Variable Name	Description
Date	Date of the observation
Month	Month of the observation
Year	Year of the observation
Active_Rigs	Number of active drilling rigs on and offshore recorded to be working on that date
US_Population	USA population estimates for the date of the observation
President_Party	Identified political party for the ruling US president on the date observation
Senate	Political party that holds the majority of US Senate on the date of the observation
House	Political party that holds the majority of US congress on the date of the observation
life_exp	US average life expectancy at the date of the observation
GDP	Gross Domestic Product of the US economy on the date of the observation
Oil_Production	US Total oil production on the date of the observation
Oil_Stock	US inventory of oil in stock on the date of the observation
Oil_Price	US average domestic first purchase price, not corrected by inflation
HDD	Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F
CDD	Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F
Energy_Consumption	The amount of energy consumed by the US market on the date of the observation
Net_imports_Energy	The Net amount of energy imported by the US market on the date of the observation
Oil_Reserves	US proven oil reserves on the date of the observation
Active_Conflicts	Number of world active military conflicts on the date of the observation, includes civil wars
Drilled_Wells	Number of drilled wells on US on the date of the observation

Codebook Varible Units of measurement

Variable Name	Units of measurement
Date	Date format Year-Month-Day
Month	Integer from 1-12
Year	Integer 4 digits
Active_Rigs	Integer each unit represents a drilling rig
US_Population	Numerical in Millions of people
President_Party	Categorical variable Either Democrat or Republican
Senate	Categorical variable Either Democrat or Republican
House	Categorical variable Either Democrat or Republican
life_exp	Numerical life expectancy in years
GDP	Numerical GDP in Trillions of dollars 10^12 USD
Oil_Production	Numerical Oil production in Thousand barrels per day
Oil_Stock	Numerical US inventory of oil in million barrels
Oil_Price	Numerical US dollars
HDD	Numerical Monthly US HDD measured un number of °F
CDD	Numerical Monthly US CDD measured un number of °F
Energy_Consumption	Numerical Energy consumed in Quadrillion btu’s 10^15
Net_imports_Energy	Numerical Net energy imported in Quadrillion btu’s 10^15
Oil_Reserves	NUmerical Oil reserves in million barrels
Active_Conflicts	Numerical each unit represents an active military conflict
Drilled_Wells	Numerical each unit represents a drilled well

Codebook Varible Source of data

Variable Name	Units of measurement
Date	Included with other measurements
Month	Included with other measurements
Year	Included with other measurements
Active_Rigs	Data Link
US_Population	Data Link
President_Party	Data Link
Senate	Data Link
House	Data Link
life_exp	Data Link
GDP	Data Link
Oil_Production	Data Link
Oil_Stock	Data Link
Oil_Price	Data Link
HDD	Data Link
CDD	Data Link
Energy_Consumption	Data Link
Net_imports_Energy	Data Link
Oil_Reserves	Data Link
Active_Conflicts	Data Link
Active_Conflicts 2	Data Link
Drilled_Wells	Data Link

Raw data in original .csv and .xls formats can be found on the following Link
Data wrangling code used on this work, can be found on the following Link
Already cleaned up .RData files can be found on the following Link

Objective

The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables. In order to achive this objective it is necessary to first develop insight on variables that might influence the behavior of both response variables.

Exploratory data analysis

The first step in our analysis will be to upload the dataset, and explore its structure, the descriptive statistics and if weather or not observations in the dataset are complete.

'data.frame':   552 obs. of  20 variables:
 $ Date              : Date, format: "1973-01-01" "1973-02-01" ...
 $ Month             : Factor w/ 12 levels "01","02","03",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Year              : num  1973 1973 1973 1973 1973 ...
 $ Active_Rigs       : num  1219 1126 1049 993 1046 ...
 $ US_Population     : num  212 212 212 212 212 ...
 $ President_Party   : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
 $ Senate            : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ House             : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ life_exp          : num  71.4 71.4 71.4 71.4 71.4 ...
 $ GDP               : num  1.43 1.43 1.43 1.43 1.43 ...
 $ Oil_Production    : num  10856 11140 11006 11042 11002 ...
 $ Oil_Stock         : num  237 235 244 248 257 248 243 248 241 246 ...
 $ Oil_Price         : num  3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 ...
 $ HDD               : int  965 843 552 402 215 26 8 11 74 246 ...
 $ CDD               : int  5 3 19 13 62 214 297 280 142 50 ...
 $ Energy_Consumption: num  7.23 6.59 6.52 5.94 6.07 ...
 $ Net_imports_Energy: num  1.047 1.047 1.17 0.891 0.966 ...
 $ Oil_Reserves      : num  35300 35300 35300 35300 35300 35300 35300 35300 35300 35300 ...
 $ Active_Conflicts  : int  52 52 52 52 52 52 52 52 52 52 ...
 $ Drilled_Wells     : int  2166 1896 2151 1883 2298 2279 2367 2690 2333 2538 ...

      Date                Month          Year       Active_Rigs    
 Min.   :1973-01-01   01     : 46   Min.   :1973   Min.   : 407.0  
 1st Qu.:1984-06-23   02     : 46   1st Qu.:1984   1st Qu.: 856.8  
 Median :1995-12-16   03     : 46   Median :1996   Median :1184.5  
 Mean   :1995-12-16   04     : 46   Mean   :1996   Mean   :1435.0  
 3rd Qu.:2007-06-08   05     : 46   3rd Qu.:2007   3rd Qu.:1851.5  
 Max.   :2018-12-01   06     : 46   Max.   :2018   Max.   :4521.0  
                      (Other):276                                  
 US_Population     President_Party        Senate           House    
 Min.   :211.9   Democrat  :240    Democrat  :288   Democrat  :312  
 1st Qu.:235.8   Republican:312    Republican:264   Republican:240  
 Median :267.8                                                      
 Mean   :268.6                                                      
 3rd Qu.:301.2                                                      
 Max.   :327.0                                                      
                                                                    
    life_exp          GDP         Oil_Production    Oil_Stock     
 Min.   :71.36   Min.   : 1.429   Min.   : 5488   Min.   : 233.0  
 1st Qu.:74.56   1st Qu.: 4.041   1st Qu.: 7769   1st Qu.: 763.8  
 Median :75.82   Median : 7.882   Median : 9160   Median : 898.0  
 Mean   :75.98   Mean   : 9.027   Mean   : 9323   Mean   : 823.4  
 3rd Qu.:77.99   3rd Qu.:14.419   3rd Qu.:10252   3rd Qu.:1003.2  
 Max.   :78.84   Max.   :20.660   Max.   :16289   Max.   :1230.0  
                                                                  
   Oil_Price           HDD              CDD        Energy_Consumption
 Min.   :  3.89   Min.   :   3.0   Min.   :  2.0   Min.   :5.438     
 1st Qu.: 14.01   1st Qu.:  45.0   1st Qu.: 12.0   1st Qu.:6.592     
 Median : 23.51   Median : 310.0   Median : 42.0   Median :7.552     
 Mean   : 33.28   Mean   : 378.9   Mean   :103.6   Mean   :7.382     
 3rd Qu.: 45.19   3rd Qu.: 660.0   3rd Qu.:183.8   3rd Qu.:8.048     
 Max.   :128.08   Max.   :1266.0   Max.   :404.0   Max.   :9.655     
 NA's   :1        NA's   :2        NA's   :2       NA's   :2         
 Net_imports_Energy  Oil_Reserves   Active_Conflicts Drilled_Wells 
 Min.   :0.06212    Min.   :19121   Min.   : 52.0    Min.   :1268  
 1st Qu.:0.94186    1st Qu.:22351   1st Qu.: 83.5    1st Qu.:2288  
 Median :1.30195    Median :26544   Median :132.5    Median :2914  
 Mean   :1.39765    Mean   :26705   Mean   :121.8    Mean   :3482  
 3rd Qu.:1.85828    3rd Qu.:30529   3rd Qu.:151.2    3rd Qu.:4208  
 Max.   :2.74169    Max.   :39160   Max.   :205.0    Max.   :8556  
 NA's   :2          NA's   :12      NA's   :24       NA's   :96

              Date              Month               Year 
                 0                  0                  0 
       Active_Rigs      US_Population    President_Party 
                 0                  0                  0 
            Senate              House           life_exp 
                 0                  0                  0 
               GDP     Oil_Production          Oil_Stock 
                 0                  0                  0 
         Oil_Price                HDD                CDD 
                 1                  2                  2 
Energy_Consumption Net_imports_Energy       Oil_Reserves 
                 2                  2                 12 
  Active_Conflicts      Drilled_Wells 
                24                 96

The variable with most NA’s in the data set is Drilled wells, the dataset of the EIA did not provide this information since the end of 2010. for the exploratory analysis the complete dataset will be used altought for modeling and predicting perhaps this variable should be dropped or only use a complete.case dataset.

After the analysis of the structure of the dataset, we will first start with a series of pairs plot for different categories of variables, that for the sake of simplicity we have divided into the following groups: Environmental, Socioeconomic, Political, Oil related, Energy related

Environmental

Altogh pairwise graphs can result complicated, they migh highlight interesting patterns of our data, for instance:

Active rigs and oil prices have a direct positive relationship, however it seems like there are two different groups and trends, two different lines
With the years, in general oil prices have increased and the rig activity have decreased
CDD and HDD have a clear relationship with months so it might be convenient to look for seasonality in our data set
Altough the data is noisy, it seems to be that HDD have decreased with the pass of the years and CDD have increased (spooky global warming???), might be interesting to group the data by month or season and then look at the behavior with years
CDD and HDD can be described as mutually exclusive, when its hot it is unlikely that is cold at the same time (HDD vs CDD scatterplot)

Social & economic variables

Allmast all the variables share an strong linear relationship with year, with the exception of Activity rigs, therefore we see a similar trend between them.

Oil Market

Oil production had a downwards trend up to the beginning of 2000, on which there is clear change in trend (perhaps a different energy policy???)
There is an strong linear relationship between the US oil reserves and the production, perhaps one of the strongest incentive to increase the production has been the growth of US proven reserves. Looking at both curves Production vs Year and reserves vs Year we can see how they allmost overlap
There is an strong linear relationship between the amount of drilled wells and the number of active rigs
Many of the variables seem to show a dual behavior, this might be related to their dependence in time and of course on political factors that determine the energy policy of the country

Energy Market

With time, the energy consumption of US has increased, which is logical, since factors as population, life expectancy and GDP has grown as well with time
There is sharp break point on the net energy imports curve in relationship with time, on the mid 2000’s the upwards linear trend was replaced by an strong downward trend (Shale boom), it might be interesting to divide the data on two different periods of times, since it seems there was a 180 degrees change in energetic policy for the US

Political

This plot is extremely noisy due to the fact that some of these variables are categoricals, however it still provides some interesting information

It seems to be a relationship between Oil prices and the ruling party at the different components of the US government, this is something worth to be explored with further detail
Active conflicts and oil prices seems to share a positive relationship (not necesarily linear but positive), this seems to be logical, US is one of the biggest arm producer and this industry has to be fueled, as per market rules, a higher demand on a product the higher the prices
The number of active conflicts seems to share a relationship with the ruling party at the different components of the US government
The amount of active rigs seems to be unafected by politics
Active conflicst worldwide have been increasing in time, perhaps their nature have changed but the number has increased with years

Unsupervised Learning for EDA

After looking at pairs plot with more detail, it seems like various of the observations are clustered together, therefore it seems logical as well to try to find possible groups and associations that might bring up to light relevant variables or data insights. In order to to this we will use unsupervised learning methods, such as K-means and hierarchical clustering to aid us.

In order to facilitate the analysis we will start with group data by years, instead of using the monthly division. To group by years we will use median as the grouping functions seems most of the variables are rightly skewed. This yearly data will be used initially to create a heatmap.

Heatmap

Interestingly the heatmap shows that indeed recent years are part of a cluster by themselves and are characterized by:

Above the average oil prices
High levels of oil reserves
High levels of oil stock
High levels of oil production
Low level of active rigs (seems that drilling activity has focused on highly productive wells)
High energy consumption and demand
Lowest levels of imported energy
Highest levels of CDD (Global warming??)
Smallest levels of HDD
Mixed politics in all levels, either Executive, House and Senate

The next logical step in our analisis is perform a cluster analysis and analyze the characteristics of the different clusters of observations

Kmeans Cluster Analysis

Tha main issue with kmeans clustering is the question, related to the number of adequate clusters, for certain datasets (sports, market segments, gender, etc.), it is faily obvious but for other type of problems such as this one the number is unknown. To determine the number of clusters for this analysis we will use the elbow and silhouette criteria

Elbow Scree Criteria

k	tot_withinss
1	308.00000
2	187.29814
3	124.30791
4	108.16505
5	85.18374
6	47.37352
7	43.09698
8	29.82660
9	26.98528
10	22.92257

Looking at the plot, the elbow seems to be located between k=2 and k=4, the silhouette will help us further with this determination

Silhouette Plot

k	sil_width
2	0.3790503
3	0.3665515
4	0.4044439
5	0.4094645
6	0.4303218
7	0.4655798
8	0.4437710
9	0.4207146
10	0.4203871

The Silhouette plot, show us a peak on the silhouette width at k=7. For further exploration the number of centroids to be used will be 7

cluster	n
1	5
2	8
3	9
4	3
5	4
6	12
7	4

The barplot of sil width shows, that all points have a positive width, which means that clusters were correctly created. Lets see with more detail the characteristics of each one of the seven clusters created.

cluster	Year	Oil_Price	Oil_Reserves	Oil_Stock	Oil_Production	Active_Rigs	Energy_Consumption	Net_imports_Energy	CDD	HDD	President_Party	House	Senate
1	1983.0	26.0600	28416.0	681.00	10258.581	2418.50	6.145066	0.6833525	40.50	325.50	Republican	Democrat	Republican
2	1976.5	8.2575	31567.5	308.50	10143.014	1829.75	6.213764	1.1358337	31.25	354.50	Mixed	Democrat	Democrat
3	1990.0	15.3750	26254.0	909.50	9045.215	836.50	6.909735	1.1485480	38.00	320.00	Republican	Democrat	Democrat
4	2016.0	42.8600	32773.0	1171.00	12807.305	874.50	8.046779	0.9050690	65.00	254.00	Democrat	Republican	Republican
5	2012.5	94.5275	31950.0	1044.00	9354.557	1875.00	8.033162	1.2129182	52.00	281.00	Democrat	Republican	Democrat
6	2000.5	23.2375	21954.0	894.75	7709.979	941.25	7.928625	2.1101968	43.25	301.25	Mixed	Republican	Republican
7	2008.5	68.9325	20999.5	1019.00	7125.754	1659.25	8.115634	2.0194070	43.00	291.75	Mixed	Democrat	Democrat

Results from the cluster analysis show us clearly that more recent years are categorized by more oil independence of US, with an increase in production, a decrease on net energy imports, higher oil reserves, and moderated prices and rig activity

Hierarchical Clustering Dendogram

Hierarchical clusters show us a similar trend as well, on which more recent years, can be found as a cluster of its own, on the left side area of the dendogram

With the insight gained thanks to the Unsupervised learning EDA, we can explore further the behavior of our variables with relationship with clustering

If we look with more detail at the four plots, we can see an interesting relatioship, specially on the year 2007, highlighted by the red line, as we can see on the highest point for oil prices, oil production was at its lowest point, oil reserves as well, and rig count on low to mid levels. 2007 Seems to be a turning point in the history of US oil industry.

It is clear by the analysis that there is indeed some sort of relationship between these variables, however on some of them like Active_Rigs and Oil_Price the relationship, still seems confusing, it might be worth to look onto this relationship with more detail.

Interestingly the relationship is positive, however it is definitively not linear, it might be also interesting at the behavior of both variables with time, since oil price and rigs are on almos two order of magnitud apart, we should first normalize the variables to perform the adequate analysis and them plot their behavior on time

The time analysis whos what we suspected, indeed price and active rigs have a relationship, in fact whenever price has increased active rigs has followed the same trend, which is logical from an economic point of view, the higher the prices of goods to be sold, the more incentive for their production and for entering the oil and drilling business. Altouhgt there is a relationship between the two variables it seems that on early years, active rigs was a more volatile variable in comparisson with price, but its behavior has stabilized with time, which might suggest that there are other variables or dynamics affecting the behavior of the drilling business besides the oil price.

Drilling Dynamics I

The relationship between Active Rigs and drilled wells it is crearly and strongly linear, therefore we can assume that in time whenever active rigs increase drilled wells increase as well. On recent years, specially since the beginning of 2000’s there is a clear linear relationship between active rigs, production and reserves, which seems logical, since to produce more oil, we either need to incorpore more wells or make the current wells produce more, altought it is not as simple as it sounds due to various factors:

Not all drilled wells are development or production wells, some of the wells are exploration wells with the sole objective of loooking for more oil
Not all drilled wells results in positive production or reserves, since a portion of wells are what is called dry wells (No oil or Gas)
Not all wells are drilled for oil, some of them are drilled for Natural gas production
Oil production is a controlled variable, heavily influenced by the energy policy of the country

If we normalize our Production and Reserves variables, we can see more clearly the exploration and production dynamics.

Drilling Dynamics II

Now it is more clear, periods of high efficiency of reserves per new drilled well, and production will match periods of low drilling activity, for those periods lets called them sustain periods, the drilling activity will be meant to adjust production by the declination rate of the in service wells, whenever those values start to lay behind energy demand projections a new drilling cycle seems starts on which the industry is aggresively looking for new reservoirs to increase reserves and to include new development well into the production pipeline.

Feature Engineering for EDA

After making an initial exploration of the data in hand, it becomes clear that we might need some more information or transformations of some of the variables, since their current form perhaps they dont capture the complete picture of the problem in hand.

All code from the transformation can be found on the following link

Seasonality

Monthly variables as itself, doesnt seems to produce a clear difference neither in price or active rigs

The monthly variable might be compresed into a season variable as per the standard Seasons dates on the US

Season	Dates
Spring	March 1 to May 31
Summer	June 1 to August 31
Autumn	September 1 to November 30
Winter	December 1 to February 28

After addingg the season variable, the behavior seems to be unaffected altought this question will be confirmed on posterior sections of this work, by using statistical inference methods.

Politics

Energy Policy plays indeed a critical role on the behavior of our varibles, by the analysis of the following boxplots it seems that there is a difference on the ruling parties at different levels on the US government and the effect on Oil prices and Rig Activty.

Altought we have captured the dynamics of US politics using three variables President Party, Senate and House, neither of them by themselves describe completely, which political party is truly ruling the US, energy and oil policy, for istance a ruling president with an oposing house and senate, might act completely different as a president with support at all levels. This an instesting aspect that should be explored further

New varibles are going to be coded, on which we will differentiate between a total control scenario on which the same party is ruling all three powers and a shared power scenario on which at least one of the powers belongs to a different political party. Using this variable we will stablish on scenarios of total control which party is ruling the government, and we will evaluate if there is a difference on these scenarios


   shared     Total 
0.6521739 0.3478261

From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time.

Now lets look with more detail on how this, have affected our response variables

There seems to be a certain difference in oil price whenever the government is totally ruled by a single party in comparisson with the scenario on which power is shared, however in regards to activity rigs the difference doesnt seems to be that dramatic.


  Democrat Republican     shared 
 0.1739130  0.1739130  0.6521739

From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time. On these periods of total control, each party has enjoyed equal amount of time (17% each) rulling on total control.

Now lets look with more detail on how this, have affected our response variables

When we facet the scenarios on total government control by including the ruling party, a clear difference emerges on both variables. Altought graphics uncovered differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.

Economics

As we saw on early pair graphis, GDP has grown on a linear positive fashion in regards to time, therefore graphics of GDP and Oil Price or Activity rigs doesnt, reveal any new behavior that was not previously highlighted by the relationships of these variables and time. however there might still be some further exploration to make in regards to this economic variable, an interesting aspect to highlight might be to differentiate the behavior of our variables of interest on periods of economic recession and normal economic behavior.

Name	Begining	End	Short Description
1973 oil crisis	Nov 1973	Mar 1975	The 1973 oil crisis, a quadrupling of oil prices by OPEC.
1980 recession	Jan 1980	July 1980	Federal Reserve, raised interest rates dramatically.
1981–1982 recession	July 1981	Nov 1982	The Iranian Revolution & Tight monetary policy in the US.
Early 1990s recession	July 1990	Mar 1991	Federal Reserve raising interest & 1990 oil price shock.
Early 2000s dot-com	Mar 2001	Nov 2001	Collapse of speculative dot-com bubble September 11th attacks.
Great Recession	Dec 2007	une 2009	The subprime mortgage crisis, collapse of US housing bubble.


   Normal Recession 
0.8605072 0.1394928

From Jan 1973 until the present date, the US Economy has been on recession 14% of the months. Now lets look with more detail on how this, have affected our response variables

Looking at the Price vs time behavior it seems interesint to notice that almost all early recession are associated with an spike on oil prices. Similar peaks can be found as well on the active rigs curve in time.

The Price and recession boxplot doesnt show a dramatic difference, however the rig count seems to be directly affected by periods of economic recession. This difference in the behavior of both variables might be explained by the fact that oil price is an international variable that not only depends on the US economy, and in contrast active rigs in US in directly affected by how the country economy is performing.Altought graphics uncovered interesting differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.

Statistical Inference

Our Previous analysis have lead us to various questions, related to the nature of the relationships between our response variables explanatory ones. The initial EDA has uncovered or to be more precise it has awaken our curiosity regarding this questions, but the answers can only be found using more advanced techniques. One this section we intend to answer this questions.

All the code of the statistical tests performed on this section, can be found on the following Link

Does oil price varies with Months?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Month	1	101.9351	101.9351	0.1321276	0.7163761
Residuals	549	423548.1536	771.4903	NA	NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.76-76%

Does Active_Rigs varies with Months?

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Month	1	602666.4	602666.4	1.098467	0.2950631
Residuals	550	301753681.5	548643.1	NA	NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.29-29%

Does oil price varies with Seasons?

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Season	3	633.4052	211.1351	0.2730173	0.8448725
Residuals	547	423016.6836	773.3395	NA	NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.84-84%

Does Active Rigs varies with Seasons?

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Season	3	1138054	379351.5	0.690146	0.558328
Residuals	548	301218293	549668.4	NA	NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.55-55%

Does Oil Price varies with the US Presidents political Party

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$President_Party
t = 3.7616, df = 405.02, p-value = 0.0001937
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.42473 14.11211
sample estimates:
  mean in group Democrat mean in group Republican 
                38.51321                 29.24479

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.0001937 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents have experieced on average oil prices between 4.4 and 14.1 $ higher than republican presidents.

Does Active Rigs varies with the US Presidents political Party?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$President_Party
t = -2.1034, df = 541.3, p-value = 0.03589
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -253.424469   -8.662711
sample estimates:
  mean in group Democrat mean in group Republican 
                1360.950                 1491.994

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.03589-3.6 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents on average have experienced a number of active rigs between 253 and 9 rigs lower than republican presidents.

Does Oil price varies with the majority ruling party of US Senate?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Senate
t = 2.6457, df = 393.49, p-value = 0.008479
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  1.55085 10.52260
sample estimates:
  mean in group Democrat mean in group Republican 
                36.16326                 30.12654

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.008479-0.8 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant Senate has on average have experienced Oil prices between 1.5 and 10.5 $ higher than a dominant republican senate.

Does Active Rigs varies with the majority ruling party of US Senate?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Senate
t = 1.9558, df = 448.03, p-value = 0.05111
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -0.6066393 251.1350484
sample estimates:
  mean in group Democrat mean in group Republican 
                1494.927                 1369.663

As we can see there is no a significative difference since we obtained an T test p-value of 0.05111-5.1%

Does Oil price varies with the majority ruling party of US House?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$House
t = -8.5814, df = 447.84, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.24962 -15.21222
sample estimates:
  mean in group Democrat mean in group Republican 
                24.72343                 44.45435

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced Oil prices between 24.24 and 15.21 $ lower than a dominant republican house.

Does Active Rigs varies with the majority ruling party of US House?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$House
t = 9.3118, df = 493.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 407.6818 625.7310
sample estimates:
  mean in group Democrat mean in group Republican 
                1659.673                 1142.967

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced a Rig count between 408 and 626 rig count higher than a dominant republican house.

Does Oil prices varies if the US government is completely dominated by one sigle party at all levels?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Gov_ctrl_type
t = -0.24384, df = 504.06, p-value = 0.8075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.963467  3.867454
sample estimates:
mean in group shared  mean in group Total 
            33.09189             33.63990

As we can see there is no a significative difference since we obtained an T test p-value of 0.8075-80.75%

Does Active rigs varies if the US government is completely dominated by one sigle party at all levels?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Gov_ctrl_type
t = 0.49325, df = 477.12, p-value = 0.6221
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -90.53302 151.21913
sample estimates:
mean in group shared  mean in group Total 
            1445.572             1415.229

As we can see there is no a significative difference since we obtained an T test p-value of 0.6221-62.21%

On Scenarios of complete control by one party, does oil prices varies in relationship to the party?

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Gov_ctrl_party	2	10521.5	5260.7522	6.978196	0.0010171
Residuals	548	413128.6	753.8843	NA	NA

As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 0.0010171 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.

term	comparison	estimate	conf.low	conf.high	adj.p.value
OIL_MODIFIED$Gov_ctrl_party	Republican-Democrat	14.817787	5.4797130	24.1558616	0.0006204
OIL_MODIFIED$Gov_ctrl_party	shared-Democrat	6.822097	-0.5898721	14.2340666	0.0785599
OIL_MODIFIED$Gov_ctrl_party	shared-Republican	-7.995690	-15.4383933	-0.5529868	0.0317663

Based on the results we can claim that a completely controled republican goverment has experienced oil prices between 5.5 and 24.15 $ higher than a completely controled democrat government, It is also interesting to mention that a completely controled republican goverment shows higher oil prices in comparisson with a shared government.

On Scenarios of complete control by one party, does Active Rigs varies in relationship to the party?

term	df	sumsq	meansq	statistic	p.value
OIL_MODIFIED$Gov_ctrl_party	2	14370308	7185153.9	13.69736	1.6e-06
Residuals	549	287986040	524564.7	NA	NA

As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 1.6e-06 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.

term	comparison	estimate	conf.low	conf.high	adj.p.value
OIL_MODIFIED$Gov_ctrl_party	Republican-Democrat	-544.9583	-790.6342	-299.28247	0.0000008
OIL_MODIFIED$Gov_ctrl_party	shared-Democrat	-242.1361	-437.6505	-46.62175	0.0104769
OIL_MODIFIED$Gov_ctrl_party	shared-Republican	302.8222	107.3079	498.33658	0.0008702

Based on the results we can claim that a completely controled republican goverment has experienced lower levels of rig activity in comparisson with a fully democrat government or shared government. Completely Democrat governments has experienced the highest levels of active rigs.

Does Oil prices varies if the US economy is on recession?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Eco_Rec
t = -0.045776, df = 95.274, p-value = 0.9636
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.707692  7.360243
sample estimates:
   mean in group Normal mean in group Recession 
               33.25757                33.43130

As we can see there is no a significative difference since we obtained an T test p-value of 0.9636-96.4%

Does Active Rigs varies if the US economy is on recession?


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Eco_Rec
t = -5.565, df = 85.947, p-value = 2.925e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -905.1819 -428.6917
sample estimates:
   mean in group Normal mean in group Recession 
               1341.985                2008.922

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.925e-07 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that US economy on recession has on average experienced a Rig count between 428 and 905 rig count higher than a US economy on normal conditions.

Does oil price behavior is independent of price behavior on previous month?

This might sound like an unrelated question, but it is on the contrary fundametal for this work, what is the point of attempting to predict via regression or forecasting methods, the behavior of a variable if it responds to randomness with a certain probability? and if this is the case it is critical to now it in advance instead of attempting to develop models that are fundamentally wrong from the conception.

It is very commong to hear from time to time that trading markets like forex or stock markets, behave randomly, that the ups and downs of prices follows patterns as such the throwing of a coin; being oil a commodity it is sometimes accused of the same issue.

This sounds like a very complex question, and it is, but a chi-square GOF test can be used to study the problem. We will label each observaion as Up or Down (D) depending on whether the Oil price market was up or down that month.


DOWN   UP 
 235  316


     DOWN        UP 
0.4264973 0.5735027

For our dataset, composed of 552 monthly observations of Oil prices, starting on january 1973, the proportion of UP’s in the market is of 57%. The next step in our analysis will be to generate an histogram of the distance between UP’s, how many months of DOWN we have before experiencing again a market turnup. We will cal this variable streak

price_streak	Freq
0	126
1	37
2	26
3	15
4	15
5	6
6	3
7	4
8	1
9	1
10	1
19	1

price_streak	Freq
0	0.5338983
1	0.1567797
2	0.1101695
3	0.0635593
4	0.0635593
5	0.0254237
6	0.0127119
7	0.0169492
8	0.0042373
9	0.0042373
10	0.0042373
19	0.0042373

Since the chi-square GOF test, requires to have at least 5 observations on each category, we will group up values of streak above 6 days.

price_streak	Freq	cummulative
0	0.5338983	0.5338983
1	0.1567797	0.6906780
2	0.1101695	0.8008475
3	0.0635593	0.8644068
4	0.0635593	0.9279661
5	0.0254237	0.9533898
6	0.0466102	1.0000000

With the streak distribution, the next step is to calculate the distribution of a geometric distribution with a success probability of 57%, on which values above 6 are grouped together.

Geometric_streak	Freq	cummulative
0	0.5796	0.5796
1	0.2420	0.8216
2	0.1043	0.9259
3	0.0411	0.9670
4	0.0205	0.9875
5	0.0071	0.9946
6	0.0054	1.0000

Having the geometric distribution proportions and the original table of oil price streak frequencies we can perform a chi-square goodness of fit test, the selected alpha value for this test will be 5% or 0.05.


    Chi-squared test for given probabilities

data:  price_observed
X-squared = 102.79, df = 6, p-value < 2.2e-16

As we can see the obtained p-value of the test is 2.2e-16, almost 0 which means that it is almost impossible to find the oil price streak distribution if we assume the null hypothesis of a geometric distribution with succes probability of 57%.

Since this is a complicate question we will confirm our test, by using simulation instead of the predefined statistical functions of R. We will replicate with replacement 10000 (10 thousands) observations of our dataset, and calculate the chi-square statistic assuming a null hypothesis on which our distribution is equal to the geometric distrbution with a probability of success of 57%, and then we will calculate the p-value of obtaining our observed chi-square statistic.

replicate	stat
1	8.371387
2	13.970183
3	5.948372
4	2.312726
5	4.998194
6	8.594592

[1] 10000     2

With this new data frame of simulations we will calculate the simulated p-value which is just the proportion of simulated stats, more extreme or equal to our observed chi-square stat.

# A tibble: 1 x 1
  `mean(stat >= oil_observed_chi)`
                             <dbl>
1                                0

The blue curve of this graph corresponds to a chi-square distribution with 6 degrees of freedom, we can see how it almost perfecly overlap the black curve which is our simulated distribution, this give us confirmation that our simulation was correct.

As we can see both graphically and with the calculation this proportion or simulated p-value is virtually zero, which means that our oil price streak distribution doesnt follow a geometric distribution and which means as well that, there is no independence between the behavior of oil price one month after one other, making this behavior not random. This confirm our suspicious that oil prices are influenced by time and many other variables.

Why monthly oil prices and not daily? how volatiles are daily oil prices?

This is an interesting question and perhaps somehow complicated to answer, the idea of this study is to capture macroeconomic or industry related trends that on the somehow long-term cause oil prices to raise or to drop, oil being traded as a commodity on daily basis might be more volatile and more susceptible to speculation.

To test this, we are going to evaluate the effect of, US president Donald Trump, oil & energy related tweets, on the daily Brent crude price. We are going to use only tweet from 2016 up to the first quarter of 2019.

Compendium of trump tweets were downloaded in json format from the following link.

The final database and transformation code can be found on the following link.

The Trump tweets archive that is used on this workd is composed by 11363 tweets, here is an example of the first lines of the archive:

V1	date	Month	Day	Year	oil_related_word
the person that hillary clinton least wants to run against is, by far, me. it will be the largest voter turnout ever - she will be swamped!	2016-01-01	Jan	01	2016	FALSE
“@codyraymille: i have never been interested in politics but because of you i want to get my political science degree. #trump2016” great!	2016-01-01	Jan	01	2016	FALSE
i will be going to mississippi tomorrow night - hear the crowds are going to be massive! look forward to it.	2016-01-01	Jan	01	2016	FALSE
“@casuperrunner: @georgehenryw huckabee is a good man…but he needs to get behind @realdonaldtrump” i agree!	2016-01-01	Jan	01	2016	FALSE
“@memeoryhead: i’m one of your biggest fans mr trump and i can’t wait for you to make america great ifagain! never forget you have support!”	2016-01-01	Jan	01	2016	FALSE
“@jodil792: we are standing with you! spreading the word…trump for president 2016!! i wake & sleep praying for you & the usa!+israel2”	2016-01-01	Jan	01	2016	FALSE

There are 1186 days between the first day of the database and the last one, using this means that president trump is tweeting on average 9.6 tweet per day since january 2016, until april 2019, which makes him very active on this social media.

for the sake of this work, we will focus only on tweets on which he used the following words, since we consider them possibly related to the oil & gas market:

oil
gas
opec
pipeline
barrel
oil price
drilling
oil production
gas price
Sanction
Sanctions
Russia

after filtering the complete database we found 421 positive tweets for oil related words. Already having this information in place, the first question that comes to my mind is weather or not there is a difference in brent prices, when trump tweets about oil and when he doesnt? to answer this question we will use a t.test with an alpha value of 5%


    Welch Two Sample t-test

data:  trump_oil2$value by trump_oil2$oil_keyword
t = -7.3975, df = 344.41, p-value = 1.068e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.974548 -5.204563
sample estimates:
mean in group FALSE  mean in group TRUE 
           55.10396            62.19351

Interestingly there is indeed a significative difference between brent crude prices when trump tweets about oil, in comparisson with days on which he does not. the obtained p-value is 1.068e-12 which is lower that our alpha level making the test significative. Brent crude prices tend to be higher on days on which trumo tweets about oil, between 5.2 and 8.79 dollars higher.

Now obtaining this result only means that there is indeed a correlation, however it doesnt imply causation, it is important not to be confused and think that oil related tweet of trump cause higher oil prices. In fact we can conclude that is higher oil prices, what motivates Donald trump to tweet about oil prices.

To test this hypothesis we will look and the distribution of oil tweet of trump with oil prices ranges

Price_range	Tweets
(25.9,46]	16
(46,66]	93
(66,86.1]	79

As we can see, the majority of oil tweets of donald trump, are condensed on high oil price ranges, but the real question is, is this distribution sufficiently different from expected values, to consider it significative? to test this we will use a chi-square independence test, using an alpha value of 0.05-5%.


    Pearson's Chi-squared test

data:  table(cut(trump_oil2$value, 3), trump_oil2$oil_keyword)
X-squared = 43.082, df = 2, p-value = 4.414e-10

The results obtained by the chi-square test, suggest that there is no independence between oil tweets and oil price ranges. which means that trump oil tweets are triggered by differences in oil prices, and this relationship is significative since the obtained p-value of 4.414e-10 is below the 5% alpha value.

If Trump oil tweets and oil prices were independent, this will be the expected distribution:

Price_range	Chisq_Expected_Tweets
(25.9,46]	35.06258
(46,66]	105.64908
(66,86.1]	47.28834

This difference can be better observed by comparing the expected and observed distributions

So it is more clear, that there a more pronounced concentration of oil tweets into the 66-87 dollars brent price binge. And we can conclude that what mostly triggers Trump oil tweets is the raise in oil prices.

It might also be interesing to study how trump oil tweets have changed over time.

Year	Tweet_by_year
2016	22
2017	79
2018	126
2019	38

Looking with more detail at the table, it is easy to notice a trend, on which the amount of oil related tweets, have increased over time, just on the first quarter of 2019, Trump has already tweeted 38 oil tweets, if we expect him to continue with this trend, we should expect 152 oil related tweets by the end of the year.

       1 
179.6667

If we use a linear model on data of 2016,2017 and 2018 and predict 2019, tweets, we should expect at leats 179 tweets by the end of the year. So it is clear that Trump has been getting more and more active on oil tweets over time. Which move us to our next lines of questions.

Using logical thinking, as rational beings,we would only repeat and even increase a behavior if we expect it to produce the desired result, if not this might be considered to be insane or at least ilogical. The question is, does oil trump tweets, are able to decrease oil prices, at least on the shor term?

To be able to answer this question we will look at brent crude prices, the day of an oil tweet, the day after, two days after and three days after, and compare the price between these days and its behavior. We will also use linear regression inference to determine the magnitude of these effects.

After the variable creation we will perform an ANOVA test, with an alpha level of 0.05-5% to test if there is a significant level of difference on brent oil price values and the different tweet days.

term	df	sumsq	meansq	statistic	p.value
trump_oil2$tweet_pattern	8	24926.24	3115.7805	23.63862	0
Residuals	806	106237.98	131.8089	NA	NA

The obtained p-value of 0 of the F-statistic, confirms that there is indeed a significative difference, since it is below our selected alpha level. Looking at the mean values we can observe that there is a descendent pattern on which on average the brent oil market has reacted on a downslide pattern after donal trump oil tweets

tweet_pattern	Mean_Price
Zero	62.19351
One	60.85161
Three	60.36951
Two	60.17584
Four	58.45350
Five	56.01485
Six	55.74000
Seven	54.86548
No_Tweet	48.71600

The day after an oil tweet, values are aproximately equal to the prices of the day of the tweet, altought, two days after values tend to decrease so as third days values and the forth day and so up until even a week after.

Finally we would like to confirm this trend, and to do so we will a t.test between the brent prices on Trump’s oil tweet day and the prices four days after an oil tweet, we will perform a one sided test, on which the alternative hypothesis will be that the true difference in mean is less than 0. To perform this test, we will use a significance alpha level of 5%.


    Welch Two Sample t-test

data:  trump_oil_tweet_effect$value by trump_oil_tweet_effect$tweet_pattern
t = -3.2562, df = 39.559, p-value = 0.00116
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -3.537552
sample estimates:
mean in group Seven  mean in group Zero 
           54.86548            62.19351

Based on the results obtained on the test, we can claim that there is indeed a significative difference, at a 95% confidence level, on brent price values between the tweet day and four days after a trump tweet. We can expect oil brent prices on trump’s twee day being at least 0.38 dollars higher in comparisson with four days after the tweet was made.

With such an interesting result it is important to confirm the results, due to this we will perform a similar test, but using simulation, we will simulate 10000 differences in the observed data group of brent prices on the day of tweet and four days after tweet and we will calculate a simulated p-value.

The simulated data, seems to fit very well a normal distribution, the next step is to calculate the simulated p-value, which is equivalent of calculating the proportion of values of difference more extremes or equal to the observed difference.

# A tibble: 1 x 1
   p_value
     <dbl>
1 0.000600

Equivalent to what was obtained on the formal t.test, the simulated p-value is below the 5% alpha level, which allow us to conclude that there is indeed a significative difference on brent crude prices, on the day of a tweet and four days after.

To conclude, this is the main explanation of why using daily oil price data, might not be advisible, since it might be volatile enought to be influenced by speculations created by social media interactions of powerfull entities such as the US president, and perhaps many other factors that were not evaluated during this analysis. Altought daily~weekly prices might be sensible enought to be influenced by these kind of effects, monthly prices are more related to macro economics and industry based trends and therefore are more usable to forecast on a long term basis.

Linear Regression Inference & Feature Confirmation

On the previous section of this work, we answer already plenty of the questions that the early EDA uncovered, now with the answers of this questions we can begin with the task of creating a model with the objective of predict the behavior of our response variables, but before generating definitve models it is better to first study which variables affect the most our target response variables. On this section we will explore with more depth the linear relationship of our response and explanatory variables via various methods such as:

-Spearman Correlation matrix -Full linear model -StepAIC, akaike information criterion link -BIC Bayesian information criterion link -Boruta selection criteria link

All the code related to the models built on this section can be found on the following link

Correlation matrix Oil price

We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Oil prices

	x
Oil_Price	1.0000000
GDP	0.6840554
Year	0.6832099
US_Population	0.6832099
life_exp	0.6822414
Oil_Stock	0.5769223
Energy_Consumption	0.4434374
Drilled_Wells	0.3802835
Net_imports_Energy	0.3655332
Senate	0.3486705
Active_Conflicts	0.2770792
Active_Rigs	0.2539703
House	0.2100427
Gov_ctrl_type	0.1996047
President_Party	0.1613554
Eco_Rec	0.1045710
CDD	0.0671465
Month	0.0393345
Season	-0.0295112
HDD	-0.0533247
Gov_ctrl_party	-0.1241196
Oil_Production	-0.5322219
Oil_Reserves	-0.5824720

As we can see oil is correlated with various variables, but some of this variables are highly correlated with each others as well, one of such cases is: GDP, YEAR, US Population and life expectancy, allmost all these variables have been growing at almost the same rate with time and therefore they are perfectly correlated, this already indicate us that we should consider only using one of them for the model selection.

Correlation matrix Active Rigs

We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Active Rigs.

	x
Active_Rigs	1.0000000
Drilled_Wells	0.9281927
Oil_Reserves	0.4381892
Oil_Production	0.3700018
Eco_Rec	0.2988132
Oil_Price	0.2539703
President_Party	0.2051462
Gov_ctrl_type	0.1149544
Month	0.0761974
HDD	0.0523462
Season	-0.0083891
Senate	-0.0539184
CDD	-0.0775931
Gov_ctrl_party	-0.1415784
Net_imports_Energy	-0.2739901
Energy_Consumption	-0.3396680
GDP	-0.3614154
Year	-0.3666279
US_Population	-0.3666279
life_exp	-0.3678292
House	-0.4006386
Oil_Stock	-0.4074189
Active_Conflicts	-0.6697390

With active rigs we see a similar trend, we have highly correlated variables in our dataset, and we would have to drop some of them to avoid multicolinearity.

Full Linear model Y = Oil Price

We will first begin by making a linear model of oil price versus all of the available exploratory variables.


Call:
lm(formula = Oil_Price ~ ., data = OIL_MODIFIED_MODEL)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.275  -2.575   0.076   2.771  39.407 

Coefficients: (2 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.551e+04  5.085e+03  -5.016 7.71e-07 ***
Month                      2.759e-01  1.250e-01   2.207 0.027838 *  
Year                       1.390e+01  2.722e+00   5.107 4.93e-07 ***
Active_Rigs                3.909e-03  1.555e-03   2.513 0.012323 *  
US_Population             -7.068e+00  9.087e-01  -7.778 5.47e-14 ***
President_PartyRepublican -4.026e+00  1.094e+00  -3.678 0.000264 ***
SenateRepublican           2.555e+00  2.199e+00   1.162 0.245921    
HouseRepublican           -6.122e+00  2.710e+00  -2.259 0.024353 *  
life_exp                  -5.437e+00  2.236e+00  -2.432 0.015420 *  
GDP                        2.526e+01  2.025e+00  12.470  < 2e-16 ***
Oil_Production             2.847e-03  1.418e-03   2.008 0.045261 *  
Oil_Stock                 -9.132e-02  1.435e-02  -6.363 5.04e-10 ***
HDD                        2.705e-03  4.770e-03   0.567 0.570874    
CDD                        1.958e-02  1.168e-02   1.676 0.094369 .  
Energy_Consumption         2.127e-01  1.854e+00   0.115 0.908724    
Net_imports_Energy        -1.508e+01  2.566e+00  -5.875 8.46e-09 ***
Oil_Reserves              -2.038e-03  4.586e-04  -4.444 1.13e-05 ***
Active_Conflicts           4.234e-02  2.940e-02   1.440 0.150501    
Drilled_Wells              2.178e-03  8.049e-04   2.707 0.007067 ** 
Eco_RecRecession          -4.553e-01  1.039e+00  -0.438 0.661531    
SeasonSpring               1.360e+00  1.199e+00   1.134 0.257560    
SeasonSummer              -1.021e+00  1.760e+00  -0.580 0.562125    
SeasonWinter              -2.240e+00  1.615e+00  -1.387 0.166124    
Gov_ctrl_typeTotal        -1.438e+00  1.026e+00  -1.401 0.161852    
Gov_ctrl_partyRepublican          NA         NA      NA       NA    
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.597 on 432 degrees of freedom
  (96 observations deleted due to missingness)
Multiple R-squared:  0.9068,    Adjusted R-squared:  0.9018 
F-statistic: 182.7 on 23 and 432 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: Oil_Price
                    Df Sum Sq Mean Sq   F value    Pr(>F)    
Month                1    309     309    7.1060  0.007971 ** 
Year                 1  91169   91169 2094.8581 < 2.2e-16 ***
Active_Rigs          1  50324   50324 1156.3228 < 2.2e-16 ***
US_Population        1    819     819   18.8116 1.798e-05 ***
President_Party      1   4745    4745  109.0386 < 2.2e-16 ***
Senate               1  15114   15114  347.2866 < 2.2e-16 ***
House                1   6137    6137  141.0032 < 2.2e-16 ***
life_exp             1    176     176    4.0327  0.045249 *  
GDP                  1   7642    7642  175.5931 < 2.2e-16 ***
Oil_Production       1     50      50    1.1425  0.285715    
Oil_Stock            1    419     419    9.6215  0.002049 ** 
HDD                  1    638     638   14.6708  0.000147 ***
CDD                  1      0       0    0.0104  0.918874    
Energy_Consumption   1    151     151    3.4625  0.063454 .  
Net_imports_Energy   1   2809    2809   64.5378 9.106e-15 ***
Oil_Reserves         1   1665    1665   38.2601 1.437e-09 ***
Active_Conflicts     1     53      53    1.2128  0.271390    
Drilled_Wells        1    266     266    6.1185  0.013761 *  
Eco_Rec              1      2       2    0.0538  0.816692    
Season               3    319     106    2.4455  0.063397 .  
Gov_ctrl_type        1     85      85    1.9635  0.161852    
Residuals          432  18801      44                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A linear model of all variables included, give us an adjusted coefficient of determination of 0.9018, and an RMSE value of 6.597 dollars. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:

Oil_Production
CDD
Energy_Consumption
Active_Conflicts
Eco_Rec
Season
Gov_ctrl_type
Gov_ctrl_party

To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria

Residuals plot, show us that they are almost centered and scatered around 0, except for high oil prices levels, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.

Full Linear model Y = Active Rigs

We will first begin by making a linear model of Active rigs versus all of the available exploratory variables.


Call:
lm(formula = Active_Rigs ~ ., data = OIL_MODIFIED_MODEL)

Residuals:
    Min      1Q  Median      3Q     Max 
-579.89  -94.98    5.04   84.42 1157.95 

Coefficients: (2 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -3.177e+05  1.599e+05  -1.987 0.047572 *  
Month                      4.381e+00  3.855e+00   1.136 0.256415    
Year                       1.653e+02  8.571e+01   1.928 0.054481 .  
US_Population             -4.804e+01  2.971e+01  -1.617 0.106593    
President_PartyRepublican -1.998e+02  3.275e+01  -6.100 2.36e-09 ***
SenateRepublican          -1.155e+01  6.765e+01  -0.171 0.864463    
HouseRepublican            3.164e+00  8.371e+01   0.038 0.969867    
life_exp                   4.052e+01  6.910e+01   0.586 0.557910    
GDP                       -5.869e+01  7.249e+01  -0.810 0.418605    
Oil_Production            -1.978e-01  4.270e-02  -4.632 4.81e-06 ***
Oil_Stock                 -1.675e+00  4.538e-01  -3.692 0.000251 ***
Oil_Price                  3.687e+00  1.467e+00   2.513 0.012323 *  
HDD                        1.562e-01  1.464e-01   1.068 0.286327    
CDD                       -9.085e-02  3.598e-01  -0.252 0.800781    
Energy_Consumption        -2.245e+01  5.694e+01  -0.394 0.693599    
Net_imports_Energy        -6.835e+01  8.183e+01  -0.835 0.404064    
Oil_Reserves               5.680e-02  1.414e-02   4.017 6.95e-05 ***
Active_Conflicts           9.646e-01  9.038e-01   1.067 0.286436    
Drilled_Wells              4.099e-01  1.524e-02  26.892  < 2e-16 ***
Eco_RecRecession           1.725e+02  3.082e+01   5.597 3.89e-08 ***
SeasonSpring               6.016e+01  3.678e+01   1.636 0.102612    
SeasonSummer               1.551e+01  5.406e+01   0.287 0.774318    
SeasonWinter               5.255e+01  4.964e+01   1.059 0.290349    
Gov_ctrl_typeTotal         1.494e+01  3.157e+01   0.473 0.636218    
Gov_ctrl_partyRepublican          NA         NA      NA       NA    
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 202.6 on 432 degrees of freedom
  (96 observations deleted due to missingness)
Multiple R-squared:  0.9352,    Adjusted R-squared:  0.9318 
F-statistic: 271.2 on 23 and 432 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: Active_Rigs
                    Df   Sum Sq  Mean Sq   F value    Pr(>F)    
Month                1   738085   738085   17.9807 2.731e-05 ***
Year                 1 45785358 45785358 1115.3882 < 2.2e-16 ***
US_Population        1 13504188 13504188  328.9788 < 2.2e-16 ***
President_Party      1  4453281  4453281  108.4875 < 2.2e-16 ***
Senate               1  5932278  5932278  144.5177 < 2.2e-16 ***
House                1 81501703 81501703 1985.4827 < 2.2e-16 ***
life_exp             1 34160309 34160309  832.1875 < 2.2e-16 ***
GDP                  1  8000118  8000118  194.8928 < 2.2e-16 ***
Oil_Production       1   469909   469909   11.4476 0.0007812 ***
Oil_Stock            1  4691129  4691129  114.2817 < 2.2e-16 ***
Oil_Price            1 11911533 11911533  290.1797 < 2.2e-16 ***
HDD                  1  1046345  1046345   25.4903 6.567e-07 ***
CDD                  1    89693    89693    2.1850 0.1400869    
Energy_Consumption   1   333804   333804    8.1319 0.0045579 ** 
Net_imports_Energy   1     3174     3174    0.0773 0.7810762    
Oil_Reserves         1   480056   480056   11.6948 0.0006863 ***
Active_Conflicts     1   369798   369798    9.0087 0.0028427 ** 
Drilled_Wells        1 41125657 41125657 1001.8721 < 2.2e-16 ***
Eco_Rec              1  1328248  1328248   32.3578 2.367e-08 ***
Season               3   122802    40934    0.9972 0.3939847    
Gov_ctrl_type        1     9197     9197    0.2240 0.6362178    
Residuals          432 17733086    41049                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A linear model of all variables included, give us an adjusted coefficient of determination of 0.9318, and an RMSE value of 203 Rigs. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:

CDD
Net_imports_Energy
Season
Gov_ctrl_type
Gov_ctrl_party

To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria

Residuals plot, show us that they are almost centered and scatered around 0, except for high levels of active rigs, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.

Multicolinearity Evaluation

Before going any further we should evaluate multicolinearity on our model, the correlation plots already gave us a hint of which variables were highly correlated. We will first evaluate if within our datasets we have aliased variables, which means if weather or not we have perfect linear combinations of variables.

Model :
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Eco_Rec + Season + Gov_ctrl_type + 
    Gov_ctrl_party

Complete :
                         (Intercept) Month Year Active_Rigs US_Population
Gov_ctrl_partyRepublican -1/2           0     0    0           0         
Gov_ctrl_partyshared        1           0     0    0           0         
                         President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican  1/2                         0            
Gov_ctrl_partyshared        0                         0            
                         HouseRepublican life_exp GDP  Oil_Production
Gov_ctrl_partyRepublican  1/2               0        0    0          
Gov_ctrl_partyshared        0               0        0    0          
                         Oil_Stock HDD  CDD  Energy_Consumption
Gov_ctrl_partyRepublican    0         0    0    0              
Gov_ctrl_partyshared        0         0    0    0              
                         Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican    0                  0            0            
Gov_ctrl_partyshared        0                  0            0            
                         Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican    0             0                0        
Gov_ctrl_partyshared        0             0                0        
                         SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican    0            0          1/2              
Gov_ctrl_partyshared        0            0           -1

Model :
Active_Rigs ~ Month + Year + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec + 
    Season + Gov_ctrl_type + Gov_ctrl_party

Complete :
                         (Intercept) Month Year US_Population
Gov_ctrl_partyRepublican -1/2           0     0    0         
Gov_ctrl_partyshared        1           0     0    0         
                         President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican  1/2                         0            
Gov_ctrl_partyshared        0                         0            
                         HouseRepublican life_exp GDP  Oil_Production
Gov_ctrl_partyRepublican  1/2               0        0    0          
Gov_ctrl_partyshared        0               0        0    0          
                         Oil_Stock Oil_Price HDD  CDD  Energy_Consumption
Gov_ctrl_partyRepublican    0         0         0    0    0              
Gov_ctrl_partyshared        0         0         0    0    0              
                         Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican    0                  0            0            
Gov_ctrl_partyshared        0                  0            0            
                         Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican    0             0                0        
Gov_ctrl_partyshared        0             0                0        
                         SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican    0            0          1/2              
Gov_ctrl_partyshared        0            0           -1

as we can see Gov_ctrl_type and Gov_ctrl_party are aliased with other political variables so we will drop them from the dataset, and then generate a VIF matrix to look for other colinear explanatory variables

	GVIF	Df	GVIF^(1/(2*Df))
Month	1.890459	1	1.374939
Year	10877.031124	1	104.293006
Active_Rigs	3.968626	1	1.992141
US_Population	7889.673094	1	88.823832
President_Party	2.441082	1	1.562396
Senate	7.488026	1	2.736426
House	11.497575	1	3.390808
life_exp	204.129970	1	14.287406
GDP	532.624103	1	23.078650
Oil_Production	22.474923	1	4.740772
Oil_Stock	118.034523	1	10.864369
HDD	23.352858	1	4.832479
CDD	16.766321	1	4.094670
Energy_Consumption	25.308852	1	5.030790
Net_imports_Energy	13.723124	1	3.704474
Oil_Reserves	26.968245	1	5.193096
Active_Conflicts	7.728751	1	2.780063
Eco_Rec	1.574211	1	1.254676
Season	32.850729	3	1.789606

	GVIF	Df	GVIF^(1/(2*Df))
Month	1.885898	1	1.373280
Year	11206.371199	1	105.860149
US_Population	7887.114065	1	88.809426
President_Party	2.466371	1	1.570468
Senate	7.292218	1	2.700411
House	11.227107	1	3.350687
life_exp	248.770202	1	15.772451
GDP	734.158067	1	27.095351
Oil_Production	36.535996	1	6.044501
Oil_Stock	117.483915	1	10.839000
Oil_Price	11.390239	1	3.374943
HDD	23.815865	1	4.880150
CDD	16.783883	1	4.096814
Energy_Consumption	27.020968	1	5.198170
Net_imports_Energy	26.840064	1	5.180740
Oil_Reserves	27.275467	1	5.222592
Active_Conflicts	9.332611	1	3.054932
Eco_Rec	1.484848	1	1.218543
Season	34.183125	3	1.801504
predictions_oil	41.542859	1	6.445375

Variables that are highly correlated that high a VIF value bigger than 5 are:

Year
US_Population
life_exp
GDP
Oil_Production
Energy_Consumption
Oil_Reserves

Before deciding to remove them from the models, we will evaluate the more advance feature selecion algorithms

StepAIC akaike information criterion Oil Price model

We will run the AIC algorithm backwards and see which variables the algorithm decides to drop

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Eco_Rec + Season

Final Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Season

       Step Df Deviance Resid. Df Resid. Dev      AIC
1                             433   18886.24 1744.005
2     - HDD  1 1.773351       434   18888.01 1742.048
3 - Eco_Rec  1 5.320678       435   18893.33 1740.177

The AIC algorithm decides to drop the following variables:

HDD
Eco_Rec (Economical Recession)

StepAIC akaike information criterion Active Rigs model

We will run the AIC algorithm backwards and see which variables the algorithm decides to drop

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Active_Rigs ~ Month + Year + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec + 
    Season

Final Model:
Active_Rigs ~ Year + US_Population + President_Party + Oil_Production + 
    Oil_Stock + Oil_Price + HDD + CDD + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec

                  Step Df    Deviance Resid. Df Resid. Dev      AIC
1                                           433   17742283 4865.450
2             - Season  3 122801.6609       436   17865085 4862.595
3              - House  1    262.9605       437   17865348 4860.602
4              - Month  1    638.7724       438   17865986 4858.618
5 - Energy_Consumption  1   2877.4922       439   17868864 4856.691
6             - Senate  1   9271.8029       440   17878136 4854.928
7           - life_exp  1  32495.7768       441   17910631 4853.756
8                - GDP  1  38134.0117       442   17948765 4852.726

The AIC algorithm decides to drop the following variables:

Season
House
Month
Energy_Consumption
Senate
life_exp
GDP

BIC Bayesian information criterion Oil Price model

We will run the BIC algorithm and see which variables has the highest likehood for being included

BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:

Month
Active_Rigs
Senate
Oil_Production
HDD
CDD
Energy_Consumption
Active_Conflicts
Eco_Rec
Season

BIC Bayesian information criterion Active Rigs model

We will run the BIC algorithm and see which variables has the highest likehood for being included

BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:

Month
Senate
House
life_exp
GDP
CDD
Energy consumption
Net imports energy
Active conflicts
Season

Boruta Feature Selection Oil Price

We will run the Boruta feature selection algorithm and see which variables are included

Boruta performed 99 iterations in 21.89947 secs.
 17 attributes confirmed important: Active_Conflicts, Active_Rigs,
CDD, Eco_Rec, Energy_Consumption and 12 more;
 1 attributes confirmed unimportant: Season;
 1 tentative attributes left: HDD;

features	medianImpact	decision
GDP	13.583942	Confirmed
US_Population	13.207721	Confirmed
Year	13.116314	Confirmed
Active_Rigs	12.602486	Confirmed
Oil_Stock	12.566571	Confirmed
life_exp	10.985583	Confirmed
Senate	10.465949	Confirmed
Active_Conflicts	8.949173	Confirmed
Net_imports_Energy	8.597149	Confirmed
Oil_Reserves	8.522154	Confirmed
Oil_Production	8.505690	Confirmed
Month	6.678523	Tentative
Energy_Consumption	6.076158	Confirmed
CDD	5.334423	Confirmed
President_Party	5.082532	Confirmed
House	4.744941	Confirmed
Eco_Rec	4.201994	Confirmed
HDD	2.801193	Confirmed
Season	1.878089	Rejected

             Month               Year        Active_Rigs 
         Confirmed          Confirmed          Confirmed 
     US_Population    President_Party             Senate 
         Confirmed          Confirmed          Confirmed 
             House           life_exp                GDP 
         Confirmed          Confirmed          Confirmed 
    Oil_Production          Oil_Stock                HDD 
         Confirmed          Confirmed          Tentative 
               CDD Energy_Consumption Net_imports_Energy 
         Confirmed          Confirmed          Confirmed 
      Oil_Reserves   Active_Conflicts            Eco_Rec 
         Confirmed          Confirmed          Confirmed 
            Season 
          Rejected 
Levels: Tentative Confirmed Rejected

Only one variable was dropped by borutas final decission:

Season

interestigly the variables which are given the highest importance are economic variables like GDP and population

Boruta Feature Selection Active Rigs

Boruta performed 99 iterations in 21.22427 secs.
 16 attributes confirmed important: Active_Conflicts, Eco_Rec,
Energy_Consumption, GDP, House and 11 more;
 No attributes deemed unimportant.
 3 tentative attributes left: CDD, HDD, Season;

features	medianImpact	decision
Oil_Price	21.231639	Confirmed
Oil_Stock	14.001810	Confirmed
GDP	13.771841	Confirmed
Year	13.462641	Confirmed
US_Population	13.449564	Confirmed
life_exp	12.746623	Confirmed
Net_imports_Energy	11.438301	Confirmed
Active_Conflicts	11.436812	Confirmed
Oil_Reserves	11.024660	Confirmed
Oil_Production	10.984145	Confirmed
Eco_Rec	8.375199	Confirmed
President_Party	8.094818	Tentative
Senate	8.037422	Tentative
Energy_Consumption	6.580326	Confirmed
Month	5.184240	Confirmed
House	4.492152	Confirmed
CDD	1.929082	Confirmed
HDD	1.739728	Confirmed
Season	1.684771	Tentative

             Month               Year      US_Population 
         Confirmed          Confirmed          Confirmed 
   President_Party             Senate              House 
         Confirmed          Confirmed          Confirmed 
          life_exp                GDP     Oil_Production 
         Confirmed          Confirmed          Confirmed 
         Oil_Stock          Oil_Price                HDD 
         Confirmed          Confirmed          Tentative 
               CDD Energy_Consumption Net_imports_Energy 
         Tentative          Confirmed          Confirmed 
      Oil_Reserves   Active_Conflicts            Eco_Rec 
         Confirmed          Confirmed          Confirmed 
            Season 
         Tentative 
Levels: Tentative Confirmed Rejected

Threee variables were classified as tentative by borutas final decision:

CDD
HDD
Season

MBA Thesis Part I, EDA & Inference

Luis Fernando Perez Armas

April 2, 2019

Data Codebook

Objective

Exploratory data analysis

Unsupervised Learning for EDA

Feature Engineering for EDA

Statistical Inference

Linear Regression Inference & Feature Confirmation