The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.

EIA Webpage

Data Codebook

Description of variables, units of measurements, data category and the source of data, can be found on the tables below:

Codebook Varible Description

Variable Name Description
Date Date of the observation
Month Month of the observation
Year Year of the observation
Active_Rigs Number of active drilling rigs on and offshore recorded to be working on that date
US_Population USA population estimates for the date of the observation
President_Party Identified political party for the ruling US president on the date observation
Senate Political party that holds the majority of US Senate on the date of the observation
House Political party that holds the majority of US congress on the date of the observation
life_exp US average life expectancy at the date of the observation
GDP Gross Domestic Product of the US economy on the date of the observation
Oil_Production US Total oil production on the date of the observation
Oil_Stock US inventory of oil in stock on the date of the observation
Oil_Price US average domestic first purchase price, not corrected by inflation
HDD Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F
CDD Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F
Energy_Consumption The amount of energy consumed by the US market on the date of the observation
Net_imports_Energy The Net amount of energy imported by the US market on the date of the observation
Oil_Reserves US proven oil reserves on the date of the observation
Active_Conflicts Number of world active military conflicts on the date of the observation, includes civil wars
Drilled_Wells Number of drilled wells on US on the date of the observation

Codebook Varible Units of measurement

Variable Name Units of measurement
Date Date format Year-Month-Day
Month Integer from 1-12
Year Integer 4 digits
Active_Rigs Integer each unit represents a drilling rig
US_Population Numerical in Millions of people
President_Party Categorical variable Either Democrat or Republican
Senate Categorical variable Either Democrat or Republican
House Categorical variable Either Democrat or Republican
life_exp Numerical life expectancy in years
GDP Numerical GDP in Trillions of dollars 10^12 USD
Oil_Production Numerical Oil production in Thousand barrels per day
Oil_Stock Numerical US inventory of oil in million barrels
Oil_Price Numerical US dollars
HDD Numerical Monthly US HDD measured un number of °F
CDD Numerical Monthly US CDD measured un number of °F
Energy_Consumption Numerical Energy consumed in Quadrillion btu’s 10^15
Net_imports_Energy Numerical Net energy imported in Quadrillion btu’s 10^15
Oil_Reserves NUmerical Oil reserves in million barrels
Active_Conflicts Numerical each unit represents an active military conflict
Drilled_Wells Numerical each unit represents a drilled well

Codebook Varible Source of data

Variable Name Units of measurement
Date Included with other measurements
Month Included with other measurements
Year Included with other measurements
Active_Rigs Data Link
US_Population Data Link
President_Party Data Link
Senate Data Link
House Data Link
life_exp Data Link
GDP Data Link
Oil_Production Data Link
Oil_Stock Data Link
Oil_Price Data Link
HDD Data Link
CDD Data Link
Energy_Consumption Data Link
Net_imports_Energy Data Link
Oil_Reserves Data Link
Active_Conflicts Data Link
Active_Conflicts 2 Data Link
Drilled_Wells Data Link

Objective

The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables. In order to achive this objective it is necessary to first develop insight on variables that might influence the behavior of both response variables.

Exploratory data analysis

The first step in our analysis will be to upload the dataset, and explore its structure, the descriptive statistics and if weather or not observations in the dataset are complete.

'data.frame':   552 obs. of  20 variables:
 $ Date              : Date, format: "1973-01-01" "1973-02-01" ...
 $ Month             : Factor w/ 12 levels "01","02","03",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Year              : num  1973 1973 1973 1973 1973 ...
 $ Active_Rigs       : num  1219 1126 1049 993 1046 ...
 $ US_Population     : num  212 212 212 212 212 ...
 $ President_Party   : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
 $ Senate            : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ House             : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ life_exp          : num  71.4 71.4 71.4 71.4 71.4 ...
 $ GDP               : num  1.43 1.43 1.43 1.43 1.43 ...
 $ Oil_Production    : num  10856 11140 11006 11042 11002 ...
 $ Oil_Stock         : num  237 235 244 248 257 248 243 248 241 246 ...
 $ Oil_Price         : num  3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 ...
 $ HDD               : int  965 843 552 402 215 26 8 11 74 246 ...
 $ CDD               : int  5 3 19 13 62 214 297 280 142 50 ...
 $ Energy_Consumption: num  7.23 6.59 6.52 5.94 6.07 ...
 $ Net_imports_Energy: num  1.047 1.047 1.17 0.891 0.966 ...
 $ Oil_Reserves      : num  35300 35300 35300 35300 35300 35300 35300 35300 35300 35300 ...
 $ Active_Conflicts  : int  52 52 52 52 52 52 52 52 52 52 ...
 $ Drilled_Wells     : int  2166 1896 2151 1883 2298 2279 2367 2690 2333 2538 ...
      Date                Month          Year       Active_Rigs    
 Min.   :1973-01-01   01     : 46   Min.   :1973   Min.   : 407.0  
 1st Qu.:1984-06-23   02     : 46   1st Qu.:1984   1st Qu.: 856.8  
 Median :1995-12-16   03     : 46   Median :1996   Median :1184.5  
 Mean   :1995-12-16   04     : 46   Mean   :1996   Mean   :1435.0  
 3rd Qu.:2007-06-08   05     : 46   3rd Qu.:2007   3rd Qu.:1851.5  
 Max.   :2018-12-01   06     : 46   Max.   :2018   Max.   :4521.0  
                      (Other):276                                  
 US_Population     President_Party        Senate           House    
 Min.   :211.9   Democrat  :240    Democrat  :288   Democrat  :312  
 1st Qu.:235.8   Republican:312    Republican:264   Republican:240  
 Median :267.8                                                      
 Mean   :268.6                                                      
 3rd Qu.:301.2                                                      
 Max.   :327.0                                                      
                                                                    
    life_exp          GDP         Oil_Production    Oil_Stock     
 Min.   :71.36   Min.   : 1.429   Min.   : 5488   Min.   : 233.0  
 1st Qu.:74.56   1st Qu.: 4.041   1st Qu.: 7769   1st Qu.: 763.8  
 Median :75.82   Median : 7.882   Median : 9160   Median : 898.0  
 Mean   :75.98   Mean   : 9.027   Mean   : 9323   Mean   : 823.4  
 3rd Qu.:77.99   3rd Qu.:14.419   3rd Qu.:10252   3rd Qu.:1003.2  
 Max.   :78.84   Max.   :20.660   Max.   :16289   Max.   :1230.0  
                                                                  
   Oil_Price           HDD              CDD        Energy_Consumption
 Min.   :  3.89   Min.   :   3.0   Min.   :  2.0   Min.   :5.438     
 1st Qu.: 14.01   1st Qu.:  45.0   1st Qu.: 12.0   1st Qu.:6.592     
 Median : 23.51   Median : 310.0   Median : 42.0   Median :7.552     
 Mean   : 33.28   Mean   : 378.9   Mean   :103.6   Mean   :7.382     
 3rd Qu.: 45.19   3rd Qu.: 660.0   3rd Qu.:183.8   3rd Qu.:8.048     
 Max.   :128.08   Max.   :1266.0   Max.   :404.0   Max.   :9.655     
 NA's   :1        NA's   :2        NA's   :2       NA's   :2         
 Net_imports_Energy  Oil_Reserves   Active_Conflicts Drilled_Wells 
 Min.   :0.06212    Min.   :19121   Min.   : 52.0    Min.   :1268  
 1st Qu.:0.94186    1st Qu.:22351   1st Qu.: 83.5    1st Qu.:2288  
 Median :1.30195    Median :26544   Median :132.5    Median :2914  
 Mean   :1.39765    Mean   :26705   Mean   :121.8    Mean   :3482  
 3rd Qu.:1.85828    3rd Qu.:30529   3rd Qu.:151.2    3rd Qu.:4208  
 Max.   :2.74169    Max.   :39160   Max.   :205.0    Max.   :8556  
 NA's   :2          NA's   :12      NA's   :24       NA's   :96    
              Date              Month               Year 
                 0                  0                  0 
       Active_Rigs      US_Population    President_Party 
                 0                  0                  0 
            Senate              House           life_exp 
                 0                  0                  0 
               GDP     Oil_Production          Oil_Stock 
                 0                  0                  0 
         Oil_Price                HDD                CDD 
                 1                  2                  2 
Energy_Consumption Net_imports_Energy       Oil_Reserves 
                 2                  2                 12 
  Active_Conflicts      Drilled_Wells 
                24                 96 

The variable with most NA’s in the data set is Drilled wells, the dataset of the EIA did not provide this information since the end of 2010. for the exploratory analysis the complete dataset will be used altought for modeling and predicting perhaps this variable should be dropped or only use a complete.case dataset.

After the analysis of the structure of the dataset, we will first start with a series of pairs plot for different categories of variables, that for the sake of simplicity we have divided into the following groups: Environmental, Socioeconomic, Political, Oil related, Energy related

Environmental

Altogh pairwise graphs can result complicated, they migh highlight interesting patterns of our data, for instance:

Social & economic variables

Allmast all the variables share an strong linear relationship with year, with the exception of Activity rigs, therefore we see a similar trend between them.

Oil Market

Energy Market

Political

This plot is extremely noisy due to the fact that some of these variables are categoricals, however it still provides some interesting information

Unsupervised Learning for EDA

After looking at pairs plot with more detail, it seems like various of the observations are clustered together, therefore it seems logical as well to try to find possible groups and associations that might bring up to light relevant variables or data insights. In order to to this we will use unsupervised learning methods, such as K-means and hierarchical clustering to aid us.

In order to facilitate the analysis we will start with group data by years, instead of using the monthly division. To group by years we will use median as the grouping functions seems most of the variables are rightly skewed. This yearly data will be used initially to create a heatmap.

Heatmap

Interestingly the heatmap shows that indeed recent years are part of a cluster by themselves and are characterized by:

  • Above the average oil prices
  • High levels of oil reserves
  • High levels of oil stock
  • High levels of oil production
  • Low level of active rigs (seems that drilling activity has focused on highly productive wells)
  • High energy consumption and demand
  • Lowest levels of imported energy
  • Highest levels of CDD (Global warming??)
  • Smallest levels of HDD
  • Mixed politics in all levels, either Executive, House and Senate

The next logical step in our analisis is perform a cluster analysis and analyze the characteristics of the different clusters of observations

Kmeans Cluster Analysis

Tha main issue with kmeans clustering is the question, related to the number of adequate clusters, for certain datasets (sports, market segments, gender, etc.), it is faily obvious but for other type of problems such as this one the number is unknown. To determine the number of clusters for this analysis we will use the elbow and silhouette criteria

Elbow Scree Criteria

k tot_withinss
1 308.00000
2 187.29814
3 124.30791
4 108.16505
5 85.18374
6 47.37352
7 43.09698
8 29.82660
9 26.98528
10 22.92257

Looking at the plot, the elbow seems to be located between k=2 and k=4, the silhouette will help us further with this determination

Silhouette Plot

k sil_width
2 0.3790503
3 0.3665515
4 0.4044439
5 0.4094645
6 0.4303218
7 0.4655798
8 0.4437710
9 0.4207146
10 0.4203871

The Silhouette plot, show us a peak on the silhouette width at k=7. For further exploration the number of centroids to be used will be 7

cluster n
1 5
2 8
3 9
4 3
5 4
6 12
7 4

The barplot of sil width shows, that all points have a positive width, which means that clusters were correctly created. Lets see with more detail the characteristics of each one of the seven clusters created.

cluster Year Oil_Price Oil_Reserves Oil_Stock Oil_Production Active_Rigs Energy_Consumption Net_imports_Energy CDD HDD President_Party House Senate
1 1983.0 26.0600 28416.0 681.00 10258.581 2418.50 6.145066 0.6833525 40.50 325.50 Republican Democrat Republican
2 1976.5 8.2575 31567.5 308.50 10143.014 1829.75 6.213764 1.1358337 31.25 354.50 Mixed Democrat Democrat
3 1990.0 15.3750 26254.0 909.50 9045.215 836.50 6.909735 1.1485480 38.00 320.00 Republican Democrat Democrat
4 2016.0 42.8600 32773.0 1171.00 12807.305 874.50 8.046779 0.9050690 65.00 254.00 Democrat Republican Republican
5 2012.5 94.5275 31950.0 1044.00 9354.557 1875.00 8.033162 1.2129182 52.00 281.00 Democrat Republican Democrat
6 2000.5 23.2375 21954.0 894.75 7709.979 941.25 7.928625 2.1101968 43.25 301.25 Mixed Republican Republican
7 2008.5 68.9325 20999.5 1019.00 7125.754 1659.25 8.115634 2.0194070 43.00 291.75 Mixed Democrat Democrat

Results from the cluster analysis show us clearly that more recent years are categorized by more oil independence of US, with an increase in production, a decrease on net energy imports, higher oil reserves, and moderated prices and rig activity

Hierarchical Clustering Dendogram

Hierarchical clusters show us a similar trend as well, on which more recent years, can be found as a cluster of its own, on the left side area of the dendogram

With the insight gained thanks to the Unsupervised learning EDA, we can explore further the behavior of our variables with relationship with clustering

If we look with more detail at the four plots, we can see an interesting relatioship, specially on the year 2007, highlighted by the red line, as we can see on the highest point for oil prices, oil production was at its lowest point, oil reserves as well, and rig count on low to mid levels. 2007 Seems to be a turning point in the history of US oil industry.

It is clear by the analysis that there is indeed some sort of relationship between these variables, however on some of them like Active_Rigs and Oil_Price the relationship, still seems confusing, it might be worth to look onto this relationship with more detail.

Interestingly the relationship is positive, however it is definitively not linear, it might be also interesting at the behavior of both variables with time, since oil price and rigs are on almos two order of magnitud apart, we should first normalize the variables to perform the adequate analysis and them plot their behavior on time

The time analysis whos what we suspected, indeed price and active rigs have a relationship, in fact whenever price has increased active rigs has followed the same trend, which is logical from an economic point of view, the higher the prices of goods to be sold, the more incentive for their production and for entering the oil and drilling business. Altouhgt there is a relationship between the two variables it seems that on early years, active rigs was a more volatile variable in comparisson with price, but its behavior has stabilized with time, which might suggest that there are other variables or dynamics affecting the behavior of the drilling business besides the oil price.

Drilling Dynamics I

The relationship between Active Rigs and drilled wells it is crearly and strongly linear, therefore we can assume that in time whenever active rigs increase drilled wells increase as well. On recent years, specially since the beginning of 2000’s there is a clear linear relationship between active rigs, production and reserves, which seems logical, since to produce more oil, we either need to incorpore more wells or make the current wells produce more, altought it is not as simple as it sounds due to various factors:

  • Not all drilled wells are development or production wells, some of the wells are exploration wells with the sole objective of loooking for more oil
  • Not all drilled wells results in positive production or reserves, since a portion of wells are what is called dry wells (No oil or Gas)
  • Not all wells are drilled for oil, some of them are drilled for Natural gas production
  • Oil production is a controlled variable, heavily influenced by the energy policy of the country

If we normalize our Production and Reserves variables, we can see more clearly the exploration and production dynamics.

Drilling Dynamics II

Now it is more clear, periods of high efficiency of reserves per new drilled well, and production will match periods of low drilling activity, for those periods lets called them sustain periods, the drilling activity will be meant to adjust production by the declination rate of the in service wells, whenever those values start to lay behind energy demand projections a new drilling cycle seems starts on which the industry is aggresively looking for new reservoirs to increase reserves and to include new development well into the production pipeline.

Feature Engineering for EDA

After making an initial exploration of the data in hand, it becomes clear that we might need some more information or transformations of some of the variables, since their current form perhaps they dont capture the complete picture of the problem in hand.

All code from the transformation can be found on the following link

Seasonality

Monthly variables as itself, doesnt seems to produce a clear difference neither in price or active rigs

The monthly variable might be compresed into a season variable as per the standard Seasons dates on the US

Season Dates
Spring March 1 to May 31
Summer June 1 to August 31
Autumn September 1 to November 30
Winter December 1 to February 28

After addingg the season variable, the behavior seems to be unaffected altought this question will be confirmed on posterior sections of this work, by using statistical inference methods.

Politics

Energy Policy plays indeed a critical role on the behavior of our varibles, by the analysis of the following boxplots it seems that there is a difference on the ruling parties at different levels on the US government and the effect on Oil prices and Rig Activty.

Altought we have captured the dynamics of US politics using three variables President Party, Senate and House, neither of them by themselves describe completely, which political party is truly ruling the US, energy and oil policy, for istance a ruling president with an oposing house and senate, might act completely different as a president with support at all levels. This an instesting aspect that should be explored further

New varibles are going to be coded, on which we will differentiate between a total control scenario on which the same party is ruling all three powers and a shared power scenario on which at least one of the powers belongs to a different political party. Using this variable we will stablish on scenarios of total control which party is ruling the government, and we will evaluate if there is a difference on these scenarios


   shared     Total 
0.6521739 0.3478261 

From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time.

Now lets look with more detail on how this, have affected our response variables

There seems to be a certain difference in oil price whenever the government is totally ruled by a single party in comparisson with the scenario on which power is shared, however in regards to activity rigs the difference doesnt seems to be that dramatic.


  Democrat Republican     shared 
 0.1739130  0.1739130  0.6521739 

From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time. On these periods of total control, each party has enjoyed equal amount of time (17% each) rulling on total control.

Now lets look with more detail on how this, have affected our response variables

When we facet the scenarios on total government control by including the ruling party, a clear difference emerges on both variables. Altought graphics uncovered differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.

Economics

As we saw on early pair graphis, GDP has grown on a linear positive fashion in regards to time, therefore graphics of GDP and Oil Price or Activity rigs doesnt, reveal any new behavior that was not previously highlighted by the relationships of these variables and time. however there might still be some further exploration to make in regards to this economic variable, an interesting aspect to highlight might be to differentiate the behavior of our variables of interest on periods of economic recession and normal economic behavior.

Name Begining End Short Description
1973 oil crisis Nov 1973 Mar 1975 The 1973 oil crisis, a quadrupling of oil prices by OPEC.
1980 recession Jan 1980 July 1980 Federal Reserve, raised interest rates dramatically.
1981–1982 recession July 1981 Nov 1982 The Iranian Revolution & Tight monetary policy in the US.
Early 1990s recession July 1990 Mar 1991 Federal Reserve raising interest & 1990 oil price shock.
Early 2000s dot-com Mar 2001 Nov 2001 Collapse of speculative dot-com bubble September 11th attacks.
Great Recession Dec 2007 une 2009 The subprime mortgage crisis, collapse of US housing bubble.

   Normal Recession 
0.8605072 0.1394928 

From Jan 1973 until the present date, the US Economy has been on recession 14% of the months. Now lets look with more detail on how this, have affected our response variables

Looking at the Price vs time behavior it seems interesint to notice that almost all early recession are associated with an spike on oil prices. Similar peaks can be found as well on the active rigs curve in time.

The Price and recession boxplot doesnt show a dramatic difference, however the rig count seems to be directly affected by periods of economic recession. This difference in the behavior of both variables might be explained by the fact that oil price is an international variable that not only depends on the US economy, and in contrast active rigs in US in directly affected by how the country economy is performing.Altought graphics uncovered interesting differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.

Statistical Inference

Our Previous analysis have lead us to various questions, related to the nature of the relationships between our response variables explanatory ones. The initial EDA has uncovered or to be more precise it has awaken our curiosity regarding this questions, but the answers can only be found using more advanced techniques. One this section we intend to answer this questions.

All the code of the statistical tests performed on this section, can be found on the following Link

Does oil price varies with Months?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Month 1 101.9351 101.9351 0.1321276 0.7163761
Residuals 549 423548.1536 771.4903 NA NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.76-76%

Does Active_Rigs varies with Months?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Month 1 602666.4 602666.4 1.098467 0.2950631
Residuals 550 301753681.5 548643.1 NA NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.29-29%

Does oil price varies with Seasons?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Season 3 633.4052 211.1351 0.2730173 0.8448725
Residuals 547 423016.6836 773.3395 NA NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.84-84%

Does Active Rigs varies with Seasons?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Season 3 1138054 379351.5 0.690146 0.558328
Residuals 548 301218293 549668.4 NA NA

As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.55-55%

Does Oil Price varies with the US Presidents political Party

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$President_Party
t = 3.7616, df = 405.02, p-value = 0.0001937
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.42473 14.11211
sample estimates:
  mean in group Democrat mean in group Republican 
                38.51321                 29.24479 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.0001937 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents have experieced on average oil prices between 4.4 and 14.1 $ higher than republican presidents.

Does Active Rigs varies with the US Presidents political Party?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$President_Party
t = -2.1034, df = 541.3, p-value = 0.03589
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -253.424469   -8.662711
sample estimates:
  mean in group Democrat mean in group Republican 
                1360.950                 1491.994 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.03589-3.6 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents on average have experienced a number of active rigs between 253 and 9 rigs lower than republican presidents.

Does Oil price varies with the majority ruling party of US Senate?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Senate
t = 2.6457, df = 393.49, p-value = 0.008479
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  1.55085 10.52260
sample estimates:
  mean in group Democrat mean in group Republican 
                36.16326                 30.12654 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.008479-0.8 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant Senate has on average have experienced Oil prices between 1.5 and 10.5 $ higher than a dominant republican senate.

Does Active Rigs varies with the majority ruling party of US Senate?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Senate
t = 1.9558, df = 448.03, p-value = 0.05111
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -0.6066393 251.1350484
sample estimates:
  mean in group Democrat mean in group Republican 
                1494.927                 1369.663 

As we can see there is no a significative difference since we obtained an T test p-value of 0.05111-5.1%

Does Oil price varies with the majority ruling party of US House?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$House
t = -8.5814, df = 447.84, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.24962 -15.21222
sample estimates:
  mean in group Democrat mean in group Republican 
                24.72343                 44.45435 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced Oil prices between 24.24 and 15.21 $ lower than a dominant republican house.

Does Active Rigs varies with the majority ruling party of US House?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$House
t = 9.3118, df = 493.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 407.6818 625.7310
sample estimates:
  mean in group Democrat mean in group Republican 
                1659.673                 1142.967 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced a Rig count between 408 and 626 rig count higher than a dominant republican house.

Does Oil prices varies if the US government is completely dominated by one sigle party at all levels?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Gov_ctrl_type
t = -0.24384, df = 504.06, p-value = 0.8075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.963467  3.867454
sample estimates:
mean in group shared  mean in group Total 
            33.09189             33.63990 

As we can see there is no a significative difference since we obtained an T test p-value of 0.8075-80.75%

Does Active rigs varies if the US government is completely dominated by one sigle party at all levels?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Gov_ctrl_type
t = 0.49325, df = 477.12, p-value = 0.6221
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -90.53302 151.21913
sample estimates:
mean in group shared  mean in group Total 
            1445.572             1415.229 

As we can see there is no a significative difference since we obtained an T test p-value of 0.6221-62.21%

On Scenarios of complete control by one party, does oil prices varies in relationship to the party?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Gov_ctrl_party 2 10521.5 5260.7522 6.978196 0.0010171
Residuals 548 413128.6 753.8843 NA NA

As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 0.0010171 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.

term comparison estimate conf.low conf.high adj.p.value
OIL_MODIFIED$Gov_ctrl_party Republican-Democrat 14.817787 5.4797130 24.1558616 0.0006204
OIL_MODIFIED$Gov_ctrl_party shared-Democrat 6.822097 -0.5898721 14.2340666 0.0785599
OIL_MODIFIED$Gov_ctrl_party shared-Republican -7.995690 -15.4383933 -0.5529868 0.0317663

Based on the results we can claim that a completely controled republican goverment has experienced oil prices between 5.5 and 24.15 $ higher than a completely controled democrat government, It is also interesting to mention that a completely controled republican goverment shows higher oil prices in comparisson with a shared government.

On Scenarios of complete control by one party, does Active Rigs varies in relationship to the party?

To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05

term df sumsq meansq statistic p.value
OIL_MODIFIED$Gov_ctrl_party 2 14370308 7185153.9 13.69736 1.6e-06
Residuals 549 287986040 524564.7 NA NA

As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 1.6e-06 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.

term comparison estimate conf.low conf.high adj.p.value
OIL_MODIFIED$Gov_ctrl_party Republican-Democrat -544.9583 -790.6342 -299.28247 0.0000008
OIL_MODIFIED$Gov_ctrl_party shared-Democrat -242.1361 -437.6505 -46.62175 0.0104769
OIL_MODIFIED$Gov_ctrl_party shared-Republican 302.8222 107.3079 498.33658 0.0008702

Based on the results we can claim that a completely controled republican goverment has experienced lower levels of rig activity in comparisson with a fully democrat government or shared government. Completely Democrat governments has experienced the highest levels of active rigs.

Does Oil prices varies if the US economy is on recession?

To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05


    Welch Two Sample t-test

data:  OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Eco_Rec
t = -0.045776, df = 95.274, p-value = 0.9636
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.707692  7.360243
sample estimates:
   mean in group Normal mean in group Recession 
               33.25757                33.43130 

As we can see there is no a significative difference since we obtained an T test p-value of 0.9636-96.4%

Does Active Rigs varies if the US economy is on recession?


    Welch Two Sample t-test

data:  OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Eco_Rec
t = -5.565, df = 85.947, p-value = 2.925e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -905.1819 -428.6917
sample estimates:
   mean in group Normal mean in group Recession 
               1341.985                2008.922 

As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.925e-07 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that US economy on recession has on average experienced a Rig count between 428 and 905 rig count higher than a US economy on normal conditions.

Does oil price behavior is independent of price behavior on previous month?

This might sound like an unrelated question, but it is on the contrary fundametal for this work, what is the point of attempting to predict via regression or forecasting methods, the behavior of a variable if it responds to randomness with a certain probability? and if this is the case it is critical to now it in advance instead of attempting to develop models that are fundamentally wrong from the conception.

It is very commong to hear from time to time that trading markets like forex or stock markets, behave randomly, that the ups and downs of prices follows patterns as such the throwing of a coin; being oil a commodity it is sometimes accused of the same issue.

This sounds like a very complex question, and it is, but a chi-square GOF test can be used to study the problem. We will label each observaion as Up or Down (D) depending on whether the Oil price market was up or down that month.


DOWN   UP 
 235  316 

     DOWN        UP 
0.4264973 0.5735027 

For our dataset, composed of 552 monthly observations of Oil prices, starting on january 1973, the proportion of UP’s in the market is of 57%. The next step in our analysis will be to generate an histogram of the distance between UP’s, how many months of DOWN we have before experiencing again a market turnup. We will cal this variable streak

price_streak Freq
0 126
1 37
2 26
3 15
4 15
5 6
6 3
7 4
8 1
9 1
10 1
19 1
price_streak Freq
0 0.5338983
1 0.1567797
2 0.1101695
3 0.0635593
4 0.0635593
5 0.0254237
6 0.0127119
7 0.0169492
8 0.0042373
9 0.0042373
10 0.0042373
19 0.0042373

Since the chi-square GOF test, requires to have at least 5 observations on each category, we will group up values of streak above 6 days.

price_streak Freq cummulative
0 0.5338983 0.5338983
1 0.1567797 0.6906780
2 0.1101695 0.8008475
3 0.0635593 0.8644068
4 0.0635593 0.9279661
5 0.0254237 0.9533898
6 0.0466102 1.0000000

With the streak distribution, the next step is to calculate the distribution of a geometric distribution with a success probability of 57%, on which values above 6 are grouped together.

Geometric_streak Freq cummulative
0 0.5796 0.5796
1 0.2420 0.8216
2 0.1043 0.9259
3 0.0411 0.9670
4 0.0205 0.9875
5 0.0071 0.9946
6 0.0054 1.0000

Having the geometric distribution proportions and the original table of oil price streak frequencies we can perform a chi-square goodness of fit test, the selected alpha value for this test will be 5% or 0.05.


    Chi-squared test for given probabilities

data:  price_observed
X-squared = 102.79, df = 6, p-value < 2.2e-16

As we can see the obtained p-value of the test is 2.2e-16, almost 0 which means that it is almost impossible to find the oil price streak distribution if we assume the null hypothesis of a geometric distribution with succes probability of 57%.

Since this is a complicate question we will confirm our test, by using simulation instead of the predefined statistical functions of R. We will replicate with replacement 10000 (10 thousands) observations of our dataset, and calculate the chi-square statistic assuming a null hypothesis on which our distribution is equal to the geometric distrbution with a probability of success of 57%, and then we will calculate the p-value of obtaining our observed chi-square statistic.

replicate stat
1 8.371387
2 13.970183
3 5.948372
4 2.312726
5 4.998194
6 8.594592
[1] 10000     2

With this new data frame of simulations we will calculate the simulated p-value which is just the proportion of simulated stats, more extreme or equal to our observed chi-square stat.

# A tibble: 1 x 1
  `mean(stat >= oil_observed_chi)`
                             <dbl>
1                                0

The blue curve of this graph corresponds to a chi-square distribution with 6 degrees of freedom, we can see how it almost perfecly overlap the black curve which is our simulated distribution, this give us confirmation that our simulation was correct.

As we can see both graphically and with the calculation this proportion or simulated p-value is virtually zero, which means that our oil price streak distribution doesnt follow a geometric distribution and which means as well that, there is no independence between the behavior of oil price one month after one other, making this behavior not random. This confirm our suspicious that oil prices are influenced by time and many other variables.

Why monthly oil prices and not daily? how volatiles are daily oil prices?

This is an interesting question and perhaps somehow complicated to answer, the idea of this study is to capture macroeconomic or industry related trends that on the somehow long-term cause oil prices to raise or to drop, oil being traded as a commodity on daily basis might be more volatile and more susceptible to speculation.

To test this, we are going to evaluate the effect of, US president Donald Trump, oil & energy related tweets, on the daily Brent crude price. We are going to use only tweet from 2016 up to the first quarter of 2019.

Compendium of trump tweets were downloaded in json format from the following link.

The final database and transformation code can be found on the following link.

The Trump tweets archive that is used on this workd is composed by 11363 tweets, here is an example of the first lines of the archive:

V1 date Month Day Year oil_related_word
the person that hillary clinton least wants to run against is, by far, me. it will be the largest voter turnout ever - she will be swamped! 2016-01-01 Jan 01 2016 FALSE
@codyraymille: i have never been interested in politics but because of you i want to get my political science degree. #trump2016” great! 2016-01-01 Jan 01 2016 FALSE
i will be going to mississippi tomorrow night - hear the crowds are going to be massive! look forward to it. 2016-01-01 Jan 01 2016 FALSE
@casuperrunner: @georgehenryw huckabee is a good man…but he needs to get behind @realdonaldtrump” i agree! 2016-01-01 Jan 01 2016 FALSE
@memeoryhead: i’m one of your biggest fans mr trump and i can’t wait for you to make america great ifagain! never forget you have support!” 2016-01-01 Jan 01 2016 FALSE
@jodil792: we are standing with you! spreading the word…trump for president 2016!! i wake & sleep praying for you & the usa!+israel2” 2016-01-01 Jan 01 2016 FALSE

There are 1186 days between the first day of the database and the last one, using this means that president trump is tweeting on average 9.6 tweet per day since january 2016, until april 2019, which makes him very active on this social media.

for the sake of this work, we will focus only on tweets on which he used the following words, since we consider them possibly related to the oil & gas market:

after filtering the complete database we found 421 positive tweets for oil related words. Already having this information in place, the first question that comes to my mind is weather or not there is a difference in brent prices, when trump tweets about oil and when he doesnt? to answer this question we will use a t.test with an alpha value of 5%


    Welch Two Sample t-test

data:  trump_oil2$value by trump_oil2$oil_keyword
t = -7.3975, df = 344.41, p-value = 1.068e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.974548 -5.204563
sample estimates:
mean in group FALSE  mean in group TRUE 
           55.10396            62.19351 

Interestingly there is indeed a significative difference between brent crude prices when trump tweets about oil, in comparisson with days on which he does not. the obtained p-value is 1.068e-12 which is lower that our alpha level making the test significative. Brent crude prices tend to be higher on days on which trumo tweets about oil, between 5.2 and 8.79 dollars higher.

Now obtaining this result only means that there is indeed a correlation, however it doesnt imply causation, it is important not to be confused and think that oil related tweet of trump cause higher oil prices. In fact we can conclude that is higher oil prices, what motivates Donald trump to tweet about oil prices.

To test this hypothesis we will look and the distribution of oil tweet of trump with oil prices ranges

Price_range Tweets
(25.9,46] 16
(46,66] 93
(66,86.1] 79

As we can see, the majority of oil tweets of donald trump, are condensed on high oil price ranges, but the real question is, is this distribution sufficiently different from expected values, to consider it significative? to test this we will use a chi-square independence test, using an alpha value of 0.05-5%.


    Pearson's Chi-squared test

data:  table(cut(trump_oil2$value, 3), trump_oil2$oil_keyword)
X-squared = 43.082, df = 2, p-value = 4.414e-10

The results obtained by the chi-square test, suggest that there is no independence between oil tweets and oil price ranges. which means that trump oil tweets are triggered by differences in oil prices, and this relationship is significative since the obtained p-value of 4.414e-10 is below the 5% alpha value.

If Trump oil tweets and oil prices were independent, this will be the expected distribution:

Price_range Chisq_Expected_Tweets
(25.9,46] 35.06258
(46,66] 105.64908
(66,86.1] 47.28834

This difference can be better observed by comparing the expected and observed distributions

So it is more clear, that there a more pronounced concentration of oil tweets into the 66-87 dollars brent price binge. And we can conclude that what mostly triggers Trump oil tweets is the raise in oil prices.

It might also be interesing to study how trump oil tweets have changed over time.

Year Tweet_by_year
2016 22
2017 79
2018 126
2019 38

Looking with more detail at the table, it is easy to notice a trend, on which the amount of oil related tweets, have increased over time, just on the first quarter of 2019, Trump has already tweeted 38 oil tweets, if we expect him to continue with this trend, we should expect 152 oil related tweets by the end of the year.

       1 
179.6667 

If we use a linear model on data of 2016,2017 and 2018 and predict 2019, tweets, we should expect at leats 179 tweets by the end of the year. So it is clear that Trump has been getting more and more active on oil tweets over time. Which move us to our next lines of questions.

Using logical thinking, as rational beings,we would only repeat and even increase a behavior if we expect it to produce the desired result, if not this might be considered to be insane or at least ilogical. The question is, does oil trump tweets, are able to decrease oil prices, at least on the shor term?

To be able to answer this question we will look at brent crude prices, the day of an oil tweet, the day after, two days after and three days after, and compare the price between these days and its behavior. We will also use linear regression inference to determine the magnitude of these effects.

After the variable creation we will perform an ANOVA test, with an alpha level of 0.05-5% to test if there is a significant level of difference on brent oil price values and the different tweet days.

term df sumsq meansq statistic p.value
trump_oil2$tweet_pattern 8 24926.24 3115.7805 23.63862 0
Residuals 806 106237.98 131.8089 NA NA

The obtained p-value of 0 of the F-statistic, confirms that there is indeed a significative difference, since it is below our selected alpha level. Looking at the mean values we can observe that there is a descendent pattern on which on average the brent oil market has reacted on a downslide pattern after donal trump oil tweets

tweet_pattern Mean_Price
Zero 62.19351
One 60.85161
Three 60.36951
Two 60.17584
Four 58.45350
Five 56.01485
Six 55.74000
Seven 54.86548
No_Tweet 48.71600

The day after an oil tweet, values are aproximately equal to the prices of the day of the tweet, altought, two days after values tend to decrease so as third days values and the forth day and so up until even a week after.

Finally we would like to confirm this trend, and to do so we will a t.test between the brent prices on Trump’s oil tweet day and the prices four days after an oil tweet, we will perform a one sided test, on which the alternative hypothesis will be that the true difference in mean is less than 0. To perform this test, we will use a significance alpha level of 5%.


    Welch Two Sample t-test

data:  trump_oil_tweet_effect$value by trump_oil_tweet_effect$tweet_pattern
t = -3.2562, df = 39.559, p-value = 0.00116
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -3.537552
sample estimates:
mean in group Seven  mean in group Zero 
           54.86548            62.19351 

Based on the results obtained on the test, we can claim that there is indeed a significative difference, at a 95% confidence level, on brent price values between the tweet day and four days after a trump tweet. We can expect oil brent prices on trump’s twee day being at least 0.38 dollars higher in comparisson with four days after the tweet was made.

With such an interesting result it is important to confirm the results, due to this we will perform a similar test, but using simulation, we will simulate 10000 differences in the observed data group of brent prices on the day of tweet and four days after tweet and we will calculate a simulated p-value.

The simulated data, seems to fit very well a normal distribution, the next step is to calculate the simulated p-value, which is equivalent of calculating the proportion of values of difference more extremes or equal to the observed difference.

# A tibble: 1 x 1
   p_value
     <dbl>
1 0.000600

Equivalent to what was obtained on the formal t.test, the simulated p-value is below the 5% alpha level, which allow us to conclude that there is indeed a significative difference on brent crude prices, on the day of a tweet and four days after.

To conclude, this is the main explanation of why using daily oil price data, might not be advisible, since it might be volatile enought to be influenced by speculations created by social media interactions of powerfull entities such as the US president, and perhaps many other factors that were not evaluated during this analysis. Altought daily~weekly prices might be sensible enought to be influenced by these kind of effects, monthly prices are more related to macro economics and industry based trends and therefore are more usable to forecast on a long term basis.

Linear Regression Inference & Feature Confirmation

On the previous section of this work, we answer already plenty of the questions that the early EDA uncovered, now with the answers of this questions we can begin with the task of creating a model with the objective of predict the behavior of our response variables, but before generating definitve models it is better to first study which variables affect the most our target response variables. On this section we will explore with more depth the linear relationship of our response and explanatory variables via various methods such as:

-Spearman Correlation matrix -Full linear model -StepAIC, akaike information criterion link -BIC Bayesian information criterion link -Boruta selection criteria link

All the code related to the models built on this section can be found on the following link

Correlation matrix Oil price

We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Oil prices

x
Oil_Price 1.0000000
GDP 0.6840554
Year 0.6832099
US_Population 0.6832099
life_exp 0.6822414
Oil_Stock 0.5769223
Energy_Consumption 0.4434374
Drilled_Wells 0.3802835
Net_imports_Energy 0.3655332
Senate 0.3486705
Active_Conflicts 0.2770792
Active_Rigs 0.2539703
House 0.2100427
Gov_ctrl_type 0.1996047
President_Party 0.1613554
Eco_Rec 0.1045710
CDD 0.0671465
Month 0.0393345
Season -0.0295112
HDD -0.0533247
Gov_ctrl_party -0.1241196
Oil_Production -0.5322219
Oil_Reserves -0.5824720

As we can see oil is correlated with various variables, but some of this variables are highly correlated with each others as well, one of such cases is: GDP, YEAR, US Population and life expectancy, allmost all these variables have been growing at almost the same rate with time and therefore they are perfectly correlated, this already indicate us that we should consider only using one of them for the model selection.

Correlation matrix Active Rigs

We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Active Rigs.

x
Active_Rigs 1.0000000
Drilled_Wells 0.9281927
Oil_Reserves 0.4381892
Oil_Production 0.3700018
Eco_Rec 0.2988132
Oil_Price 0.2539703
President_Party 0.2051462
Gov_ctrl_type 0.1149544
Month 0.0761974
HDD 0.0523462
Season -0.0083891
Senate -0.0539184
CDD -0.0775931
Gov_ctrl_party -0.1415784
Net_imports_Energy -0.2739901
Energy_Consumption -0.3396680
GDP -0.3614154
Year -0.3666279
US_Population -0.3666279
life_exp -0.3678292
House -0.4006386
Oil_Stock -0.4074189
Active_Conflicts -0.6697390

With active rigs we see a similar trend, we have highly correlated variables in our dataset, and we would have to drop some of them to avoid multicolinearity.

Full Linear model Y = Oil Price

We will first begin by making a linear model of oil price versus all of the available exploratory variables.


Call:
lm(formula = Oil_Price ~ ., data = OIL_MODIFIED_MODEL)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.275  -2.575   0.076   2.771  39.407 

Coefficients: (2 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.551e+04  5.085e+03  -5.016 7.71e-07 ***
Month                      2.759e-01  1.250e-01   2.207 0.027838 *  
Year                       1.390e+01  2.722e+00   5.107 4.93e-07 ***
Active_Rigs                3.909e-03  1.555e-03   2.513 0.012323 *  
US_Population             -7.068e+00  9.087e-01  -7.778 5.47e-14 ***
President_PartyRepublican -4.026e+00  1.094e+00  -3.678 0.000264 ***
SenateRepublican           2.555e+00  2.199e+00   1.162 0.245921    
HouseRepublican           -6.122e+00  2.710e+00  -2.259 0.024353 *  
life_exp                  -5.437e+00  2.236e+00  -2.432 0.015420 *  
GDP                        2.526e+01  2.025e+00  12.470  < 2e-16 ***
Oil_Production             2.847e-03  1.418e-03   2.008 0.045261 *  
Oil_Stock                 -9.132e-02  1.435e-02  -6.363 5.04e-10 ***
HDD                        2.705e-03  4.770e-03   0.567 0.570874    
CDD                        1.958e-02  1.168e-02   1.676 0.094369 .  
Energy_Consumption         2.127e-01  1.854e+00   0.115 0.908724    
Net_imports_Energy        -1.508e+01  2.566e+00  -5.875 8.46e-09 ***
Oil_Reserves              -2.038e-03  4.586e-04  -4.444 1.13e-05 ***
Active_Conflicts           4.234e-02  2.940e-02   1.440 0.150501    
Drilled_Wells              2.178e-03  8.049e-04   2.707 0.007067 ** 
Eco_RecRecession          -4.553e-01  1.039e+00  -0.438 0.661531    
SeasonSpring               1.360e+00  1.199e+00   1.134 0.257560    
SeasonSummer              -1.021e+00  1.760e+00  -0.580 0.562125    
SeasonWinter              -2.240e+00  1.615e+00  -1.387 0.166124    
Gov_ctrl_typeTotal        -1.438e+00  1.026e+00  -1.401 0.161852    
Gov_ctrl_partyRepublican          NA         NA      NA       NA    
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.597 on 432 degrees of freedom
  (96 observations deleted due to missingness)
Multiple R-squared:  0.9068,    Adjusted R-squared:  0.9018 
F-statistic: 182.7 on 23 and 432 DF,  p-value: < 2.2e-16
Analysis of Variance Table

Response: Oil_Price
                    Df Sum Sq Mean Sq   F value    Pr(>F)    
Month                1    309     309    7.1060  0.007971 ** 
Year                 1  91169   91169 2094.8581 < 2.2e-16 ***
Active_Rigs          1  50324   50324 1156.3228 < 2.2e-16 ***
US_Population        1    819     819   18.8116 1.798e-05 ***
President_Party      1   4745    4745  109.0386 < 2.2e-16 ***
Senate               1  15114   15114  347.2866 < 2.2e-16 ***
House                1   6137    6137  141.0032 < 2.2e-16 ***
life_exp             1    176     176    4.0327  0.045249 *  
GDP                  1   7642    7642  175.5931 < 2.2e-16 ***
Oil_Production       1     50      50    1.1425  0.285715    
Oil_Stock            1    419     419    9.6215  0.002049 ** 
HDD                  1    638     638   14.6708  0.000147 ***
CDD                  1      0       0    0.0104  0.918874    
Energy_Consumption   1    151     151    3.4625  0.063454 .  
Net_imports_Energy   1   2809    2809   64.5378 9.106e-15 ***
Oil_Reserves         1   1665    1665   38.2601 1.437e-09 ***
Active_Conflicts     1     53      53    1.2128  0.271390    
Drilled_Wells        1    266     266    6.1185  0.013761 *  
Eco_Rec              1      2       2    0.0538  0.816692    
Season               3    319     106    2.4455  0.063397 .  
Gov_ctrl_type        1     85      85    1.9635  0.161852    
Residuals          432  18801      44                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A linear model of all variables included, give us an adjusted coefficient of determination of 0.9018, and an RMSE value of 6.597 dollars. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:

To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria

Residuals plot, show us that they are almost centered and scatered around 0, except for high oil prices levels, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.

Full Linear model Y = Active Rigs

We will first begin by making a linear model of Active rigs versus all of the available exploratory variables.


Call:
lm(formula = Active_Rigs ~ ., data = OIL_MODIFIED_MODEL)

Residuals:
    Min      1Q  Median      3Q     Max 
-579.89  -94.98    5.04   84.42 1157.95 

Coefficients: (2 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -3.177e+05  1.599e+05  -1.987 0.047572 *  
Month                      4.381e+00  3.855e+00   1.136 0.256415    
Year                       1.653e+02  8.571e+01   1.928 0.054481 .  
US_Population             -4.804e+01  2.971e+01  -1.617 0.106593    
President_PartyRepublican -1.998e+02  3.275e+01  -6.100 2.36e-09 ***
SenateRepublican          -1.155e+01  6.765e+01  -0.171 0.864463    
HouseRepublican            3.164e+00  8.371e+01   0.038 0.969867    
life_exp                   4.052e+01  6.910e+01   0.586 0.557910    
GDP                       -5.869e+01  7.249e+01  -0.810 0.418605    
Oil_Production            -1.978e-01  4.270e-02  -4.632 4.81e-06 ***
Oil_Stock                 -1.675e+00  4.538e-01  -3.692 0.000251 ***
Oil_Price                  3.687e+00  1.467e+00   2.513 0.012323 *  
HDD                        1.562e-01  1.464e-01   1.068 0.286327    
CDD                       -9.085e-02  3.598e-01  -0.252 0.800781    
Energy_Consumption        -2.245e+01  5.694e+01  -0.394 0.693599    
Net_imports_Energy        -6.835e+01  8.183e+01  -0.835 0.404064    
Oil_Reserves               5.680e-02  1.414e-02   4.017 6.95e-05 ***
Active_Conflicts           9.646e-01  9.038e-01   1.067 0.286436    
Drilled_Wells              4.099e-01  1.524e-02  26.892  < 2e-16 ***
Eco_RecRecession           1.725e+02  3.082e+01   5.597 3.89e-08 ***
SeasonSpring               6.016e+01  3.678e+01   1.636 0.102612    
SeasonSummer               1.551e+01  5.406e+01   0.287 0.774318    
SeasonWinter               5.255e+01  4.964e+01   1.059 0.290349    
Gov_ctrl_typeTotal         1.494e+01  3.157e+01   0.473 0.636218    
Gov_ctrl_partyRepublican          NA         NA      NA       NA    
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 202.6 on 432 degrees of freedom
  (96 observations deleted due to missingness)
Multiple R-squared:  0.9352,    Adjusted R-squared:  0.9318 
F-statistic: 271.2 on 23 and 432 DF,  p-value: < 2.2e-16
Analysis of Variance Table

Response: Active_Rigs
                    Df   Sum Sq  Mean Sq   F value    Pr(>F)    
Month                1   738085   738085   17.9807 2.731e-05 ***
Year                 1 45785358 45785358 1115.3882 < 2.2e-16 ***
US_Population        1 13504188 13504188  328.9788 < 2.2e-16 ***
President_Party      1  4453281  4453281  108.4875 < 2.2e-16 ***
Senate               1  5932278  5932278  144.5177 < 2.2e-16 ***
House                1 81501703 81501703 1985.4827 < 2.2e-16 ***
life_exp             1 34160309 34160309  832.1875 < 2.2e-16 ***
GDP                  1  8000118  8000118  194.8928 < 2.2e-16 ***
Oil_Production       1   469909   469909   11.4476 0.0007812 ***
Oil_Stock            1  4691129  4691129  114.2817 < 2.2e-16 ***
Oil_Price            1 11911533 11911533  290.1797 < 2.2e-16 ***
HDD                  1  1046345  1046345   25.4903 6.567e-07 ***
CDD                  1    89693    89693    2.1850 0.1400869    
Energy_Consumption   1   333804   333804    8.1319 0.0045579 ** 
Net_imports_Energy   1     3174     3174    0.0773 0.7810762    
Oil_Reserves         1   480056   480056   11.6948 0.0006863 ***
Active_Conflicts     1   369798   369798    9.0087 0.0028427 ** 
Drilled_Wells        1 41125657 41125657 1001.8721 < 2.2e-16 ***
Eco_Rec              1  1328248  1328248   32.3578 2.367e-08 ***
Season               3   122802    40934    0.9972 0.3939847    
Gov_ctrl_type        1     9197     9197    0.2240 0.6362178    
Residuals          432 17733086    41049                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A linear model of all variables included, give us an adjusted coefficient of determination of 0.9318, and an RMSE value of 203 Rigs. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:

To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria

Residuals plot, show us that they are almost centered and scatered around 0, except for high levels of active rigs, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.

Multicolinearity Evaluation

Before going any further we should evaluate multicolinearity on our model, the correlation plots already gave us a hint of which variables were highly correlated. We will first evaluate if within our datasets we have aliased variables, which means if weather or not we have perfect linear combinations of variables.

Model :
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Eco_Rec + Season + Gov_ctrl_type + 
    Gov_ctrl_party

Complete :
                         (Intercept) Month Year Active_Rigs US_Population
Gov_ctrl_partyRepublican -1/2           0     0    0           0         
Gov_ctrl_partyshared        1           0     0    0           0         
                         President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican  1/2                         0            
Gov_ctrl_partyshared        0                         0            
                         HouseRepublican life_exp GDP  Oil_Production
Gov_ctrl_partyRepublican  1/2               0        0    0          
Gov_ctrl_partyshared        0               0        0    0          
                         Oil_Stock HDD  CDD  Energy_Consumption
Gov_ctrl_partyRepublican    0         0    0    0              
Gov_ctrl_partyshared        0         0    0    0              
                         Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican    0                  0            0            
Gov_ctrl_partyshared        0                  0            0            
                         Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican    0             0                0        
Gov_ctrl_partyshared        0             0                0        
                         SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican    0            0          1/2              
Gov_ctrl_partyshared        0            0           -1              
Model :
Active_Rigs ~ Month + Year + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec + 
    Season + Gov_ctrl_type + Gov_ctrl_party

Complete :
                         (Intercept) Month Year US_Population
Gov_ctrl_partyRepublican -1/2           0     0    0         
Gov_ctrl_partyshared        1           0     0    0         
                         President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican  1/2                         0            
Gov_ctrl_partyshared        0                         0            
                         HouseRepublican life_exp GDP  Oil_Production
Gov_ctrl_partyRepublican  1/2               0        0    0          
Gov_ctrl_partyshared        0               0        0    0          
                         Oil_Stock Oil_Price HDD  CDD  Energy_Consumption
Gov_ctrl_partyRepublican    0         0         0    0    0              
Gov_ctrl_partyshared        0         0         0    0    0              
                         Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican    0                  0            0            
Gov_ctrl_partyshared        0                  0            0            
                         Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican    0             0                0        
Gov_ctrl_partyshared        0             0                0        
                         SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican    0            0          1/2              
Gov_ctrl_partyshared        0            0           -1              

as we can see Gov_ctrl_type and Gov_ctrl_party are aliased with other political variables so we will drop them from the dataset, and then generate a VIF matrix to look for other colinear explanatory variables

GVIF Df GVIF^(1/(2*Df))
Month 1.890459 1 1.374939
Year 10877.031124 1 104.293006
Active_Rigs 3.968626 1 1.992141
US_Population 7889.673094 1 88.823832
President_Party 2.441082 1 1.562396
Senate 7.488026 1 2.736426
House 11.497575 1 3.390808
life_exp 204.129970 1 14.287406
GDP 532.624103 1 23.078650
Oil_Production 22.474923 1 4.740772
Oil_Stock 118.034523 1 10.864369
HDD 23.352858 1 4.832479
CDD 16.766321 1 4.094670
Energy_Consumption 25.308852 1 5.030790
Net_imports_Energy 13.723124 1 3.704474
Oil_Reserves 26.968245 1 5.193096
Active_Conflicts 7.728751 1 2.780063
Eco_Rec 1.574211 1 1.254676
Season 32.850729 3 1.789606
GVIF Df GVIF^(1/(2*Df))
Month 1.885898 1 1.373280
Year 11206.371199 1 105.860149
US_Population 7887.114065 1 88.809426
President_Party 2.466371 1 1.570468
Senate 7.292218 1 2.700411
House 11.227107 1 3.350687
life_exp 248.770202 1 15.772451
GDP 734.158067 1 27.095351
Oil_Production 36.535996 1 6.044501
Oil_Stock 117.483915 1 10.839000
Oil_Price 11.390239 1 3.374943
HDD 23.815865 1 4.880150
CDD 16.783883 1 4.096814
Energy_Consumption 27.020968 1 5.198170
Net_imports_Energy 26.840064 1 5.180740
Oil_Reserves 27.275467 1 5.222592
Active_Conflicts 9.332611 1 3.054932
Eco_Rec 1.484848 1 1.218543
Season 34.183125 3 1.801504
predictions_oil 41.542859 1 6.445375

Variables that are highly correlated that high a VIF value bigger than 5 are:

Before deciding to remove them from the models, we will evaluate the more advance feature selecion algorithms

StepAIC akaike information criterion Oil Price model

We will run the AIC algorithm backwards and see which variables the algorithm decides to drop

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Eco_Rec + Season

Final Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves + 
    Active_Conflicts + Drilled_Wells + Season

       Step Df Deviance Resid. Df Resid. Dev      AIC
1                             433   18886.24 1744.005
2     - HDD  1 1.773351       434   18888.01 1742.048
3 - Eco_Rec  1 5.320678       435   18893.33 1740.177

The AIC algorithm decides to drop the following variables:

StepAIC akaike information criterion Active Rigs model

We will run the AIC algorithm backwards and see which variables the algorithm decides to drop

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Active_Rigs ~ Month + Year + US_Population + President_Party + 
    Senate + House + life_exp + GDP + Oil_Production + Oil_Stock + 
    Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec + 
    Season

Final Model:
Active_Rigs ~ Year + US_Population + President_Party + Oil_Production + 
    Oil_Stock + Oil_Price + HDD + CDD + Net_imports_Energy + 
    Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec

                  Step Df    Deviance Resid. Df Resid. Dev      AIC
1                                           433   17742283 4865.450
2             - Season  3 122801.6609       436   17865085 4862.595
3              - House  1    262.9605       437   17865348 4860.602
4              - Month  1    638.7724       438   17865986 4858.618
5 - Energy_Consumption  1   2877.4922       439   17868864 4856.691
6             - Senate  1   9271.8029       440   17878136 4854.928
7           - life_exp  1  32495.7768       441   17910631 4853.756
8                - GDP  1  38134.0117       442   17948765 4852.726

The AIC algorithm decides to drop the following variables:

BIC Bayesian information criterion Oil Price model

We will run the BIC algorithm and see which variables has the highest likehood for being included

BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:

BIC Bayesian information criterion Active Rigs model

We will run the BIC algorithm and see which variables has the highest likehood for being included

BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:

Boruta Feature Selection Oil Price

We will run the Boruta feature selection algorithm and see which variables are included

Boruta performed 99 iterations in 21.89947 secs.
 17 attributes confirmed important: Active_Conflicts, Active_Rigs,
CDD, Eco_Rec, Energy_Consumption and 12 more;
 1 attributes confirmed unimportant: Season;
 1 tentative attributes left: HDD;

features medianImpact decision
GDP 13.583942 Confirmed
US_Population 13.207721 Confirmed
Year 13.116314 Confirmed
Active_Rigs 12.602486 Confirmed
Oil_Stock 12.566571 Confirmed
life_exp 10.985583 Confirmed
Senate 10.465949 Confirmed
Active_Conflicts 8.949173 Confirmed
Net_imports_Energy 8.597149 Confirmed
Oil_Reserves 8.522154 Confirmed
Oil_Production 8.505690 Confirmed
Month 6.678523 Tentative
Energy_Consumption 6.076158 Confirmed
CDD 5.334423 Confirmed
President_Party 5.082532 Confirmed
House 4.744941 Confirmed
Eco_Rec 4.201994 Confirmed
HDD 2.801193 Confirmed
Season 1.878089 Rejected
             Month               Year        Active_Rigs 
         Confirmed          Confirmed          Confirmed 
     US_Population    President_Party             Senate 
         Confirmed          Confirmed          Confirmed 
             House           life_exp                GDP 
         Confirmed          Confirmed          Confirmed 
    Oil_Production          Oil_Stock                HDD 
         Confirmed          Confirmed          Tentative 
               CDD Energy_Consumption Net_imports_Energy 
         Confirmed          Confirmed          Confirmed 
      Oil_Reserves   Active_Conflicts            Eco_Rec 
         Confirmed          Confirmed          Confirmed 
            Season 
          Rejected 
Levels: Tentative Confirmed Rejected

Only one variable was dropped by borutas final decission:

interestigly the variables which are given the highest importance are economic variables like GDP and population

Boruta Feature Selection Active Rigs

Boruta performed 99 iterations in 21.22427 secs.
 16 attributes confirmed important: Active_Conflicts, Eco_Rec,
Energy_Consumption, GDP, House and 11 more;
 No attributes deemed unimportant.
 3 tentative attributes left: CDD, HDD, Season;

features medianImpact decision
Oil_Price 21.231639 Confirmed
Oil_Stock 14.001810 Confirmed
GDP 13.771841 Confirmed
Year 13.462641 Confirmed
US_Population 13.449564 Confirmed
life_exp 12.746623 Confirmed
Net_imports_Energy 11.438301 Confirmed
Active_Conflicts 11.436812 Confirmed
Oil_Reserves 11.024660 Confirmed
Oil_Production 10.984145 Confirmed
Eco_Rec 8.375199 Confirmed
President_Party 8.094818 Tentative
Senate 8.037422 Tentative
Energy_Consumption 6.580326 Confirmed
Month 5.184240 Confirmed
House 4.492152 Confirmed
CDD 1.929082 Confirmed
HDD 1.739728 Confirmed
Season 1.684771 Tentative
             Month               Year      US_Population 
         Confirmed          Confirmed          Confirmed 
   President_Party             Senate              House 
         Confirmed          Confirmed          Confirmed 
          life_exp                GDP     Oil_Production 
         Confirmed          Confirmed          Confirmed 
         Oil_Stock          Oil_Price                HDD 
         Confirmed          Confirmed          Tentative 
               CDD Energy_Consumption Net_imports_Energy 
         Tentative          Confirmed          Confirmed 
      Oil_Reserves   Active_Conflicts            Eco_Rec 
         Confirmed          Confirmed          Confirmed 
            Season 
         Tentative 
Levels: Tentative Confirmed Rejected

Threee variables were classified as tentative by borutas final decision: