The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.
Description of variables, units of measurements, data category and the source of data, can be found on the tables below:
Codebook Varible Description
| Variable Name | Description |
|---|---|
| Date | Date of the observation |
| Month | Month of the observation |
| Year | Year of the observation |
| Active_Rigs | Number of active drilling rigs on and offshore recorded to be working on that date |
| US_Population | USA population estimates for the date of the observation |
| President_Party | Identified political party for the ruling US president on the date observation |
| Senate | Political party that holds the majority of US Senate on the date of the observation |
| House | Political party that holds the majority of US congress on the date of the observation |
| life_exp | US average life expectancy at the date of the observation |
| GDP | Gross Domestic Product of the US economy on the date of the observation |
| Oil_Production | US Total oil production on the date of the observation |
| Oil_Stock | US inventory of oil in stock on the date of the observation |
| Oil_Price | US average domestic first purchase price, not corrected by inflation |
| HDD | Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F |
| CDD | Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F |
| Energy_Consumption | The amount of energy consumed by the US market on the date of the observation |
| Net_imports_Energy | The Net amount of energy imported by the US market on the date of the observation |
| Oil_Reserves | US proven oil reserves on the date of the observation |
| Active_Conflicts | Number of world active military conflicts on the date of the observation, includes civil wars |
| Drilled_Wells | Number of drilled wells on US on the date of the observation |
Codebook Varible Units of measurement
| Variable Name | Units of measurement |
|---|---|
| Date | Date format Year-Month-Day |
| Month | Integer from 1-12 |
| Year | Integer 4 digits |
| Active_Rigs | Integer each unit represents a drilling rig |
| US_Population | Numerical in Millions of people |
| President_Party | Categorical variable Either Democrat or Republican |
| Senate | Categorical variable Either Democrat or Republican |
| House | Categorical variable Either Democrat or Republican |
| life_exp | Numerical life expectancy in years |
| GDP | Numerical GDP in Trillions of dollars 10^12 USD |
| Oil_Production | Numerical Oil production in Thousand barrels per day |
| Oil_Stock | Numerical US inventory of oil in million barrels |
| Oil_Price | Numerical US dollars |
| HDD | Numerical Monthly US HDD measured un number of °F |
| CDD | Numerical Monthly US CDD measured un number of °F |
| Energy_Consumption | Numerical Energy consumed in Quadrillion btu’s 10^15 |
| Net_imports_Energy | Numerical Net energy imported in Quadrillion btu’s 10^15 |
| Oil_Reserves | NUmerical Oil reserves in million barrels |
| Active_Conflicts | Numerical each unit represents an active military conflict |
| Drilled_Wells | Numerical each unit represents a drilled well |
Codebook Varible Source of data
| Variable Name | Units of measurement |
|---|---|
| Date | Included with other measurements |
| Month | Included with other measurements |
| Year | Included with other measurements |
| Active_Rigs | Data Link |
| US_Population | Data Link |
| President_Party | Data Link |
| Senate | Data Link |
| House | Data Link |
| life_exp | Data Link |
| GDP | Data Link |
| Oil_Production | Data Link |
| Oil_Stock | Data Link |
| Oil_Price | Data Link |
| HDD | Data Link |
| CDD | Data Link |
| Energy_Consumption | Data Link |
| Net_imports_Energy | Data Link |
| Oil_Reserves | Data Link |
| Active_Conflicts | Data Link |
| Active_Conflicts 2 | Data Link |
| Drilled_Wells | Data Link |
The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables. In order to achive this objective it is necessary to first develop insight on variables that might influence the behavior of both response variables.
The first step in our analysis will be to upload the dataset, and explore its structure, the descriptive statistics and if weather or not observations in the dataset are complete.
'data.frame': 552 obs. of 20 variables:
$ Date : Date, format: "1973-01-01" "1973-02-01" ...
$ Month : Factor w/ 12 levels "01","02","03",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Year : num 1973 1973 1973 1973 1973 ...
$ Active_Rigs : num 1219 1126 1049 993 1046 ...
$ US_Population : num 212 212 212 212 212 ...
$ President_Party : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
$ Senate : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ House : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ life_exp : num 71.4 71.4 71.4 71.4 71.4 ...
$ GDP : num 1.43 1.43 1.43 1.43 1.43 ...
$ Oil_Production : num 10856 11140 11006 11042 11002 ...
$ Oil_Stock : num 237 235 244 248 257 248 243 248 241 246 ...
$ Oil_Price : num 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 3.89 ...
$ HDD : int 965 843 552 402 215 26 8 11 74 246 ...
$ CDD : int 5 3 19 13 62 214 297 280 142 50 ...
$ Energy_Consumption: num 7.23 6.59 6.52 5.94 6.07 ...
$ Net_imports_Energy: num 1.047 1.047 1.17 0.891 0.966 ...
$ Oil_Reserves : num 35300 35300 35300 35300 35300 35300 35300 35300 35300 35300 ...
$ Active_Conflicts : int 52 52 52 52 52 52 52 52 52 52 ...
$ Drilled_Wells : int 2166 1896 2151 1883 2298 2279 2367 2690 2333 2538 ...
Date Month Year Active_Rigs
Min. :1973-01-01 01 : 46 Min. :1973 Min. : 407.0
1st Qu.:1984-06-23 02 : 46 1st Qu.:1984 1st Qu.: 856.8
Median :1995-12-16 03 : 46 Median :1996 Median :1184.5
Mean :1995-12-16 04 : 46 Mean :1996 Mean :1435.0
3rd Qu.:2007-06-08 05 : 46 3rd Qu.:2007 3rd Qu.:1851.5
Max. :2018-12-01 06 : 46 Max. :2018 Max. :4521.0
(Other):276
US_Population President_Party Senate House
Min. :211.9 Democrat :240 Democrat :288 Democrat :312
1st Qu.:235.8 Republican:312 Republican:264 Republican:240
Median :267.8
Mean :268.6
3rd Qu.:301.2
Max. :327.0
life_exp GDP Oil_Production Oil_Stock
Min. :71.36 Min. : 1.429 Min. : 5488 Min. : 233.0
1st Qu.:74.56 1st Qu.: 4.041 1st Qu.: 7769 1st Qu.: 763.8
Median :75.82 Median : 7.882 Median : 9160 Median : 898.0
Mean :75.98 Mean : 9.027 Mean : 9323 Mean : 823.4
3rd Qu.:77.99 3rd Qu.:14.419 3rd Qu.:10252 3rd Qu.:1003.2
Max. :78.84 Max. :20.660 Max. :16289 Max. :1230.0
Oil_Price HDD CDD Energy_Consumption
Min. : 3.89 Min. : 3.0 Min. : 2.0 Min. :5.438
1st Qu.: 14.01 1st Qu.: 45.0 1st Qu.: 12.0 1st Qu.:6.592
Median : 23.51 Median : 310.0 Median : 42.0 Median :7.552
Mean : 33.28 Mean : 378.9 Mean :103.6 Mean :7.382
3rd Qu.: 45.19 3rd Qu.: 660.0 3rd Qu.:183.8 3rd Qu.:8.048
Max. :128.08 Max. :1266.0 Max. :404.0 Max. :9.655
NA's :1 NA's :2 NA's :2 NA's :2
Net_imports_Energy Oil_Reserves Active_Conflicts Drilled_Wells
Min. :0.06212 Min. :19121 Min. : 52.0 Min. :1268
1st Qu.:0.94186 1st Qu.:22351 1st Qu.: 83.5 1st Qu.:2288
Median :1.30195 Median :26544 Median :132.5 Median :2914
Mean :1.39765 Mean :26705 Mean :121.8 Mean :3482
3rd Qu.:1.85828 3rd Qu.:30529 3rd Qu.:151.2 3rd Qu.:4208
Max. :2.74169 Max. :39160 Max. :205.0 Max. :8556
NA's :2 NA's :12 NA's :24 NA's :96
Date Month Year
0 0 0
Active_Rigs US_Population President_Party
0 0 0
Senate House life_exp
0 0 0
GDP Oil_Production Oil_Stock
0 0 0
Oil_Price HDD CDD
1 2 2
Energy_Consumption Net_imports_Energy Oil_Reserves
2 2 12
Active_Conflicts Drilled_Wells
24 96
The variable with most NA’s in the data set is Drilled wells, the dataset of the EIA did not provide this information since the end of 2010. for the exploratory analysis the complete dataset will be used altought for modeling and predicting perhaps this variable should be dropped or only use a complete.case dataset.
After the analysis of the structure of the dataset, we will first start with a series of pairs plot for different categories of variables, that for the sake of simplicity we have divided into the following groups: Environmental, Socioeconomic, Political, Oil related, Energy related
Environmental
Altogh pairwise graphs can result complicated, they migh highlight interesting patterns of our data, for instance:
Social & economic variables
Allmast all the variables share an strong linear relationship with year, with the exception of Activity rigs, therefore we see a similar trend between them.
Oil Market
Energy Market
Political
This plot is extremely noisy due to the fact that some of these variables are categoricals, however it still provides some interesting information
After looking at pairs plot with more detail, it seems like various of the observations are clustered together, therefore it seems logical as well to try to find possible groups and associations that might bring up to light relevant variables or data insights. In order to to this we will use unsupervised learning methods, such as K-means and hierarchical clustering to aid us.
In order to facilitate the analysis we will start with group data by years, instead of using the monthly division. To group by years we will use median as the grouping functions seems most of the variables are rightly skewed. This yearly data will be used initially to create a heatmap.
Heatmap
Interestingly the heatmap shows that indeed recent years are part of a cluster by themselves and are characterized by:
The next logical step in our analisis is perform a cluster analysis and analyze the characteristics of the different clusters of observations
Kmeans Cluster Analysis
Tha main issue with kmeans clustering is the question, related to the number of adequate clusters, for certain datasets (sports, market segments, gender, etc.), it is faily obvious but for other type of problems such as this one the number is unknown. To determine the number of clusters for this analysis we will use the elbow and silhouette criteria
Elbow Scree Criteria
| k | tot_withinss |
|---|---|
| 1 | 308.00000 |
| 2 | 187.29814 |
| 3 | 124.30791 |
| 4 | 108.16505 |
| 5 | 85.18374 |
| 6 | 47.37352 |
| 7 | 43.09698 |
| 8 | 29.82660 |
| 9 | 26.98528 |
| 10 | 22.92257 |
Looking at the plot, the elbow seems to be located between k=2 and k=4, the silhouette will help us further with this determination
Silhouette Plot
| k | sil_width |
|---|---|
| 2 | 0.3790503 |
| 3 | 0.3665515 |
| 4 | 0.4044439 |
| 5 | 0.4094645 |
| 6 | 0.4303218 |
| 7 | 0.4655798 |
| 8 | 0.4437710 |
| 9 | 0.4207146 |
| 10 | 0.4203871 |
The Silhouette plot, show us a peak on the silhouette width at k=7. For further exploration the number of centroids to be used will be 7
| cluster | n |
|---|---|
| 1 | 5 |
| 2 | 8 |
| 3 | 9 |
| 4 | 3 |
| 5 | 4 |
| 6 | 12 |
| 7 | 4 |
The barplot of sil width shows, that all points have a positive width, which means that clusters were correctly created. Lets see with more detail the characteristics of each one of the seven clusters created.
| cluster | Year | Oil_Price | Oil_Reserves | Oil_Stock | Oil_Production | Active_Rigs | Energy_Consumption | Net_imports_Energy | CDD | HDD | President_Party | House | Senate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1983.0 | 26.0600 | 28416.0 | 681.00 | 10258.581 | 2418.50 | 6.145066 | 0.6833525 | 40.50 | 325.50 | Republican | Democrat | Republican |
| 2 | 1976.5 | 8.2575 | 31567.5 | 308.50 | 10143.014 | 1829.75 | 6.213764 | 1.1358337 | 31.25 | 354.50 | Mixed | Democrat | Democrat |
| 3 | 1990.0 | 15.3750 | 26254.0 | 909.50 | 9045.215 | 836.50 | 6.909735 | 1.1485480 | 38.00 | 320.00 | Republican | Democrat | Democrat |
| 4 | 2016.0 | 42.8600 | 32773.0 | 1171.00 | 12807.305 | 874.50 | 8.046779 | 0.9050690 | 65.00 | 254.00 | Democrat | Republican | Republican |
| 5 | 2012.5 | 94.5275 | 31950.0 | 1044.00 | 9354.557 | 1875.00 | 8.033162 | 1.2129182 | 52.00 | 281.00 | Democrat | Republican | Democrat |
| 6 | 2000.5 | 23.2375 | 21954.0 | 894.75 | 7709.979 | 941.25 | 7.928625 | 2.1101968 | 43.25 | 301.25 | Mixed | Republican | Republican |
| 7 | 2008.5 | 68.9325 | 20999.5 | 1019.00 | 7125.754 | 1659.25 | 8.115634 | 2.0194070 | 43.00 | 291.75 | Mixed | Democrat | Democrat |
Results from the cluster analysis show us clearly that more recent years are categorized by more oil independence of US, with an increase in production, a decrease on net energy imports, higher oil reserves, and moderated prices and rig activity
Hierarchical Clustering Dendogram
Hierarchical clusters show us a similar trend as well, on which more recent years, can be found as a cluster of its own, on the left side area of the dendogram
With the insight gained thanks to the Unsupervised learning EDA, we can explore further the behavior of our variables with relationship with clustering
If we look with more detail at the four plots, we can see an interesting relatioship, specially on the year 2007, highlighted by the red line, as we can see on the highest point for oil prices, oil production was at its lowest point, oil reserves as well, and rig count on low to mid levels. 2007 Seems to be a turning point in the history of US oil industry.
It is clear by the analysis that there is indeed some sort of relationship between these variables, however on some of them like Active_Rigs and Oil_Price the relationship, still seems confusing, it might be worth to look onto this relationship with more detail.
Interestingly the relationship is positive, however it is definitively not linear, it might be also interesting at the behavior of both variables with time, since oil price and rigs are on almos two order of magnitud apart, we should first normalize the variables to perform the adequate analysis and them plot their behavior on time
The time analysis whos what we suspected, indeed price and active rigs have a relationship, in fact whenever price has increased active rigs has followed the same trend, which is logical from an economic point of view, the higher the prices of goods to be sold, the more incentive for their production and for entering the oil and drilling business. Altouhgt there is a relationship between the two variables it seems that on early years, active rigs was a more volatile variable in comparisson with price, but its behavior has stabilized with time, which might suggest that there are other variables or dynamics affecting the behavior of the drilling business besides the oil price.
Drilling Dynamics I
The relationship between Active Rigs and drilled wells it is crearly and strongly linear, therefore we can assume that in time whenever active rigs increase drilled wells increase as well. On recent years, specially since the beginning of 2000’s there is a clear linear relationship between active rigs, production and reserves, which seems logical, since to produce more oil, we either need to incorpore more wells or make the current wells produce more, altought it is not as simple as it sounds due to various factors:
If we normalize our Production and Reserves variables, we can see more clearly the exploration and production dynamics.
Drilling Dynamics II
Now it is more clear, periods of high efficiency of reserves per new drilled well, and production will match periods of low drilling activity, for those periods lets called them sustain periods, the drilling activity will be meant to adjust production by the declination rate of the in service wells, whenever those values start to lay behind energy demand projections a new drilling cycle seems starts on which the industry is aggresively looking for new reservoirs to increase reserves and to include new development well into the production pipeline.
After making an initial exploration of the data in hand, it becomes clear that we might need some more information or transformations of some of the variables, since their current form perhaps they dont capture the complete picture of the problem in hand.
All code from the transformation can be found on the following link
Seasonality
Monthly variables as itself, doesnt seems to produce a clear difference neither in price or active rigs
The monthly variable might be compresed into a season variable as per the standard Seasons dates on the US
| Season | Dates |
|---|---|
| Spring | March 1 to May 31 |
| Summer | June 1 to August 31 |
| Autumn | September 1 to November 30 |
| Winter | December 1 to February 28 |
After addingg the season variable, the behavior seems to be unaffected altought this question will be confirmed on posterior sections of this work, by using statistical inference methods.
Politics
Energy Policy plays indeed a critical role on the behavior of our varibles, by the analysis of the following boxplots it seems that there is a difference on the ruling parties at different levels on the US government and the effect on Oil prices and Rig Activty.
Altought we have captured the dynamics of US politics using three variables President Party, Senate and House, neither of them by themselves describe completely, which political party is truly ruling the US, energy and oil policy, for istance a ruling president with an oposing house and senate, might act completely different as a president with support at all levels. This an instesting aspect that should be explored further
New varibles are going to be coded, on which we will differentiate between a total control scenario on which the same party is ruling all three powers and a shared power scenario on which at least one of the powers belongs to a different political party. Using this variable we will stablish on scenarios of total control which party is ruling the government, and we will evaluate if there is a difference on these scenarios
shared Total
0.6521739 0.3478261
From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time.
Now lets look with more detail on how this, have affected our response variables
There seems to be a certain difference in oil price whenever the government is totally ruled by a single party in comparisson with the scenario on which power is shared, however in regards to activity rigs the difference doesnt seems to be that dramatic.
Democrat Republican shared
0.1739130 0.1739130 0.6521739
From Jan 1973 until the present date, the US government has been on a total control by one party scenario 35% of the time. On these periods of total control, each party has enjoyed equal amount of time (17% each) rulling on total control.
Now lets look with more detail on how this, have affected our response variables
When we facet the scenarios on total government control by including the ruling party, a clear difference emerges on both variables. Altought graphics uncovered differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.
Economics
As we saw on early pair graphis, GDP has grown on a linear positive fashion in regards to time, therefore graphics of GDP and Oil Price or Activity rigs doesnt, reveal any new behavior that was not previously highlighted by the relationships of these variables and time. however there might still be some further exploration to make in regards to this economic variable, an interesting aspect to highlight might be to differentiate the behavior of our variables of interest on periods of economic recession and normal economic behavior.
| Name | Begining | End | Short Description |
|---|---|---|---|
| 1973 oil crisis | Nov 1973 | Mar 1975 | The 1973 oil crisis, a quadrupling of oil prices by OPEC. |
| 1980 recession | Jan 1980 | July 1980 | Federal Reserve, raised interest rates dramatically. |
| 1981–1982 recession | July 1981 | Nov 1982 | The Iranian Revolution & Tight monetary policy in the US. |
| Early 1990s recession | July 1990 | Mar 1991 | Federal Reserve raising interest & 1990 oil price shock. |
| Early 2000s dot-com | Mar 2001 | Nov 2001 | Collapse of speculative dot-com bubble September 11th attacks. |
| Great Recession | Dec 2007 | une 2009 | The subprime mortgage crisis, collapse of US housing bubble. |
Normal Recession
0.8605072 0.1394928
From Jan 1973 until the present date, the US Economy has been on recession 14% of the months. Now lets look with more detail on how this, have affected our response variables
Looking at the Price vs time behavior it seems interesint to notice that almost all early recession are associated with an spike on oil prices. Similar peaks can be found as well on the active rigs curve in time.
The Price and recession boxplot doesnt show a dramatic difference, however the rig count seems to be directly affected by periods of economic recession. This difference in the behavior of both variables might be explained by the fact that oil price is an international variable that not only depends on the US economy, and in contrast active rigs in US in directly affected by how the country economy is performing.Altought graphics uncovered interesting differences, this will only be confirmed on posterior sections of this work, by using statistical inference methods.
Our Previous analysis have lead us to various questions, related to the nature of the relationships between our response variables explanatory ones. The initial EDA has uncovered or to be more precise it has awaken our curiosity regarding this questions, but the answers can only be found using more advanced techniques. One this section we intend to answer this questions.
All the code of the statistical tests performed on this section, can be found on the following Link
Does oil price varies with Months?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Month | 1 | 101.9351 | 101.9351 | 0.1321276 | 0.7163761 |
| Residuals | 549 | 423548.1536 | 771.4903 | NA | NA |
As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.76-76%
Does Active_Rigs varies with Months?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Month | 1 | 602666.4 | 602666.4 | 1.098467 | 0.2950631 |
| Residuals | 550 | 301753681.5 | 548643.1 | NA | NA |
As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.29-29%
Does oil price varies with Seasons?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Season | 3 | 633.4052 | 211.1351 | 0.2730173 | 0.8448725 |
| Residuals | 547 | 423016.6836 | 773.3395 | NA | NA |
As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.84-84%
Does Active Rigs varies with Seasons?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Season | 3 | 1138054 | 379351.5 | 0.690146 | 0.558328 |
| Residuals | 548 | 301218293 | 549668.4 | NA | NA |
As we can see there is no a significative difference since we obtained an ANOVA test p-value of 0.55-55%
Does Oil Price varies with the US Presidents political Party
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Oil_Price by OIL_MODIFIED$President_Party
t = 3.7616, df = 405.02, p-value = 0.0001937
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.42473 14.11211
sample estimates:
mean in group Democrat mean in group Republican
38.51321 29.24479
As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.0001937 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents have experieced on average oil prices between 4.4 and 14.1 $ higher than republican presidents.
Does Active Rigs varies with the US Presidents political Party?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$President_Party
t = -2.1034, df = 541.3, p-value = 0.03589
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-253.424469 -8.662711
sample estimates:
mean in group Democrat mean in group Republican
1360.950 1491.994
As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.03589-3.6 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the test we can claim that Democrat US presidents on average have experienced a number of active rigs between 253 and 9 rigs lower than republican presidents.
Does Oil price varies with the majority ruling party of US Senate?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Senate
t = 2.6457, df = 393.49, p-value = 0.008479
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.55085 10.52260
sample estimates:
mean in group Democrat mean in group Republican
36.16326 30.12654
As we can see there difference is indeed a significative difference since we obtained T test p-value of 0.008479-0.8 % at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant Senate has on average have experienced Oil prices between 1.5 and 10.5 $ higher than a dominant republican senate.
Does Active Rigs varies with the majority ruling party of US Senate?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Senate
t = 1.9558, df = 448.03, p-value = 0.05111
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6066393 251.1350484
sample estimates:
mean in group Democrat mean in group Republican
1494.927 1369.663
As we can see there is no a significative difference since we obtained an T test p-value of 0.05111-5.1%
Does Oil price varies with the majority ruling party of US House?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Oil_Price by OIL_MODIFIED$House
t = -8.5814, df = 447.84, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-24.24962 -15.21222
sample estimates:
mean in group Democrat mean in group Republican
24.72343 44.45435
As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced Oil prices between 24.24 and 15.21 $ lower than a dominant republican house.
Does Active Rigs varies with the majority ruling party of US House?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$House
t = 9.3118, df = 493.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
407.6818 625.7310
sample estimates:
mean in group Democrat mean in group Republican
1659.673 1142.967
As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.2e-16 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that Democrat dominant House has on average experienced a Rig count between 408 and 626 rig count higher than a dominant republican house.
Does Oil prices varies if the US government is completely dominated by one sigle party at all levels?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Gov_ctrl_type
t = -0.24384, df = 504.06, p-value = 0.8075
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.963467 3.867454
sample estimates:
mean in group shared mean in group Total
33.09189 33.63990
As we can see there is no a significative difference since we obtained an T test p-value of 0.8075-80.75%
Does Active rigs varies if the US government is completely dominated by one sigle party at all levels?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Gov_ctrl_type
t = 0.49325, df = 477.12, p-value = 0.6221
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-90.53302 151.21913
sample estimates:
mean in group shared mean in group Total
1445.572 1415.229
As we can see there is no a significative difference since we obtained an T test p-value of 0.6221-62.21%
On Scenarios of complete control by one party, does oil prices varies in relationship to the party?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Gov_ctrl_party | 2 | 10521.5 | 5260.7522 | 6.978196 | 0.0010171 |
| Residuals | 548 | 413128.6 | 753.8843 | NA | NA |
As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 0.0010171 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.
| term | comparison | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Gov_ctrl_party | Republican-Democrat | 14.817787 | 5.4797130 | 24.1558616 | 0.0006204 |
| OIL_MODIFIED$Gov_ctrl_party | shared-Democrat | 6.822097 | -0.5898721 | 14.2340666 | 0.0785599 |
| OIL_MODIFIED$Gov_ctrl_party | shared-Republican | -7.995690 | -15.4383933 | -0.5529868 | 0.0317663 |
Based on the results we can claim that a completely controled republican goverment has experienced oil prices between 5.5 and 24.15 $ higher than a completely controled democrat government, It is also interesting to mention that a completely controled republican goverment shows higher oil prices in comparisson with a shared government.
On Scenarios of complete control by one party, does Active Rigs varies in relationship to the party?
To answer this question we will first use and ANOVA test and if it provides significant differences then we will perform a tukeyHSD test to find out which categories are truly different.We will use and alpha of 5% or 0.05
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Gov_ctrl_party | 2 | 14370308 | 7185153.9 | 13.69736 | 1.6e-06 |
| Residuals | 549 | 287986040 | 524564.7 | NA | NA |
As we can see there difference is indeed a significative difference since we obtained an ANOVA test p-value of 1.6e-06 at a 95% confident interval, which is below our selected alpha level of 5%. With such a difference we will proceed with the TukeyHSD to find out which categories present a significative difference. It is importatnt to highlight that tukeyHSD already presents the corrected or adjusted p-values based on the Bonferroni correction.
| term | comparison | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|
| OIL_MODIFIED$Gov_ctrl_party | Republican-Democrat | -544.9583 | -790.6342 | -299.28247 | 0.0000008 |
| OIL_MODIFIED$Gov_ctrl_party | shared-Democrat | -242.1361 | -437.6505 | -46.62175 | 0.0104769 |
| OIL_MODIFIED$Gov_ctrl_party | shared-Republican | 302.8222 | 107.3079 | 498.33658 | 0.0008702 |
Based on the results we can claim that a completely controled republican goverment has experienced lower levels of rig activity in comparisson with a fully democrat government or shared government. Completely Democrat governments has experienced the highest levels of active rigs.
Does Oil prices varies if the US economy is on recession?
To answer this question we will use a T test and we will evaluate the obtained p-values and the confidence interval of the test parameter.We will use and alpha of 5% or 0.05
Welch Two Sample t-test
data: OIL_MODIFIED$Oil_Price by OIL_MODIFIED$Eco_Rec
t = -0.045776, df = 95.274, p-value = 0.9636
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.707692 7.360243
sample estimates:
mean in group Normal mean in group Recession
33.25757 33.43130
As we can see there is no a significative difference since we obtained an T test p-value of 0.9636-96.4%
Does Active Rigs varies if the US economy is on recession?
Welch Two Sample t-test
data: OIL_MODIFIED$Active_Rigs by OIL_MODIFIED$Eco_Rec
t = -5.565, df = 85.947, p-value = 2.925e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-905.1819 -428.6917
sample estimates:
mean in group Normal mean in group Recession
1341.985 2008.922
As we can see there difference is indeed a significative difference since we obtained T test p-value of 2.925e-07 at a 95% confident interval, which is below our selected alpha level of 5%. Based on the values of the T test we can claim that US economy on recession has on average experienced a Rig count between 428 and 905 rig count higher than a US economy on normal conditions.
Does oil price behavior is independent of price behavior on previous month?
This might sound like an unrelated question, but it is on the contrary fundametal for this work, what is the point of attempting to predict via regression or forecasting methods, the behavior of a variable if it responds to randomness with a certain probability? and if this is the case it is critical to now it in advance instead of attempting to develop models that are fundamentally wrong from the conception.
It is very commong to hear from time to time that trading markets like forex or stock markets, behave randomly, that the ups and downs of prices follows patterns as such the throwing of a coin; being oil a commodity it is sometimes accused of the same issue.
This sounds like a very complex question, and it is, but a chi-square GOF test can be used to study the problem. We will label each observaion as Up or Down (D) depending on whether the Oil price market was up or down that month.
DOWN UP
235 316
DOWN UP
0.4264973 0.5735027
For our dataset, composed of 552 monthly observations of Oil prices, starting on january 1973, the proportion of UP’s in the market is of 57%. The next step in our analysis will be to generate an histogram of the distance between UP’s, how many months of DOWN we have before experiencing again a market turnup. We will cal this variable streak
| price_streak | Freq |
|---|---|
| 0 | 126 |
| 1 | 37 |
| 2 | 26 |
| 3 | 15 |
| 4 | 15 |
| 5 | 6 |
| 6 | 3 |
| 7 | 4 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 19 | 1 |
| price_streak | Freq |
|---|---|
| 0 | 0.5338983 |
| 1 | 0.1567797 |
| 2 | 0.1101695 |
| 3 | 0.0635593 |
| 4 | 0.0635593 |
| 5 | 0.0254237 |
| 6 | 0.0127119 |
| 7 | 0.0169492 |
| 8 | 0.0042373 |
| 9 | 0.0042373 |
| 10 | 0.0042373 |
| 19 | 0.0042373 |
Since the chi-square GOF test, requires to have at least 5 observations on each category, we will group up values of streak above 6 days.
| price_streak | Freq | cummulative |
|---|---|---|
| 0 | 0.5338983 | 0.5338983 |
| 1 | 0.1567797 | 0.6906780 |
| 2 | 0.1101695 | 0.8008475 |
| 3 | 0.0635593 | 0.8644068 |
| 4 | 0.0635593 | 0.9279661 |
| 5 | 0.0254237 | 0.9533898 |
| 6 | 0.0466102 | 1.0000000 |
With the streak distribution, the next step is to calculate the distribution of a geometric distribution with a success probability of 57%, on which values above 6 are grouped together.
| Geometric_streak | Freq | cummulative |
|---|---|---|
| 0 | 0.5796 | 0.5796 |
| 1 | 0.2420 | 0.8216 |
| 2 | 0.1043 | 0.9259 |
| 3 | 0.0411 | 0.9670 |
| 4 | 0.0205 | 0.9875 |
| 5 | 0.0071 | 0.9946 |
| 6 | 0.0054 | 1.0000 |
Having the geometric distribution proportions and the original table of oil price streak frequencies we can perform a chi-square goodness of fit test, the selected alpha value for this test will be 5% or 0.05.
Chi-squared test for given probabilities
data: price_observed
X-squared = 102.79, df = 6, p-value < 2.2e-16
As we can see the obtained p-value of the test is 2.2e-16, almost 0 which means that it is almost impossible to find the oil price streak distribution if we assume the null hypothesis of a geometric distribution with succes probability of 57%.
Since this is a complicate question we will confirm our test, by using simulation instead of the predefined statistical functions of R. We will replicate with replacement 10000 (10 thousands) observations of our dataset, and calculate the chi-square statistic assuming a null hypothesis on which our distribution is equal to the geometric distrbution with a probability of success of 57%, and then we will calculate the p-value of obtaining our observed chi-square statistic.
| replicate | stat |
|---|---|
| 1 | 8.371387 |
| 2 | 13.970183 |
| 3 | 5.948372 |
| 4 | 2.312726 |
| 5 | 4.998194 |
| 6 | 8.594592 |
[1] 10000 2
With this new data frame of simulations we will calculate the simulated p-value which is just the proportion of simulated stats, more extreme or equal to our observed chi-square stat.
# A tibble: 1 x 1
`mean(stat >= oil_observed_chi)`
<dbl>
1 0
The blue curve of this graph corresponds to a chi-square distribution with 6 degrees of freedom, we can see how it almost perfecly overlap the black curve which is our simulated distribution, this give us confirmation that our simulation was correct.
As we can see both graphically and with the calculation this proportion or simulated p-value is virtually zero, which means that our oil price streak distribution doesnt follow a geometric distribution and which means as well that, there is no independence between the behavior of oil price one month after one other, making this behavior not random. This confirm our suspicious that oil prices are influenced by time and many other variables.
Why monthly oil prices and not daily? how volatiles are daily oil prices?
This is an interesting question and perhaps somehow complicated to answer, the idea of this study is to capture macroeconomic or industry related trends that on the somehow long-term cause oil prices to raise or to drop, oil being traded as a commodity on daily basis might be more volatile and more susceptible to speculation.
To test this, we are going to evaluate the effect of, US president Donald Trump, oil & energy related tweets, on the daily Brent crude price. We are going to use only tweet from 2016 up to the first quarter of 2019.
Compendium of trump tweets were downloaded in json format from the following link.
The final database and transformation code can be found on the following link.
The Trump tweets archive that is used on this workd is composed by 11363 tweets, here is an example of the first lines of the archive:
| V1 | date | Month | Day | Year | oil_related_word |
|---|---|---|---|---|---|
| the person that hillary clinton least wants to run against is, by far, me. it will be the largest voter turnout ever - she will be swamped! | 2016-01-01 | Jan | 01 | 2016 | FALSE |
| “@codyraymille: i have never been interested in politics but because of you i want to get my political science degree. #trump2016” great! | 2016-01-01 | Jan | 01 | 2016 | FALSE |
| i will be going to mississippi tomorrow night - hear the crowds are going to be massive! look forward to it. | 2016-01-01 | Jan | 01 | 2016 | FALSE |
| “@casuperrunner: @georgehenryw huckabee is a good man…but he needs to get behind @realdonaldtrump” i agree! | 2016-01-01 | Jan | 01 | 2016 | FALSE |
| “@memeoryhead: i’m one of your biggest fans mr trump and i can’t wait for you to make america great ifagain! never forget you have support!” | 2016-01-01 | Jan | 01 | 2016 | FALSE |
| “@jodil792: we are standing with you! spreading the word…trump for president 2016!! i wake & sleep praying for you & the usa!+israel2” | 2016-01-01 | Jan | 01 | 2016 | FALSE |
There are 1186 days between the first day of the database and the last one, using this means that president trump is tweeting on average 9.6 tweet per day since january 2016, until april 2019, which makes him very active on this social media.
for the sake of this work, we will focus only on tweets on which he used the following words, since we consider them possibly related to the oil & gas market:
after filtering the complete database we found 421 positive tweets for oil related words. Already having this information in place, the first question that comes to my mind is weather or not there is a difference in brent prices, when trump tweets about oil and when he doesnt? to answer this question we will use a t.test with an alpha value of 5%
Welch Two Sample t-test
data: trump_oil2$value by trump_oil2$oil_keyword
t = -7.3975, df = 344.41, p-value = 1.068e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.974548 -5.204563
sample estimates:
mean in group FALSE mean in group TRUE
55.10396 62.19351
Interestingly there is indeed a significative difference between brent crude prices when trump tweets about oil, in comparisson with days on which he does not. the obtained p-value is 1.068e-12 which is lower that our alpha level making the test significative. Brent crude prices tend to be higher on days on which trumo tweets about oil, between 5.2 and 8.79 dollars higher.
Now obtaining this result only means that there is indeed a correlation, however it doesnt imply causation, it is important not to be confused and think that oil related tweet of trump cause higher oil prices. In fact we can conclude that is higher oil prices, what motivates Donald trump to tweet about oil prices.
To test this hypothesis we will look and the distribution of oil tweet of trump with oil prices ranges
| Price_range | Tweets |
|---|---|
| (25.9,46] | 16 |
| (46,66] | 93 |
| (66,86.1] | 79 |
As we can see, the majority of oil tweets of donald trump, are condensed on high oil price ranges, but the real question is, is this distribution sufficiently different from expected values, to consider it significative? to test this we will use a chi-square independence test, using an alpha value of 0.05-5%.
Pearson's Chi-squared test
data: table(cut(trump_oil2$value, 3), trump_oil2$oil_keyword)
X-squared = 43.082, df = 2, p-value = 4.414e-10
The results obtained by the chi-square test, suggest that there is no independence between oil tweets and oil price ranges. which means that trump oil tweets are triggered by differences in oil prices, and this relationship is significative since the obtained p-value of 4.414e-10 is below the 5% alpha value.
If Trump oil tweets and oil prices were independent, this will be the expected distribution:
| Price_range | Chisq_Expected_Tweets |
|---|---|
| (25.9,46] | 35.06258 |
| (46,66] | 105.64908 |
| (66,86.1] | 47.28834 |
This difference can be better observed by comparing the expected and observed distributions
So it is more clear, that there a more pronounced concentration of oil tweets into the 66-87 dollars brent price binge. And we can conclude that what mostly triggers Trump oil tweets is the raise in oil prices.
It might also be interesing to study how trump oil tweets have changed over time.
| Year | Tweet_by_year |
|---|---|
| 2016 | 22 |
| 2017 | 79 |
| 2018 | 126 |
| 2019 | 38 |
Looking with more detail at the table, it is easy to notice a trend, on which the amount of oil related tweets, have increased over time, just on the first quarter of 2019, Trump has already tweeted 38 oil tweets, if we expect him to continue with this trend, we should expect 152 oil related tweets by the end of the year.
1
179.6667
If we use a linear model on data of 2016,2017 and 2018 and predict 2019, tweets, we should expect at leats 179 tweets by the end of the year. So it is clear that Trump has been getting more and more active on oil tweets over time. Which move us to our next lines of questions.
Using logical thinking, as rational beings,we would only repeat and even increase a behavior if we expect it to produce the desired result, if not this might be considered to be insane or at least ilogical. The question is, does oil trump tweets, are able to decrease oil prices, at least on the shor term?
To be able to answer this question we will look at brent crude prices, the day of an oil tweet, the day after, two days after and three days after, and compare the price between these days and its behavior. We will also use linear regression inference to determine the magnitude of these effects.
After the variable creation we will perform an ANOVA test, with an alpha level of 0.05-5% to test if there is a significant level of difference on brent oil price values and the different tweet days.
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| trump_oil2$tweet_pattern | 8 | 24926.24 | 3115.7805 | 23.63862 | 0 |
| Residuals | 806 | 106237.98 | 131.8089 | NA | NA |
The obtained p-value of 0 of the F-statistic, confirms that there is indeed a significative difference, since it is below our selected alpha level. Looking at the mean values we can observe that there is a descendent pattern on which on average the brent oil market has reacted on a downslide pattern after donal trump oil tweets
| tweet_pattern | Mean_Price |
|---|---|
| Zero | 62.19351 |
| One | 60.85161 |
| Three | 60.36951 |
| Two | 60.17584 |
| Four | 58.45350 |
| Five | 56.01485 |
| Six | 55.74000 |
| Seven | 54.86548 |
| No_Tweet | 48.71600 |
The day after an oil tweet, values are aproximately equal to the prices of the day of the tweet, altought, two days after values tend to decrease so as third days values and the forth day and so up until even a week after.
Finally we would like to confirm this trend, and to do so we will a t.test between the brent prices on Trump’s oil tweet day and the prices four days after an oil tweet, we will perform a one sided test, on which the alternative hypothesis will be that the true difference in mean is less than 0. To perform this test, we will use a significance alpha level of 5%.
Welch Two Sample t-test
data: trump_oil_tweet_effect$value by trump_oil_tweet_effect$tweet_pattern
t = -3.2562, df = 39.559, p-value = 0.00116
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -3.537552
sample estimates:
mean in group Seven mean in group Zero
54.86548 62.19351
Based on the results obtained on the test, we can claim that there is indeed a significative difference, at a 95% confidence level, on brent price values between the tweet day and four days after a trump tweet. We can expect oil brent prices on trump’s twee day being at least 0.38 dollars higher in comparisson with four days after the tweet was made.
With such an interesting result it is important to confirm the results, due to this we will perform a similar test, but using simulation, we will simulate 10000 differences in the observed data group of brent prices on the day of tweet and four days after tweet and we will calculate a simulated p-value.
The simulated data, seems to fit very well a normal distribution, the next step is to calculate the simulated p-value, which is equivalent of calculating the proportion of values of difference more extremes or equal to the observed difference.
# A tibble: 1 x 1
p_value
<dbl>
1 0.000600
Equivalent to what was obtained on the formal t.test, the simulated p-value is below the 5% alpha level, which allow us to conclude that there is indeed a significative difference on brent crude prices, on the day of a tweet and four days after.
To conclude, this is the main explanation of why using daily oil price data, might not be advisible, since it might be volatile enought to be influenced by speculations created by social media interactions of powerfull entities such as the US president, and perhaps many other factors that were not evaluated during this analysis. Altought daily~weekly prices might be sensible enought to be influenced by these kind of effects, monthly prices are more related to macro economics and industry based trends and therefore are more usable to forecast on a long term basis.
On the previous section of this work, we answer already plenty of the questions that the early EDA uncovered, now with the answers of this questions we can begin with the task of creating a model with the objective of predict the behavior of our response variables, but before generating definitve models it is better to first study which variables affect the most our target response variables. On this section we will explore with more depth the linear relationship of our response and explanatory variables via various methods such as:
-Spearman Correlation matrix -Full linear model -StepAIC, akaike information criterion link -BIC Bayesian information criterion link -Boruta selection criteria link
All the code related to the models built on this section can be found on the following link
Correlation matrix Oil price
We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Oil prices
| x | |
|---|---|
| Oil_Price | 1.0000000 |
| GDP | 0.6840554 |
| Year | 0.6832099 |
| US_Population | 0.6832099 |
| life_exp | 0.6822414 |
| Oil_Stock | 0.5769223 |
| Energy_Consumption | 0.4434374 |
| Drilled_Wells | 0.3802835 |
| Net_imports_Energy | 0.3655332 |
| Senate | 0.3486705 |
| Active_Conflicts | 0.2770792 |
| Active_Rigs | 0.2539703 |
| House | 0.2100427 |
| Gov_ctrl_type | 0.1996047 |
| President_Party | 0.1613554 |
| Eco_Rec | 0.1045710 |
| CDD | 0.0671465 |
| Month | 0.0393345 |
| Season | -0.0295112 |
| HDD | -0.0533247 |
| Gov_ctrl_party | -0.1241196 |
| Oil_Production | -0.5322219 |
| Oil_Reserves | -0.5824720 |
As we can see oil is correlated with various variables, but some of this variables are highly correlated with each others as well, one of such cases is: GDP, YEAR, US Population and life expectancy, allmost all these variables have been growing at almost the same rate with time and therefore they are perfectly correlated, this already indicate us that we should consider only using one of them for the model selection.
Correlation matrix Active Rigs
We will first make a correlation matrix of Spearman correlation coefficients to determine variables that have at least a weak (0.3,-0.3) correlation with Active Rigs.
| x | |
|---|---|
| Active_Rigs | 1.0000000 |
| Drilled_Wells | 0.9281927 |
| Oil_Reserves | 0.4381892 |
| Oil_Production | 0.3700018 |
| Eco_Rec | 0.2988132 |
| Oil_Price | 0.2539703 |
| President_Party | 0.2051462 |
| Gov_ctrl_type | 0.1149544 |
| Month | 0.0761974 |
| HDD | 0.0523462 |
| Season | -0.0083891 |
| Senate | -0.0539184 |
| CDD | -0.0775931 |
| Gov_ctrl_party | -0.1415784 |
| Net_imports_Energy | -0.2739901 |
| Energy_Consumption | -0.3396680 |
| GDP | -0.3614154 |
| Year | -0.3666279 |
| US_Population | -0.3666279 |
| life_exp | -0.3678292 |
| House | -0.4006386 |
| Oil_Stock | -0.4074189 |
| Active_Conflicts | -0.6697390 |
With active rigs we see a similar trend, we have highly correlated variables in our dataset, and we would have to drop some of them to avoid multicolinearity.
Full Linear model Y = Oil Price
We will first begin by making a linear model of oil price versus all of the available exploratory variables.
Call:
lm(formula = Oil_Price ~ ., data = OIL_MODIFIED_MODEL)
Residuals:
Min 1Q Median 3Q Max
-44.275 -2.575 0.076 2.771 39.407
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.551e+04 5.085e+03 -5.016 7.71e-07 ***
Month 2.759e-01 1.250e-01 2.207 0.027838 *
Year 1.390e+01 2.722e+00 5.107 4.93e-07 ***
Active_Rigs 3.909e-03 1.555e-03 2.513 0.012323 *
US_Population -7.068e+00 9.087e-01 -7.778 5.47e-14 ***
President_PartyRepublican -4.026e+00 1.094e+00 -3.678 0.000264 ***
SenateRepublican 2.555e+00 2.199e+00 1.162 0.245921
HouseRepublican -6.122e+00 2.710e+00 -2.259 0.024353 *
life_exp -5.437e+00 2.236e+00 -2.432 0.015420 *
GDP 2.526e+01 2.025e+00 12.470 < 2e-16 ***
Oil_Production 2.847e-03 1.418e-03 2.008 0.045261 *
Oil_Stock -9.132e-02 1.435e-02 -6.363 5.04e-10 ***
HDD 2.705e-03 4.770e-03 0.567 0.570874
CDD 1.958e-02 1.168e-02 1.676 0.094369 .
Energy_Consumption 2.127e-01 1.854e+00 0.115 0.908724
Net_imports_Energy -1.508e+01 2.566e+00 -5.875 8.46e-09 ***
Oil_Reserves -2.038e-03 4.586e-04 -4.444 1.13e-05 ***
Active_Conflicts 4.234e-02 2.940e-02 1.440 0.150501
Drilled_Wells 2.178e-03 8.049e-04 2.707 0.007067 **
Eco_RecRecession -4.553e-01 1.039e+00 -0.438 0.661531
SeasonSpring 1.360e+00 1.199e+00 1.134 0.257560
SeasonSummer -1.021e+00 1.760e+00 -0.580 0.562125
SeasonWinter -2.240e+00 1.615e+00 -1.387 0.166124
Gov_ctrl_typeTotal -1.438e+00 1.026e+00 -1.401 0.161852
Gov_ctrl_partyRepublican NA NA NA NA
Gov_ctrl_partyshared NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.597 on 432 degrees of freedom
(96 observations deleted due to missingness)
Multiple R-squared: 0.9068, Adjusted R-squared: 0.9018
F-statistic: 182.7 on 23 and 432 DF, p-value: < 2.2e-16
Analysis of Variance Table
Response: Oil_Price
Df Sum Sq Mean Sq F value Pr(>F)
Month 1 309 309 7.1060 0.007971 **
Year 1 91169 91169 2094.8581 < 2.2e-16 ***
Active_Rigs 1 50324 50324 1156.3228 < 2.2e-16 ***
US_Population 1 819 819 18.8116 1.798e-05 ***
President_Party 1 4745 4745 109.0386 < 2.2e-16 ***
Senate 1 15114 15114 347.2866 < 2.2e-16 ***
House 1 6137 6137 141.0032 < 2.2e-16 ***
life_exp 1 176 176 4.0327 0.045249 *
GDP 1 7642 7642 175.5931 < 2.2e-16 ***
Oil_Production 1 50 50 1.1425 0.285715
Oil_Stock 1 419 419 9.6215 0.002049 **
HDD 1 638 638 14.6708 0.000147 ***
CDD 1 0 0 0.0104 0.918874
Energy_Consumption 1 151 151 3.4625 0.063454 .
Net_imports_Energy 1 2809 2809 64.5378 9.106e-15 ***
Oil_Reserves 1 1665 1665 38.2601 1.437e-09 ***
Active_Conflicts 1 53 53 1.2128 0.271390
Drilled_Wells 1 266 266 6.1185 0.013761 *
Eco_Rec 1 2 2 0.0538 0.816692
Season 3 319 106 2.4455 0.063397 .
Gov_ctrl_type 1 85 85 1.9635 0.161852
Residuals 432 18801 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A linear model of all variables included, give us an adjusted coefficient of determination of 0.9018, and an RMSE value of 6.597 dollars. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:
To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria
Residuals plot, show us that they are almost centered and scatered around 0, except for high oil prices levels, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.
Full Linear model Y = Active Rigs
We will first begin by making a linear model of Active rigs versus all of the available exploratory variables.
Call:
lm(formula = Active_Rigs ~ ., data = OIL_MODIFIED_MODEL)
Residuals:
Min 1Q Median 3Q Max
-579.89 -94.98 5.04 84.42 1157.95
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.177e+05 1.599e+05 -1.987 0.047572 *
Month 4.381e+00 3.855e+00 1.136 0.256415
Year 1.653e+02 8.571e+01 1.928 0.054481 .
US_Population -4.804e+01 2.971e+01 -1.617 0.106593
President_PartyRepublican -1.998e+02 3.275e+01 -6.100 2.36e-09 ***
SenateRepublican -1.155e+01 6.765e+01 -0.171 0.864463
HouseRepublican 3.164e+00 8.371e+01 0.038 0.969867
life_exp 4.052e+01 6.910e+01 0.586 0.557910
GDP -5.869e+01 7.249e+01 -0.810 0.418605
Oil_Production -1.978e-01 4.270e-02 -4.632 4.81e-06 ***
Oil_Stock -1.675e+00 4.538e-01 -3.692 0.000251 ***
Oil_Price 3.687e+00 1.467e+00 2.513 0.012323 *
HDD 1.562e-01 1.464e-01 1.068 0.286327
CDD -9.085e-02 3.598e-01 -0.252 0.800781
Energy_Consumption -2.245e+01 5.694e+01 -0.394 0.693599
Net_imports_Energy -6.835e+01 8.183e+01 -0.835 0.404064
Oil_Reserves 5.680e-02 1.414e-02 4.017 6.95e-05 ***
Active_Conflicts 9.646e-01 9.038e-01 1.067 0.286436
Drilled_Wells 4.099e-01 1.524e-02 26.892 < 2e-16 ***
Eco_RecRecession 1.725e+02 3.082e+01 5.597 3.89e-08 ***
SeasonSpring 6.016e+01 3.678e+01 1.636 0.102612
SeasonSummer 1.551e+01 5.406e+01 0.287 0.774318
SeasonWinter 5.255e+01 4.964e+01 1.059 0.290349
Gov_ctrl_typeTotal 1.494e+01 3.157e+01 0.473 0.636218
Gov_ctrl_partyRepublican NA NA NA NA
Gov_ctrl_partyshared NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 202.6 on 432 degrees of freedom
(96 observations deleted due to missingness)
Multiple R-squared: 0.9352, Adjusted R-squared: 0.9318
F-statistic: 271.2 on 23 and 432 DF, p-value: < 2.2e-16
Analysis of Variance Table
Response: Active_Rigs
Df Sum Sq Mean Sq F value Pr(>F)
Month 1 738085 738085 17.9807 2.731e-05 ***
Year 1 45785358 45785358 1115.3882 < 2.2e-16 ***
US_Population 1 13504188 13504188 328.9788 < 2.2e-16 ***
President_Party 1 4453281 4453281 108.4875 < 2.2e-16 ***
Senate 1 5932278 5932278 144.5177 < 2.2e-16 ***
House 1 81501703 81501703 1985.4827 < 2.2e-16 ***
life_exp 1 34160309 34160309 832.1875 < 2.2e-16 ***
GDP 1 8000118 8000118 194.8928 < 2.2e-16 ***
Oil_Production 1 469909 469909 11.4476 0.0007812 ***
Oil_Stock 1 4691129 4691129 114.2817 < 2.2e-16 ***
Oil_Price 1 11911533 11911533 290.1797 < 2.2e-16 ***
HDD 1 1046345 1046345 25.4903 6.567e-07 ***
CDD 1 89693 89693 2.1850 0.1400869
Energy_Consumption 1 333804 333804 8.1319 0.0045579 **
Net_imports_Energy 1 3174 3174 0.0773 0.7810762
Oil_Reserves 1 480056 480056 11.6948 0.0006863 ***
Active_Conflicts 1 369798 369798 9.0087 0.0028427 **
Drilled_Wells 1 41125657 41125657 1001.8721 < 2.2e-16 ***
Eco_Rec 1 1328248 1328248 32.3578 2.367e-08 ***
Season 3 122802 40934 0.9972 0.3939847
Gov_ctrl_type 1 9197 9197 0.2240 0.6362178
Residuals 432 17733086 41049
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A linear model of all variables included, give us an adjusted coefficient of determination of 0.9318, and an RMSE value of 203 Rigs. The ANOVA test of the coefficients tell us that the variables which are non significant, at an alpha level of 5%, for this model are:
To evaluate with more detail the generated model we should look at the residuals to evaluate their normality and constant variability criteria
Residuals plot, show us that they are almost centered and scatered around 0, except for high levels of active rigs, they also shown certain levels of “heteroscedasticity”, meaning that the residuals get larger as the prediction moves from small to large. This is something we need to keep in mind for further evaluation of variables transformation. The qqplot show us a similar story than the residuals plot.
Multicolinearity Evaluation
Before going any further we should evaluate multicolinearity on our model, the correlation plots already gave us a hint of which variables were highly correlated. We will first evaluate if within our datasets we have aliased variables, which means if weather or not we have perfect linear combinations of variables.
Model :
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party +
Senate + House + life_exp + GDP + Oil_Production + Oil_Stock +
HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves +
Active_Conflicts + Drilled_Wells + Eco_Rec + Season + Gov_ctrl_type +
Gov_ctrl_party
Complete :
(Intercept) Month Year Active_Rigs US_Population
Gov_ctrl_partyRepublican -1/2 0 0 0 0
Gov_ctrl_partyshared 1 0 0 0 0
President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican 1/2 0
Gov_ctrl_partyshared 0 0
HouseRepublican life_exp GDP Oil_Production
Gov_ctrl_partyRepublican 1/2 0 0 0
Gov_ctrl_partyshared 0 0 0 0
Oil_Stock HDD CDD Energy_Consumption
Gov_ctrl_partyRepublican 0 0 0 0
Gov_ctrl_partyshared 0 0 0 0
Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican 0 0 0
Gov_ctrl_partyshared 0 0 0
Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican 0 0 0
Gov_ctrl_partyshared 0 0 0
SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican 0 0 1/2
Gov_ctrl_partyshared 0 0 -1
Model :
Active_Rigs ~ Month + Year + US_Population + President_Party +
Senate + House + life_exp + GDP + Oil_Production + Oil_Stock +
Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy +
Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec +
Season + Gov_ctrl_type + Gov_ctrl_party
Complete :
(Intercept) Month Year US_Population
Gov_ctrl_partyRepublican -1/2 0 0 0
Gov_ctrl_partyshared 1 0 0 0
President_PartyRepublican SenateRepublican
Gov_ctrl_partyRepublican 1/2 0
Gov_ctrl_partyshared 0 0
HouseRepublican life_exp GDP Oil_Production
Gov_ctrl_partyRepublican 1/2 0 0 0
Gov_ctrl_partyshared 0 0 0 0
Oil_Stock Oil_Price HDD CDD Energy_Consumption
Gov_ctrl_partyRepublican 0 0 0 0 0
Gov_ctrl_partyshared 0 0 0 0 0
Net_imports_Energy Oil_Reserves Active_Conflicts
Gov_ctrl_partyRepublican 0 0 0
Gov_ctrl_partyshared 0 0 0
Drilled_Wells Eco_RecRecession SeasonSpring
Gov_ctrl_partyRepublican 0 0 0
Gov_ctrl_partyshared 0 0 0
SeasonSummer SeasonWinter Gov_ctrl_typeTotal
Gov_ctrl_partyRepublican 0 0 1/2
Gov_ctrl_partyshared 0 0 -1
as we can see Gov_ctrl_type and Gov_ctrl_party are aliased with other political variables so we will drop them from the dataset, and then generate a VIF matrix to look for other colinear explanatory variables
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| Month | 1.890459 | 1 | 1.374939 |
| Year | 10877.031124 | 1 | 104.293006 |
| Active_Rigs | 3.968626 | 1 | 1.992141 |
| US_Population | 7889.673094 | 1 | 88.823832 |
| President_Party | 2.441082 | 1 | 1.562396 |
| Senate | 7.488026 | 1 | 2.736426 |
| House | 11.497575 | 1 | 3.390808 |
| life_exp | 204.129970 | 1 | 14.287406 |
| GDP | 532.624103 | 1 | 23.078650 |
| Oil_Production | 22.474923 | 1 | 4.740772 |
| Oil_Stock | 118.034523 | 1 | 10.864369 |
| HDD | 23.352858 | 1 | 4.832479 |
| CDD | 16.766321 | 1 | 4.094670 |
| Energy_Consumption | 25.308852 | 1 | 5.030790 |
| Net_imports_Energy | 13.723124 | 1 | 3.704474 |
| Oil_Reserves | 26.968245 | 1 | 5.193096 |
| Active_Conflicts | 7.728751 | 1 | 2.780063 |
| Eco_Rec | 1.574211 | 1 | 1.254676 |
| Season | 32.850729 | 3 | 1.789606 |
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| Month | 1.885898 | 1 | 1.373280 |
| Year | 11206.371199 | 1 | 105.860149 |
| US_Population | 7887.114065 | 1 | 88.809426 |
| President_Party | 2.466371 | 1 | 1.570468 |
| Senate | 7.292218 | 1 | 2.700411 |
| House | 11.227107 | 1 | 3.350687 |
| life_exp | 248.770202 | 1 | 15.772451 |
| GDP | 734.158067 | 1 | 27.095351 |
| Oil_Production | 36.535996 | 1 | 6.044501 |
| Oil_Stock | 117.483915 | 1 | 10.839000 |
| Oil_Price | 11.390239 | 1 | 3.374943 |
| HDD | 23.815865 | 1 | 4.880150 |
| CDD | 16.783883 | 1 | 4.096814 |
| Energy_Consumption | 27.020968 | 1 | 5.198170 |
| Net_imports_Energy | 26.840064 | 1 | 5.180740 |
| Oil_Reserves | 27.275467 | 1 | 5.222592 |
| Active_Conflicts | 9.332611 | 1 | 3.054932 |
| Eco_Rec | 1.484848 | 1 | 1.218543 |
| Season | 34.183125 | 3 | 1.801504 |
| predictions_oil | 41.542859 | 1 | 6.445375 |
Variables that are highly correlated that high a VIF value bigger than 5 are:
Before deciding to remove them from the models, we will evaluate the more advance feature selecion algorithms
StepAIC akaike information criterion Oil Price model
We will run the AIC algorithm backwards and see which variables the algorithm decides to drop
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party +
Senate + House + life_exp + GDP + Oil_Production + Oil_Stock +
HDD + CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves +
Active_Conflicts + Drilled_Wells + Eco_Rec + Season
Final Model:
Oil_Price ~ Month + Year + Active_Rigs + US_Population + President_Party +
Senate + House + life_exp + GDP + Oil_Production + Oil_Stock +
CDD + Energy_Consumption + Net_imports_Energy + Oil_Reserves +
Active_Conflicts + Drilled_Wells + Season
Step Df Deviance Resid. Df Resid. Dev AIC
1 433 18886.24 1744.005
2 - HDD 1 1.773351 434 18888.01 1742.048
3 - Eco_Rec 1 5.320678 435 18893.33 1740.177
The AIC algorithm decides to drop the following variables:
StepAIC akaike information criterion Active Rigs model
We will run the AIC algorithm backwards and see which variables the algorithm decides to drop
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
Active_Rigs ~ Month + Year + US_Population + President_Party +
Senate + House + life_exp + GDP + Oil_Production + Oil_Stock +
Oil_Price + HDD + CDD + Energy_Consumption + Net_imports_Energy +
Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec +
Season
Final Model:
Active_Rigs ~ Year + US_Population + President_Party + Oil_Production +
Oil_Stock + Oil_Price + HDD + CDD + Net_imports_Energy +
Oil_Reserves + Active_Conflicts + Drilled_Wells + Eco_Rec
Step Df Deviance Resid. Df Resid. Dev AIC
1 433 17742283 4865.450
2 - Season 3 122801.6609 436 17865085 4862.595
3 - House 1 262.9605 437 17865348 4860.602
4 - Month 1 638.7724 438 17865986 4858.618
5 - Energy_Consumption 1 2877.4922 439 17868864 4856.691
6 - Senate 1 9271.8029 440 17878136 4854.928
7 - life_exp 1 32495.7768 441 17910631 4853.756
8 - GDP 1 38134.0117 442 17948765 4852.726
The AIC algorithm decides to drop the following variables:
BIC Bayesian information criterion Oil Price model
We will run the BIC algorithm and see which variables has the highest likehood for being included
BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:
BIC Bayesian information criterion Active Rigs model
We will run the BIC algorithm and see which variables has the highest likehood for being included
BIC is considerably more strict on the penalization of the number of terms in comparisson with AIC, the variables that are not included in the most likely model are:
Boruta Feature Selection Oil Price
We will run the Boruta feature selection algorithm and see which variables are included
Boruta performed 99 iterations in 21.89947 secs.
17 attributes confirmed important: Active_Conflicts, Active_Rigs,
CDD, Eco_Rec, Energy_Consumption and 12 more;
1 attributes confirmed unimportant: Season;
1 tentative attributes left: HDD;
| features | medianImpact | decision |
|---|---|---|
| GDP | 13.583942 | Confirmed |
| US_Population | 13.207721 | Confirmed |
| Year | 13.116314 | Confirmed |
| Active_Rigs | 12.602486 | Confirmed |
| Oil_Stock | 12.566571 | Confirmed |
| life_exp | 10.985583 | Confirmed |
| Senate | 10.465949 | Confirmed |
| Active_Conflicts | 8.949173 | Confirmed |
| Net_imports_Energy | 8.597149 | Confirmed |
| Oil_Reserves | 8.522154 | Confirmed |
| Oil_Production | 8.505690 | Confirmed |
| Month | 6.678523 | Tentative |
| Energy_Consumption | 6.076158 | Confirmed |
| CDD | 5.334423 | Confirmed |
| President_Party | 5.082532 | Confirmed |
| House | 4.744941 | Confirmed |
| Eco_Rec | 4.201994 | Confirmed |
| HDD | 2.801193 | Confirmed |
| Season | 1.878089 | Rejected |
Month Year Active_Rigs
Confirmed Confirmed Confirmed
US_Population President_Party Senate
Confirmed Confirmed Confirmed
House life_exp GDP
Confirmed Confirmed Confirmed
Oil_Production Oil_Stock HDD
Confirmed Confirmed Tentative
CDD Energy_Consumption Net_imports_Energy
Confirmed Confirmed Confirmed
Oil_Reserves Active_Conflicts Eco_Rec
Confirmed Confirmed Confirmed
Season
Rejected
Levels: Tentative Confirmed Rejected
Only one variable was dropped by borutas final decission:
interestigly the variables which are given the highest importance are economic variables like GDP and population
Boruta Feature Selection Active Rigs
Boruta performed 99 iterations in 21.22427 secs.
16 attributes confirmed important: Active_Conflicts, Eco_Rec,
Energy_Consumption, GDP, House and 11 more;
No attributes deemed unimportant.
3 tentative attributes left: CDD, HDD, Season;
| features | medianImpact | decision |
|---|---|---|
| Oil_Price | 21.231639 | Confirmed |
| Oil_Stock | 14.001810 | Confirmed |
| GDP | 13.771841 | Confirmed |
| Year | 13.462641 | Confirmed |
| US_Population | 13.449564 | Confirmed |
| life_exp | 12.746623 | Confirmed |
| Net_imports_Energy | 11.438301 | Confirmed |
| Active_Conflicts | 11.436812 | Confirmed |
| Oil_Reserves | 11.024660 | Confirmed |
| Oil_Production | 10.984145 | Confirmed |
| Eco_Rec | 8.375199 | Confirmed |
| President_Party | 8.094818 | Tentative |
| Senate | 8.037422 | Tentative |
| Energy_Consumption | 6.580326 | Confirmed |
| Month | 5.184240 | Confirmed |
| House | 4.492152 | Confirmed |
| CDD | 1.929082 | Confirmed |
| HDD | 1.739728 | Confirmed |
| Season | 1.684771 | Tentative |
Month Year US_Population
Confirmed Confirmed Confirmed
President_Party Senate House
Confirmed Confirmed Confirmed
life_exp GDP Oil_Production
Confirmed Confirmed Confirmed
Oil_Stock Oil_Price HDD
Confirmed Confirmed Tentative
CDD Energy_Consumption Net_imports_Energy
Tentative Confirmed Confirmed
Oil_Reserves Active_Conflicts Eco_Rec
Confirmed Confirmed Confirmed
Season
Tentative
Levels: Tentative Confirmed Rejected
Threee variables were classified as tentative by borutas final decision: