MATH1324 Assignment 3

Factors influencing the price of Crude Oil

Bowei Wu, Yinan Zhang

Last updated: 31 May, 2019

Problem Statement

Among all the investments, many investors prefer investing in commodities. The crude oil is called “the king of commodities”, as it has high fluctuation in price, where investors can earn profit.

Therefore it is profitable for investors if we can investigate how economic variables affect the price of crude oil and create a model to predict it.

In this project we are looking to investigate four publicly available variables and their potential to predict the price of crude oil individually.

The scope of the project is to use simple linear regression only to find the best predictor as that is what is within the course guidelines. The model can be further optimised with more complex models later on.

Method

The price of crude oil is the dependent variable.

Four predictor variables we are looking at are:

All data sets contain 9 years information from January of 2010 to January of 2019. The data sets, U.S. import of oil, U.S. production of oil and refinery operable Utilisation rate are sourced from EIA website, and the U.S. Dollar Index and Oil Price are sourced from investing website and FRED website respectively.

To analyse variable we will 1. Look at the summary descriptive statistics 2. Visualise with histogram with an overlayed normal distribution curve 3. Check for normality of distribution with shapiro test.

To find which is the best variable to use to predict Oil Price we will be: 1. Building linear regression model for each variable 2. Checking if the model produce meaningful results 3. Finding the best simple linear regression variable and analyse further on the effectiveness and limitations of the model.

Importing and Preprocessing of the Data

dataset = read.csv("Dataset.csv")
oilprice = ts(dataset$Crude.Oil.Price, frequency = 12, start = c(2010,1), end = c(2019,1))
rate = ts(dataset$U.S..Percent.Utilization.of.Refinery.Operable.Capacity, frequency = 12,
          start = c(2010,1), end =c(2019,1))
index = ts(dataset$U.S.Dollar.Index, frequency = 12, start = c(2010,1), end = c(2019,1))
importoil = ts(dataset$U.S..Imports.of.Crude.Oil..Thousand.Barrels., frequency = 12, 
               start = c(2010,1), end = c(2019,1))
oilproduction = ts(dataset$U.S..Field.Production.of.Crude.Oil..Thousand.Barrels., frequency = 12, 
                   start = c(2010,1), end =c(2019,1))

Creating Descriptive Summary of Each Data Set

summary.rate = as.matrix(summary(rate))
summary.index = as.matrix(summary(index))
summary.importoil = as.matrix(summary(importoil))
summary.oilproduction = as.matrix(summary(oilproduction))

IQRtoSD = as.matrix(rbind(IQR(rate)/sd(rate),
                          IQR(index)/sd(index),
                          IQR(importoil)/sd(importoil),
                          IQR(oilproduction)/sd(oilproduction))) 
colnames(IQRtoSD)=c("IQRtoSD")

RangetoSD = as.matrix(rbind((max(rate)-min(rate))/sd(rate),
                      (max(index)-min(index))/sd(index),
                      (max(importoil)-min(importoil))/sd(importoil),
                      (max(oilproduction)-min(oilproduction))/sd(oilproduction)))
colnames(RangetoSD) = c("RangetoSD")

Skewness = as.matrix(rbind(skewness(rate),
                           skewness(index),
                           skewness(importoil),
                           skewness(oilproduction)))
colnames(Skewness)=c("Skewness")

Kurtosis = as.matrix(rbind(kurtosis(rate),
                           kurtosis(index),
                           kurtosis(importoil),
                           kurtosis(oilproduction)))
colnames(Kurtosis) = c("kurtosis")

tab = t(cbind(summary.rate, summary.index, summary.importoil, summary.oilproduction))
tab = cbind(tab, IQRtoSD, RangetoSD, Skewness, Kurtosis) %>% data.frame()
rownames(tab) = c("Operable Utilisation Rate", "U.S. Dollar Index","U.S. Crude Oil import","U.S. Oil Production")
tab1 <- select(tab,Min.:Max.)
tab2 <- select(tab,-(Min.:Max.))

Summary Statistics and Discussion

kable(tab1, digits = round(2)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
Operable Utilisation Rate 79.90 86.80 90.00 89.42 92.00 97.00
U.S. Dollar Index 94.62 100.45 103.93 109.93 120.58 128.56
U.S. Crude Oil import 237278.00 280530.00 299373.00 295549.82 311950.00 327476.00
U.S. Oil Production 151185.00 190983.00 260115.00 246070.06 285836.00 370853.00
kable(tab2, digits = round(2))%>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
IQRtoSD RangetoSD Skewness kurtosis
Operable Utilisation Rate 1.42 4.65 -0.36 -0.32
U.S. Dollar Index 1.86 3.14 0.27 -1.59
U.S. Crude Oil import 1.50 4.31 -0.77 0.15
U.S. Oil Production 1.67 3.87 0.05 -0.97

Visualisation

par(mfrow = c(2,2))
hist(rate, main = "Histogram of Refinery Operable Utilisation rate", xlab = "Refinery Operable Utilisation rate", prob = T, xlim=c(75,100))
curve(dnorm(x, mean(rate),sd(rate)), add = TRUE, lwd=2)

hist(index, main = "Histogram of U.S Dollar Index", xlab = "U.S Dollar Index", prob = T)
curve(dnorm(x, mean(index),sd(index)), add = TRUE, lwd=2)

hist(importoil, main = "Histogram of U.S Import Crude Oil", xlab = "U.S Import Crude Oil", prob = T)
curve(dnorm(x, mean(importoil),sd(importoil)), add = TRUE, lwd=2)

hist(oilproduction, main = "Histogram of U.S Production of Crude Oil", xlab = "U.S Production of Crude Oil", prob = T)
curve(dnorm(x, mean(oilproduction),sd(oilproduction)), add = TRUE, lwd=2)

When visualise the histogram of each variable, variables ‘rate’ and ‘importoil’ seems left skew, and other two variables does not have features of the normal distribution. We will further confirm this by performing the normality test, sharpiro test, to further examine whether each variable is normal distributed.

Normality Test

The null hypothesis of the shapiro test is that the data set is normal distributed, and the alternative hypothesis is the data set is not normal distributed. This project set the significance level at 5%.

\(H0:\) Data are normally distributed
\(H1:\) Data are not normally distributed

U.S. Dollar Index

shapiro.test(index)
## 
##  Shapiro-Wilk normality test
## 
## data:  index
## W = 0.86654, p-value = 1.804e-08

U.S. Crude Oil Import

shapiro.test(importoil)
## 
##  Shapiro-Wilk normality test
## 
## data:  importoil
## W = 0.94208, p-value = 0.0001319

U.S. Oil Production

shapiro.test(oilproduction)
## 
##  Shapiro-Wilk normality test
## 
## data:  oilproduction
## W = 0.94713, p-value = 0.0002853

Operable Utilisation Rate

shapiro.test(rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  rate
## W = 0.98412, p-value = 0.2223

From the p-value of the shapiro test of each variable, only the variable ‘rate’ has p-value greater than 0.05, which means we failed to reject the null hypothesis for the data set rate. Since the p-value of all other data sets are less than 0.05, for those variables, the null hypothesis is rejected. We make the observation that all data sets except ‘rate’ are not normally distributed.

Transforming Data

U.S. Crude Oil Import is left skewed. As it is, we found that it is signifiantly different from a normal distribution. As a predictor, we would like it to be as symetrical as possible. Therefore transformation of the variable is required. After a few attempts, we found that a \(x^6\) transformation creates the best result

importoil_t <- importoil^6
hist(importoil_t, main = "Histogram of U.S Import Crude Oil Transformed", xlab = "U.S Import Crude Oil Transformed", prob = T)
curve(dnorm(x, mean(importoil_t),sd(importoil_t)), add = TRUE, lwd=2)

skewness(importoil_t)
## [1] -0.0534415

Looks much more symetrical now. Conducting test for normality:

shapiro.test(importoil_t)
## 
##  Shapiro-Wilk normality test
## 
## data:  importoil_t
## W = 0.97686, p-value = 0.05439

Since the p value is 0.054, we fail to reject the hypothesis that the data is not normally distributed.

This may be a better variable to use as a predictor.

Simple Linear Regression Model

model1 = lm(oilprice~index)
model2 = lm(oilprice~importoil)
model3 = lm(oilprice~oilproduction)
model4 = lm(oilprice~rate)
model5 = lm(oilprice~importoil_t)

Applying the linear regression model to each of the following variables:

We then look at each model to find the best predictor of Oil price.

U.S. Dollar Index

summary(model1)
## 
## Call:
## lm(formula = oilprice ~ index)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2905  -9.3224  -0.2908   8.3091  27.3312 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 272.89457   10.81451   25.23   <2e-16 ***
## index        -1.80997    0.09791  -18.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11 on 107 degrees of freedom
## Multiple R-squared:  0.7616, Adjusted R-squared:  0.7593 
## F-statistic: 341.7 on 1 and 107 DF,  p-value: < 2.2e-16

Simple linear model resulted \[ \hat{oilprice} = 272.89 -1.81 * index\] with R-square equals to 0.7616, which means 76.16% variation of Oil Price is explained by index. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, index is an significant regressor.

U.S. Crude Oil Import

summary(model2)
## 
## Call:
## lm(formula = oilprice ~ importoil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.037 -17.994   0.773  17.668  38.804 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.742e+02  2.914e+01   5.978 3.02e-08 ***
## importoil   -3.393e-04  9.835e-05  -3.450 0.000804 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.37 on 107 degrees of freedom
## Multiple R-squared:  0.1001, Adjusted R-squared:  0.09168 
## F-statistic:  11.9 on 1 and 107 DF,  p-value: 0.0008039

Simple linear model resulted \[ \hat{oilprice} = 174.2 -0.0003 * importoil\] with R-square equals to 0.1001, which means 10.01% variation of Oil Price is explained by oil import. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, oil import is an significant regressor.

U.S. Oil Production

summary(model3)
## 
## Call:
## lm(formula = oilprice ~ oilproduction)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.703 -16.332  -1.869  13.674  35.845 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.320e+02  7.744e+00  17.046  < 2e-16 ***
## oilproduction -2.361e-04  3.068e-05  -7.695 7.39e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.08 on 107 degrees of freedom
## Multiple R-squared:  0.3563, Adjusted R-squared:  0.3503 
## F-statistic: 59.22 on 1 and 107 DF,  p-value: 7.394e-12

Simple linear model resulted \[ \hat{oilprice} = 132.01 -0.0002 * oilproduction\] with R-square equals to 0.3563, which means 35.63% variation of Oil Price is explained by oil production. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, index is an significant regressor.

Operable Utilisation Rate

summary(model4)
## 
## Call:
## lm(formula = oilprice ~ rate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.906 -18.113   3.993  18.513  40.471 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 257.1246    49.7470   5.169  1.1e-06 ***
## rate         -2.0486     0.5558  -3.686  0.00036 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.22 on 107 degrees of freedom
## Multiple R-squared:  0.1127, Adjusted R-squared:  0.1044 
## F-statistic: 13.58 on 1 and 107 DF,  p-value: 0.0003599

Simple linear model resulted \[ \hat{oilprice} = 257.12 -2.05 * rate\] with R-square equals to 0.2122, which means 21.22% variation of Oil Price is explained by rate. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, rate is an significant regressor.

U.S. Crude Oil Import Transformed

summary(model5)
## 
## Call:
## lm(formula = oilprice ~ importoil_t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.140 -17.196   0.337  16.613  40.546 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.367e+01  5.915e+00  15.836  < 2e-16 ***
## importoil_t -2.769e-32  7.785e-33  -3.557 0.000561 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.31 on 107 degrees of freedom
## Multiple R-squared:  0.1057, Adjusted R-squared:  0.09738 
## F-statistic: 12.65 on 1 and 107 DF,  p-value: 0.0005606

Simple linear model resulted \[ \hat{oilprice} = 93.67 -2.769*10^{-33} * importoil\] with R-square equals to 0.1057, which means 10.57% variation of Oil Price is explained by oil import. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, oil import is an significant regressor.

Comparing to find highest R quared value.

r1 <- summary(model1)$r.squared 
r2 <- summary(model2)$r.squared 
r3 <- summary(model3)$r.squared
r4 <- summary(model4)$r.squared
r5 <- summary(model5)$r.squared
r <-data.frame(c(r1, r2, r3, r4, r5), 
           row.names = c("U.S. Dollar Index", "U.S. Crude Oil Import", 
                         "U.S. Oil Production", "Operable Utilisation Rate", "U.S. Crude Oil Import Transformed"))
kable(r, digits = round(2), col.names = "R Squared Value")%>% 
  kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)
R Squared Value
U.S. Dollar Index 0.76
U.S. Crude Oil Import 0.10
U.S. Oil Production 0.36
Operable Utilisation Rate 0.11
U.S. Crude Oil Import Transformed 0.11

Overall, after comparing these 5 models, the final model choosed to perform further test is depends on the R-square of each value. The highest R-square value is 0.7616, which is the model between Oil Price and U.S. Dollar Index. This is the highest proportion of variability in the Oil Price that can be explained by a linear relationship with the U.s. Dollar Index.

Interesting to note that the transformation of the data on U.S. Crude Oil Import did result in an improvement of the R squared score of 0.01, or a 1% improvement of the model.

We proceed further by checking the simple linear regression model with U.S. Dollar Index variable with the normality test and residual check.

U.S. Dollar Index V.S Oil Price

par(mar = c(5,5,3,5))
plot.ts(oilprice, col="red", las=1)
par(new =T)
plot(index, ylab="",xaxt ="n", yaxt="n",col="blue", main = "Time Series Plot of U.S. Dollar Index V.S Oil Price")
axis(side =4)
mtext("Dollar Index", side =4, line =3)
legend("bottomright", legend =c("Oil Price", "Dollar Index"), col = c("red", "blue"), lty=1, box.lty=0)

Time Series Plot of U.S. Dollar Index V.S Oil Price suggests there is an reverse relation between Oil Price and Dollar Index, as when there is an increase in Dollar Index will simultaneously result a decrease in Oil Price and vise versa.

Scatter plot between Oil Price and U.S. Dollar Index

plot(dataset$U.S.Dollar.Index, dataset$Crude.Oil.Price, ylab = "Oil Price", xlab = "U.S. Dollar Index", las=1
     , main ="Scatter plot of U.S. Dollar Index vs Oil Price")
abline(lm(oilprice~index))

Plotting the U,S. dollar Index against the Oil Price, We can clearly see a trend ofnegative linear relationship between the variables.

Correlation between Oil price and U.S. Dollar Index

cor(index, oilprice)
## [1] -0.8726729

Furthermore, the correlation of -0.87 between them also indicates there is a strong correlation between the independent variable and dependent variable.

Test for Normally Distributed Errors

H0: Errors are normally distributed

H1: Errors are not normally distributed

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.97432, p-value = 0.03322

Residual Check

par(mfrow = c(2,3), mar = c(4,2,5,2))
plot(model1)
acf(model1$residuals)

From the residual check, the first graph named ‘Residuals VS Fitted’ is used to check for non-randomness. The graph indicates the red lines of both model are not approximately horizontal at 0. Therefore, the residuals of both models have problem of non-randomness.

The second graph named ‘Normal Q-Q’ is used to test the normality assumption. The graph of simple regression model indicates the normal probability plot of residual are not following the red line, as there are one heavy tails at left end that deviate from straight line. Besides, the shapiro test of two models, the p-value of simple linear model is less than 5% significance level. Therefore, the errors of simple linear regression violate the assumption of normal distributed.

The third graph named ‘Scale-Location’ is used to check the homogeneity of variance of the residuals. Since the statistics of p-value from ncv test of both models are less than 5% significant level, which indicates the null hypothesis of errors are normal distributed is rejects.Therefore, the residuals of both models violate the assumption of homoscedasticity.

The fourth graph named ‘Residuals VS Leverage’ is used to check whether there is an influence point. From the graph, both models have no point go beyond the cook’s distance. Therefore, the residual of both model do not have influential point effect.

The last graph named ‘Series res’ is used to test the auto-correlation within the residuals. The graph shows the errors are not independent of each other, as both models have first sevral lags auto-correlated. The statisitics of p-value from durbinwaston test is less than 5% significant level, which means the null hypothesis of errors are uncorrelated is rejected. Therefore, the errors violate the assumption of first-order correlation.

Conclusion

After investigating the following four predictor variables on their potential to predict the price of crude oil. * Operable Utilisation rate * U.S. Dollar Index * U.S. crude Oil import * U.S. production of crude oil Using simple linear regression alone we can conclude that using the U.S. Dollar Index is the best variable to use for prediction of the Oil Price. The model have: * high negative correlation * high R2 value.
* Both results were statistically significant

Further investigation into the model we have found statically significant evidence that the residual errors is not normally distributed. Therefore, suggest the errors are not random, and there are other explanatory variables could be used.
In this project only choose 4 variables. Since we are only looking for the best variable to use as predictor. We can improve the model by combining predictors in a multivariate regression model. We can look to include other relevant variables to further improve the quality of the model.

Data Reference

  1. Investing.com. (2019). US Dollar Futures Historical Data. Retrieved from https://au.investing.com/quotes/us-dollar-index-historical-data.

  2. FRED. (2019). Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma. Retrieved from https://fred.stlouisfed.org/series/DCOILWTICO.

  3. U.S. Energy Information Administration. (2019). Refinery Utilization and Capacity. Retrived from https://www.eia.gov/dnav/pet/pet_pnp_unc_dcu_nus_m.htm.

  4. U.S. Energy Information Administration. (2019). U.S. Field Production of Crude Oi. Retrived from https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=pet&s=mcrfpus1&f=m

  5. U.S. Energy Information Administration. (2019). U.S. Imports of Crude Oil. Retrived from https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=MCRIMUS1&f=M