Bowei Wu, Yinan Zhang
Last updated: 31 May, 2019
Among all the investments, many investors prefer investing in commodities. The crude oil is called “the king of commodities”, as it has high fluctuation in price, where investors can earn profit.
Therefore it is profitable for investors if we can investigate how economic variables affect the price of crude oil and create a model to predict it.
In this project we are looking to investigate four publicly available variables and their potential to predict the price of crude oil individually.
The scope of the project is to use simple linear regression only to find the best predictor as that is what is within the course guidelines. The model can be further optimised with more complex models later on.
The price of crude oil is the dependent variable.
Four predictor variables we are looking at are:
All data sets contain 9 years information from January of 2010 to January of 2019. The data sets, U.S. import of oil, U.S. production of oil and refinery operable Utilisation rate are sourced from EIA website, and the U.S. Dollar Index and Oil Price are sourced from investing website and FRED website respectively.
To analyse variable we will 1. Look at the summary descriptive statistics 2. Visualise with histogram with an overlayed normal distribution curve 3. Check for normality of distribution with shapiro test.
To find which is the best variable to use to predict Oil Price we will be: 1. Building linear regression model for each variable 2. Checking if the model produce meaningful results 3. Finding the best simple linear regression variable and analyse further on the effectiveness and limitations of the model.
dataset = read.csv("Dataset.csv")
oilprice = ts(dataset$Crude.Oil.Price, frequency = 12, start = c(2010,1), end = c(2019,1))
rate = ts(dataset$U.S..Percent.Utilization.of.Refinery.Operable.Capacity, frequency = 12,
start = c(2010,1), end =c(2019,1))
index = ts(dataset$U.S.Dollar.Index, frequency = 12, start = c(2010,1), end = c(2019,1))
importoil = ts(dataset$U.S..Imports.of.Crude.Oil..Thousand.Barrels., frequency = 12,
start = c(2010,1), end = c(2019,1))
oilproduction = ts(dataset$U.S..Field.Production.of.Crude.Oil..Thousand.Barrels., frequency = 12,
start = c(2010,1), end =c(2019,1))summary.rate = as.matrix(summary(rate))
summary.index = as.matrix(summary(index))
summary.importoil = as.matrix(summary(importoil))
summary.oilproduction = as.matrix(summary(oilproduction))
IQRtoSD = as.matrix(rbind(IQR(rate)/sd(rate),
IQR(index)/sd(index),
IQR(importoil)/sd(importoil),
IQR(oilproduction)/sd(oilproduction)))
colnames(IQRtoSD)=c("IQRtoSD")
RangetoSD = as.matrix(rbind((max(rate)-min(rate))/sd(rate),
(max(index)-min(index))/sd(index),
(max(importoil)-min(importoil))/sd(importoil),
(max(oilproduction)-min(oilproduction))/sd(oilproduction)))
colnames(RangetoSD) = c("RangetoSD")
Skewness = as.matrix(rbind(skewness(rate),
skewness(index),
skewness(importoil),
skewness(oilproduction)))
colnames(Skewness)=c("Skewness")
Kurtosis = as.matrix(rbind(kurtosis(rate),
kurtosis(index),
kurtosis(importoil),
kurtosis(oilproduction)))
colnames(Kurtosis) = c("kurtosis")
tab = t(cbind(summary.rate, summary.index, summary.importoil, summary.oilproduction))
tab = cbind(tab, IQRtoSD, RangetoSD, Skewness, Kurtosis) %>% data.frame()
rownames(tab) = c("Operable Utilisation Rate", "U.S. Dollar Index","U.S. Crude Oil import","U.S. Oil Production")
tab1 <- select(tab,Min.:Max.)
tab2 <- select(tab,-(Min.:Max.))kable(tab1, digits = round(2)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | |
|---|---|---|---|---|---|---|
| Operable Utilisation Rate | 79.90 | 86.80 | 90.00 | 89.42 | 92.00 | 97.00 |
| U.S. Dollar Index | 94.62 | 100.45 | 103.93 | 109.93 | 120.58 | 128.56 |
| U.S. Crude Oil import | 237278.00 | 280530.00 | 299373.00 | 295549.82 | 311950.00 | 327476.00 |
| U.S. Oil Production | 151185.00 | 190983.00 | 260115.00 | 246070.06 | 285836.00 | 370853.00 |
kable(tab2, digits = round(2))%>%
kable_styling(bootstrap_options = c("striped", "hover"))| IQRtoSD | RangetoSD | Skewness | kurtosis | |
|---|---|---|---|---|
| Operable Utilisation Rate | 1.42 | 4.65 | -0.36 | -0.32 |
| U.S. Dollar Index | 1.86 | 3.14 | 0.27 | -1.59 |
| U.S. Crude Oil import | 1.50 | 4.31 | -0.77 | 0.15 |
| U.S. Oil Production | 1.67 | 3.87 | 0.05 | -0.97 |
One important property of the theoretical normal distribution requires that the interquartile range and range is approximately equal to 1.33 and 6 standard deviations, respectively. Above statistics table shows only Rate has a value of IQR to Standard deviation is close to 1.33, whereas other variables all far away from 1.33. However, for the range to standard deviation, there are no variable has value that close to 6.
Kurtosis and skewness values are also important factors in determining normal distribution. Theoretically, the normal distributed variable should have skewness and excess kurtosis value both equal to 0. Above table shows all variables excpet ‘oilproduction’ has the value of skewness between -0.5 and 0.5, which indicates they are fairly symmetrical. Besides, all variables except ‘index’ has the value of kurtosis between -1 and 1, which is acceptable excess kurtosis value. Therefore, the normality of the each variable cannot be clear defined by using these statistics.
par(mfrow = c(2,2))
hist(rate, main = "Histogram of Refinery Operable Utilisation rate", xlab = "Refinery Operable Utilisation rate", prob = T, xlim=c(75,100))
curve(dnorm(x, mean(rate),sd(rate)), add = TRUE, lwd=2)
hist(index, main = "Histogram of U.S Dollar Index", xlab = "U.S Dollar Index", prob = T)
curve(dnorm(x, mean(index),sd(index)), add = TRUE, lwd=2)
hist(importoil, main = "Histogram of U.S Import Crude Oil", xlab = "U.S Import Crude Oil", prob = T)
curve(dnorm(x, mean(importoil),sd(importoil)), add = TRUE, lwd=2)
hist(oilproduction, main = "Histogram of U.S Production of Crude Oil", xlab = "U.S Production of Crude Oil", prob = T)
curve(dnorm(x, mean(oilproduction),sd(oilproduction)), add = TRUE, lwd=2)When visualise the histogram of each variable, variables ‘rate’ and ‘importoil’ seems left skew, and other two variables does not have features of the normal distribution. We will further confirm this by performing the normality test, sharpiro test, to further examine whether each variable is normal distributed.
The null hypothesis of the shapiro test is that the data set is normal distributed, and the alternative hypothesis is the data set is not normal distributed. This project set the significance level at 5%.
\(H0:\) Data are normally distributed
\(H1:\) Data are not normally distributed
shapiro.test(index)##
## Shapiro-Wilk normality test
##
## data: index
## W = 0.86654, p-value = 1.804e-08
shapiro.test(importoil)##
## Shapiro-Wilk normality test
##
## data: importoil
## W = 0.94208, p-value = 0.0001319
shapiro.test(oilproduction)##
## Shapiro-Wilk normality test
##
## data: oilproduction
## W = 0.94713, p-value = 0.0002853
shapiro.test(rate)##
## Shapiro-Wilk normality test
##
## data: rate
## W = 0.98412, p-value = 0.2223
From the p-value of the shapiro test of each variable, only the variable ‘rate’ has p-value greater than 0.05, which means we failed to reject the null hypothesis for the data set rate. Since the p-value of all other data sets are less than 0.05, for those variables, the null hypothesis is rejected. We make the observation that all data sets except ‘rate’ are not normally distributed.
U.S. Crude Oil Import is left skewed. As it is, we found that it is signifiantly different from a normal distribution. As a predictor, we would like it to be as symetrical as possible. Therefore transformation of the variable is required. After a few attempts, we found that a \(x^6\) transformation creates the best result
importoil_t <- importoil^6
hist(importoil_t, main = "Histogram of U.S Import Crude Oil Transformed", xlab = "U.S Import Crude Oil Transformed", prob = T)
curve(dnorm(x, mean(importoil_t),sd(importoil_t)), add = TRUE, lwd=2)skewness(importoil_t)## [1] -0.0534415
Looks much more symetrical now. Conducting test for normality:
shapiro.test(importoil_t)##
## Shapiro-Wilk normality test
##
## data: importoil_t
## W = 0.97686, p-value = 0.05439
Since the p value is 0.054, we fail to reject the hypothesis that the data is not normally distributed.
This may be a better variable to use as a predictor.
model1 = lm(oilprice~index)
model2 = lm(oilprice~importoil)
model3 = lm(oilprice~oilproduction)
model4 = lm(oilprice~rate)
model5 = lm(oilprice~importoil_t)Applying the linear regression model to each of the following variables:
We then look at each model to find the best predictor of Oil price.
summary(model1)##
## Call:
## lm(formula = oilprice ~ index)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2905 -9.3224 -0.2908 8.3091 27.3312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 272.89457 10.81451 25.23 <2e-16 ***
## index -1.80997 0.09791 -18.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11 on 107 degrees of freedom
## Multiple R-squared: 0.7616, Adjusted R-squared: 0.7593
## F-statistic: 341.7 on 1 and 107 DF, p-value: < 2.2e-16
Simple linear model resulted \[ \hat{oilprice} = 272.89 -1.81 * index\] with R-square equals to 0.7616, which means 76.16% variation of Oil Price is explained by index. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, index is an significant regressor.
summary(model2)##
## Call:
## lm(formula = oilprice ~ importoil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.037 -17.994 0.773 17.668 38.804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.742e+02 2.914e+01 5.978 3.02e-08 ***
## importoil -3.393e-04 9.835e-05 -3.450 0.000804 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.37 on 107 degrees of freedom
## Multiple R-squared: 0.1001, Adjusted R-squared: 0.09168
## F-statistic: 11.9 on 1 and 107 DF, p-value: 0.0008039
Simple linear model resulted \[ \hat{oilprice} = 174.2 -0.0003 * importoil\] with R-square equals to 0.1001, which means 10.01% variation of Oil Price is explained by oil import. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, oil import is an significant regressor.
summary(model3)##
## Call:
## lm(formula = oilprice ~ oilproduction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.703 -16.332 -1.869 13.674 35.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.320e+02 7.744e+00 17.046 < 2e-16 ***
## oilproduction -2.361e-04 3.068e-05 -7.695 7.39e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.08 on 107 degrees of freedom
## Multiple R-squared: 0.3563, Adjusted R-squared: 0.3503
## F-statistic: 59.22 on 1 and 107 DF, p-value: 7.394e-12
Simple linear model resulted \[ \hat{oilprice} = 132.01 -0.0002 * oilproduction\] with R-square equals to 0.3563, which means 35.63% variation of Oil Price is explained by oil production. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, index is an significant regressor.
summary(model4)##
## Call:
## lm(formula = oilprice ~ rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.906 -18.113 3.993 18.513 40.471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 257.1246 49.7470 5.169 1.1e-06 ***
## rate -2.0486 0.5558 -3.686 0.00036 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.22 on 107 degrees of freedom
## Multiple R-squared: 0.1127, Adjusted R-squared: 0.1044
## F-statistic: 13.58 on 1 and 107 DF, p-value: 0.0003599
Simple linear model resulted \[ \hat{oilprice} = 257.12 -2.05 * rate\] with R-square equals to 0.2122, which means 21.22% variation of Oil Price is explained by rate. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, rate is an significant regressor.
summary(model5)##
## Call:
## lm(formula = oilprice ~ importoil_t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.140 -17.196 0.337 16.613 40.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.367e+01 5.915e+00 15.836 < 2e-16 ***
## importoil_t -2.769e-32 7.785e-33 -3.557 0.000561 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.31 on 107 degrees of freedom
## Multiple R-squared: 0.1057, Adjusted R-squared: 0.09738
## F-statistic: 12.65 on 1 and 107 DF, p-value: 0.0005606
Simple linear model resulted \[ \hat{oilprice} = 93.67 -2.769*10^{-33} * importoil\] with R-square equals to 0.1057, which means 10.57% variation of Oil Price is explained by oil import. In this model, since the p-value of this explanatory variable is less than 5% significancce level, the null hypothesis of the estimated coefficient of index is no statistically different from 0 is rejected. Therefore, oil import is an significant regressor.
r1 <- summary(model1)$r.squared
r2 <- summary(model2)$r.squared
r3 <- summary(model3)$r.squared
r4 <- summary(model4)$r.squared
r5 <- summary(model5)$r.squared
r <-data.frame(c(r1, r2, r3, r4, r5),
row.names = c("U.S. Dollar Index", "U.S. Crude Oil Import",
"U.S. Oil Production", "Operable Utilisation Rate", "U.S. Crude Oil Import Transformed"))
kable(r, digits = round(2), col.names = "R Squared Value")%>%
kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)| R Squared Value | |
|---|---|
| U.S. Dollar Index | 0.76 |
| U.S. Crude Oil Import | 0.10 |
| U.S. Oil Production | 0.36 |
| Operable Utilisation Rate | 0.11 |
| U.S. Crude Oil Import Transformed | 0.11 |
Overall, after comparing these 5 models, the final model choosed to perform further test is depends on the R-square of each value. The highest R-square value is 0.7616, which is the model between Oil Price and U.S. Dollar Index. This is the highest proportion of variability in the Oil Price that can be explained by a linear relationship with the U.s. Dollar Index.
Interesting to note that the transformation of the data on U.S. Crude Oil Import did result in an improvement of the R squared score of 0.01, or a 1% improvement of the model.
We proceed further by checking the simple linear regression model with U.S. Dollar Index variable with the normality test and residual check.
par(mar = c(5,5,3,5))
plot.ts(oilprice, col="red", las=1)
par(new =T)
plot(index, ylab="",xaxt ="n", yaxt="n",col="blue", main = "Time Series Plot of U.S. Dollar Index V.S Oil Price")
axis(side =4)
mtext("Dollar Index", side =4, line =3)
legend("bottomright", legend =c("Oil Price", "Dollar Index"), col = c("red", "blue"), lty=1, box.lty=0) Time Series Plot of U.S. Dollar Index V.S Oil Price suggests there is an reverse relation between Oil Price and Dollar Index, as when there is an increase in Dollar Index will simultaneously result a decrease in Oil Price and vise versa.
plot(dataset$U.S.Dollar.Index, dataset$Crude.Oil.Price, ylab = "Oil Price", xlab = "U.S. Dollar Index", las=1
, main ="Scatter plot of U.S. Dollar Index vs Oil Price")
abline(lm(oilprice~index)) Plotting the U,S. dollar Index against the Oil Price, We can clearly see a trend ofnegative linear relationship between the variables.
cor(index, oilprice)## [1] -0.8726729
Furthermore, the correlation of -0.87 between them also indicates there is a strong correlation between the independent variable and dependent variable.
H0: Errors are normally distributed
H1: Errors are not normally distributed
shapiro.test(model1$residuals)##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.97432, p-value = 0.03322
par(mfrow = c(2,3), mar = c(4,2,5,2))
plot(model1)
acf(model1$residuals)From the residual check, the first graph named ‘Residuals VS Fitted’ is used to check for non-randomness. The graph indicates the red lines of both model are not approximately horizontal at 0. Therefore, the residuals of both models have problem of non-randomness.
The second graph named ‘Normal Q-Q’ is used to test the normality assumption. The graph of simple regression model indicates the normal probability plot of residual are not following the red line, as there are one heavy tails at left end that deviate from straight line. Besides, the shapiro test of two models, the p-value of simple linear model is less than 5% significance level. Therefore, the errors of simple linear regression violate the assumption of normal distributed.
The third graph named ‘Scale-Location’ is used to check the homogeneity of variance of the residuals. Since the statistics of p-value from ncv test of both models are less than 5% significant level, which indicates the null hypothesis of errors are normal distributed is rejects.Therefore, the residuals of both models violate the assumption of homoscedasticity.
The fourth graph named ‘Residuals VS Leverage’ is used to check whether there is an influence point. From the graph, both models have no point go beyond the cook’s distance. Therefore, the residual of both model do not have influential point effect.
The last graph named ‘Series res’ is used to test the auto-correlation within the residuals. The graph shows the errors are not independent of each other, as both models have first sevral lags auto-correlated. The statisitics of p-value from durbinwaston test is less than 5% significant level, which means the null hypothesis of errors are uncorrelated is rejected. Therefore, the errors violate the assumption of first-order correlation.
After investigating the following four predictor variables on their potential to predict the price of crude oil. * Operable Utilisation rate * U.S. Dollar Index * U.S. crude Oil import * U.S. production of crude oil Using simple linear regression alone we can conclude that using the U.S. Dollar Index is the best variable to use for prediction of the Oil Price. The model have: * high negative correlation * high R2 value.
* Both results were statistically significant
Further investigation into the model we have found statically significant evidence that the residual errors is not normally distributed. Therefore, suggest the errors are not random, and there are other explanatory variables could be used.
In this project only choose 4 variables. Since we are only looking for the best variable to use as predictor. We can improve the model by combining predictors in a multivariate regression model. We can look to include other relevant variables to further improve the quality of the model.
Investing.com. (2019). US Dollar Futures Historical Data. Retrieved from https://au.investing.com/quotes/us-dollar-index-historical-data.
FRED. (2019). Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma. Retrieved from https://fred.stlouisfed.org/series/DCOILWTICO.
U.S. Energy Information Administration. (2019). Refinery Utilization and Capacity. Retrived from https://www.eia.gov/dnav/pet/pet_pnp_unc_dcu_nus_m.htm.
U.S. Energy Information Administration. (2019). U.S. Field Production of Crude Oi. Retrived from https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=pet&s=mcrfpus1&f=m
U.S. Energy Information Administration. (2019). U.S. Imports of Crude Oil. Retrived from https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=MCRIMUS1&f=M