1 Background

Team Fuel Circle was intrigued by the question of when and where to buy cheapest fuel in NSW so that we analysed the 2019-2020 fuel price data across regions of New South Wales to determine the times and places where fuel prices were the cheapest and most expensive. Furthermore, the project team studied and investigated some factors like regions and socio-economic (i.e. house prices), and temporal factors, e.g. leading into and out of school and public holidays. Team also performed spatial analysis, time series analysis and applied a generalised linear regression model on data sets. However, there are still many problems and mysteries left unsolved.

As stated in a report from Australian Institute of Petroleum, almost 45% of local fuel demand is imported from overseas and 65% can be met by Australian local refineries. Australian refineries compete directly with Asian imports, and locally produced petroleum products must be priced to match with imports. According to ACCC’s report, Australian fuel price can also be affected by levels of competition in different areas and pricing decisions by wholesalers and retailers. Wholesalers and retailers make a profit by setting their profit margins based on levels of competition in different regions. They will also add the costs of transportation, storage fees and salaries etc on top of base fuel prices. It is arguably true that fuel prices can be manipulated by wholesalers and retailers in Australia. This report will introduce some new data sets and study the relationships between petroleum wholesale prices and other factors e.g. imports/exports, production and sales etc.

2 Objective

It is well known that Australian petroleum refineries industry is the price taker and fuel price is largely impacted by Asian market. However, there are many other factors that still play important roles in setting Australian fuel price . Principal component analysis will be used to further analyse correlations between petroleum wholesale prices and other factors e.g. imports/exports, production and sales etc.

3 Data source

The existing data sets that were used in the Fuel Circle project can provide little insight using principal component analysis as PCA performs better on multi-variable numeric data sets, and most of variables in existing data sets are descriptive or categorical. It is necessary to gather more variables that could potentially have impact on Australian fuel price. Here, below two data sets were merged together.

  • 2010-2020 Monthly Petroleum statistics (Fuel, diesel and LPG) is collected from energy. gov. au.
  • 2010-2020 Terminal gate price(Fuel, diesel and LPG) is gathered from Australian Institute of petroleum

There are six independent variables: petroleum imports, petroleum production, refinery production, petroleum sales, petroleum export and end of month stock, and one dependent variable: Terminal gate price. For fuel and diesel, terminal gate price is the price of full tanker loads of fuel to wholesale customers from seaboard terminals on a spot basis.(BP) Terminal gate prices don’t include added services such as business support, freight, branding or wages for staff employed at the service station. For LPG, wholesale price is based on the benchmark price set in Saudi Arabia. Wholesale price is used for LPG in the model as there is no TPG price available for LPG. Unfortunately, international benchmark prices for fuel and diesel cannot be accessed using free subscription. We will exclude the factor in PCA/PCR analysis. Necessary data wrangling and merging steps were done before we applied the principal component regression model.

4 Methodologies

4.1 Exploratory data analysis- Principal component analysis(without data scaling/normalisation)

Before building and training the statistical model, let’s do some basic exploratory data analysis using the PCA method. PCA is commonly used to avoid multicollinearity between predictors and reduce dimensionality when doing multivariable analysis. In other words, PCA can decompose all the x variables , generate new components and rank them in order to show how each component explains the most of the variance in the data set.

As per below table, we have six independent variables x and one dependent variable y: TPG price. Only numeric values can be used in PCA. The dataset has been factorised into two matrices: rotation and x. Both Matrices can be used to recover the original data set. Rotation measure provides the principal component loading and x contains the principle component score vector. As mentioned earlier, PCA decomposes all independent variables, and creates new components Below scree plot graph shows how each principal component explains variances of the data set. PC1 contributed the most , followed by PC2 and PC3. Top three components explain approximately 87% of variances, which reduce the complexity of the original data set.

Now let’s take a look at the PCA plots. It is clearly shown that different fuel products are clustered into three groups. Angles between the vectors tell us how different variables correlate with one another. The vectors also show how much weight they have on each principal component.

fuel price factors
date Fuel.Code petroleum_imports petroleum_production Refinery_production petroleum_sales petroleum_export end_of_.month_stock petroleum_TPG_price benchmark.price
1/7/10 LPG 88.2 348.5 164.9 513.4 259.7 177.0 62.5 38.2
1/8/10 LPG 141.4 331.0 164.7 495.7 271.5 160.2 58.6 35.0
1/9/10 LPG 4.0 301.3 145.8 447.1 263.3 155.7 58.0 37.2
1/10/10 LPG 164.4 289.6 141.7 431.3 252.2 140.6 60.0 38.7
1/11/10 LPG 3.6 257.2 145.6 402.8 96.7 188.4 62.3 41.9
1/12/10 LPG 159.4 259.6 156.2 415.8 234.4 186.3 69.4 48.8
## [1] 1.8492066 1.3414505 0.6758471 0.5132693 0.2051562 0.1365339
##                             PC1        PC2         PC3          PC4         PC5
## petroleum_imports     0.2600999  0.5930519 -0.27450407  0.602823014  0.35540520
## petroleum_production  0.3429984  0.4292202  0.75582227 -0.005896393 -0.34953490
## Refinery_production   0.4804181 -0.2794685  0.26560206 -0.264240768  0.63081881
## petroleum_sales       0.4766845 -0.3321648 -0.05747291  0.206826526  0.08431445
## petroleum_export     -0.3822676 -0.3936873  0.47511071  0.667324609  0.13979920
## end_of_.month_stock   0.4588485 -0.3473236 -0.23195351  0.280440934 -0.57177564
##                              PC6
## petroleum_imports    -0.12480620
## petroleum_production  0.06815695
## Refinery_production  -0.39089101
## petroleum_sales       0.78054415
## petroleum_export     -0.09102565
## end_of_.month_stock  -0.45765427
##            PC1        PC2         PC3         PC4         PC5          PC6
## [1,] -1.889478 -1.2396006  0.35487343  0.18039598  0.07624745  0.019448280
## [2,] -1.953452 -1.2423778  0.36416724  0.26961049  0.14565152  0.002121723
## [3,] -2.044354 -1.3024468  0.31186508  0.12919268  0.07679708  0.004877205
## [4,] -1.999511 -1.1500333  0.20208023  0.16452926  0.14853663 -0.004948633
## [5,] -1.526934 -0.7970537 -0.45835897 -0.80948619 -0.13110591  0.056455727
## [6,] -1.911377 -1.1644071  0.06414407  0.07529108  0.11440689 -0.069397806

4.2 Principal component regression model

PCA can be a useless tool if data scaling or normalisation is not done properly as the analysis is very sensitive to variance in data sets. We performed EDA without data normalisation only to have some basic understanding of how raw dataset behaves in the principal component analysis. Before building a principal component regression model, there are two important steps that improve the success rate of the PCR model : Data normalisation and train/test split.

  • Data normalisation : Z-score data normalisation is used in the model. Z-score transforms the data set such that the resulting distribution has a mean of 0 and a standard deviation of 1, and the shape of distribution will not be changed after normalisation.

  • Train/test split: The data set is split into 80% training data set and 20% testing data set. Model will be trained on the 80% data set and the 20% is for validation.

Before diving deeper into the PCR model, it is well known that one of the advantages of PCA is to remove multicollinearity between independent variables. As per the first corrplot shown, there are some significant correlations between independent variables. The second corrplot is perfectly white except for the middle blue line because principal components are orthogonal. There is zero correlation between each principle component.

set.seed(123)
##split data into 80:20 data sets
training.samples <- df2$petroleum_TPG_price%>%
  createDataPartition(p = 0.8, list = FALSE)

df_train  <- df2[training.samples, ]
df_test <- df2[-training.samples, ]

## run corrplot to show correlations bewteen x varaibles
res <- cor(df_train[c(3,4,5,6,7,8)], method="pearson")
corrplot::corrplot(res, method= "color", order = "hclust", tl.pos = 'n')

##data normalization

norm_train<-data.Normalization(df_train[c(3,4,5,6,7,8,9)], type="n1", normalization="column")
norm_test<-data.Normalization(df_test[c(3,4,5,6,7,8,9)], type="n1", normalization="column")

Below scree plot shows the relationship between root mean square error (RMSE)of cross-validation and number of principal components. If all six PCs are used in the model , then RMSE will be minimised close to zero. However, it will be meaningless if we use all PCs in the model as this does not reduce the dimensionality. The goal is to maximise the variances that can be explained by the model and minimise the errors. R package caret uses cross-validation to automatically identify the optimal number of principal components (ncomp) to be incorporated in the model. Five PCs can explain most of the variance. 80.94% of the variation contained in the predictors are captured by five PCs . In addition, 74.37% of the information was captured in the outcome variable (TPG price) , which is a fairly good result.

To further validate if the model is a good fit or not, let’s fit the model to the 20% test data set. There are multiple ways to generate prediction metrics. Here we choose to use RMSE and R Square. In this example, RMSE is 0.528 and R square is 0.716, which looks very promising. We will compare these metrics with other statistical models in later sections. As per the prediction plot shown, although it’s not a perfect fit, there is some sort of linear regression, which double confirms the prediction metrics. Lastly, we can compare the predictions based on the 80% train data set with the 20% results.

##   ncomp
## 5     5
## Data:    X dimension: 294 6 
##  Y dimension: 294 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           57.49    87.19    94.75    99.06    99.69
## .outcome    49.46    71.68    72.29    73.08    74.37

##  [1] -1.12160076 -1.08901390 -1.21850541 -1.10305627 -1.13395535 -1.19024048
##  [7] -1.19517873 -1.26946269 -1.22722268 -1.22854402 -1.29456419 -1.22898906
## [13] -1.24485520 -1.24923711 -1.26953859 -1.26779775 -1.24878985 -1.29319077
## [19] -1.18576572 -1.22488591 -1.20909383  0.96799779  1.41736033  0.83928112
## [25]  0.71995858  0.74417501  0.69745713  0.47902538  0.79722518  0.41173042
## [31]  0.76136489  0.99115741  0.44139968 -0.24012826  0.09187276  0.18313261
## [37]  0.18097495 -0.06841660  0.13440512 -0.17854267 -0.17752350  0.13328111
## [43]  0.11483489  0.01118658  0.08431935  0.65775437 -0.01023034 -0.70147181
## [49]  0.60695960  0.29763199  0.77279805  0.93952859  1.01559589  1.04887182
## [55]  0.91394172  0.82291497  1.13040433  0.77247205  0.35728602  0.41943479
## [61]  0.76951707  0.37190789  0.80845744  0.77516096  0.67359361  0.42603791
## [67]  0.57426673  0.53173525  0.51056451  0.58430312  0.55295904  0.33356349
##  [1] -1.95946605 -2.11565287 -1.58301575 -1.53896306 -1.53095348 -1.64308761
##  [7] -1.05037864 -0.58181817 -1.03435948 -1.06639780 -1.34673312 -1.49090558
## [13] -1.71116905 -1.77124090 -1.19855588 -1.23459899 -0.97428762 -0.65790918
## [19] -0.51373673 -0.90220140 -0.95025888  0.19911595  0.44420912  0.92718684
## [25]  0.76939810  0.68209367  1.04012192  0.88593750  1.32566348  1.21272839
## [31]  1.37692479  1.36450994  0.63924241  0.27640841 -0.27425027 -0.39078967
## [37] -0.03716668 -0.04597722 -0.10244476  0.20272026 -0.01914512  0.68409606
## [43]  1.19390587  0.31645631  0.94801175  0.91837630  0.83707905 -0.71798104
## [49]  0.23155475  0.60440073  0.96883666  1.15706180  1.03251282  1.02490372
## [55]  1.30804240  1.33167066  1.33086970 -0.05078297 -0.76723996 -0.06840404
## [61]  0.02210422  0.17749008 -0.11325770  0.41016840  0.58237439  1.02650564
## [67]  1.15786276  0.94360648  1.00287737 -0.53976787 -0.31910392 -0.25502727
##        RMSE   Rsquare
## 1 0.5284522 0.7169223

4.3 Partial Least squares regression model

Partial least squares regression is similar to principal component regression. The difference is PLS projects the predicted variables and the observable variables to a new space. It is particularly useful when data sets contain a set of dependent variables and a very large set of independent variables. The optimal number of components selected by R is still five, and the prediction plot looks very similar to the PCA prediction plot. The Cross-validation RMSE is 0.531 ,which is slightly higher than the PCA model and R Squares slightly lower, 0.714. In this case, the PCA model is the better fit.

##   ncomp
## 5     5
## Data:    X dimension: 294 6 
##  Y dimension: 294 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           54.50    86.98    93.35    97.92    99.60
## .outcome    67.38    72.46    73.58    74.38    75.57

##        RMSE   Rsquare
## 1 0.5312614 0.7138858
## Data:    X dimension: 294 6 
##  Y dimension: 294 1
## Fit method: kernelpls
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.002   0.5772   0.5314   0.5222   0.5157   0.5054   0.5039
## adjCV        1.002   0.5769   0.5310   0.5218   0.5153   0.5048   0.5032
## 
## TRAINING: % variance explained
##                      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                      54.50    86.98    93.35    97.92    99.60   100.00
## petroleum_TPG_price    67.38    72.46    73.58    74.38    75.57    75.77

5 Reflection

PCA is a great tool to eliminate noise variables and simplify the dimension structure. However, it is difficult to interpret the results as PCA decomposes all variables and creates new principal components. In this example, we are unable to determine which factor has significant influence on Australian Petroleum price. We used a generalised linear regression model in the AT2 project and we have concerns over multicollinearity. In this example, variables such as petroleum production and refinery production are highly correlated, and PCA eliminates multicollinearity. Let’s look at things from different angles. If we treat all x variables as a single x variable and y (terminal gate/wholesale price)stays the same. The PCA model shows the relationship between dynamics of supply and demand in Australia Petroleum Market and Petroleum TPG/wholesale price.

6 Conclusion

This report uses principal component based regression methods to analyse how other factors influence the Australian petroleum price , including principal component regression and partial least squares regression These methods are very useful for multivariate data containing correlated predictors. Based on the results, the PCA model is the better fit than the PLS model. However, both models are difficult to interpret. We are unable to do variable selection using principal component based regression methods or plot the estimated coefficient estimates. Despite the nature of the PCA model, the principal component method is widely used in agricultural, biological, environmental sciences ,and machine learning where large data sets are used.