To what extent are submarkets in gasoline prices segmented accross distinct geographic teritories?
2018-12-10
To what extent are submarkets in gasoline prices segmented accross distinct geographic teritories?
Gasoline markets are very local. While there is correlation in price over time between cities this is almost entirely due to crude oil and refining costs. NYC is correlated with Calgary to some extent, but not when you look at margins.
These costs, including transport and called the “rack” in Canada. The (retail price) - rack is the margin. Because the rack to retail (R2R) is independant of the rack, margins are uncorrelated between cities.
Markets become much less correlated when you compare margins than when you compare prices.
Prices
## calgary edmonton kelowna red deer vancouver victoria ## calgary 1.00 0.99 0.95 0.94 0.85 0.56 ## edmonton 0.99 1.00 0.93 0.94 0.85 0.54 ## kelowna 0.95 0.93 1.00 0.92 0.90 0.66 ## red deer 0.94 0.94 0.92 1.00 0.84 0.56 ## vancouver 0.85 0.85 0.90 0.84 1.00 0.82 ## victoria 0.56 0.54 0.66 0.56 0.82 1.00
Margins
## calgary edmonton kelowna red deer vancouver victoria ## calgary 1.00 0.92 0.45 0.03 -0.22 -0.25 ## edmonton 0.92 1.00 0.51 0.23 -0.22 -0.21 ## kelowna 0.45 0.51 1.00 0.67 -0.03 0.35 ## red deer 0.03 0.23 0.67 1.00 0.13 0.57 ## vancouver -0.22 -0.22 -0.03 0.13 1.00 0.51 ## victoria -0.25 -0.21 0.35 0.57 0.51 1.00
so while there is some correlation, it’s not nearly as distinct as with prices.
Much work has been done analyzing gasoline markets because they are highly visible and people spend a lot of money on gas.
Existing literature focused on small sample sizes (although considerably more accurate than mine) and spatial regression.
I used data.table and TSclust to cluster time series data using correlation and eucludian distance. I then picked the number of clusters and used this as ‘arbitrary’ labes.
I then split this into train and test sets and performed KNN classification on this using latitude and longitude.
How does this perform?
## actuals ## preds Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6 ## Cluster_1 31 0 0 0 0 0 ## Cluster_2 0 20 0 0 1 0 ## Cluster_3 0 0 33 21 1 0 ## Cluster_4 1 0 14 25 0 1 ## Cluster_5 0 0 0 0 11 0 ## Cluster_6 0 0 0 0 0 0
## Confusion Matrix and Statistics ## ## actuals ## preds Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6 ## Cluster_1 31 0 0 0 0 0 ## Cluster_2 0 20 0 0 1 0 ## Cluster_3 0 0 33 21 1 0 ## Cluster_4 1 0 14 25 0 1 ## Cluster_5 0 0 0 0 11 0 ## Cluster_6 0 0 0 0 0 0 ## ## Overall Statistics ## ## Accuracy : 0.7547 ## 95% CI : (0.6803, 0.8194) ## No Information Rate : 0.2956 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.678 ## ## Mcnemar's Test P-Value : NA ## ## Statistics by Class: ## ## Class: Cluster_1 Class: Cluster_2 Class: Cluster_3 ## Sensitivity 0.9688 1.0000 0.7021 ## Specificity 1.0000 0.9928 0.8036 ## Pos Pred Value 1.0000 0.9524 0.6000 ## Neg Pred Value 0.9922 1.0000 0.8654 ## Prevalence 0.2013 0.1258 0.2956 ## Detection Rate 0.1950 0.1258 0.2075 ## Detection Prevalence 0.1950 0.1321 0.3459 ## Balanced Accuracy 0.9844 0.9964 0.7528 ## Class: Cluster_4 Class: Cluster_5 Class: Cluster_6 ## Sensitivity 0.5435 0.84615 0.000000 ## Specificity 0.8584 1.00000 1.000000 ## Pos Pred Value 0.6098 1.00000 NaN ## Neg Pred Value 0.8220 0.98649 0.993711 ## Prevalence 0.2893 0.08176 0.006289 ## Detection Rate 0.1572 0.06918 0.000000 ## Detection Prevalence 0.2579 0.06918 0.000000 ## Balanced Accuracy 0.7009 0.92308 0.500000
Taking the most common lable by city:
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics). ## This warning is displayed once per session.
## raw_eucl_6.y ## raw_eucl_6.x Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 ## Cluster_1 107 1 1 0 0 ## Cluster_2 0 67 0 0 0 ## Cluster_3 0 0 84 73 0 ## Cluster_4 0 0 78 78 0 ## Cluster_5 0 9 1 2 32 ## Cluster_6 0 0 3 3 0
## [1] "The accuracy is: 0.688311688311688"
Not Great
These results are not incredibly promissing although they do fall within of the 95% CI for the accuracy measure.
Taking the most promissing result for Calgary, the adjusted correlation 2
## actuals ## preds Cluster_1 Cluster_2 ## Cluster_1 13 12 ## Cluster_2 9 12
If we chose the most common lable in calgary we get:
## calg_adj_cor_2 count_calg ## 1: Cluster_2 82 ## 2: Cluster_1 74
This would only give us an accuracy of 0.5384615.
This suggests that aggrigating over entire markets only captures the market, not the sub markets.
If we increase K to 5:
## [1] "The Accuracy Table is:"
## actuals ## preds Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 ## Cluster_1 0 0 0 0 0 ## Cluster_2 0 0 0 0 0 ## Cluster_3 0 0 4 0 1 ## Cluster_4 3 2 4 19 6 ## Cluster_5 0 1 1 2 2
For K=5 for Calgary
## Confusion Matrix and Statistics ## ## actuals ## preds Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 ## Cluster_1 0 0 0 0 0 ## Cluster_2 0 0 0 0 0 ## Cluster_3 0 0 4 0 1 ## Cluster_4 3 2 4 19 6 ## Cluster_5 0 1 1 2 2 ## ## Overall Statistics ## ## Accuracy : 0.5556 ## 95% CI : (0.4, 0.7036) ## No Information Rate : 0.4667 ## P-Value [Acc > NIR] : 0.1479 ## ## Kappa : 0.2574 ## ## Mcnemar's Test P-Value : NA ## ## Statistics by Class: ## ## Class: Cluster_1 Class: Cluster_2 Class: Cluster_3 ## Sensitivity 0.00000 0.00000 0.44444 ## Specificity 1.00000 1.00000 0.97222 ## Pos Pred Value NaN NaN 0.80000 ## Neg Pred Value 0.93333 0.93333 0.87500 ## Prevalence 0.06667 0.06667 0.20000 ## Detection Rate 0.00000 0.00000 0.08889 ## Detection Prevalence 0.00000 0.00000 0.11111 ## Balanced Accuracy 0.50000 0.50000 0.70833 ## Class: Cluster_4 Class: Cluster_5 ## Sensitivity 0.9048 0.22222 ## Specificity 0.3750 0.88889 ## Pos Pred Value 0.5588 0.33333 ## Neg Pred Value 0.8182 0.82051 ## Prevalence 0.4667 0.20000 ## Detection Rate 0.4222 0.04444 ## Detection Prevalence 0.7556 0.13333 ## Balanced Accuracy 0.6399 0.55556
If we chose the most common lable in calgary we get:
## calg_adj_cor_5 count_calg ## 1: Cluster_4 71 ## 2: Cluster_5 30 ## 3: Cluster_3 33 ## 4: Cluster_2 11 ## 5: Cluster_1 11
This would give an accuracy of 0.4551282 which is greater than the KNN accuracy. This leads me to believe that sub markets within cities do not exist.
There are several issues that I faced that might hide a different result.
While the cluster analysis worked well defining different cities, it was a poor predictor of within city markets.
Explanation
Nobody travels accross town for gas (Costco is an exception), but it is only marginally more work to move out of a neighborhood, especially since people commute great distances. Thus, submarkets merge accross cities.
See hierarchies