2018-12-10

Thesis

To what extent are submarkets in gasoline prices segmented accross distinct geographic teritories?

Problem Overview

Gasoline markets are very local. While there is correlation in price over time between cities this is almost entirely due to crude oil and refining costs. NYC is correlated with Calgary to some extent, but not when you look at margins.

These costs, including transport and called the “rack” in Canada. The (retail price) - rack is the margin. Because the rack to retail (R2R) is independant of the rack, margins are uncorrelated between cities.

Gas Markets

  • Gas prices are highly competitive
  • The price payed for gasoline only contains a bit of profit for the station
  • Margins of 0 to 10% are the norm.
  • There is significant fluctiation throught the day and year

Submarkets

Markets become much less correlated when you compare margins than when you compare prices.

Correlation

Prices

##           calgary edmonton kelowna red deer vancouver victoria
## calgary      1.00     0.99    0.95     0.94      0.85     0.56
## edmonton     0.99     1.00    0.93     0.94      0.85     0.54
## kelowna      0.95     0.93    1.00     0.92      0.90     0.66
## red deer     0.94     0.94    0.92     1.00      0.84     0.56
## vancouver    0.85     0.85    0.90     0.84      1.00     0.82
## victoria     0.56     0.54    0.66     0.56      0.82     1.00

Correlation

Margins

##           calgary edmonton kelowna red deer vancouver victoria
## calgary      1.00     0.92    0.45     0.03     -0.22    -0.25
## edmonton     0.92     1.00    0.51     0.23     -0.22    -0.21
## kelowna      0.45     0.51    1.00     0.67     -0.03     0.35
## red deer     0.03     0.23    0.67     1.00      0.13     0.57
## vancouver   -0.22    -0.22   -0.03     0.13      1.00     0.51
## victoria    -0.25    -0.21    0.35     0.57      0.51     1.00

so while there is some correlation, it’s not nearly as distinct as with prices.

Existing Literature

Much work has been done analyzing gasoline markets because they are highly visible and people spend a lot of money on gas.

Existing literature focused on small sample sizes (although considerably more accurate than mine) and spatial regression.

My Approach

Python

  • Scrape GasBuddy.com
    • Python/AWS/Selenium/Lots of errors
  • Pull location info from google maps API
    • I had grandiouse ideas to scrape weather and traffic but due to budget limitations, I did not
    • Python
  • Manipulate data using Pandas Resample

My Approach

R

I used data.table and TSclust to cluster time series data using correlation and eucludian distance. I then picked the number of clusters and used this as ‘arbitrary’ labes.

I then split this into train and test sets and performed KNN classification on this using latitude and longitude.

Graphs

Calgary 5 Sites

Western Canada Combined

Best Clustering

Comparison to Vancouver

Selection

  • I will test more exhaustively in the final project
  • Current analysis focuses on All Raw Euclidian 6.

How does this perform?

##            actuals
## preds       Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6
##   Cluster_1        31         0         0         0         0         0
##   Cluster_2         0        20         0         0         1         0
##   Cluster_3         0         0        33        21         1         0
##   Cluster_4         1         0        14        25         0         1
##   Cluster_5         0         0         0         0        11         0
##   Cluster_6         0         0         0         0         0         0

Confusion Matrix

## Confusion Matrix and Statistics
## 
##            actuals
## preds       Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6
##   Cluster_1        31         0         0         0         0         0
##   Cluster_2         0        20         0         0         1         0
##   Cluster_3         0         0        33        21         1         0
##   Cluster_4         1         0        14        25         0         1
##   Cluster_5         0         0         0         0        11         0
##   Cluster_6         0         0         0         0         0         0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7547          
##                  95% CI : (0.6803, 0.8194)
##     No Information Rate : 0.2956          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.678           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Cluster_1 Class: Cluster_2 Class: Cluster_3
## Sensitivity                    0.9688           1.0000           0.7021
## Specificity                    1.0000           0.9928           0.8036
## Pos Pred Value                 1.0000           0.9524           0.6000
## Neg Pred Value                 0.9922           1.0000           0.8654
## Prevalence                     0.2013           0.1258           0.2956
## Detection Rate                 0.1950           0.1258           0.2075
## Detection Prevalence           0.1950           0.1321           0.3459
## Balanced Accuracy              0.9844           0.9964           0.7528
##                      Class: Cluster_4 Class: Cluster_5 Class: Cluster_6
## Sensitivity                    0.5435          0.84615         0.000000
## Specificity                    0.8584          1.00000         1.000000
## Pos Pred Value                 0.6098          1.00000              NaN
## Neg Pred Value                 0.8220          0.98649         0.993711
## Prevalence                     0.2893          0.08176         0.006289
## Detection Rate                 0.1572          0.06918         0.000000
## Detection Prevalence           0.2579          0.06918         0.000000
## Balanced Accuracy              0.7009          0.92308         0.500000

Next Best Alternative

Taking the most common lable by city:

## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
##             raw_eucl_6.y
## raw_eucl_6.x Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5
##    Cluster_1       107         1         1         0         0
##    Cluster_2         0        67         0         0         0
##    Cluster_3         0         0        84        73         0
##    Cluster_4         0         0        78        78         0
##    Cluster_5         0         9         1         2        32
##    Cluster_6         0         0         3         3         0
## [1] "The accuracy is: 0.688311688311688"

Results

Not Great

These results are not incredibly promissing although they do fall within of the 95% CI for the accuracy measure.

Individal Cities

Taking the most promissing result for Calgary, the adjusted correlation 2

##            actuals
## preds       Cluster_1 Cluster_2
##   Cluster_1        13        12
##   Cluster_2         9        12

Analysis

If we chose the most common lable in calgary we get:

##    calg_adj_cor_2 count_calg
## 1:      Cluster_2         82
## 2:      Cluster_1         74

This would only give us an accuracy of 0.5384615.

This suggests that aggrigating over entire markets only captures the market, not the sub markets.

Increase K

If we increase K to 5:

## [1] "The Accuracy Table is:"
##            actuals
## preds       Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5
##   Cluster_1         0         0         0         0         0
##   Cluster_2         0         0         0         0         0
##   Cluster_3         0         0         4         0         1
##   Cluster_4         3         2         4        19         6
##   Cluster_5         0         1         1         2         2

Confusion Matrix

For K=5 for Calgary

## Confusion Matrix and Statistics
## 
##            actuals
## preds       Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5
##   Cluster_1         0         0         0         0         0
##   Cluster_2         0         0         0         0         0
##   Cluster_3         0         0         4         0         1
##   Cluster_4         3         2         4        19         6
##   Cluster_5         0         1         1         2         2
## 
## Overall Statistics
##                                        
##                Accuracy : 0.5556       
##                  95% CI : (0.4, 0.7036)
##     No Information Rate : 0.4667       
##     P-Value [Acc > NIR] : 0.1479       
##                                        
##                   Kappa : 0.2574       
##                                        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: Cluster_1 Class: Cluster_2 Class: Cluster_3
## Sensitivity                   0.00000          0.00000          0.44444
## Specificity                   1.00000          1.00000          0.97222
## Pos Pred Value                    NaN              NaN          0.80000
## Neg Pred Value                0.93333          0.93333          0.87500
## Prevalence                    0.06667          0.06667          0.20000
## Detection Rate                0.00000          0.00000          0.08889
## Detection Prevalence          0.00000          0.00000          0.11111
## Balanced Accuracy             0.50000          0.50000          0.70833
##                      Class: Cluster_4 Class: Cluster_5
## Sensitivity                    0.9048          0.22222
## Specificity                    0.3750          0.88889
## Pos Pred Value                 0.5588          0.33333
## Neg Pred Value                 0.8182          0.82051
## Prevalence                     0.4667          0.20000
## Detection Rate                 0.4222          0.04444
## Detection Prevalence           0.7556          0.13333
## Balanced Accuracy              0.6399          0.55556

Analysis

If we chose the most common lable in calgary we get:

##    calg_adj_cor_5 count_calg
## 1:      Cluster_4         71
## 2:      Cluster_5         30
## 3:      Cluster_3         33
## 4:      Cluster_2         11
## 5:      Cluster_1         11

This would give an accuracy of 0.4551282 which is greater than the KNN accuracy. This leads me to believe that sub markets within cities do not exist.

Issues

There are several issues that I faced that might hide a different result.

  • The data collection was spoty due to AWS
  • GasBuddy does not have very granular pricing
  • Since no company to my knowlege tries to set prices at the neighborhood level there could still be potential gains from trying too.

Results

While the cluster analysis worked well defining different cities, it was a poor predictor of within city markets.

Explanation

Nobody travels accross town for gas (Costco is an exception), but it is only marginally more work to move out of a neighborhood, especially since people commute great distances. Thus, submarkets merge accross cities.

Further Work

See hierarchies